BOSC 2012, day 2 am: Carole Goble on open-source community building; Software Interoperability talks

Talk notes from the 2012 Bioinformatics Open Source Conference.

Carole Goble: If I Build It Will They Come?

Carole’s goal is to discuss experience building communities around open source software that facilitates reuse and reusability. 3 areas of emphasis: computational methods and scientific workflows (Taverna), social collaboration like MyExperiment and knowledge acquisition like RightField.

Goal of MyExperiment is to collaboratively share workflow. Awesomely, it now interoperates with Galaxy workflows. For knowledge acquisition, Seek4Science handles data sharing and inter-operates with IsaTab.

General philosophy: laissez-faire philosophy trying to encourage inclusiveness and ability to evolve over time. Some difficulty with being too free since led to some ugly metadata when tools are too general. People prefer simple interfaces that work and they can adopt to. By having flexibility and extensibility, have been able to work across multiple communities and widen adoption. Majority of work paid for in projects outside of Biology. Currently supporting 16 full time informaticians.

How can you get users? Be useful for something, by somebody, some of the time. Need to under promise and over deliver. 4 things that drive adoption: 1. Provide added value 2. Provide a new asset 3. Keep up with the field 4. Because there is no choice.

7 thinks that hinder adoption: 1. Not enough added value 2. Doesn’t work for non-expert users 3. No time or capacity to take on learning. The first 3 sum up to: the software sucks and is difficult to improve. Good solutions exist to help: Software Carpentry. 4. Cost of disruption 5. Exposure to risk 6. No community 7. Changes to work practice. Last 4 boil down to being too costly. Tipping points for usage are normally not technical. Another issue is that people haven’t heard of it so need to promote and discuss.

Some other problems that happen: adoption is incidental, familial adoption for people like me. Difficult to build for others that are not like me. Need to motivate others to help fill in gaps. Need to fully interoperate and not tweak to be non-back-compatible without specific breaks.

Adoption model: need to build an initial seed of users that can advocate and encourage others to use. Trickiest part is to establish this initial set of friends that love your software. This can primarily be a relationship building process: need trust and usability even for the best backend implementations. Need to understand where your projects fit and target the right people even if they are not exactly like you. Difficult.

In social environments, need to understand what drives and fears people have with regards to sharing. A real problem is people not giving credit for things they use. How can we improve micro-attribution in science? How do we harness this competitiveness? Provide reputation and attribution for what you do. Protect and preserve data to help make people more productive.

If you are lucky enough to build a community, then move to the next level and need to maintain and nurture. As you concentration on maintenance can lose focus on the initial things that drove adoption. Version 2 syndrome: featuritis.

In summary, need to focus on: What is it that you provide and people value? Who are you targeting? Need to be able to be agile and provide improvements continuously to existing users, but be careful not to end up with hideous code that is unmaintainable. Difficult balance between being general and specific. Final trick is to be long term once you’ve got adoption: how can you keep software around and provide the funding to continue developing and improving it?

Social and technical aspects of adoption both equally important.

Richard Holland: Pistoia Alliance Sequence Squeeze: Using a competition model to spur development of novel open-source algorithms

Pistoia Alliance Sequence Squeeze competition to develop improved sequence compression algorithms. Pistoia alliance is an not-for-profit alliance of life-science companies by trying to collaborate on shared research and development. Goal of competition was to come up with approach for compression that is not linear but retains 100% of the information. Work on FASTQ data and all software is open-source.

Overall has 12 entrants with 108 distinct submissions. Provided a public leaderboard that encouraged competition. Winner was James Bonfield. Other useful compression approaches like PAQ fared well.

Michael Reich: GenomeSpace: An open source environment for frictionless bioinformatics

GenomeSpace tackles the difficulty of switching between different tools. Creates a connection layer between genome analysis tools with biological data types. Based supported tools are Cytospace, Galaxy, GenePattern, Genomica, IGV and the UCSC browser. GenomeSpace aimed at non-programming users and support automatic cross-tool compatibility. It’s a great resource to connect tools.

Dannon Baker: Galaxy Project Update

Details on some of the fun stuff that the Galaxy team has been working on. Lots of development on the API, including providing a set of wrapper methods that wrap the base REST API. API allows running of tools in Galaxy.

Automated parallelism is now built into Galaxy. Can split up BLAST jobs into pieces, run on cluster and then combine back together. Main concerns are: overhead of splitting plus temporary space requirement. Set use_tasked_jobs=True in configuration and supports BLAST, BWA and Bowtie. There is a parallelism tag in Galaxy XML that enables it. Advanced splitting has a FUSE later that makes the file look like a directory to avoid re-copying files.

Galaxy tool shed allows automated input of tools into Galaxy instance.

Enis Afgan: Zero to a Bioinformatics Analysis Platform in 4 Minutes

Enis currently working with Australian Nectar national cloud to port CloudMan to private clouds in addition to currently supported Amazon EC2. Idea is to make resources instantly available to users and provide a layer on top of programmer shell interfaces. 4 projects put together: BioCloudCentral, CloudMan, CloudBioLinux, Galaxy. The Blend library provides a python API on top of Galaxy and CloudMan, with full docs.

Alexandros Kanterakis: PyPedia: A python crowdsourcing development environment for bioinformatics and computational biology

PyPedia provides a collaborative programming web environment. General idea is that wiki can provide a clean way to upload and run python code: Google App Engine + wiki. This has lots of useful example code, like converting VCF to reference. PyPedia provides a REST interface and everything so everything is fully reproducible and re-runnable.

This would be great to host an
interactive cookbook for Biopython with automated tests and ability to fork.

Alexis Kalderimis: InterMine – Embeddable Data-Mining Components

InterMine is an integrated data warehouse with customizable backend storage. It provides a web interface with query functionality. Provides a nice technology stack underneath with web service and Java API interfaces. Provides libraries in Python, Perl, Java that give a nice intuitive interface for querying pragmatically. Can write custom analysis widgets with some nice javascript interfaces: built using CoffeeScript with Backbone.js and Underscore.js so lots of pretty javascript underlying it.

Bruno Kinoshita: Creating biology pipelines with BioUno

BioUno is a biology workflow system built using Jenkins build server. Provides 5 plugins for biological work. Jenkins is an open source continuous integration system that makes it easy to write plugins. Bruno bravely shows a live demo including fabulous Java war files.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s