Day 2 of the bioinformatics open source conference (BOSC) kicked off bright and early on Saturday with a very nice discussion from Ross Gardler about building open development communities.
Ross Gardler — Community Development at the Apache Software Foundation
Ross is going to discuss how the community development system at Apache could be useful to open source communities in biology. Apache has 70 active projects and 30+ in development; there are 2500 regular contributors with commit access. The Apache foundation started in 1995 to fix up the UIUC server and became an official organization in 1999 to provide legal protection for members. The mission statement was broad and general: more about a way of doing things in an open manner than about specific projects. Foundation exists to get the legal nastiness and what not out of the way so folks can write code and documentation with minimal resistance. Apache provides indirect financial support. They don’t pay for code but many developers are paid by third-parties to do work on Apache projects.
Apache is a meritocracy, and everyone has a voice and vote. Contribution of value produces merit within projects. If you earn merit in multiple projects then you can earn membership at the foundation level. Consensus is made via debate and code, although occasionally a vote is required via the mailing list. The rule of lazy consensus is that trusted folks can just code away: once you have code you can evaluate it more easily and move forward if everyone agrees.
Growing the foundation from original Apache server to the 70+ projects has been a challenge. Jakarta become developed as a sub-project head underneath Apache which had some failures; modification to the organization was to keep a flat structure without any umbrella projects. This allowed projects to be reviewed by the Apache folks who have lots of experience evaluating development communities. The Apache foundation doesn’t consider technical issues, but rather things like stagnating communities, undue commercial influence and other potential problems.
What are the characteristics of a good Apache project? Diversity — at least 3 committers unrelated to each other outside the project. Full audited code for IP issues which makes the work more palatable to companies who are contributing. Projects should be generic and reusable, so the component parts are available. Idea is that the components can be used outside of your field so you can build a wider community.
How do you scale the community? More projects brings in additional volunteers and doesn’t stress the overhead too much, but creates the potential for dilution of the Apache foundation values and brand. The flat structure gives power to new members since there are low barriers to entry, but this can result in the blind leading the blind. However, hierarchy is inefficient. Peer review is one of the answers to helping the community self-regulate.
Mentoring helps bring new folks into Apache. In the incubator, mentors guide new project teams and teach them the apache way. Google summer of code brings in some community members, and the Apache mentoring project goes beyond this to provide mentoring on a year round basis.
Summary of lessons: the foundation should handle the brand, infrastructure, and legal aspects of projects. This also allows for cross project community discussions. The project handles technical issues and handling contributors. Lazy consensus is used to avoid management by committee and keep the power in the hands of the people who do things. Need to think how to generalise your project components and get outside of you niche. Excellent things to think about for the biology community where we are used to trying to specialize.
Chris Fields — BioPerl
Chris will talk about current things happening in BioPerl, and then focus on some changes that are happening in the community: making things easier for new users, using modern perl features and dealing with BioPerl being monolithic. BioPerl has been around since 1996 and has impressive number of current and past contributors. Lincoln Stein next-gen tools: Bio-SamTools, Bio-BigFile which are separate CPAN distributions. Gbrowse talk later.
Summer of code happening for the 3rd year. The alignment subsystem is being cleaned up to include the capability to deal with large datasets via indexing and reduced in memory representations.
Moving forward, how can the current code be improved and modularized. To lower the barrier to entry, the BioPerl repository was migrated to GitHub. The monolithic nature of BioPerl makes things very hard to maintain and release. One idea is to make BioPerl a front end installer that adds specific individual packages based on interests and needs. Have an initial prototype using Moose for BioPerl objects, and for BioPerl on Perl 6.
Raoul Bonnal — BioRuby
Overview of Ruby itself: a nice language with object orientation, functional aspects and reflection. BioRuby works with both standard C Ruby and JRuby. Last BioRuby update presentation was 2008 BOSC, and have tons of development including 3 Hackathons and 1 Codefest. New features include support for BioSQL which allows interoperable storage of sequences, PhyloXML support from a GSOC project, Fastq parsing support, NCBI REST access, and TogoWS support.
BioRuby has frequent meetings via mail, skype and IRC. Very strict requirement for tests as they continue to move to an agile programming style. BioRuby has a plugin system with standard naming scheme: bioruby-plugin-NAME. Provide a script interface to download and install plugins.
Peter Rice — EMBOSS
EMBOSS received continued funding last year which allowed new development as opposed to bug fix and maintenance releases over the previous two years. EMBOSS aims at both developers and end users, and is targeted at the commandline. There are over 100 interfaces including Galaxy. New release supports BAM and other new next gen features. 3 open source books are coming out soon, which will lock down much of the library functionality.
Fastq and other parsing was improved by thinking about truncated failure cases and building up a standard set of problem cases. New EMBOSS accesses BioMart and ENSEMBL. New planned are DAS, GMOD and BioSQL. Provide a standard definition format for defining databases; awesome way to avoid re-doing all of the specific process. Other new planned features include improved Ontology support.
Tiago Antao — population genetics in Python
HapMap project develops a haplotypemap of the human genome: 11 populations, 90-180 individuals in each. It contains SNPs, CNVs, genotypes, pedigree info. UCSC known genes are most useful for overlapping with data from HapMap. Python library accesses both HapMap and UCSC with Biopython, matplotlib, GenePop and Entrez data. Ensembl Variation API covers a similar are in Perl.
Structure is SQLite based: remote data is downloaded once and stored and indexed. Interface examples look straightforward to retrieval and querying. Very nice demonstration plots of data with matplotlib.
Laurent Gautier — Bioconductor and Python
Provides a way to natively access libraries implemented in R. Bioconductor is one really useful targets for biologists: tons of open source packages in R. Laurent shows an awesome diagram of the biological data landscape: what Python handles well and what R/Bioconductor handles well. R is heavily statistical while Python is more focused on data processing.
Idea of rpy2 is to bridge the Python and R communities. Community wise, this lets interpreters develop that can share the usefulness of each separate community. Nice example of using edgeR from python to look at differential expression of RNA-seq data.
Eric Talevich — Bio.Phylo package in python
Eric developed a phylogenetics library for Biopython, that makes it easy to explore tree data. There are bunch of phylogenetics formats: New
ick, Nexus, PhyloXML and NeXML.
Eric provides a demo of using PhyloXML to parse a Newick tree, visualize it in multiple ways: text tree, networkx style graphical trees. With PhyloXML you can specify attributes of a tree and annotate it, and then store all this in the XML format. Easy to promote standard Newick to the more representative PhyloXML.