Notes from the 2012 Bioinformatics Open Source Conference.
Jonathan Eisen: Science Wants to Be Open – If Only We Could Get Out of Its Way
Jonathan Eisen starts off by mentioning this is the first time he’s giving an open science talk to a friendly audience. He’s associated, and obsessed, with the PLoS open access journal. History of PLoS: started with a petition coming out of free microarray specifications from Michael Eisen and Pat Brown to make journals available. 25,000 people signed, but had a small impact on open access support mainly because available open source journals were not high profile enough. PLoS started as selective high-profile open-access journal to fill this gap.
Ft Lauderdale agreement debated how to be open with genomic data. Sean Eddy argued for openness in data and source code by promoting advantages of collaborations and feedback. First open data experiment at TIGR was Tetrahumena thermophila. Openly released and got useful biological feedback. Published next paper on Wolbachia in PLoS Biology despite overtures from Science/Nature.
Medical emergency with wife was real motivator for openness. Could not get access to journals to research to help. Terrible lack of access for people outside of big academic institutions. Same problem happened with access to make all father’s research papers available.
Open access definition: free, immediate access online with unrestricted distribution and re-use. Authors retain rights to paper. PLoS uses broad creative commons license. Why is reuse important? Thought example: what if you had to pay for access to each sequence when searching for BRCA1 homologs? Then what if it were free, but you couldn’t re-analyze it in a new paper? Science built off re-use and re-purposing of results. Extends to education and fair use of figures. Additional areas to consider are open discussion and open reviews.
What are things you can do to support openness? Share things as openly as possible, participate in open discussion, consider being more open pre-publication. Risk to sharing is low, and benefit is high with help and discussion. Important to judge people by contributions, instead of surrogates like journal impact factor. Enhance and embrace open material while giving credit to everything you can. Support jobs and places that are into openness.
Great talk and a nice way to start off the meeting by focusing on why folks here care about open source.
Sebastian Schönherr: Cloudgene – an execution platform for MapReduce programs in public and private clouds
How to support scientists when using MapReduce programs? Goal is to simplify access to a working MapReduce cluster. Cloudgene designed to handle these usability improvements. Sebastian talks through all of the work to setup a cluster: build cluster, HDFS, run program, and retrieve. It’s a lot of steps. Cloudgene builds a unified interface for all of these.
Cloudgene merges software like Myrna under one unified interface. It works on both public and private clouds. New programs integrated into Cloudgene via a simple configuration file in YAML format. Cool web interface similar to BioCloudCentral but on top of MapReduce work and with lots more metrics and functionality. Configuration files map to web forms ala Galaxy or Genepattern.
C Titus Brown: Data reduction and division approaches for assembling short-read data in the cloud
Titus has loads of useful code on his lab’s GitHub page and shares on blog, twitter and preprints: fully open. Uses approaches that are single-pass, involve compression and work with low-memory datastructures. Goal is to supplement existing work that exists, like assembly, with pre-processing algorithms. Digital normalization aims to remove unnecessary coverage for metagenomic assembly. Downsamples based on a de Bruijn graph to normalized coverage. Analysis is streaming so doesn’t require pre-loading a graph in memory and uses fixed memory. Avoids the nasty high memory requirements for assembly to allow it to run on commodity hardware, like EC2.
Removing redundant reads has nice side-effect of removing errors. Effective in combination with other error-correction approaches. Results in virtually identical contig assembly after normalization.
Need approaches to both improve algorithms in combination with additional capacity and infrastructure. Some tough things to overcome: biologists hate throwing away data; normalization gets rid of abundance data. New approaches are to use this for streaming error correction. Error correction is the biggest data problem left in sequencing.
All figures and information from paper can be entirely reproduced as an ipython notebook. Can redo data analysis from any point: awesome. Approach has been useful in teaching and training as well as new projects.
Andrey Tovchigrechko: MGTAXA – a toolkit and a Web server for predicting taxonomy of the metagenomic sequences with Galaxy front-end and parallel computational back-end
MGTAXA predicts taxonomic classifications for bacterial metagenomic sequences. Uses an ICM (Interpolated Context Model) to help extract signal from shorter sequences: better than using a fixed k-mer model. Started off using self-organized maps to identify taxonomy, but not possible in complex cases where clades cluster together.
ICM used to identify phase relationship with host based on shared k-mer composition of virus and host. Parallelization approach used to scale model training with multiple backends: serially, SGE, Makeflow workflow engine. Cool, I didn’t know about Makeflow. Andrey suggests as a nice fit for Galaxy tool parallelization. Also implemented a BLAST+ MPI implementation using MapReduce-MPI which gives fault tolerance exposing an MPI library.
Python source is available on GitHub and integrated with Galaxy frontend. Network architecture setup at JCVI to allow access to firewalled cluster for processing. Uses AMQP messaging for communication with Apache Qpid.
Katy Wolstencroft: Workflows on the Cloud – Scaling for National Service
Building workflows for genetic testing on the cloud with National Health Service in the UK. Done in collaboration with Eagle Genomics. Diagnostic testing today uses a small numbers of variants, but will soon be scaling up to whole genomes.
Using Taverna workflows to run the analyses. Workflow is to identify variants, annotate with dbSNP, 1000 genomes and conservation data. Gather evidence to classify variants as problematic or not. Taverna provides workflow provenance so it’s accessible, secure and reproducible.
Architecture currently uses Amazon cloud, Taverna server with data in S3 buckets. Modified Taverna to work better with Amazon and improvements will be available in the next Taverna release.
Andreas Prlic: How to use BioJava to calculate one billion protein structure alignments at the RCSB PDB website
Andreas combines interests in PDB for work and open-source contributions to BioJava. He describes a workflow to find novel relationships between systematic structural alignments. Work uses Open Science Grid by pushing a custom job management system talking on port 80. Converts CPU bound problem to IO problems. Alignment comparison and visualization code is available from BioJava.