I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 morning session focused on software interoperability.
- Day 1 morning talks: Cameron Neylon and Open Science
- Day 2 afternoon talks: Visualization and project updates
Biological sequence analysis in the post-data era
Sean starts off with a self-described embarrassing personal history about how he developed his scientific direction. Biology background: multiple self-splicing introns in a bacteriophage, unexpected in a highly streamlined genome. The introns are self-removing transposable elements that are difficult for the organism to remove from the genome. No sequence conservation of these, only structural conservation, but no tools to detect this. Sean was an experimental biologist and used this as motivational problem to search for an algorithm/programming solution to the problem. Not able to accomplish this straight off until learned about HMM approaches. Reimplemented HMMs and re-invented stochastic context free grammars to model structural work as a tree structure. Embarrassing part was that his post-doc lab work on GFP was not going well and scooped, so wrote a postdoc grant update to switch to computational biology. This switch led to HMMER, Infernal, Biological Sequencing Analysis.
From this: general advice to not do incremental engineering is wrong. A lot of great work came from incremental engineering: automobiles, sequence analysis (Smith Waterman -> BLAST -> PSI-BLAST -> HMMER). Engineering is a valuable part of science. Requires insane dedication to a single problem. The truth: science rewards for how much impact you have, not how many papers you write. Arbitrage approach to science: take ideas and tools and make them usable for biologists who need them. Not traditionally valuable but useful so can carve out a niche.
General approach to Pfam that helps tame exponential growth of sequences. Strategy is to use representative seed alignments, sweep the full database, use scalable models in HMMER and Infernal, then automate. Scales as you’ve got more data.
Scientific publication is a 350 year old tradition of open science. First journal with peer review in 1665: scientific priority and fame in return for publication and disclosure. This quid pro quo still exists today. The intent of the system has been open data since the beginning, but tricky part now is that the part you want to be open does not fit into the paper. Specifically in computational science, the paper is an advertisement, not a delivery mechanism.
Two magic tricks. We need sophisticated infrastructure, but most of the time we’re exploring. For one-off data analysis, premium is on expert biology and tools as simple as possible. Trick 1: use control experiments over statistical tests. Things you need: trusted methods, data availability, command line. Trick 2: take small random sub-samples of large datasets. Review example using this approach to catch algorithm approach error in spliced aligner.
Bioinformatics: data analysis needs to be part of the science. Biologists need to be fluent in computational analysis and strong computational tools will always be in demand. Great end to a brilliant talk.
BioBlend – Enabling Pipeline Dreams
BioBlend is a Python wrapper around the Galaxy and CloudMan APIs. The goal is to enable creation of automated and scalable pipelines. For some workflows the Galaxy GUI workflow isn’t enough because we need metadata to drive the analysis. Luckily Galaxy has a documented REST API that supports most of the functionality. To support scaling out Galaxy, CloudMan automates the entire process of spinning up an instance, creating and SGE cluster and managing data and tools. Galaxy is a execution engine and CloudMan is the infrastructure manager. BioBlend has extensive documentation and has lots of community contributions.
Taverna Components: Semantically annotated and shareable units of functionality
Taverna components are well described parts that plug into a workflow. It needs curation, documentation and to work (and fail) in predictable ways. The component hides the complexity of calling the wrapped tool service. This is a full part of the Taverna 2.5 release: both workbench and server. Components are semantically annotated to describe inputs/outputs according to domain ontologies. Components are not just nested workflows since they obey a set of rules so can treat as a black box and drill in only if needed. Components enable additional abstraction allowing workflows to be more modular: allows individual work on components and high level workflows with updates for new versions. Long term goal is to treat the entire workflow as a RDF model to improve searching.
UGENE Workflow Designer â flexible control and extension of pipelines with scripts
Reproducible Quantitative Transcriptome Analysis with Oqtans
Starts off talk with poll from RNA-seq blog. The most immediate needs for the community are standard bioinformatics pipelines and skilled bioinformatics specialists. oqtans is online quantitative transcriptome analysis, code available on GitHub. Drives an automated pipeline with a vast assortment of RNA-seq data analysis tools. Some useful tools used: PALMapper for mapping, rDiff for differential expression analysis, rQuant for alternative transcripts. oqtans available from a public Galaxy instance and with Amazon AMIs.
MetaSee: An interactive visualization toolbox for metagenomic sample analysis and comparison
MetaSee provides an online tool for visualizing metagenomic data. It’s a general visualization tool and integrates multiple input types. Nice tools specifically for metagenomics to display taxa in a population. Have a nice MetaSee mouth example which maps metagenomics of the mouth. Also pictures of teeth are scary without gums. Meta-Mesh is a metagenomic database and analysis system.
PhyloCommons: community storage, annotation and reuse of phylogenies
Phylocommons provides an annotated repository of phylogenetic trees. Trees are key to biological analyses and increasing in number, but difficult to reuse and build off. Most are not archived, and even if so are in images or other hard to automatically use. It uses Biopython to convert trees into RDF and allows query through the Virtuoso RDF database. Code is available on GitHub.