Bioinformatics Open Source Conference 2013, day 2 morning: Sean Eddy and Software Interoperability

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 morning session focused on software interoperability.

Previous notes:

Biological sequence analysis in the post-data era

Sean Eddy

Sean starts off with a self-described embarrassing personal history about how he developed his scientific direction. Biology background: multiple self-splicing introns in a bacteriophage, unexpected in a highly streamlined genome. The introns are self-removing transposable elements that are difficult for the organism to remove from the genome. No sequence conservation of these, only structural conservation, but no tools to detect this. Sean was an experimental biologist and used this as motivational problem to search for an algorithm/programming solution to the problem. Not able to accomplish this straight off until learned about HMM approaches. Reimplemented HMMs and re-invented stochastic context free grammars to model structural work as a tree structure. Embarrassing part was that his post-doc lab work on GFP was not going well and scooped, so wrote a postdoc grant update to switch to computational biology. This switch led to HMMER, Infernal, Biological Sequencing Analysis.

From this: general advice to not do incremental engineering is wrong. A lot of great work came from incremental engineering: automobiles, sequence analysis (Smith Waterman -> BLAST -> PSI-BLAST -> HMMER). Engineering is a valuable part of science. Requires insane dedication to a single problem. The truth: science rewards for how much impact you have, not how many papers you write. Arbitrage approach to science: take ideas and tools and make them usable for biologists who need them. Not traditionally valuable but useful so can carve out a niche.

General approach to Pfam that helps tame exponential growth of sequences. Strategy is to use representative seed alignments, sweep the full database, use scalable models in HMMER and Infernal, then automate. Scales as you’ve got more data.

Scientific publication is a 350 year old tradition of open science. First journal with peer review in 1665: scientific priority and fame in return for publication and disclosure. This quid pro quo still exists today. The intent of the system has been open data since the beginning, but tricky part now is that the part you want to be open does not fit into the paper. Specifically in computational science, the paper is an advertisement, not a delivery mechanism.

Two magic tricks. We need sophisticated infrastructure, but most of the time we’re exploring. For one-off data analysis, premium is on expert biology and tools as simple as possible. Trick 1: use control experiments over statistical tests. Things you need: trusted methods, data availability, command line. Trick 2: take small random sub-samples of large datasets. Review example using this approach to catch algorithm approach error in spliced aligner.

Bioinformatics: data analysis needs to be part of the science. Biologists need to be fluent in computational analysis and strong computational tools will always be in demand. Great end to a brilliant talk.

Software Interoperability

BioBlend – Enabling Pipeline Dreams

Enis Afgan

BioBlend is a Python wrapper around the Galaxy and CloudMan APIs. The goal is to enable creation of automated and scalable pipelines. For some workflows the Galaxy GUI workflow isn’t enough because we need metadata to drive the analysis. Luckily Galaxy has a documented REST API that supports most of the functionality. To support scaling out Galaxy, CloudMan automates the entire process of spinning up an instance, creating and SGE cluster and managing data and tools. Galaxy is a execution engine and CloudMan is the infrastructure manager. BioBlend has extensive documentation and has lots of community contributions.

Taverna Components: Semantically annotated and shareable units of functionality

Donal Fellows

Taverna components are well described parts that plug into a workflow. It needs curation, documentation and to work (and fail) in predictable ways. The component hides the complexity of calling the wrapped tool service. This is a full part of the Taverna 2.5 release: both workbench and server. Components are semantically annotated to describe inputs/outputs according to domain ontologies. Components are not just nested workflows since they obey a set of rules so can treat as a black box and drill in only if needed. Components enable additional abstraction allowing workflows to be more modular: allows individual work on components and high level workflows with updates for new versions. Long term goal is to treat the entire workflow as a RDF model to improve searching.

UGENE Workflow Designer – flexible control and extension of pipelines with scripts

Yuriy Vaskin

UGENE focuses on integration of biological tools using a graphical interface. It has a workflow designer like Galaxy and Taverna and runs on local machines. Also offers a python API for scripting through UGENE. Nice example code feeding Biopython inputs into the API natively.

Reproducible Quantitative Transcriptome Analysis with Oqtans

Vipin Sreedharan

Starts off talk with poll from RNA-seq blog. The most immediate needs for the community are standard bioinformatics pipelines and skilled bioinformatics specialists. oqtans is online quantitative transcriptome analysis, code available on GitHub. Drives an automated pipeline with a vast assortment of RNA-seq data analysis tools. Some useful tools used: PALMapper for mapping, rDiff for differential expression analysis, rQuant for alternative transcripts. oqtans available from a public Galaxy instance and with Amazon AMIs.

MetaSee: An interactive visualization toolbox for metagenomic sample analysis and comparison

Xiaoquan Su

MetaSee provides an online tool for visualizing metagenomic data. It’s a general visualization tool and integrates multiple input types. Nice tools specifically for metagenomics to display taxa in a population. Have a nice MetaSee mouth example which maps metagenomics of the mouth. Also pictures of teeth are scary without gums. Meta-Mesh is a metagenomic database and analysis system.

PhyloCommons: community storage, annotation and reuse of phylogenies

Hilmar Lapp

Phylocommons provides an annotated repository of phylogenetic trees. Trees are key to biological analyses and increasing in number, but difficult to reuse and build off. Most are not archived, and even if so are in images or other hard to automatically use. It uses Biopython to convert trees into RDF and allows query through the Virtuoso RDF database. Code is available on GitHub.

GEMBASSY: an EMBOSS associated package for genome analysis using G-language SOAP/REST web services

Hidetoshi Itaya

GEMBASSY provides an EMBOSS package that integrates with the G-Language using a web service. This gives you commandline access through EMBOSS for a wide variety of visualization and analysis tools. Nice integration examples show it working directly in a command line workflow.

Rubra – flexible distributed pipelines for bioinformatics

Clare Sloggett

Rubra provides flexible distributed pipelines for bioinformatics, build on top of Ruffus. Used to build a variant calling pipeline built on bwa, GATK and ENSEMBL.


One thought on “Bioinformatics Open Source Conference 2013, day 2 morning: Sean Eddy and Software Interoperability

  1. Pingback: Bioinformatics Open Source Conference 2013, day 2 afternoon: cloud computing, translational genomics and funding | Small Change Bioinformatics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s