The Bioinformatics Open Source Conference (BOSC 2010) is taking place in Boston on Friday, July 9th and Saturday, July 10th. It focuses around open source software for biology, and is a technical conference for folks in the data trenches. These are my notes from the talks and discussions on the first morning.
Guy Coates — Clouds: all fluff and no substance
Guy is from Sanger Institute and starts with an overview of the amazing infrastructure they have for churning out sequencing data. They’ve been experiencing exponential growth in both storage and compute. Moore’s law is too slow for the increase in computational needs.
So a natural area to explore is on-demand Cloud Computing. Where is cloud computing on the hype cycle?
3 use cases to explore:
- Web presence
- HPC workload
- Data warehousing
As an example, consider Ensembl. The web presence has 10k unique visitors a day and 126k page views. The HPC workload is automated analysis of 51 vertebrate genomes.
Approach to improving website reactivity in two areas. First was improving website code and caching to avoid large page loads. The second was adding a US mirror in California which takes about 1/3 of web traffic. This was a traditional mirror in a co-location facility. A US east mirror was built on Amazon web services. Tuning was necessary for cloud infrastructure, especially for the 1Tb Ensembl database.
How does the cloud web presence compare to traditional co-location? Having no physical hardware saved for start up time and management infrastructure. Virtual machines provide free hardware upgrades so don’t have to sweat the 3 year cycle of hardware obsolescence.
Is the cloud cost effective? Need to consider your comparison. How much would it cost to do locally? How many times do you need to run it? But also consider total cost of operation of a server, including power, cooling and admin. Comparing to a co-location facility: $120k for hardware + $51k per year is $91k per year for a 3 year lifeycle. Amazon is $77k/year, so is cost effective.
Additional benefits of the cloud: the website and cloud is packaged together so there is a ready to go Amazon image with everything including data in a Amazon public dataset. Some open questions are how scale out on Amazon will compare relative to experience at Ensembl and co-location facilities.
Some lessons learned in generating the mirror: more time than expected to move code outside of Ensembl. Overall happy with Amazon and plan to consider Amazon’s far-east servers. Virtual servers can also be useful for other Sanger services. In terms of the hype cycle we are in the plateau of boring ol’ usefulness, which is good.
Second use case for Amazon is Ensembl’s compute pipeline for gene calling and annotation. Code base is object oriented perl running core algorithms in C with over 200 external binaries called out. Workflow is embarrassingly parallel and IO heavy. Takes ~500 CPU days for a moderate sized genome and need a high performance file system underneath. Wanted to explore how difficult it would be to move some of this to the cloud for coping with increases in data. Other important thing is democratizing the pipeline so others can use it with ready to go AMIs on Amazon.
Did not end up working well. Porting the queuing system to Amazon was difficult with LSF/SGE queuing systems due to LSF licensing and fiddling. Moving data to the cloud was very difficult; if you look at most cloud success stories they are not big data applications. Transfer speeds across the network are too slow and difficult to get a handle on what exactly the bottlenecks are. For physics they develop dedicated pipelines to deal with this problem, but biology collaborations are not conducive to this. Within the cloud there are no global filesystems since NFS is not so hot and EC2 inter-node networking is not great.
Why not S3/hadoop/map-reduce? A lot of code expecteds file on a filesystem, not S3. Many existing applications would need to be re-written. How do you manage both hadoop and standard apps. Barrier to entry higher for biologists. So on HPC the cloud hype cycle is still in the trough of disillusionment currently.
The problem with current data archives is that are centralized in a location where you can put/get the data, but not compute on it. Big question: is data in an archive like this really useful? Example use case: 100Tb of data in short read archive. Estimate 3 months to pull down the data and get it ready. Can the cloud help? Well, it sounds good to move the CPU to the data but how do you expose the data? How do you organize the data to make it compute efficient? In terms of funding, whose cloud do we use? Most resources are funded to provide data but not compute: implies a commercial solution like Amazon makes sense. This does solve networking problems, but would need to invest in high speed links at Amazon. In terms of hype cycle, computable archives are still in the peak of inflated expectations: don’t know how this will turn out in practice.
Ron Taylor — Overview of Hadoop/MapReduce/HBase
Ron is presenting a general overview of Hadoop and MapReduce to frame the afternoon talks. Hadoop is a java software framework designed to handle very large datasets and simplifies the development of large-scale fault-tolerant distributed apps on clusters of commodity machines. Data is replicated and single point of failure is only the head node.
MapReduce divides program execution into a map and reduce step, separated by data transfer between nodes. It’s a functional approach that aggregates/reduces data based on key/value pairs. Can fit a large number of tasks into this framework.
Hbase adds random real-time read/write access to data stored in a distributed column-oriented db. Used as input and output for MapReduce jobs. Data is stored in tables ala a relational database. Data at each row and column is versioned. Flexible modification of columns allows modifying the data model on the fly.
Pig provides a high level data language that is designed for batch processing of data within Hadoop.
Hive is a data warehouse infrastructure build on top of hadoop that models a relational database with rows/columns and an SQL-like query language.
Cascading is a java library that sits on top of Hadoop MapReduce that operates at a higher level.
Mahout builds scalable machine learning libraries on top of Hadoop.
Amazon EC2 provides Hadoop as a first class service. Some examples of Bioinformatics projects on EC2: Crossbow which we’ll hear about later.
Matt Hanna — The Genome Analysis Toolkit (GATK)
Matt will discuss the Broad’s toolkit for next gen sequencing analysis. Idea is that dataset size greatly increases the complexity of analysis; how can this be abstracted to deal with common problems associated with big datasets.
There are multiple ways to apply MapReduce to a next-gen sequencing pipeline. GATK provides a traversal infrastructure to simplify. Data is sharded into small chunks that can be processed independently. This data is then streamed based on groupings; for instance by gene loci. Can process either serially or in parallel.
At a high level an end user will write a walker that deals with the underlying libraries. Standard traversals are available to use, like traverse by loci.
An example is a simple bayesian genotyper that calls bases at each position in a reference genome based on reads: SNP calling. Walker needs to be written that specifies the data access pattern and commandline arguments. Second step is to write a reduce function that filters pileup on demand. These are then output ba
sed on the filters. Throwing processors at a job improves performance nearly linearly, but very few changes need to happen in the actual code. Works best on CPU, not IO, bound processes.
Brian O’Connor — SeqWare Query Engine
SeqWare is designed to standardize the process of making biological queries on data. Made up of a REST interface and HBase backend. Analysis is done with SeqWare pipeline. Web Service is accessed with a REST XML client API, or can be presented as HTML forms.
The back end needed to support a rich level of annotation on objects, support large variant databases, be distributed across a cluster, support querying, and scale to crazy data growth.
HBase focuses on random access to data based on column oriented tables. Flexibility allows storage of arbitrary data types. Keys are made up of munged together string data. Data access is through both the HBase API and with MapReduce custom queries. Since multiple genomes are stored in the same table, this makes it easy to run MapReduce jobs. Performance compares favorably to Berkeley DB can does really good for retrieval. Katta (distributed Lucene) might help with queries.
Judy Qiu — Cloud Technologies Applications
Starts off talking about the data deluge and how it can be addressed with cloud technologies. Looking at public health data, PubChem and sequence alignments. Trying to evaluate the cloud to hide the complexity of dealing with infrastructure. Microsoft’s DryadLINQ is comparable to Hadoop for parallelization. Amazon costs are similar to local cluster costs of data processing in tests using CAP3 assembly.
Twister is an open source iterated MapReduce infrastructure for simplifying data mining analysis tasks.
Ben Langmead — Cloud scale genomics: examples and lessons
Crossbow is cloud enabled softward for genotyping. It splits alignments using Bowtie, aggregates them into pileups, and then calls SNPs. Parallelized by reads, genomic bins help with reduction using Hadoop.
Myrna is a cloud approach for looking a differential expression using sequencing of transcriptional data. Reads are aligned in parallel, then aggregated into genomic bins. Bins are normalized based on total reads and count data are re-aggregated into a set of p-values for expression.
Architecture of Crossbow: cloud driver script prepares a pipeline, wrapper runs bowtie, wrapper runs soapsnp, then postprocess. Hadoop ties the parts together. To run this on non-cloud architecture, you can write a Hadoop driver structure. The third mode is non-Hadoop which is implemented in Perl.
Enis describes his work deploying Galaxy on the cloud. My coverage of this is not going to be great since I heard this talk earlier at the Galaxy developer’s conference. See the previous summary.