AWS genomics event: Deepak Singh, Allen Day; practical pipeline scaling

High Performance Computing in the Cloud: Deepak Singh, AWS

After the coffee break, Deepak starts off the late morning talks at the AWS genomics event. He begins by discussing the types of jobs in biology: batch jobs and data intensive computing. 4 things go into it:

  • infrastructure — instances on AWS that you access using an API. On EC2, this infrastructure is elastic and programmable. Amazon has cluster compute instances that make it easier to connect this infrastructure into something you can do distributed work on. Scaling on Amazon cluster good for MPI jobs in common jobs in computational chemistry and physics. Another choice are GPU instances if your code distributes there.

  • provision and manage — Lots of choices here: ruby scripts, Amazon CloudFormation, chef, puppet. Also standard cluster options: Condor, SGE, LSF, Torque, Rocks+. MIT Starcluster examples with awesome demos of making clusters less tricky. Cycle Computing leveraging these tools to make massive clusters and monitoring tools.

  • applications: Galaxy CloudMan, CloudBioLinux, Map Reduce for Genomics: Contrail, DNANexus, SeqCentral, Nimbus Bioinformatics

  • people: Most valuable resource that you can maximize by removing constraints. Big advantage to have access to unlimited instances to leverage when needed.

Elastic Analysis Pipelines: Allen Day, Ion Flux

Allen from IonFlux talks about a system for processing Ion Torrent data; production oriented system to move from initial data to final results behind a well-packaged front end. Decided to work on the Cloud to well serve smaller labs without existing infrastructure, plus all the benefits of scale. End to end solution from torrent machine to results.

The pipeline pulls data into S3, aligns data, does realignments, produces variant calls. Workflow involves using Cascading describe Hadoop jobs; this talks to Hbase to store results. On the LIMS side, uses messaging queues to pass analysis needs to workflow side. Jenkins used for continuous integration.

What does the Hadoop part do? Distributed sorting of BAM files, co-grouping, and detection of outliers. Idea is to distribute filtering work to prioritize variants or regions of interest. Data is self-describing, which allows restarting or recovering at arbitrary points. Cascading allows serialization of these by defining schemes for BAM, VCF and fastq files.

How did they work on scaling problems? Index server that pre-computes results and feeds them into the MapReduce cluster. Index server became a bottleneck. Can improve this by moving index server into a bittorrent swarm that serves them to the MapReduce nodes. The continuous integration systems does the work of creating index files as EBS snapshots; bittorrent swarm uses these snapshots.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s