AWS genomics event: Distributed Bio, Cycle Computing talks: practical scaling

The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.

Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio

Chris and Giles work at Distributed Bio, which provides consulting services on building, well, distributed biology pipelines. Chris starts by talking about some recommended tools:

  • Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).

  • iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It’s a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.

  • GlusterFS: easy to setup and use; performance equivalent or better to NFS

  • Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler

Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.

Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.

NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.

Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS

Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.

CycleCloud demo: Andrew Kaczorek

Andrew from Cycle Computing wraps up the day by talking about their 30,000 node cluster created using CycleCloud. Monitoring done using Grill, a CycleServer plugin that monitors Chef jobs.

The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.

Also demos a SMRT interface to PacBio data and analyses. Provides the back end elastic compute to a PacBio instrument on Amazon.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s