AWS genomics event: Distributed Bio, Cycle Computing talks: practical scaling

The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.

Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio

Chris and Giles work at Distributed Bio, which provides consulting services on building, well, distributed biology pipelines. Chris starts by talking about some recommended tools:

  • Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).

  • iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It’s a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.

  • GlusterFS: easy to setup and use; performance equivalent or better to NFS

  • Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler

Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.

Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.

NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.

Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS

Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.

CycleCloud demo: Andrew Kaczorek

Andrew from Cycle Computing wraps up the day by talking about their 30,000 node cluster created using CycleCloud. Monitoring done using Grill, a CycleServer plugin that monitors Chef jobs.

The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.

Also demos a SMRT interface to PacBio data and analyses. Provides the back end elastic compute to a PacBio instrument on Amazon.


AWS genomics event: Deepak Singh, Allen Day; practical pipeline scaling

High Performance Computing in the Cloud: Deepak Singh, AWS

After the coffee break, Deepak starts off the late morning talks at the AWS genomics event. He begins by discussing the types of jobs in biology: batch jobs and data intensive computing. 4 things go into it:

  • infrastructure — instances on AWS that you access using an API. On EC2, this infrastructure is elastic and programmable. Amazon has cluster compute instances that make it easier to connect this infrastructure into something you can do distributed work on. Scaling on Amazon cluster good for MPI jobs in common jobs in computational chemistry and physics. Another choice are GPU instances if your code distributes there.

  • provision and manage — Lots of choices here: ruby scripts, Amazon CloudFormation, chef, puppet. Also standard cluster options: Condor, SGE, LSF, Torque, Rocks+. MIT Starcluster examples with awesome demos of making clusters less tricky. Cycle Computing leveraging these tools to make massive clusters and monitoring tools.

  • applications: Galaxy CloudMan, CloudBioLinux, Map Reduce for Genomics: Contrail, DNANexus, SeqCentral, Nimbus Bioinformatics

  • people: Most valuable resource that you can maximize by removing constraints. Big advantage to have access to unlimited instances to leverage when needed.

Elastic Analysis Pipelines: Allen Day, Ion Flux

Allen from IonFlux talks about a system for processing Ion Torrent data; production oriented system to move from initial data to final results behind a well-packaged front end. Decided to work on the Cloud to well serve smaller labs without existing infrastructure, plus all the benefits of scale. End to end solution from torrent machine to results.

The pipeline pulls data into S3, aligns data, does realignments, produces variant calls. Workflow involves using Cascading describe Hadoop jobs; this talks to Hbase to store results. On the LIMS side, uses messaging queues to pass analysis needs to workflow side. Jenkins used for continuous integration.

What does the Hadoop part do? Distributed sorting of BAM files, co-grouping, and detection of outliers. Idea is to distribute filtering work to prioritize variants or regions of interest. Data is self-describing, which allows restarting or recovering at arbitrary points. Cascading allows serialization of these by defining schemes for BAM, VCF and fastq files.

How did they work on scaling problems? Index server that pre-computes results and feeds them into the MapReduce cluster. Index server became a bottleneck. Can improve this by moving index server into a bittorrent swarm that serves them to the MapReduce nodes. The continuous integration systems does the work of creating index files as EBS snapshots; bittorrent swarm uses these snapshots.

AWS Genomics Event: Matt Wood; Chris Dagdigian on cloud biology automation

I’m in Seattle at the AWS Genomics Event, excited for a fun day of talking about genomics in the cloud.

Introduction to Research in the Cloud: Matt Wood, AWS

Matt Wood starts off the day with an introduction to Amazon Web Services and details about Amazon’s interest in Genomics. Idea is to move from data to materials, and from compute to methods; focusing better on the science. Areas where Amazon interacts with science:

  • Reproducibility: 1000 genomes a great example. Improves the impact of science by easing reuse. Can package the environment as machine images, which is awesome since you can give collaborators exactly what you did. Allows us to work in new ways since you can share complex environments. CloudFormation allows you to define in JSON all of the items in a cluster. Tools like Puppet and Chef provision software and configuration. Taverna can model the actual science workflow. Amazon provides SimpleDB as a key/attribute store to help model and store metadata associated with experiments or data. Galaxy fully invested in reproducibility and community involvement within their infrastructure.

  • Constraint removal: avoid constraints that limit innovation and research. Expand your problem space by introducing an easy approach to scaling.

  • Algorithm development: Infrastructure enables algorithms. Nice examples are:
  • GPU instances; b. Crossbow utilizing Hadoop.

  • Collaboration and sharing: data, data uses and multiple users over lots of locations. General idea: moving the compute to the data. Amazon has free inbound transfer; if that’s too slow, also have Import/Export via FedExed hard drives. Can do parallel upload to S3.

  • Funding options: On-demand is the easiest approach, but most costly. Can use reserved capacity to reduce the hourly rates. The spot market lets you bid on capacity and save money; need to architect for interruption.

  • Compliance: shared responsibility — Amazon secures the infrastructure; users secure the instances and data. ISO 27001 and HIPPA compliant. Data mirrored across availability zones, but local data stays local. GovCloud: US only usage.

Some exciting things that are coming soon in genomics. Getting closer to health and patient data: going to require security and data availability, scaling to large numbers of users with elastic pipelines. Important to put patients in charge of their own data.

Practical Cloud & Workflow Orchestration: Chris Dagdigian, The BioTeam

Chris Dagdigian discusses working on the hardware geek side of science with AWS. Three topics: time, laziness and beauty. Getting to the point where automated provisioning changes lag time between wanting to do science and getting the hardware ready to do it. Research infrastructure is 100% scriptable and automatable; be lazy and automate what you do. The beautiful bits are what you can build on top of Amazon infrastructure.

Demo time:

  • CloudInit gives you a hook into freshly booted systems. Don’t need to maintain tons of AMIs; easy way to configure a new system with a YAML configuration file.

  • Amazon CloudFormation allows you to turn on/off a large number of instances. Create an elastic database cluster, webserver cluster and monitoring: all in a JSON input file. The example JSON template is a good place to get started.

  • Opscode Chef enables infrastructure as code. Important that everything is idemopotent so you can run multiple times. Demo with knife, Chef’s commandline tool. Can run ssh code on each node in a cluster, but also do searches with this. With the searches can find certain nodes with properties of interest and run those.

  • MIT StarCluster builds ready to use cluster compute farm on AWS. Especially useful for handling legacy use cases. Slideshare example of running this.