The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.
Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio
Chris and Giles work at Distributed Bio, which provides consulting services on building, well, distributed biology pipelines. Chris starts by talking about some recommended tools:
Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).
iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It’s a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.
GlusterFS: easy to setup and use; performance equivalent or better to NFS
Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler
Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.
Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.
NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.
Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS
Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.
CycleCloud demo: Andrew Kaczorek
Andrew from Cycle Computing wraps up the day by talking about their 30,000 node cluster created using CycleCloud. Monitoring done using Grill, a CycleServer plugin that monitors Chef jobs.
The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.
Also demos a SMRT interface to PacBio data and analyses. Provides the back end elastic compute to a PacBio instrument on Amazon.
I’m in Seattle at the AWS Genomics Event, excited for a fun day of talking about genomics in the cloud.
Introduction to Research in the Cloud: Matt Wood, AWS
Matt Wood starts off the day with an introduction to Amazon Web Services and details about Amazon’s interest in Genomics. Idea is to move from data to materials, and from compute to methods; focusing better on the science. Areas where Amazon interacts with science:
Reproducibility: 1000 genomes a great example. Improves the impact of science by easing reuse. Can package the environment as machine images, which is awesome since you can give collaborators exactly what you did. Allows us to work in new ways since you can share complex environments. CloudFormation allows you to define in JSON all of the items in a cluster. Tools like Puppet and Chef provision software and configuration. Taverna can model the actual science workflow. Amazon provides SimpleDB as a key/attribute store to help model and store metadata associated with experiments or data. Galaxy fully invested in reproducibility and community involvement within their infrastructure.
Constraint removal: avoid constraints that limit innovation and research. Expand your problem space by introducing an easy approach to scaling.
- Algorithm development: Infrastructure enables algorithms. Nice examples are:
GPU instances; b. Crossbow utilizing Hadoop.
Collaboration and sharing: data, data uses and multiple users over lots of locations. General idea: moving the compute to the data. Amazon has free inbound transfer; if that’s too slow, also have Import/Export via FedExed hard drives. Can do parallel upload to S3.
Funding options: On-demand is the easiest approach, but most costly. Can use reserved capacity to reduce the hourly rates. The spot market lets you bid on capacity and save money; need to architect for interruption.
Compliance: shared responsibility — Amazon secures the infrastructure; users secure the instances and data. ISO 27001 and HIPPA compliant. Data mirrored across availability zones, but local data stays local. GovCloud: US only usage.
Some exciting things that are coming soon in genomics. Getting closer to health and patient data: going to require security and data availability, scaling to large numbers of users with elastic pipelines. Important to put patients in charge of their own data.
Practical Cloud & Workflow Orchestration: Chris Dagdigian, The BioTeam
Chris Dagdigian discusses working on the hardware geek side of science with AWS. Three topics: time, laziness and beauty. Getting to the point where automated provisioning changes lag time between wanting to do science and getting the hardware ready to do it. Research infrastructure is 100% scriptable and automatable; be lazy and automate what you do. The beautiful bits are what you can build on top of Amazon infrastructure.
CloudInit gives you a hook into freshly booted systems. Don’t need to maintain tons of AMIs; easy way to configure a new system with a YAML configuration file.
Amazon CloudFormation allows you to turn on/off a large number of instances. Create an elastic database cluster, webserver cluster and monitoring: all in a JSON input file. The example JSON template is a good place to get started.
Opscode Chef enables infrastructure as code. Important that everything is idemopotent so you can run multiple times. Demo with knife, Chef’s commandline tool. Can run ssh code on each node in a cluster, but also do searches with this. With the searches can find certain nodes with properties of interest and run those.
MIT StarCluster builds ready to use cluster compute farm on AWS. Especially useful for handling legacy use cases. Slideshare example of running this.