The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.
Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio
Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).
iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It’s a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.
GlusterFS: easy to setup and use; performance equivalent or better to NFS
Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler
Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.
Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.
NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.
Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS
Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.
CycleCloud demo: Andrew Kaczorek
The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.