Continuing on from my morning overview of the Amazon workshop on Storing, Analyzing and Sharing Genomic Data in the Cloud.
Peter Sirota — Hadoop and Amazon Elastic MapReduce
Elastic MapReduce on Amazon uses Hadoop framework. Hadoop handles parallelization, fault tolerance, data distribution and load balancing once you write in MapReduce. Elastic MapReduce takes care of the issues of tuning Hadoop and managing the compute clusters, as well as monitoring jobs.
Customers using Elastic MapReduce include eHarmony and Razorfish. Razorfish deals with 200 Gb a day of data; digital advertisting and marketing with the goal of personalizing recommendations.
Bootstrap actions allow you to run arbitrary scripts before job flow begins. This can be used to configure hadoop or the custer, or install packages. Like Elastic MapReduce, this can be managed through the AWS console.
Higher level abstraction is available on top of Hadoop: Cascading is a Java API for MapReduce; Clojure abstractions are available for Cascading; Apache Hive is a SQL-like language; Apache Pig is a scripting language; Hadoop streaming allows you to run jobs in R, Python, Ruby, Perl…
One cool project on top of MapReduce is Datameer which provides a spreadsheet interface on top of map reduce for big data analysis in a familiar interface.
Two biology examples of MapReduce. The general pattern is to do alignments, segment into bins, then scan for SNPs. Crossbow implements this on top of Bowtie alignments. Parallelize by read, genome bin, samples and genes at each step, using Hadoop to handle the details.
Andreas Sundquist — A Web 2.0 Platform to Store, Visualize, and Analyze Your Sequence Data in the Cloud
Scaling next gen sequencing is a shift from raw data to the actual biological information. The idea is to raise the level of abstraction to focus on biology, like SNPs, where the amount of information to store is lower. Andreas is from DNAnexus which supplies web-based analysis on the cloud to supply analysis for smaller centers doing sequencing. The idea is to increase access to analysis tools.
Provides quality metrics, embedded genome browser with tracks, everything is web based. Looks like some good stuff. This is implemented on the Amazon EC2 + S3.
One engineering challenge on EC2 was reliability and consistency around EC2 node failures. EBS is used for persistent data storage, relational databases with transaction commits, S3 operations can be re-run without issues, and jobs are restartable. Systems have a retry logic with exponential back off. Large EC2 instances have more reliable IO. For S3, use data structures that minimize query steps.
Getting data into AWS is a problem with S3 file size limit. Need to compress and split files. Web browsers have a 2Gb limit, so to get around this use URL/FTP upload or command line access.
Jon Sorenson — 3rd Generation sequencing in the cloud
[Pacific Biosciences] provides real time monitoring of DNA synthesis by detecting single molecules as bases are incorporated. The analysis pipeline involves going from a movie of synthesis to traces to pluses to raw base calls to consensus base calls. Produces a Gb of data in 30 minutes.
Reads are filtered, aligned or assembled, consensus sequences are called, and variants are identified. Data goes from 1Tb of a 30x human genome to 150Gb finished genome. Aimed at full automation including sample inputs. This is aimed at providing transparent automation for a small lab: pushed directly into a cloud environment for analysis.
Why deploy complex analysis on the cloud? It is more maintainable. This makes it easier to expose the costs and provide more budgetable analysis.
PacBio data is more complex than 2nd generation platforms: signal provides real information associated with DNA methylation. Want to abstract this information out to make it useful.
Lots of data is coming quickly: 10,000 genomes = 2 Pb of data that will require parallelization like Hadoop. Not yet using Hadoop at PacBio, but rethinking this stance as Hadoop development has taken care of issues.
Dione Bailey — Expanding sequencing services to the cloud
Complete genomics provides on-demand human sequencing. CGI does prep, sequencing, assembly and analysis and provides data back on Amazon. Output datasets are 500Gb. Provide back summary statistics, variant information, annotations, evidence files for each variant, along with reads and mappings. Data in burned onto a drive by Amazon and shipped to the customer using AWS Import/Export.
Thinking about moving to web based delivery on Amazon. Difficulties are overcoming people’s unfamiliarity with AWS. Have capacity to sequence 500 human genomes a month.
Complete Genomics Analysis Tools (CGA tools): an open source project providing analysis tools. Not yet available. Will provide genome comparison tools and format conversion tools.
James Taylor — Dynamically Scalable Accessible Analysis for Next Generation Sequence Data
Galaxy provides web based analysis tools for biological data. Want to make analyses more accessible, allow those analyses to be transparently communicated and make them reproducible. Tools are the basic unit of analysis in Galaxy. These are bundled together into workflows, which can be shared and published allowing others to see what you’ve done and perform it on their own work.
Galaxy is designed to be installed locally if you have resources to set this up. Without local resources, this can also be installed directly on cloud instances. Architecture has been re-architected to run reliable on cloud infrastructure with a messaging system.
Example deployment is discovery of mitochondrial heteroplasmy: tricky because there are lots of mitochondrial genomes which have their own variation. Amazon EC2 setup provides a web based galaxy console for deployment and adding compute nodes. Can spin up nodes on demand when they are saturated. Run 12 parallel workflows for analysis. At the end discovered several novel SNPs.
John Hogenesch and Angel Pizarro — How we went from, “‘omics data, cool”, to “‘omics data, uh oh”
The biological motivation: studying the circadian rhythm. Controlled by a transcriptionally regulated metabolic pathway that is studying in alternative clocks: non 24-hour clocks. Do a time-course RNA-seq analysis of wildtype liver. Used Bowtie and BLAT for mapping reads. BLAT used to handle splicing.
Cost of running on Amazon is ~$25/lane. 85% of reads mapping after Bowtie + BLAT pipeline. Done using a quadruple extra large instance. One unexpected finding was the presence of stable, spliced intronic sequences.
On the informatics side, simple job distribution systems like CloudCrowd, disco and Resque help reduce some of the barriers involved with Hadoop.
Ntino Krampis — From AMIs Running Monolithic Software, to Real Distributed Bioinformatics Applications in the Cloud
Ntino describes JCVI Cloud BioLinux, a AMI with 100+ bioinformatics tools on EC2. Once you have this, how do you leverage cloud services to scale compute? How can we do highly parallel alignments with EC2? Notion is to parallelize BLAST by distributing seeds across partitioned SimpleDB: approximately 2000 major seeds in the human genome. Seed scan is parallelized across multiple instances. Interesting thing is to think about how new bioinformatics algorithms can be designed to be more scalable for parallelized architecture.