Links: science communication, statistics, large data analysis

A long overdue set of links and quick thoughts:

Communicating research


Data analysis

  • Revolutions has advice for dealing with large data sets from the 2010 Workshop on Algorithms for Modern Massive Data Sets.

  • The larry package for manipulating tables in Python. This uses NumPy under the covers and is similar to dealing with data.frames in R.

  • Will describes how callbacks can drive an analysis pipeline. As analysis workflows get more complicated, your code can get to be a mess of special cases and become really fragile. Here he passes around functions through a standard runner to help generalize and abstract the process.




Amazon Workshop on Genomics and Cloud Computing: Afternoon talks

Continuing on from my morning overview of the Amazon workshop on Storing, Analyzing and Sharing Genomic Data in the Cloud.

Peter Sirota — Hadoop and Amazon Elastic MapReduce

Elastic MapReduce on Amazon uses Hadoop framework. Hadoop handles parallelization, fault tolerance, data distribution and load balancing once you write in MapReduce. Elastic MapReduce takes care of the issues of tuning Hadoop and managing the compute clusters, as well as monitoring jobs.

Customers using Elastic MapReduce include eHarmony and Razorfish. Razorfish deals with 200 Gb a day of data; digital advertisting and marketing with the goal of personalizing recommendations.

Bootstrap actions allow you to run arbitrary scripts before job flow begins. This can be used to configure hadoop or the custer, or install packages. Like Elastic MapReduce, this can be managed through the AWS console.

Higher level abstraction is available on top of Hadoop: Cascading is a Java API for MapReduce; Clojure abstractions are available for Cascading; Apache Hive is a SQL-like language; Apache Pig is a scripting language; Hadoop streaming allows you to run jobs in R, Python, Ruby, Perl…

One cool project on top of MapReduce is Datameer which provides a spreadsheet interface on top of map reduce for big data analysis in a familiar interface.

Two biology examples of MapReduce. The general pattern is to do alignments, segment into bins, then scan for SNPs. Crossbow implements this on top of Bowtie alignments. Parallelize by read, genome bin, samples and genes at each step, using Hadoop to handle the details.

Andreas Sundquist — A Web 2.0 Platform to Store, Visualize, and Analyze Your Sequence Data in the Cloud

Scaling next gen sequencing is a shift from raw data to the actual biological information. The idea is to raise the level of abstraction to focus on biology, like SNPs, where the amount of information to store is lower. Andreas is from DNAnexus which supplies web-based analysis on the cloud to supply analysis for smaller centers doing sequencing. The idea is to increase access to analysis tools.

Provides quality metrics, embedded genome browser with tracks, everything is web based. Looks like some good stuff. This is implemented on the Amazon EC2 + S3.

One engineering challenge on EC2 was reliability and consistency around EC2 node failures. EBS is used for persistent data storage, relational databases with transaction commits, S3 operations can be re-run without issues, and jobs are restartable. Systems have a retry logic with exponential back off. Large EC2 instances have more reliable IO. For S3, use data structures that minimize query steps.

Getting data into AWS is a problem with S3 file size limit. Need to compress and split files. Web browsers have a 2Gb limit, so to get around this use URL/FTP upload or command line access.

Jon Sorenson — 3rd Generation sequencing in the cloud

[Pacific Biosciences] provides real time monitoring of DNA synthesis by detecting single molecules as bases are incorporated. The analysis pipeline involves going from a movie of synthesis to traces to pluses to raw base calls to consensus base calls. Produces a Gb of data in 30 minutes.

Reads are filtered, aligned or assembled, consensus sequences are called, and variants are identified. Data goes from 1Tb of a 30x human genome to 150Gb finished genome. Aimed at full automation including sample inputs. This is aimed at providing transparent automation for a small lab: pushed directly into a cloud environment for analysis.

Why deploy complex analysis on the cloud? It is more maintainable. This makes it easier to expose the costs and provide more budgetable analysis.

PacBio data is more complex than 2nd generation platforms: signal provides real information associated with DNA methylation. Want to abstract this information out to make it useful.

Lots of data is coming quickly: 10,000 genomes = 2 Pb of data that will require parallelization like Hadoop. Not yet using Hadoop at PacBio, but rethinking this stance as Hadoop development has taken care of issues.

Dione Bailey — Expanding sequencing services to the cloud

Complete genomics provides on-demand human sequencing. CGI does prep, sequencing, assembly and analysis and provides data back on Amazon. Output datasets are 500Gb. Provide back summary statistics, variant information, annotations, evidence files for each variant, along with reads and mappings. Data in burned onto a drive by Amazon and shipped to the customer using AWS Import/Export.

Thinking about moving to web based delivery on Amazon. Difficulties are overcoming people’s unfamiliarity with AWS. Have capacity to sequence 500 human genomes a month.

Complete Genomics Analysis Tools (CGA tools): an open source project providing analysis tools. Not yet available. Will provide genome comparison tools and format conversion tools.

James Taylor — Dynamically Scalable Accessible Analysis for Next Generation Sequence Data

Galaxy provides web based analysis tools for biological data. Want to make analyses more accessible, allow those analyses to be transparently communicated and make them reproducible. Tools are the basic unit of analysis in Galaxy. These are bundled together into workflows, which can be shared and published allowing others to see what you’ve done and perform it on their own work.

Galaxy is designed to be installed locally if you have resources to set this up. Without local resources, this can also be installed directly on cloud instances. Architecture has been re-architected to run reliable on cloud infrastructure with a messaging system.

Example deployment is discovery of mitochondrial heteroplasmy: tricky because there are lots of mitochondrial genomes which have their own variation. Amazon EC2 setup provides a web based galaxy console for deployment and adding compute nodes. Can spin up nodes on demand when they are saturated. Run 12 parallel workflows for analysis. At the end discovered several novel SNPs.

John Hogenesch and Angel Pizarro — How we went from, “‘omics data, cool”, to “‘omics data, uh oh”

The biological motivation: studying the circadian rhythm. Controlled by a transcriptionally regulated metabolic pathway that is studying in alternative clocks: non 24-hour clocks. Do a time-course RNA-seq analysis of wildtype liver. Used Bowtie and BLAT for mapping reads. BLAT used to handle splicing.

Cost of running on Amazon is ~$25/lane. 85% of reads mapping after Bowtie + BLAT pipeline. Done using a quadruple extra large instance. One unexpected finding was the presence of stable, spliced intronic sequences.

On the informatics side, simple job distribution systems like CloudCrowd, disco and Resque help reduce some of the barriers involved with Hadoop.

Ntino Krampis — From AMIs Running Monolithic Software, to Real Distributed Bioinformatics Applications in the Cloud

Ntino describes JCVI Cloud BioLinux, a AMI with 100+ bioinformatics tools on EC2. Once you have this, how do you leverage cloud services to scale compute? How can we do highly parallel alignments with EC2? Notion is to parallelize BLAST by distributing seeds across partitioned SimpleDB: approximately 2000 major seeds in the human genome. Seed scan is parallelized across multiple instances. Interesting thing is to think about how new bioinformatics algorithms can be designed to be more scalable for parallelized architecture.

Amazon Workshop on Genomics and Cloud Computing: Morning talks

Today I’m attending an Amazon workshop on Storing, Analyzing and Sharing Genomic Data in the Cloud. Organized by Deepak Singh, it’s a chance for folks to get together and discuss ideas for utilizing cloud architecture to solve biological problems. The goal is to understand the practical issues: the what, how and why of working on the cloud.

Deepak kicked off the day with on overview of AWS services which are an impressive display of layered architectures aimed at developers.

James Hamilton — Cloud Computing Economies of Scale

Economies of scale are very important driver of cloud computing. Large buyers of hardware can get up to 7x cost savings over medium buyers: for network, storage and administration. You can pass along these savings to smaller buyers, while still maintaining a margin.

Infrastructure cost breakdown — how to make the most of resources.

James presents a detailed breakdown of huge sever farms. The full numbers are available at his blog. Servers have life of 3 years, the space 10 years and the power monthly. Need to examine costs on the same scale. If you do this, then: 34% of the costs are related to power; networking is 8% of costs. 54% of the cost is actually spent of the servers. Server utilization and efficiency are the most important thing you can improve: use ’em while you’ve got ’em.

Amazon spot instances are the way to take care of periodicity in server usage. You can get cheaper prices when usage is lower; straight up economics based on demand. It is a market to bid on compute power. In the AWS console you can see the history of pricing for instances and try and forecast times to use.

Power distribution and mechanical system efficiency

Power Usage Efficiency (PUE): relative amount of power that is delivered to a server divided by how much you take in. A good data center ratio is 1.5, so .5W of power is lost due to distribution and cooling. Real ratios are often 2-3 and up because of idle computer time when servers are using 55% of their power just sitting there.

As data gets transferred from a high voltage power station to the servers, you lose 11%: high voltage to substation to UPS to two transformers to the server. Not much can be done to trim this so not a productive target for cost savings; but is useful to look at for reliability. Server power supplies can be improved with VRM/VRD on board step-down: 80% to 95% efficiency.

The cooling system is 25% of total costs. The mechanical design of server rooms has been fairly constant over the last 30 years. General design: air moved down and cooled, runs through servers and gets heated, goes back to top, and cycle. Air-side economizer is new potential improvement: basically open the window in winter.

Good things to help with cooling: 1. Raise data center temperatures: servers can run at up to 90 degrees 2. Avoid leaks around airflow 3. Use cooling towers instead of A/C 4. Use outside air.

Cloud computing economics and innovation

Deep automation is only affordable when scaled over a larger user base. Can get to full automation, which is awesome. Really nice parallels with open source: do you have enough people to do this.

Other scaled savings: software and hardware investments, focused people working on issues like cooling. Scale also allows multiple datacenters to be used which puts the servers closer to users, and allows for cross datacenter redundancy.

Server utilization at most datacenters maxes out at about 30%; most are at 10-20%. This is very hard to change due to peaks and valleys in demand. Going to Amazon allows you to try new things and innovate without investing in these partially used servers: can scale quickly to try out analytics. Similarly, AWS pace of innovation is very quick with lots of services coming on line.

Chris Dagdigian — Scriptable infrastructures for scientific computing

Chris, from BioTeam, and has been using the cloud since 2007 in their work bridging the gap between science, IT and high performance computing. Check out their cloud consulting services.

Scriptable infrastructures are latest way embrace Larry Wall’s laziness maxim. They help reduce the friction and barriers to doing difficult distributed work. You can fully script everything that you do: servers, storage, databases, scaling, and so on: with Amazon APIs. Our IT infrastructure can be 100% scriptable.

The scripting infrastructure is the baseline for doing real work: putting together a beautiful array of complex systems and pipelines. Chef is a very useful integration platform configuration management and integration. It’s natively aware of cloud platforms and cloud instance metadata. Use a few instances running continuously and scale out as needed with Chef. Recipes are maintained as source code in Git. Awesome.

What does the cloud mean for IT people? Huge restructuring that blurs lines between IT and science research. Role of system administrator move to becoming a system architect; scientists should have more control over server distribution.

In assessing costs of moving to the cloud, the main challenge is actually accounting for all the internal costs associated with IT. Once you do that, then you can see how to make big savings.

David Dooling — Architecture for the Next Challenges in Genomics

Metagenomics in the context of the human microbiome: the 90 trillion microbial cells that live on the human body. 300 samples of 40 or 50 individuals will be sequenced; 3Tb of data.

Pipeline for analysis. 1. Remove human hits 2. Align to know bacteria 3. Align to known viruses. 4. Align the remainder, in protein space, to the nr database. Protein space alignment is expensive compared to BWA, bowtie. Using blastx this will take 13 million core hours. In the genome center, this costs 0.03/core hr. In house, costs $400,000 for computation, and $300,000 to generate. Amazon costs are ~$3 million but do have scale to handle computation in time. The tricky thing is that there are a lot of overhead costs that are very hard to count up at a university.

The right solution is a hybrid of local and cloud resources. Utilize local where scale makes sense because cost is right, utilize cloud where can’t scale out with local resources. Challenge is architecting a solution that easily works over these heterogeneous resources.

As next-gen speeds up, these analysis will be less of a batch operation and more of a steady state system. You’ll have sequences continually coming off the line. This needs to be automated to keep up, but hard because of all the different systems.

Matt Tavis — Architectural design patterns in cloud computing

Matt at Amazon helping customers move their work to the cloud, which often involves architectural changes to more efficiently use those resources. Scalability requires an architecture that takes advantage of the infrastructure. Scalability is a contract between the architecture and infrastructure. Seven lessons learned:

  1. Design for failure and nothing fails: avoid single points of failure in your system.

  2. Loose coupling sets you free: build in queues that allow different controllers in different systems. A messaging system ties the parts together.

  3. Implement elasticity:
    This is a fundamental property of the cloud. Don’t assume components are in fixed locations. Need designs that handle reboots and relaunch and dynamic configuration.

  4. Security in every layer: Each machine needs to be locked down and encrypted as needed. No longer hidden behind firewalls. Security groups handle this.

  5. Don’t fear constraints: re-think your architectural constraints to split the resources differently. By re-designing you are more flexible over time and allows horizontal scaling.

  6. Think parallel: decompose jobs into simplest form and then parallelize using something like MapReduce.

  7. Leverage many storage options: object stores, local indexed data, persistent storage, relational databases. Easier to use many concurrently.