Today I’m attending an Amazon workshop on Storing, Analyzing and Sharing Genomic Data in the Cloud. Organized by Deepak Singh, it’s a chance for folks to get together and discuss ideas for utilizing cloud architecture to solve biological problems. The goal is to understand the practical issues: the what, how and why of working on the cloud.
Deepak kicked off the day with on overview of AWS services which are an impressive display of layered architectures aimed at developers.
James Hamilton — Cloud Computing Economies of Scale
Economies of scale are very important driver of cloud computing. Large buyers of hardware can get up to 7x cost savings over medium buyers: for network, storage and administration. You can pass along these savings to smaller buyers, while still maintaining a margin.
Infrastructure cost breakdown — how to make the most of resources.
James presents a detailed breakdown of huge sever farms. The full numbers are available at his blog. Servers have life of 3 years, the space 10 years and the power monthly. Need to examine costs on the same scale. If you do this, then: 34% of the costs are related to power; networking is 8% of costs. 54% of the cost is actually spent of the servers. Server utilization and efficiency are the most important thing you can improve: use ’em while you’ve got ’em.
Amazon spot instances are the way to take care of periodicity in server usage. You can get cheaper prices when usage is lower; straight up economics based on demand. It is a market to bid on compute power. In the AWS console you can see the history of pricing for instances and try and forecast times to use.
Power distribution and mechanical system efficiency
Power Usage Efficiency (PUE): relative amount of power that is delivered to a server divided by how much you take in. A good data center ratio is 1.5, so .5W of power is lost due to distribution and cooling. Real ratios are often 2-3 and up because of idle computer time when servers are using 55% of their power just sitting there.
As data gets transferred from a high voltage power station to the servers, you lose 11%: high voltage to substation to UPS to two transformers to the server. Not much can be done to trim this so not a productive target for cost savings; but is useful to look at for reliability. Server power supplies can be improved with VRM/VRD on board step-down: 80% to 95% efficiency.
The cooling system is 25% of total costs. The mechanical design of server rooms has been fairly constant over the last 30 years. General design: air moved down and cooled, runs through servers and gets heated, goes back to top, and cycle. Air-side economizer is new potential improvement: basically open the window in winter.
Good things to help with cooling: 1. Raise data center temperatures: servers can run at up to 90 degrees 2. Avoid leaks around airflow 3. Use cooling towers instead of A/C 4. Use outside air.
Cloud computing economics and innovation
Deep automation is only affordable when scaled over a larger user base. Can get to full automation, which is awesome. Really nice parallels with open source: do you have enough people to do this.
Other scaled savings: software and hardware investments, focused people working on issues like cooling. Scale also allows multiple datacenters to be used which puts the servers closer to users, and allows for cross datacenter redundancy.
Server utilization at most datacenters maxes out at about 30%; most are at 10-20%. This is very hard to change due to peaks and valleys in demand. Going to Amazon allows you to try new things and innovate without investing in these partially used servers: can scale quickly to try out analytics. Similarly, AWS pace of innovation is very quick with lots of services coming on line.
Chris Dagdigian — Scriptable infrastructures for scientific computing
Scriptable infrastructures are latest way embrace Larry Wall’s laziness maxim. They help reduce the friction and barriers to doing difficult distributed work. You can fully script everything that you do: servers, storage, databases, scaling, and so on: with Amazon APIs. Our IT infrastructure can be 100% scriptable.
The scripting infrastructure is the baseline for doing real work: putting together a beautiful array of complex systems and pipelines. Chef is a very useful integration platform configuration management and integration. It’s natively aware of cloud platforms and cloud instance metadata. Use a few instances running continuously and scale out as needed with Chef. Recipes are maintained as source code in Git. Awesome.
What does the cloud mean for IT people? Huge restructuring that blurs lines between IT and science research. Role of system administrator move to becoming a system architect; scientists should have more control over server distribution.
In assessing costs of moving to the cloud, the main challenge is actually accounting for all the internal costs associated with IT. Once you do that, then you can see how to make big savings.
David Dooling — Architecture for the Next Challenges in Genomics
Metagenomics in the context of the human microbiome: the 90 trillion microbial cells that live on the human body. 300 samples of 40 or 50 individuals will be sequenced; 3Tb of data.
Pipeline for analysis. 1. Remove human hits 2. Align to know bacteria 3. Align to known viruses. 4. Align the remainder, in protein space, to the nr database. Protein space alignment is expensive compared to BWA, bowtie. Using blastx this will take 13 million core hours. In the genome center, this costs 0.03/core hr. In house, costs $400,000 for computation, and $300,000 to generate. Amazon costs are ~$3 million but do have scale to handle computation in time. The tricky thing is that there are a lot of overhead costs that are very hard to count up at a university.
The right solution is a hybrid of local and cloud resources. Utilize local where scale makes sense because cost is right, utilize cloud where can’t scale out with local resources. Challenge is architecting a solution that easily works over these heterogeneous resources.
As next-gen speeds up, these analysis will be less of a batch operation and more of a steady state system. You’ll have sequences continually coming off the line. This needs to be automated to keep up, but hard because of all the different systems.
Matt Tavis — Architectural design patterns in cloud computing
Matt at Amazon helping customers move their work to the cloud, which often involves architectural changes to more efficiently use those resources. Scalability requires an architecture that takes advantage of the infrastructure. Scalability is a contract between the architecture and infrastructure. Seven lessons learned:
Design for failure and nothing fails: avoid single points of failure in your system.
Loose coupling sets you free: build in queues that allow different controllers in different systems. A messaging system ties the parts together.
This is a fundamental property of the cloud. Don’t assume components are in fixed locations. Need designs that handle reboots and relaunch and dynamic configuration.
Security in every layer: Each machine needs to be locked down and encrypted as needed. No longer hidden behind firewalls. Security groups handle this.
Don’t fear constraints: re-think your architectural constraints to split the resources differently. By re-designing you are more flexible over time and allows horizontal scaling.
Think parallel: decompose jobs into simplest form and then parallelize using something like MapReduce.
Leverage many storage options: object stores, local indexed data, persistent storage, relational databases. Easier to use many concurrently.