The second day of the Bioinformatics Open Source Conference (BOSC) started off with a session on Cloud Computing.
Matt Wood — Into the Wonderful
Matt worked at Ensembl/Sanger on sequencing pipelines, now technology evangelist at Amazon dealing with Cloud Computing.
Start off by talking about data, well, lots of data. Challenges are distributing and making data available with lots of constraints: throughput, data management, software, availability, reproducibility, cost. So far, we’ve managed to move from Gb-scale to Tb-scale work. Open source software has played a role in making software easily availability, and active development communities.
Work so far is foundational blocks for next steps; how can we optimize with existing tools and infrastructure? Optimize for developer productivity; also wider development community because lower barriers to entry. Goals are to abstract difficult and tedious parts, maximizing time you can spend on the fun parts.
What are the building blocks that are available to do this? Cloud provides a collection of foundations: compute, storage, databases, automation; couple with workflows, analytics, warehouses and visualization. Move data/compute to materials/methods. Usability is the most important metric for tools: available, flexible, and reliable. Cloud is awesome example of this: quick to get new image and API to flexibly access them.
Machine images are really key to sharing code, data, configuration and services with others. Also a reproducible representation of work that was done. You have a ton of moving parts and want to be able to capture these: tools like Puppet and Chef allow you to reproduce this as well.
Amazon data replicated across multiple availability zones for redundancy. Each availability zone separate. However, data stays local so if cannot move from US to Europe does not. There are a ton of options for building different infrastructure and managing costs: standard, reserved and spot instances. Spot instances great way to get access to cheap compute; need to architect for interruption.
For Hadoop, Amazon’s ElasticMapReduce takes away a lot of the pain of setting up a Hadoop server. Another example is Amazon’s Relational Database Service for MySQL/Oracle. General idea is lowering the barrier to utilization.
Richard Holland — Securing and sharing bioinformatics in the cloud
Talking about commercial deployment with open source software: PlasMapper and Ensembl. Proof of concept cloud architecture with Ensembl and custom databases and open source applications on top, Ensembl, PlasMapper and GeneAtlas.
For security, used OpenAM to authenticate, then encrypted data on disk, SSL encryption of communication, hide Apache information, and firewalled.
Recommendations: firewall externally and internally, validate file uploads, don’t store uploaded files in accessible location, avoid GET parameters where possible.
Chunlei Wu — Mygene.info: Gene Annotation as a Service – GAaaS
Chunlei starts off with a migration story for BioGPS, a Gene-centric annotation data representation. They started with a relational database solution, then switched into a document based solution: json style objects with CouchDB. Infrastructure uses Tornado on top of that, then nginx. Web based API for query, alongside web application: mygene.info.
Ntino Krampis — Cloud BioLinux: open source, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond
Paradigm shift associated with next gen sequencing data and small sequencing machines. Now small labs can handle their own sequencing; second step is how do you analyze it. CloudBioLinux is community project with JCVI, NEBC Bio-Linux associated. Ntino demos using CloudBioLinux, connecting with graphical client and making data available with collaborators through sharing AMIs.
Olivier Sallou — OBIWEE : an open source bioinformatics cloud environment
OBIWEE is bioinformatics framework based on Torque job scheduler. It combines 3 software: a workflow authoring tool, a virtual Torque cluster and a set of deployment scripts for private or public cloud. SLICEE is the workflow authoring tool with front end to submit jobs. Has API, commandline and GUI interfaces for running.
Brian O’Connor — SeqWare: Analyzing Whole Human Genome Sequence Data on Amazon’s Cloud
SeqWare is an open source toolset for large-scale sequence analysis. Project ported it to EC2 for scaling out. Uses the Pegasus workflow engine to define workflows and run on clusters. SeqWare has multiple levels to interact with: workflow description language, java class interface. Can also provision and bundle dependencies.
To port to EC2, used StarCluster with custom AMI containing dependencies. 9 human genomes analyzed, cost $1000 per genome, and $100 per exome.
Lars Jorgensen — Sequencescape – a cloud enabled Laboratory Information Management Systems (LIMS) for second and third generation sequencing
Sequencescape is the LIMS at Sanger institute, so can definitely scale. Supports all sequencing technologies. Development is open on GitHub and what’s there matches what is running at Sanger currently. Sanger data needs to be publicly release 60 days after quality controlled. Really impressed.
LIMS handles pretty much everything: from freezer tracking, study management to automation, workflows to data release and reporting. Live demo is sweet and covers every use case you could imagine; runs on a laptop.
Enis Afgan — Enabling NGS Analysis with(out) the Infrastructure
Enis talks about CloudMan, Galaxy on the cloud with reusable backend for scaling analyses. Lets you do NGS analyses on Amazon without needing any computational resources. Has even more tools and reference datasets than the Galaxy main site. It offers a wizard-guided setup directly in th
e browser, is customizable and can be shared with other users. Contains tons of NGS tools built on top of CloudBioLinux.
Aleksi Kallio — Hadoop-BAM: A Library for Genomic Data Processing
Chipster is the main project and Hadoop-BAM was abstracted from that. Designed for dealing with large numbers of BAM files coming out of NGS analyses. Detect BAM record chunks based on compression and data for splitting up. Has a Picard compatible API. Data import/export is slow but otherwise scales based on parallelization well; used for batch pre-processing.