Scientific Python 2013, Day 2: Bioinformatics frameworks, open science and reproducibility

I’m at the 2013 Scientific Python conference in beautiful Austin, Texas. I’m helping organize this year’s Bioinformatics Symposium and learning about Python scaling and reproducibility from the Scientific Python community. These are my notes from the second day. See also my notes from the first day.

Ian Rees – Bioinformatics symposium: electron microscopy platform

Ian talked about software to handle data challenges associated with imaging. They principally focus on imaging of macromolecules using Cryo-EM. There is then a ton of processing before this can get into PDB as structures. 300+ active projects. Focus on archival, automation of record keeping, understanding how protocols change over time and sharing results with collaborators.

EMEN2 is their solution: a object-oriented lab notebook. It uses a protocol ontology to allow flexible queries of approaches. Take a general approach to connect records. Impressive display of a long 10 year project with 15,500 records. Records look like json documents of key-value pairs. Built on top of BerkeleyDB, with a twisted web server. Provides integrated plotting and viewing of results, in addition to table-based viewing of projects and samples. Hooks in with the microscopy software so data uploads automatically. It provides a public API with JSON to query the data with a constraint based query language. Also has python hooks to create extensions with controllers and mako templates.

The connected image processing toolkit is EMAN2.

Larsson Omberg – Bioinformatics symposium: Synapse platform

Larsson discussed the approaches and tools that SAGE bionetworks use to help improve reproducibility of science. The Synapse tool tries to handle reproducibility on a distributed scale with multiple collaborating scientists, as opposed to other projects which focus on single researchers. Example of usage for the cancer genome atlas: 10,000 patients, 24 cancer types, and multiple inputs: variations, RNA-seq. Biggest challenge is coordination of multiple data sources and inputs. Data automatically pushed into Synapse, then do data freezes to allow analysis. Analysis results get pushed back into Synapse as well.

Synapse is a web framework that allows multiple usages of tools in multiple places, and register results back with Synapse to coordinate results. Python API allos you to query with SQL syntax and retrieve specific datasets which have key/value style metadata annotations in addition to the raw data. Impressive demo of uploaded results with lots of metadata: nice way to understand custom analysis and review results.

Synapse focuses on avoiding the self-assessment trap by moving this assessment into a centralized location. Also run challenges that help formalize this: Dream8 Challenges.

Joshua Warner – scikit-fuzzy

Joshua talked about his implementation of a SciPy toolkit for fuzzy logic: scikit-fuzzy. Has fuzzy c-means for smaller uses and needs full Cythonization. Has 100% test coverage. Provides foundational tools for fuzzy logic but focusing on community building to provide additional tools. Good questions about the most useful places for fuzzy logic usage: it’s a good prototyping step which includes some insight into the logic intuition for understanding approaches. It’s not especially useful for categorical variables.

Jack Minardi – Raspberry Pi sensor control

Jack talked about interacting with Raspberry Pi using pyzmq: controlled LEDs and motors. Raspberry Pi has general purpose input output pins which allow you to interact with other external devices. Used the Occidentalis Wheezy based operating system which provides a lot of pre installed tools over the base installations that Raspberry Pi recommends. Jack live demos blinking an LED, which one ups live software demos for sure. Another cool demo uses pyzmq to stream the xyz location of a device to a real time plotting tool.

There are an incredible number of open source tools for Raspberry Pi on GitHub that help manage interacting with the different hardware. There is a cool community around working with it.

Raspberry Pi and other hackable hardware tools also provide a wonderful teaching environment for programming. It is so much more satisfying to make something happen in real life, and teaches all the important skills of installing, learning and debugging that you need in any kind of hacking. On a larger scale in genomics, efforts like the polonator from George Church’s lab offer an opportunity to learn all the hardware behind sequencing

Jeff Spies – The Open Science Framework

Jeff discussed work at the Center for Open Science to build infrastructure and community around opening up science to reduce the gap between scientific goals (open) and science practical needs (papers, funding). Problem is that published science is not synonymous with accurate science. Worry about unconscious biases like Motivated Reasoning. Approach of OSF is to provide tools that work within scientific workflows to enable and incentivize openness. The Open Science Framework provides a simplified front end to Git, handling archiving and versioning of study data. Provides unique URLs to tag specific versions for publication. Goals are to make components API driven to allow other interfaces like IPython notebooks.

Burcin Eröcal – scientific software distribution

Burcin discussed approaches to replicate, build on and improve scientific work. Shows an example of Sage, which has multiple requirements and installs well: installation matters, a lot. His approach is lmonade, which provides customizable distribution of scientific software. Burcin does not think virtual machines solve this problem because they are not programmable to add updates. I wonder if lightweight solutions like docker help mitigate some of these concerns. In general, I haven’t heard any usage of virtual machines at SciPy which makes me sad because I think this is an important path for moving forward with complex installations. He also compares to the nix package manager. The main issue with this is that it requires explicit definition of dependencies so not as flexible as scientific software needs.

John Kitchin – emacs org-mode for reproducible research

John discussed using emacs org-mode to create reproducible documents with embedded python code. John’s GitHub has tons of useful example of using this for blogging and book writing. Impressive demos, I need to dig into his org files for tips and tricks.

Lightning talks

Travis talked about solutions to packaging problems in Python: conda and Look like useful alternatives to pip that might help with lots of installation problems we see with multiple dependencies. The Conda recipes GitHub repo has lots of existing tools.

Matt Davis gave a great advertisement for Software Carpentry. It’s a wonderful resource for teaching scientists to program. They also need teachers and help from the community, so volunteer please.

BLZ is a high IO distributed format inside of Blaze. The BLZ format document has additional documentation. This all gives you a distributed NumPy operating on massive arrays.


Scientific Python 2013, Day 1: Bioinformatics, IPython, parallel processing and machine learning

I’m at the 2013 Scientific Python conference in beautiful Austin, Texas. I’m helping organize this year’s Bioinformatics Symposium and learning about Python scaling and reproducibility from the Scientific Python community. These are my notes from the first day.

Fernando Perez – IPython overview

Fernando’s IPython talk describes the methods driving IPython development. He runs his talk inside an IPython notebook for major bonus points. The lifecycle of scientific work is to explore, collaboratively develop, parallelize, publish and educate. The entire process is fluid and moves backwards as we explore new ideas.

Fernando walks through the history of IPython, including showing the entire 200 lines of source code from the first version. Current cool tools include:

Nice discussion of IPython parallel and StarCluster for scaling. Current scaling target for IPython is jobs up to thousands of nodes, not for larger 10k+ node clusters.

Nice discussion of lessons on building the IPython community and engaging external developers. Open to new ideas from outside members. Have full group meetings on Google+ that are available for anyone to view and comment on.

Gaël Varoquaux – Simple python patterns for large data processing

Gaël’s talk focuses on simple python patterns to deal with big data. Focuses on alternatives to Hadoop because of limitations of installability on traditional academic clusters. Toolkit is SciPy, NumPy and Joblib. His approach to work: provide easy to debug failures, avoid solving hard problems if you can still get your work done with easier solutions, dependencies are troublesome because of installation problems.

One way to improve speed is to reduce dimensions by reducing features, with approaches like sklearn.random.projection.

Parallel processing patterns with focus on embarrassingly parallel loops. Right scale needs to avoid competing for resources: disk and memory. joblib.Parallel uses queues for dispatch jobs. Would like to move to a job management solution that helps scaling up.

Caching is important to handle re-computing. joblib has a memoize pattern that handles large data inputs by using hashes for inputs. Uses disk-based persistence.

How to make IO fast: use fast compression to use more CPU and access sections of data. BAMs blocked gzip approach is a good example of this in bioinformatics. On the python side, pytables handles this.

More meta point in the talk: take seriously the cost of complexity in your code.

Leif Johnson – tools for coding and feature learning

Leif’s talk focuses on machine learning approaches to coding complex features into simple, easier to process alternatives. The two main advantages: easier classification and provides better interpretability of complex data. Simplest way to code is to project your data into the columns of a coding matrix, which preserves the dimensionality of the data. Sparse coding is a good general approach to doing this. Principle Component Analysis is widely used but some datasets fail assumptions.

For subsampling, a good compromise is k-means to minimize without excessive bias. Restricted Boltzmann Machines (RBMs), specifically MORB implementation are worth looking at as well. Uses Theano to optimize under the covers.

Algorithms help you learn similar things but the tricky part is emphasizing how to encode your data.

Olivier Grisel – Trends in machine learning

Olivier’s talk focuses on scikit-learn blackbox models, probabilistic programming with PyMC, and deep learning with PyLearn2 and Theano. scikit-learn is a blackbox library and provides a unified API on top of multiple classifiers. Handles multiple inputs for classifiers: binary inputs, multitype inputs, higher level features extracted on top of raw data.

Limitations of machine learning as a blackbox. Feature extraction is highly domain specific. Models make statistical assumptions which may not hold for your specific data. Need tools to be able to differentiate which approaches to take. Flow chart on using machine learning models help with this problem. The second problem is that blackbox models can’t explain what they learned. It might work but you don’t know why.

Probabilistic programming with generic inference engines helps avoid the lack of explainability problem. This models unknown causes with random variables, using Bayesian approaches to identify. MCMC is the most widely used approach. talk later on using Variational message passing (VMP) approach. Probabilistic programming has lots of benefits related to telling a story around the data but issues are up front in learning how to build models and choose priors.

Third approach is deep learning: deep = architectural depth. It emphasizes the number of layers between inputs and outputs: linear models have depth 0, neural networks and decision trees have depth 1, and ensembles and two-layer neural networks have depth 2. Depth 0 handles linearly separable data, depth 1 solves the XOR problem, depth 2 solves the XOR problem in multiple dimensions. Depth 2 power is why Ensemble trees have been so successful on more difficult problems.

More complicated models have multiple hidden representations via RBMs: unsupervised training approach. Dropout approach trains neural networks with less overfitting. All of this recent code in PyLearn2. Deep learning requires a lot of labeled data and GPU enabled code to be practical to run. Lots of research ongoing in this area.

Brian Granger – software engineering in IPython

Brian is talking about software engineering in IPython and what they’ve learned from multiple rounds of revisions. Brian wrote the current version of IPython notebook by leaving out lots of features in earlier versions that proved unmaintainable. General idea is to avoid over-architecting because features have hidden costs, due to developer time being a limited resource. Need a rational process for deciding how to spend developer time. Why are features problematic:

  • add complexity to code: makes it harder to understand and hack
  • adds potential bugs
  • requires documentation
  • requires support
  • requires developers to specialize
  • add complexity to the user experience
  • features multiple like bunnies and are difficult to remove once implemented

Features have costs that need counting. Need to identify the simplest possible implementation that would be useful.

The hidden benefits of bugs: they’re a sign that people are using your software and tell you what features are useful. They’re an opportunity to improve testing of software.

This all requires a cultural solution. Hard to say no to enthusiastic developers and users. Approaches to ameliorate this:

  • Create a roadmap that discusses features/plans
  • Decide on a scope or vision for a project and communicate this

Rob Zinkov –

Rob discusses a python interface to Infer.NET, a framework for Bayesian inference using graphical models. allows you to stay in the python ecosystem but call out to .NET. It’s not a iron-python wrapper in which you’d lose a lot of familiar python tools. The GitHub examples directory provides useful code to get started with the tool. Rob does a nice live demo that interacts with Infer.NET and matplotlib. Advantage of this over PyMC is that it provides non-MC code that can parallelize via message passing.

Chris Beaumont – Glue visualization

Chris’ talk discusses the Glue visualization framework. He differentiates different types of data expansion: big data = more data, and wide data = more experiments to integrate together. Most tools orient towards single datasets since the integration work is tricky and error prone. Glue provides multiple views on multiple datasets with linking, all in Python. Glue is a GUI that sits on top of matplotlib. By specifying logical connections between datasets, you can link multiple datasets and interactively select subsets and plot together. Glue interacts nicely with IPython notebook: can use qglue to go from IPython straight into Glue.

Bioinformatics mini-symposium

I’m chairing this session as well so these notes are extra scattered.

Aaron Quinlan talked about GEMINI, a framework for annotating and querying genomic variations.

Ryan Dale talked about metaseq, a framework for integrated plotting and analysis of ChIP-seq/RNA-seq data. Gives you a pure python approach to plotting and analysis, including parallelization over multiple cores using multiprocessing. Can ask questions like how do peaks cluster around transcription start sites: cluster with k-means from scikit-learn. metaseq also handles large tables like RNA-seq count results by wrapping pandas.

Brent Pedersen talked about his poverlap library to look for significance testing of intervals: do two sets of intervals overlap more than expected? poverlap wraps and parallelizes all of the work and provides a simple interface to add locality to shuffling and restrict by genomic regions. Can handle different null models: does transcription factor A overlap with B more than C. Allows you to calculate custom metrics during processing in arbitrary languages. Parallelizes with multiprocessing and IPython.

Blake Borgeson described a use case of machine learning in bioinformatics: identifying protein complexes from mass spectroscopy data. Separate based on charge, density and size. Integrates prior knowledge of existing interactions. So features of machine learning are priors plus mass spec output. Uses scikit-learn and clustering tools (clusterOne and MCL) to separate. See cool differences in interactions for different evolutionary subsets.

Pat Francis-Lyon describes her work identifying gene interactions with the goal of identifying pathways for therapies. Defining interactions is hard: difficult to define what you mean by interactions. Shows nice examples of interactions under different models: additive and multiplicative interactions. Used genomeSIMLA to simulate genetic data under many different genetic models. Used multiple supervised learning methods: SVM and neural networks.

Jacob Barhak talked about a tool to support modeling of diseases. Micro-simulation simulates individuals then combines then together into a picture of the population. By extracting MIST from a previous toolkit, they simplified installation to make it more widely useful. MIST provides a simulation language that it compiles into a Python script. Allows users to define arbitrary input rules and define distributions of population values.

Cool ideas from discussions

The best part of a conference is tips and tricks from discussions with other developers. Here are some ideas to explore that I picked up during conversations and lightning talks:

  • Python 3’s concurrent futures (backported to 2.x) provides a nicer interface to multiprocessing that mimics Java’s futures. HT to Brent Pedersen.
  • Chris Mueller of Lab7 gave a lightning talk on his general web-based pipeline manager that drives their sequencing analysis software. They also are developing BioBuild, which helps with building bioinformatics tools.
  • Travis Oliphant talked about Numba, which translates Python syntax to LLVM. Provides impressive speed ups on par with C implementations.

Galaxy Community Conference 2012, notes from day 2

These are my notes from day 2 of the 2012 Galaxy Community Conference.

Ira Cooke: Proteomics tools for Galaxy

Goal is to develop pipelines and interactive visualizations for proteomics work. Awesome tools that provide raw data + pipeline run as part of a visualization built into Galaxy. Connects to raw spectrum data from higher level summary reports. On a higher level, trying to integrate Proteomic and Genomic approaches inside Galaxy. Available from two bitbucket repositories: protk and Protvis.

James Ireland: Using Galaxy for Molecular Assay Design

James works at 5AM solutions, where they’ve been using Galaxy for a year. He’s working on molecular assay design: identifying oligos to detect or quantify molecular targets. Need to design short assays avoiding known locations of genomic variation. Developed a Galaxy workflow for assay design, including wrapping of primer3, prioritizing and filtering of designed primers, examination of secondary structure.

Richard LeDuc: The National Center for Genome Analysis Support and Galaxy

NCGAS provides large memory clusters, bioinformatics consulting. You can access infrastructure if you have NFS funding. They provide a virtual machine hosting Galaxy on top of cluster infrastructure. The VM approach allows them to spin up Galaxy instances for particular labs. Underlying infrastructure is Lustre filesystem. Do custom work on libraries: helped improved Trinity resource usage.

Liram Vardi: Window2Galaxy ??? Enabling Linux-Windows Hybrid Workflows

Provide hybrid galaxy workflow with steps done on linux and windows: transparent to the user. Works by creating an interaction between Linux and Windows VMs using Sambda and a VM shared directory. Works by using Windows2Galaxy command in Galaxy tool which does all of the wrapping.

David van Enckevort: NBIC Galaxy to Strengthen the Bioinformatics Community in the Netherlands

NCIB BioAssist provides bioinformatics assistance to help with analysis of biological data. Galaxy used for training, collaboration and sharing of developed tools and pipelines. Also used to deal with reproducible research workflows for publications. Provide a NBIC public instance and moving to a cloud Galaxy VM plus Galaxy module repository.

Ted Liefeld: GenomeSpace

GenomeSpace aims to make it easier to do integrative analysis with multiple datasets and tools. Facilitates connections between tools: Cytoscape, Galaxy, GenePattern, IGV, Genomica, UCSC. Provides an online filesystem for data, importing and exporting to data.

Greg von Kuster: Tool Shed and Changes to Galaxy Distributions

Galaxy Tool Shed improvements to integrate closer with Galaxy. Galaxy now provides ability to install tools directly from the user interface. Kicks into live demo mode: when importing workflows it will tell you missing tools that require installation from tool sheds. Tools handle custom datatypes. Allow removal of tools through user interface. Can install dependencies directly. Incredibly awesome automation and interaction improvements for managing tools. External dependencies linked with exact versions for full reproducibility.

Larry Helseth: Customizing Galaxy for a Hospital Environment

Larry describes a use case in a HIPAA environment: locked down internet and corporate browser standards. Bonuses are solid IT and resources. Exome sequence analysis work: annotation with SeattleSeq and Annovar. Everything requires validation before full clinical use.

Nate Coraor: Galaxy Object Store

Galaxy can access object stores like S3 and iRODS using plugin architecture. Extracted out access of data to not be directly on files, but rather through high level accessor methods. This lets you have complete flexibility for storage, managing where data is behind the scenes. This lets you push data to compute resources: so you could store on S3 and compute directly on Amazon.

Jaimie Frey: Galaxy and Condor integration

Wrote Galaxy module to run tasks on Condor clusters. Checked into galaxy-central as of yesterday. Use Parrot virtual filesystem to manage disk I/O to analysis machines.

Brian Ondov: Krona

Krona displays hierarchical data as zoomable pie charts. Has a tool in tool shed and can interact with charts directly in Galaxy.

Clare Sloggett: Reusable, scalable workflows

Usage example: cuffdiff analysis for large number of inputs. How can you readily do this without a lot of clicking on different workflows? Current approach: write a script with the API, but not a great way to do this through the user interface currently. John steps up, ready to work on the problem.

John Chilton: Galaxy VM Launcher

Built a Galaxy workflow for clinical variant detection. One concern about CloudMan was storage costs: CloudMan depends heavily on EBS but you can save money by using the local instance store. Galaxy VM Launcher configures tools, genome indices, users and upload data all from commandline. Awesome.

Pratik Jagtap: Galaxy-P

Galaxy-P works with Galaxy for proteomics. Proteomics work is super popular at this year’s GCC. Trying to tie together lots of discussions today: windows access from Galaxy, visualization, and push to cloud resources.

Geir Kjetil Sandve: The Genomic Hyperbrowser: Prove Your Point

Genomic Hyperbrowser provides custom Galaxy with 100 built in statistical analyses for biological datasets. Provides top level questions, using the correct statistical test under the covers. Provides nice output with simplified and detailed answers along with full set of tests used.

Bj??rn Gr??ning: ChemicalToolBoX

Provides a Galaxy instance for Cheminformatics: drug design work. Tools allow drawing of structures, upload into Galaxy. Wrapped lots of tools for chemical properties, structure search, compound plotting and molecular modification.

Breakout: Automation Strategies for Data, Tools, & Config

During the Galaxy breakout sessions, I joined folks who’ve been working on strategies to automate post-Galaxy tool and data installation. The goal was to consolidate implementations that install reference data, update Galaxy location files, and eventually install tools and software. The overall goal is to make moving to a production Galaxy instance as easy as getting up and running using ‘sh’

The work plan moving forward is:

  • Community members will look at building tools that include dependencies and sort out any issues that might arise with independent dependency installation scripts through Fabric.

  • Galaxy team is working on exposing tool installation and data installation scripts through the API to allow access through automated script
    s. The current data installation code is in the Galaxy tree.

  • Community is going to work on consolidating preparation of pre-Galaxy base machines using the CloudBioLinux framework. The short term goal is to use CloudBioLinux flavors to generalize existing scripts. Longer term, we will explore moving to a framework like Chef that handles high level configuration details.

It was great to bring all these projects together and I’m looking forward to building well supported approaches to automating the full Galaxy installation process.

Galaxy Community Conference 2012, notes from day 1

These are my notes from day 1 of the 2012 Galaxy Community Conference. Apologies to the morning speakers since my flight got me here in time for the morning break.

Liu Bo: Integrating Galaxy with Globus Online: Lessons learned from the CVRG project

Work with Galaxy part of the CardioVascular Research Grid project which sets up infrastructure for sharing and analyzing cardiovascular data. Challenges they were tackling: distributed data at multiple locations, inefficient data movement and integration of new tools.

Integrated GlobusOnline as part of Galaxy: provides hosted file transfer. Provide 3 tools that put and pull data between GlobusOnline and Galaxy.

Wrapped: GATK tools to run through the Condor job scheduler, CummeRbund for visualization of Cufflinks results, and MISO for RNA-seq analysis.

Implemented Chef recipes to deploy Galaxy + GlobusOnline on Amazon EC2.

Gianmauro Cuccuru: Scalable data management and computable framework for large scale longitudinal studies

Part of CRS4 project studying autoimmune diseases, working to scale across distributed labs across Italy. It’s a large scale projet with 28,000 biological samples with both Genotyping and Sequencing data. Use OMERO platform which is a client-server platform for visualization, then integrated specialized tools to deal with biobank data. Using seal for analysis on Hadoop clusters. Problem was that the programming/script interfaces were too complex for biologists, so wanted to put a Galaxy front end on all of these distributed services.

Integrated interactions with custom Galaxy tools, using Galaxy for short term history and the backend biobank for longer term storage. Used iRODS to handle file transfer across a heterogeneous storage system, providing interaction with Galaxy data libraries.

Valentina Boeva: Nebula – A Web-Server for Advanced ChIP-Seq Data Analysis

Nebula ChIP-seq server developed as result of participation in GCC 2011. Awesome. integrated 23 custom tools in the past year. He provides a live demo of tools in their environment, which has some snazzy CSS styling. Looks like lots of useful ChIP-seq functionality, and it would be useful to compare with Cistrome.

Sanjay Joshi: Implications of a Galaxy Community Cloud in Clinical Genomics

Want to move analysis to clinically actionable analysis: personalized, predictive, participatory and practical. Main current focus of individualized medicine are two areas: early disease detection and intervention and personalized treatments.

Underlying analysis of variants: lots of individual rare variants that have unknown disease relationships. Analysis architecture: thin, think, thin: sequencer, storage, cluster. Tricky bit is that most algorithms are not parallel so adding more cores is not magic. Also need to scale storage along with compute.

Project Serengeti provides deployment of Hadoop clusters of virtualized hardware with VMware. Cloud infrastructure a great place to drive participatory approaches in analysis.

Enis Afgan: Establishing a National Genomics Virtual Laboratory with Galaxy CloudMan

Enis talking about new additions to CloudMan: support for alternative clouds like OpenStack, Amazon spot instances and using S3 Buckets as file systems.

New library called blend that provides an API to Galaxy and CloudMan. This lets you start CloudMan instances and manage them from the commandline, doing fun stuff like adding and removing nodes programatically.

CloudMan being used on nectar: the Australian National Research Cloud. Provides a shell interface built on top with web-based interfaces via CloudMan and public data catalogues. Also building online tutorials and workshops for training on using best practice workflows.

Bj??rn Gr??ning: Keeping Track of Life Science Data

Goal is to develop an update strategy to keep Galaxy associated databases up to date. Current approaches are FTP/rsync which have a tough time scaling with updates to PDB or NCBI. Important to keep these datasets versioned so analyses are fully reproducible.

Approach: use a distributed version control system for life science data. Provides updates and dataset revision history. Used PDB as a case study to track weekly changes in a git repository. Include revision version as part of dropdown in Galaxy tools, and version pushed into history for past runs for reproducibility.

The downside is that rollback and cloning are expensive since repositories get large quickly.

Vipin Sreedharan: Easier Workflows & Tool Comparison with Oqtans+

oqtrans+ (online quantitative transcript analysis) provides a set of easily interchangeable tools for RNA-seq analysis. Some tools wrapped: PALMapper, rQuant6-3, mTim. They have automated tool deployment via custom fabric scripts. The public instance with integrated tools is available for running, and also a Amazon instance.

Gregory Minevich: CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Using C elegans and looking for specific variations in progeny with neurological problems. Approach applied to other model organisms like Arabidopsis. Available as a Galaxy page with full workflows and associated data. Nice example of a complex variant calling pipeline. Provides nice variant mapping plots across the genome. Pipeline was bwa + GATK + snpEff + custom code to identify a final list of candidate genes.

Approach to identify deletions: look for unique uncovered regions within individual samples relative to full population. Annotate these with snpEff and identify potential problematic deletions.

Tin-Lap Lee: GDSAP — A Galaxy-based platform for large-scale genomics analysis

Genomic Data Submission and Analytical Platform (GDSAP): provides a customized Galaxy instance. Integrated tools like SOAP aligner and variant caller and now part of the toolshed. Push Galaxy workflows to MyExperiment: example RNA-seq workflow.

Platform also works with GigaScience to push and pull data in association with GigaDB.

Karen Reddy: NGS analysis for Biologists: experiences from the cloud

Karen is a biologist, talking about experiences using Galaxy. Moving from large ChIP-seq datasets back to analysis w
ork. Used Galaxy CloudMan for analysis to avoid need to develop local resources. Custom analysis approach called DAMID-seq, translated into a Galaxy workflow all with standard Galaxy tools. Generally had great experience. Issues faced: is it okay to put data on the Cloud? It can be hard to judge capacity: use high memory extra large instances for better scaling.

Mike Lelivelt: Ion Torrent: Open, Accessible, Enabling

Key element of genomics software usage is that users want high level approaches but also be able to drill down into details = Galaxy. That’s why Ion Torrent is a sponsor of the Galaxy conference. IonTorrent software system sounds awesome: built on Ubuntu with a bunch of open source tools. It has a full platform to help hook into processing and runs. Have open source aligner and variant caller for IonTorrent data in the toolshed: code is all on iontorrent GitHub.

BOSC 2012 day 2 pm: Panel discussion on bioinformatics review and open source project updates

Talk notes from the 2012 Bioinformatics Open Source Conference.

Herv?? M??nager: Mobyle Web Framework: New Features

Mobyle provides easy commandline tool integration. Provides a tool-based XML to describe programs that converts into web-based interface. BMPS provides easy pipeline design and execution. Workflow execution can be dynamically reused as a simple form. Provides data versioning with integrated correction: for instance, visualize an alignment in JalView, correct, then save as updated data file. Now that Taverna and Galaxy workflows integrate, it would be great to be able to do the same with Mobyle.

Eric Talevich: Biopython Project Update

Eric talks about Biopython, discussing new and exciting features in the past year. GenomeDiagram provides beautiful graphics of sequences and features. Lots of new format parsing included in Biopython: SeqXML, Abi chromatograms, Phylip relaxed format. Bio.phylo has merged in PAML wrappers and new drawing functionality, plus paper in late review. Now have BGZF support which helps with BAM and Tabix support. Working to support PyPy and Python 3. Bug fixes for PDB via Lenna, who is now a GSoC student that I’m mentoring doing variant work, including with PyVCF.

Hiroyuki Mishima: Biogem, Ruby UCSC API, and BioRuby

BioRuby update on latest work. The community has been working on ways to make being a BioRuby member easy. Original way to contribute is to be a commiter or get patches accepted. To get more people involved, have moved to GitHub to help make it easier to accept pull requests. They’ve also introduced BioGems, a plugin system so that anyone can contribute associated packages. This includes a nice website displaying packages along with popularity metrics to make it easy to identify associated packages. bio-ucsc-api provides ActiveRecord API on top of UCSC tables. The future direction of BioGems will involve more quality control by peer-review, including required documentation and style.

Jeremy Goecks: A Framework for Interactive Visual Analysis of NGS Data using Galaxy

Jeremy talks about the Galaxy visualization framework to make highly interactive visual analysis for NGS datasets. The goal is to integrate visualizations + web tools. Jeremy then bravely launches into a live demo with Trackster. Trackster has dynamic filtering so can use sliders to view based on coverage, FKPM, or other metrics. Integration with Galaxy allows you to re-run tools with alternative parameters based on visualizations. Can create a cool tree of possible parameters than you can set in the Galaxy tool, easily varying selected parameters. This can then be dynamically re-run on a subset of the data letting you re-run and visualize multiple parameters easily. This is an incredibly easy way to find the best settings based on some known regions.

Spencer Bliven: Why Scientists Should Contribute to Wikipedia

New initiative through PLoS Computational Biology called Topic Pages. Why don’t scientists contribute more to Wikipedia? Some identified concerns: perceived inaccuracies, little time for outreach like this and no direct annotation or citation. If you contribute to Wikipedia, you get a citation. Don’t use it to fill up your CV. Topic pages are peer reviewed via Open Review, have a CC-BY license and are similar to a review article. Already have published topic pages and interest in contributing.

Markus Gumbel: scabio – a framework for bioinformatics algorithms in Scala

scabio contains algorithms written in Scala for the bioinformatics domain. Designed as teaching tool for a lecture + lab. Scala combines object oriented and functional paradigms. Akka framework provides concurrent and distributed functionality. Contains lots of teaching code for dynamic programming as a great resource. Easy BioJava 3 integration and reuse of existing libraries. Code available from GitHub

Jianwu Wang: bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data

bioKepler build on top of Kepler, a scientific workflow system. It uses distributed frameworks for parallelization. Plans are to build bioActors for alignment, NGS mapping, gene prediction and more.

Limin Fu: Dao: a novel programming language for bioinformatics

Dao is a new programming language that supports concurrent programming, based on LLVM and easily loads C libraries with Clang. Provides native concurrent iterators, map, apply and find.

Scott Cain: GMOD in the Cloud

Generic Model Organism Database project has a running Cloud instance at Has Chado, Gbrowse, Jbrowse plus sample data. AMI information available from GMOD wiki. Tripal is a Drupal based web interface to Chado.

Ben Temperton: Bioinformatics Testing Consortium: Codebase peer-review to improve robustness of bioinformatic pipelines

Ben kicks off the panel discussion with a short lightning talk about the Bioinformatics Testing Consortium which provides a way to do peer-review on codebases. Idea came from dedicated unit testers, but need a "non-cost" way to do this that fits with current workflows: peer review. Idea is to register a project and have volunteer testers actually test it.

Panel discussion

BOSC wrapped up with a panel discussion centered around ideas to improve reviewing of the bioinformatics components of papers. The panel emerged from an open review discussion between myself and Titus Brown about ideas for establishing a base set of criteria for reviewers of bioinformatics methods. The 5 panelists were Titus Brown, a professor at Michigan State; Iain Hrynaszkiewicz, an open-access publisher with BMC; Hilmar Lapp, an editor at PLoS computational biology; Scott Markel from Accelrys; and Ben Temperton from the Bioinformatics Testing Consortium.

I took these notes while chairing, so apologies if I missed any points. Please send correction and updates.

The main area of focus was around improving the bioinformatics component of papers at the time of review. Titus’ opening slides presented ideas to help improve replicability of results with the key idea being: does the software do what it claims in the paper?

  • Existing communities to connect with
  • When do tests get put in place? Last minute at review time is going to be painful for peopl
    e. There is a lot of hard work involved overall.

    • Difficult to setup VM + replicable
    • Barriers to entry
    • On the other hand, are you doing good science? What is the baseline?
    • How can you help people do this?
    • Learning to develop this way from the start with training courses like Software Carpentry.
    • Can Continuous integration play a role? travis-ci
  • Defining what to do for reviewers
  • Tough question is that editors also must review as well, so job falls on both reviewers and editors. Get a before submission seal before being able to send in for review. This is where the Bioinformatics Testing Consortium could fit in.

  • Start up idea: provide service for testing software
    • Insight journals
    • Could you incentivize for testing? Provide journal credit.
  • Tight relationship between reviews + grants: need to enforce base level of actually having minimum criteria.

  • Provide incentives + credit

Another component of discussion was around openness of reviews:

  • BMC has even split between open and non-open peer review
  • What is the policy for who owns copyright on review?
  • From a testing side, it does need to be open to iterate
  • What effect can this have on your career? Bad reviews for a senior professor.

The final conclusion was to draw up a set of best practice guidelines for reviewers, publish this as a white paper, then move forward with website implementations that help make this process easier for scientists, editors and reviewers. If we as a community can set out what best practice is, and then make it as easy as possible, this should help spread adoption.

BOSC 2012, day 2 am: Carole Goble on open-source community building; Software Interoperability talks

Talk notes from the 2012 Bioinformatics Open Source Conference.

Carole Goble: If I Build It Will They Come?

Carole’s goal is to discuss experience building communities around open source software that facilitates reuse and reusability. 3 areas of emphasis: computational methods and scientific workflows (Taverna), social collaboration like MyExperiment and knowledge acquisition like RightField.

Goal of MyExperiment is to collaboratively share workflow. Awesomely, it now interoperates with Galaxy workflows. For knowledge acquisition, Seek4Science handles data sharing and inter-operates with IsaTab.

General philosophy: laissez-faire philosophy trying to encourage inclusiveness and ability to evolve over time. Some difficulty with being too free since led to some ugly metadata when tools are too general. People prefer simple interfaces that work and they can adopt to. By having flexibility and extensibility, have been able to work across multiple communities and widen adoption. Majority of work paid for in projects outside of Biology. Currently supporting 16 full time informaticians.

How can you get users? Be useful for something, by somebody, some of the time. Need to under promise and over deliver. 4 things that drive adoption: 1. Provide added value 2. Provide a new asset 3. Keep up with the field 4. Because there is no choice.

7 thinks that hinder adoption: 1. Not enough added value 2. Doesn’t work for non-expert users 3. No time or capacity to take on learning. The first 3 sum up to: the software sucks and is difficult to improve. Good solutions exist to help: Software Carpentry. 4. Cost of disruption 5. Exposure to risk 6. No community 7. Changes to work practice. Last 4 boil down to being too costly. Tipping points for usage are normally not technical. Another issue is that people haven’t heard of it so need to promote and discuss.

Some other problems that happen: adoption is incidental, familial adoption for people like me. Difficult to build for others that are not like me. Need to motivate others to help fill in gaps. Need to fully interoperate and not tweak to be non-back-compatible without specific breaks.

Adoption model: need to build an initial seed of users that can advocate and encourage others to use. Trickiest part is to establish this initial set of friends that love your software. This can primarily be a relationship building process: need trust and usability even for the best backend implementations. Need to understand where your projects fit and target the right people even if they are not exactly like you. Difficult.

In social environments, need to understand what drives and fears people have with regards to sharing. A real problem is people not giving credit for things they use. How can we improve micro-attribution in science? How do we harness this competitiveness? Provide reputation and attribution for what you do. Protect and preserve data to help make people more productive.

If you are lucky enough to build a community, then move to the next level and need to maintain and nurture. As you concentration on maintenance can lose focus on the initial things that drove adoption. Version 2 syndrome: featuritis.

In summary, need to focus on: What is it that you provide and people value? Who are you targeting? Need to be able to be agile and provide improvements continuously to existing users, but be careful not to end up with hideous code that is unmaintainable. Difficult balance between being general and specific. Final trick is to be long term once you’ve got adoption: how can you keep software around and provide the funding to continue developing and improving it?

Social and technical aspects of adoption both equally important.

Richard Holland: Pistoia Alliance Sequence Squeeze: Using a competition model to spur development of novel open-source algorithms

Pistoia Alliance Sequence Squeeze competition to develop improved sequence compression algorithms. Pistoia alliance is an not-for-profit alliance of life-science companies by trying to collaborate on shared research and development. Goal of competition was to come up with approach for compression that is not linear but retains 100% of the information. Work on FASTQ data and all software is open-source.

Overall has 12 entrants with 108 distinct submissions. Provided a public leaderboard that encouraged competition. Winner was James Bonfield. Other useful compression approaches like PAQ fared well.

Michael Reich: GenomeSpace: An open source environment for frictionless bioinformatics

GenomeSpace tackles the difficulty of switching between different tools. Creates a connection layer between genome analysis tools with biological data types. Based supported tools are Cytospace, Galaxy, GenePattern, Genomica, IGV and the UCSC browser. GenomeSpace aimed at non-programming users and support automatic cross-tool compatibility. It’s a great resource to connect tools.

Dannon Baker: Galaxy Project Update

Details on some of the fun stuff that the Galaxy team has been working on. Lots of development on the API, including providing a set of wrapper methods that wrap the base REST API. API allows running of tools in Galaxy.

Automated parallelism is now built into Galaxy. Can split up BLAST jobs into pieces, run on cluster and then combine back together. Main concerns are: overhead of splitting plus temporary space requirement. Set use_tasked_jobs=True in configuration and supports BLAST, BWA and Bowtie. There is a parallelism tag in Galaxy XML that enables it. Advanced splitting has a FUSE later that makes the file look like a directory to avoid re-copying files.

Galaxy tool shed allows automated input of tools into Galaxy instance.

Enis Afgan: Zero to a Bioinformatics Analysis Platform in 4 Minutes

Enis currently working with Australian Nectar national cloud to port CloudMan to private clouds in addition to currently supported Amazon EC2. Idea is to make resources instantly available to users and provide a layer on top of programmer shell interfaces. 4 projects put together: BioCloudCentral, CloudMan, CloudBioLinux, Galaxy. The Blend library provides a python API on top of Galaxy and CloudMan, with full docs.

Alexandros Kanterakis: PyPedia: A python crowdsourcing development environment for bioinformatics and computational biology

PyPedia provides a collaborative programming web environment. General idea is that wiki can provide a clean way to upload and run python code: Google App Engine + wiki. This has lots of useful example code, like converting VCF to reference. PyPedia provides a REST interface and everything so everything is fully reproducible and re-runnable.

This would be great to host an
interactive cookbook for Biopython with automated tests and ability to fork.

Alexis Kalderimis: InterMine – Embeddable Data-Mining Components

InterMine is an integrated data warehouse with customizable backend storage. It provides a web interface with query functionality. Provides a nice technology stack underneath with web service and Java API interfaces. Provides libraries in Python, Perl, Java that give a nice intuitive interface for querying pragmatically. Can write custom analysis widgets with some nice javascript interfaces: built using CoffeeScript with Backbone.js and Underscore.js so lots of pretty javascript underlying it.

Bruno Kinoshita: Creating biology pipelines with BioUno

BioUno is a biology workflow system built using Jenkins build server. Provides 5 plugins for biological work. Jenkins is an open source continuous integration system that makes it easy to write plugins. Bruno bravely shows a live demo including fabulous Java war files.

BOSC 2012, day 1 pm: Genome-scale Data Management, Linked Data and Translational Knowledge Discovery

Talk notes from the 2012 Bioinformatics Open Source Conference.

Dana Robinson: Using HDF5 to Work With Large Quantities of Biological Data

HDF5 is a structured binary file format and abstract data model for describing data. Not client/server, has a C interface + other high level interfaces. HDF5 has loads of advantages in terms of technical details. One disadvantage is that querying is a bit more difficult since access is more low level. You write higher level APIs specific to your data, with speed advantages.

Aleksi Kallio: Large scale data management in Chipster 2 workflow environment

Chipster is an environment for biological data analysis aimed at non-computational users. Recent work reworked architecture to handle large NGS data. Hides data handling on the server side from the user to provide a higher level interface. With NGS data storing all the data becomes problematic so data is only moved when needed. Data stored in sessions which provide quotas and management of disk space. Handles shared filesystems invisibly to user.

Qingpeng Zhang: Khmer: A probabilistic approach for efficient counting of k-mers

Custom k-mer counting approach based on bloom filters. Allows you to tradeoff false positives with memory. This makes the approach highly scalable to large datasets. Accuracy related to the kmer size and number of unique kmers at that size. Time usage of khmer is comparable to other approaches like jellyfish but main advantage is memory efficiency and streaming.

Implementation available from GitHub.

Seth Carbon: AmiGO 2: a document-oriented approach to ontology software and escaping the heartache of an SQL backend

The Amigo Browser displays gene ontology information. Retrieves basic key/value pairs about items and connections to other data. As data has expanded the SQL backend is difficult to scale. The solution thus far has been Solr using a Lucene index to query documents. Decided to push additional information into Lucene, including complex stuff like hashes as JSON. Turns out to be a much better model for the underlying data. Downside is that you need to build additional software on top of the thin client.

Jens Lichtenberg: Discovery of motif-based regulatory signatures in whole genome methylation experiments

Software to detect regulatory elements in NGS data. Goal is to correlate multiple sources of NGS data: peak calling + RNA-seq + methylation. These feed into motif-discovery algorithms. Looking at Hematopoietic Stem Cell differentiation in mouse. The framework is Perl-based that uses bedtools and MACS under the covers. The future goal is to re-write as C++ to parallelize and speed up approach.

Philippe Rocca-Serra: The open source ISA metadata tracking framework: from data curation and management at the source, to the linked data universe

ISA metadata framework for describing experimental information in a structured way. Build a set of tools to allow people to create and edit metadata, producing valid ISA for describing the experiments. ISATab now has a plugin infrastructure for specialized experiments.

Newest work focuses on version control for distributed, decentralized groups of users. OntoMaton provides search and ontology tagging for Google Spreadsheets which helps make ontologies available to a wider variety of users.

ISATab is working on exporting to an RDF representation and OWL ontologies. Some issues involve gaps in OBO ontologies for representation.

Julie Klein: KUPKB: Sharing, Connecting and Exposing Kidney and Urinary Knowledge using RDF and OWL

Built specialized ontology for kidney disease with domain experts. Difficulty is dealing with existing software. Provided a spreadsheet interface which was easier for biologists to work with, called Populous. Ended up with SPARQL endpoint and built a web interface on top for biologists: KUPKB. Nice example of using RDF under the covers to answer interesting questions but exposing it in a way that biologists query and manipulate the data.

Sophia Cheng: eagle-i: development and expansion of a scientific resource discovery network

eagle-i connects researchers to resources to help them get work done. eagle-i provides open access to data, software and onotologies are open-source. Built using semantic web technologies under the covers. Provides downloads of all RDF. Federated architecture built around a Sesame RDF store with SPARQL and CRUD REST APIs on top. Can pull available code stack from subversion along with docs.

BOSC 2012, day 1 am: Jonathan Eisen on open science; cloud and parallel computing

Notes from the 2012 Bioinformatics Open Source Conference.

Jonathan Eisen: Science Wants to Be Open – If Only We Could Get Out of Its Way

Jonathan Eisen starts off by mentioning this is the first time he’s giving an open science talk to a friendly audience. He’s associated, and obsessed, with the PLoS open access journal. History of PLoS: started with a petition coming out of free microarray specifications from Michael Eisen and Pat Brown to make journals available. 25,000 people signed, but had a small impact on open access support mainly because available open source journals were not high profile enough. PLoS started as selective high-profile open-access journal to fill this gap.


Ft Lauderdale agreement debated how to be open with genomic data. Sean Eddy argued for openness in data and source code by promoting advantages of collaborations and feedback. First open data experiment at TIGR was Tetrahumena thermophila. Openly released and got useful biological feedback. Published next paper on Wolbachia in PLoS Biology despite overtures from Science/Nature.


Medical emergency with wife was real motivator for openness. Could not get access to journals to research to help. Terrible lack of access for people outside of big academic institutions. Same problem happened with access to make all father’s research papers available.


Open access definition: free, immediate access online with unrestricted distribution and re-use. Authors retain rights to paper. PLoS uses broad creative commons license. Why is reuse important? Thought example: what if you had to pay for access to each sequence when searching for BRCA1 homologs? Then what if it were free, but you couldn’t re-analyze it in a new paper? Science built off re-use and re-purposing of results. Extends to education and fair use of figures. Additional areas to consider are open discussion and open reviews.


What are things you can do to support openness? Share things as openly as possible, participate in open discussion, consider being more open pre-publication. Risk to sharing is low, and benefit is high with help and discussion. Important to judge people by contributions, instead of surrogates like journal impact factor. Enhance and embrace open material while giving credit to everything you can. Support jobs and places that are into openness.


Great talk and a nice way to start off the meeting by focusing on why folks here care about open source.


Sebastian Schönherr: Cloudgene – an execution platform for MapReduce programs in public and private clouds


How to support scientists when using MapReduce programs? Goal is to simplify access to a working MapReduce cluster. Cloudgene designed to handle these usability improvements. Sebastian talks through all of the work to setup a cluster: build cluster, HDFS, run program, and retrieve. It’s a lot of steps. Cloudgene builds a unified interface for all of these.


Cloudgene merges software like Myrna under one unified interface. It works on both public and private clouds. New programs integrated into Cloudgene via a simple configuration file in YAML format. Cool web interface similar to BioCloudCentral but on top of MapReduce work and with lots more metrics and functionality. Configuration files map to web forms ala Galaxy or Genepattern.


Java/Javascript codebase at GitHub.


C Titus Brown: Data reduction and division approaches for assembling short-read data in the cloud


Titus has loads of useful code on his lab’s GitHub page and shares on blog, twitter and preprints: fully open. Uses approaches that are single-pass, involve compression and work with low-memory datastructures. Goal is to supplement existing work that exists, like assembly, with pre-processing algorithms. Digital normalization aims to remove unnecessary coverage for metagenomic assembly. Downsamples based on a de Bruijn graph to normalized coverage. Analysis is streaming so doesn’t require pre-loading a graph in memory and uses fixed memory. Avoids the nasty high memory requirements for assembly to allow it to run on commodity hardware, like EC2.


Removing redundant reads has nice side-effect of removing errors. Effective in combination with other error-correction approaches. Results in virtually identical contig assembly after normalization.


Need approaches to both improve algorithms in combination with additional capacity and infrastructure. Some tough things to overcome: biologists hate throwing away data; normalization gets rid of abundance data. New approaches are to use this for streaming error correction. Error correction is the biggest data problem left in sequencing.


All figures and information from paper can be entirely reproduced as an ipython notebook. Can redo data analysis from any point: awesome. Approach has been useful in teaching and training as well as new projects.


Andrey Tovchigrechko: MGTAXA – a toolkit and a Web server for predicting taxonomy of the metagenomic sequences with Galaxy front-end and parallel computational back-end


MGTAXA predicts taxonomic classifications for bacterial metagenomic sequences. Uses an ICM (Interpolated Context Model) to help extract signal from shorter sequences: better than using a fixed k-mer model. Started off using self-organized maps to identify taxonomy, but not possible in complex cases where clades cluster together.


ICM used to identify phase relationship with host based on shared k-mer composition of virus and host. Parallelization approach used to scale model training with multiple backends: serially, SGE, Makeflow workflow engine. Cool, I didn’t know about Makeflow. Andrey suggests as a nice fit for Galaxy tool parallelization. Also implemented a BLAST+ MPI implementation using MapReduce-MPI which gives fault tolerance exposing an MPI library.


Python source is available on GitHub and integrated with Galaxy frontend. Network architecture setup at JCVI to allow access to firewalled cluster for processing. Uses AMQP messaging for communication with Apache Qpid.


Katy Wolstencroft: Workflows on the Cloud – Scaling for National Service


Building workflows for genetic testing on the cloud with National Health Service in the UK. Done in collaboration with Eagle Genomics. Diagnostic testing today uses a small numbers of variants, but will soon be scaling up to whole genomes.


Using Taverna workflows to run the analyses. Workflow is to identify variants, annotate with dbSNP, 1000 genomes and conservation data. Gather evidence to classify variants as problematic or not. Taverna provides workflow provenance so it’s accessible, secure and reproducible.


Architecture currently uses Amazon cloud, Taverna server with data in S3 buckets. Modified Taverna to work better with Amazon and improvements will be available in the next Taverna release.


Andreas Prlic: How to use BioJava to calculate one billion protein structure alignments at the RCSB PDB website


Andreas combines interests in PDB for work and open-source contributions to BioJava. He describes a workflow to find novel relationships between systematic structural alignments. Work uses Open Science Grid by pushing a custom job management system talking on port 80. Converts CPU bound problem to IO problems. Alignment comparison and visualization code is available from BioJava.

AWS genomics event: Distributed Bio, Cycle Computing talks: practical scaling

The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.

Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio

Chris and Giles work at Distributed Bio, which provides consulting services on building, well, distributed biology pipelines. Chris starts by talking about some recommended tools:

  • Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).

  • iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It’s a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.

  • GlusterFS: easy to setup and use; performance equivalent or better to NFS

  • Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler

Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.

Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.

NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.

Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS

Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.

CycleCloud demo: Andrew Kaczorek

Andrew from Cycle Computing wraps up the day by talking about their 30,000 node cluster created using CycleCloud. Monitoring done using Grill, a CycleServer plugin that monitors Chef jobs.

The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.

Also demos a SMRT interface to PacBio data and analyses. Provides the back end elastic compute to a PacBio instrument on Amazon.

AWS genomics event: Deepak Singh, Allen Day; practical pipeline scaling

High Performance Computing in the Cloud: Deepak Singh, AWS

After the coffee break, Deepak starts off the late morning talks at the AWS genomics event. He begins by discussing the types of jobs in biology: batch jobs and data intensive computing. 4 things go into it:

  • infrastructure — instances on AWS that you access using an API. On EC2, this infrastructure is elastic and programmable. Amazon has cluster compute instances that make it easier to connect this infrastructure into something you can do distributed work on. Scaling on Amazon cluster good for MPI jobs in common jobs in computational chemistry and physics. Another choice are GPU instances if your code distributes there.

  • provision and manage — Lots of choices here: ruby scripts, Amazon CloudFormation, chef, puppet. Also standard cluster options: Condor, SGE, LSF, Torque, Rocks+. MIT Starcluster examples with awesome demos of making clusters less tricky. Cycle Computing leveraging these tools to make massive clusters and monitoring tools.

  • applications: Galaxy CloudMan, CloudBioLinux, Map Reduce for Genomics: Contrail, DNANexus, SeqCentral, Nimbus Bioinformatics

  • people: Most valuable resource that you can maximize by removing constraints. Big advantage to have access to unlimited instances to leverage when needed.

Elastic Analysis Pipelines: Allen Day, Ion Flux

Allen from IonFlux talks about a system for processing Ion Torrent data; production oriented system to move from initial data to final results behind a well-packaged front end. Decided to work on the Cloud to well serve smaller labs without existing infrastructure, plus all the benefits of scale. End to end solution from torrent machine to results.

The pipeline pulls data into S3, aligns data, does realignments, produces variant calls. Workflow involves using Cascading describe Hadoop jobs; this talks to Hbase to store results. On the LIMS side, uses messaging queues to pass analysis needs to workflow side. Jenkins used for continuous integration.

What does the Hadoop part do? Distributed sorting of BAM files, co-grouping, and detection of outliers. Idea is to distribute filtering work to prioritize variants or regions of interest. Data is self-describing, which allows restarting or recovering at arbitrary points. Cascading allows serialization of these by defining schemes for BAM, VCF and fastq files.

How did they work on scaling problems? Index server that pre-computes results and feeds them into the MapReduce cluster. Index server became a bottleneck. Can improve this by moving index server into a bittorrent swarm that serves them to the MapReduce nodes. The continuous integration systems does the work of creating index files as EBS snapshots; bittorrent swarm uses these snapshots.