Notes: Bio in Docker Symposium 2015 day 2: Docker infrastructure and hackathon

I’m at day 2 of the Bio in Docker Symposium at the Wellcome Collection in London. This is a great 2 day set of talks and coding sessions around the use of Docker containerization in biology workflows. The idea is to coordinate around using these tools to be able to improve our ability to do science. There is hackathon for half of today but the morning talks focus on tools within the Docker ecosystem and learning how to use these to design better applications for biology. There is a ton of cool engineering ongoing, but untangling all of the different components is a challenge hopefully these talks will help with.

  • Notes for day 1: NGS pipelines, workflow tools and data volume management

Publication

F1000Research – a publishing platform for the Docker community

Thomas Ingraham & Michael Markie

F1000 Research is an open publisher that both handles traditional articles but also outputs like posters and presentations. The do open post-publication peer review and are generally a great way to publish research. Articles are only indexed and officially published when passing peer review. Articles have versions so you can see changes over time. Collections organize related content for easier discoverability. One example is the Bioinformatics Open Source Conference (BOSC) channel. Today they’re announcing a channel for Container virtualization in informatics. A great way to make Docker-enabled posts publicly available.

New Docker functionality

Weaving Containers in Amazon’s ECS

Alfonso Acosta

Microservice oriented architecture model work around small components that each do a single job. This helps avoid complexity, but introduces challenges in coordinating and putting things together. WeaveWorks has tools to connect, observe and control containers. It connects containers, managed IP issues, sorts out DNS, and does load balancing between identically named containers. It handles node failures with recovery. Weave does not require a distributed key/value store – uses gossip in a peer-to-peer fashion so resistant to failures. On each machine you start up a new weave client. It only needs to know the name of one other host on the network, then they’re all connected. Then you start and register containers in the network. Weave sits on top of Amazon ECS, providing all of the container registry and discovery

Orchestrating containers with docker compose

Aanand Prasad

Docker compose allows defining and running multi-container Docker applications. Motivation is that multi-container apps are a hassle to setup, configure and coordinate. Coordinates with Docker machine, which creates Docker hosts on a machine, and Docker swarm, which provides clustering of Docker containers. From our previous talk, it looks like Weave can work with Swarm directly. For a demo, Aanand shows docker compose talking directly to a swarm cluster. The swarm integration with Docker is nice – all of the standard docker commands work when you’re running a coordinated cluster. It does a great job of distributing containers across multiple hosts. Docker machine does work on running inside VMs in an easy way, but you also lose all the native goodness of docker by going this route.

Manage your infrastructure like Google

Matt Barker & Matt Banes

JetStack helps with management and orchestration of containers. At Google everything runs in a container with 2 billion containers a week. Kubernetes is the open source version of their internal cluster management tool. Pods look really cool and are a way to group together multiple applications with shared volumes. The replication controller does the work of maintaining a consistent state of pods, restarting when things fail. Kubernete’s shared-state scheduler does the job of scheduling containers to machines based on resource requests. Services provides a way to label pods and then do discovery based on the naming, instead of introducing this complexity into your code. As of the new version, Kubernetes now has a Job object which explicity managed jobs. Nice live example wrapping Mykrobe predictor inside a Docker container and then running with Kubernetes – mykrobe predictor spits out JSON which gets sucked into MongoDB from a watcher in another container. Kubernetes is impressive and the discoverability of coordinating between services is nice.

Docker and real world problems

Clive Stringer and Adam Hatherly

King’s College London is a NHS hospital with big hospital problem. Many of the components don’t fit well together so things don’t interoperate. Major need is components that slot together easily. CIAO (Care Integration and Orchestration) is an open source middleware to make it easier to use standards and share information. The NHS is a complex set of organizations so not set up to work as a whole system. Encapsulates the complex XML integration code inside of components composed in Docker containers. Each components is a microservice and self-contained. Each component in a reference standard for libraries. A second focus is on trying to characterize genes based on genetic components. Challenge is to make medical records available.

OSwitch: One-line access to other operating systems.

Yannick Wurm

Yannick works with ants, and starts with a plugs for ant being cool. Shows example of leaf cutter ants using symbiotic fungus to digest leaves. Sold. Then he motivates about the difficulties of working with computational tools to answer biological questions. Also sold. oSwitch is an awesome tool that provides a small wrapper around Docker. It creates an interactive instance around a tool, run things on the files in the current directory, then exiting. It removes all of the abstraction around Docker. We should have docs on debugging bcbio runs with oSwitch when running more with Docker.

Hackathon

The afternoon of the workshop is for working on problems and getting coding done. The organizers have nicely setup up tutorials ranging from learning Docker, to learning advanced Docker concepts to working with tools like Nextflow and Nextflow workbench that build on top of Docker. I’m working on continuing to improve CWL support in bcbio, moving towards the vision of bcbio running on alternative infrastructure presented in my talk at the conference.

Advertisements

Notes: Bio in Docker Symposium 2015 day 1: ngs pipelines, workflow tools, volume management

I’m at the Bio in Docker Symposium at the Wellcome Collection in London. This is a great 2 day set of talks and coding sessions around the use of Docker containerization in biology workflows. The idea is to coordinate around using these tools to be able to improve our ability to do science.

Pipelines and Workflows

Evaluating and ranking bioinformatics software using docker containers. plus Overview of the BioBoxes project

Peter Belmann

BioBoxes is a community effort driven by two efforts: CAMI and nucleotid.es. The goal is to improve reproducibility and reuse. Avoids issues with compilation and file format management to focus on putting things together. BioBoxes is a standard for creating interchangeable software containers. It groups software based on interface definitions. For example: Assembler takes a list of Fastq files and produces Fasta. This allows you to plug in any set of assemblers that work with this interface. Built a command line interface that hides configuration and usage of Docker to make it transparent. The key is agreeing on specifications to implement, which the community can iterate on. Good question about how to handle external data and mount into containers.

Portable workflow and tool descriptions with Common Workflow Language and Rabix

Nebojsa Tijanic

Nebojsa works at Seven Bridges on building reproducible computational analyses. Large issue = hard to setup environment to run another analysis. Docker solves these problems for us by encapsulating the environment and code. Remaining issues: understand resource needed, how to supply inputs and how to capture outputs. A second usage pattern, which is what we do in bcbio, is to provide a docker image with multiple tools together, running in parts. Once you start to orchestrate these individual tools, need a workflow engine to organize and run things. Result of multiple workflow engines is to have a single specification for describing workflows: Common Workflow Language (CWL). The goal of CWL is to have a specification for portable workflows. Some implementations that run CWL now: reference implementation, Rabix, Arvados Work in progress implementations: Cromwell (Broad), Galaxy. Design choices for CWL: declarative, extendable, uses Docker for packaging, YAML/JSON encoding, re-uses existing work as much as possible. Contains a description for defining tools, which defines all of inputs and outputs. Combining with a community sourced specification like BioBoxes this creates a standard usage for interoperability between workflow runners. Beyond tools, contains a directed acyclic graph of workflow steps. Workflow supports parallelization through scatter/gather. Rabix is the open source toolkit for CWL from Seven Bridges and will focus on consolidating with reference implementation.

Manage reproducibility in genomics pipelines with Nextflow and Docker containers

Paolo Di Tommaso

Paolo starts by talking about results of survey from Nick Loman to figure out common obstacles to running computational workflows. Why are things so hard? They’re complex, experimental and run on heterogeneous hardware. Containers solve a lot of these issues and are more lightweight than VMs – smaller images, fast start time and almost native performance. From a reproducibility standpoint, they have transparent build processes. To scale out, tools like Swarm, Fleet, Kubernetes and Mesos orchestrate containers. These are not for task scheduling, but for orchestration. The main cost is the startup time – it’s small for Docker but matters if you’re running a lot of very fast runs.

Paolo helps develop Nextflow which manages compute, data registry and local filesystem. Benchmarked native runs versus docker and very close in execution time. It provides a DSL on top of the JVM with high level parallelization. Workflow based on dataflow processing. Nextflow is well thought out including specifications for resources used and scaling. Works locally, on clusters with multiple schedulers, and on AWS with ClusterK. A nice feature is direct interaction with GitHub – you can pull a workflow from a repository automatically. Brave live demo of an RNA-seq analysis – cool example showing not working locally because of tool installation but does if you use Docker containers.

Next generation sequencing pipelines in Docker

Amos Folarin and Stephen Newhouse

Amos and Steve, the organizers of this great workshop, present their work on NGSeasy. The idea is to explore the use of Docker for workflow pipelines. Amos describes the motivations and good bits of Docker containers. There is a lot of promise but some technical challenges: moving quickly, user namespaces not yet supported (but planned for next release) so need root equivalent access. In biology you create a larger number of interfaces, potentially introducing complexity. NGSeasy strike balance in terms of separation of tools inside containers. In general, use a larger container with large number of available tools. Similar to the approach in bcbio, although NGSeasy has smaller suites of tools instead of everything. Uses bash scripts for orchestration and calling out to docker. Goal of conference is to bring together all of the projects to work together.

Pipelines to analysis data from the 100,000 genomes project as part of the Genomics England Clinical Interpretation Partnership (GeCIP)

Tim Hubbard

The goal oof Genomics England is to transform the NHS to use genomics on a large scale – focus on treatment, not research design. It does all clinical whole genome sequencing for rare disease and cancer. Also meant to build up genomics infrastructure in England, leaving a legacy of infrastructure human capacity and capability. For pipelines. Split up components into sub-pipelines and get these available from local companies. For health records, use OpenClinica. It’s an impressive model around helping patients with genomic sequencing. For research work with data, setting up a Clinical Interpretation Partnership (GECIP). How can we improve the 50% intepretation rate for rare diseases? Feed these to research groups to improve interpretation for specific diseases. Practically there are 11 Genomic Medicine Centers, 70 hospitals, 9000 participants consented. Infrastructure is fully leased and virtualized with rented compute. For data sharing a couple of current models: open to all (1000 genomes), managed repositories like DbGap. The new model for Genomics England is managed access but no redistribution. You have to work inside the environment. Long term goals of GEL: engine for NHS transformation to genomics, data standardization, acceptance of data centers for securely processing patient data.

MetaR and the Nextflow Workbench: application of Docker and language workbench technology to simplify bioinformatics training and data analysis.

Fabien Campagne

MetaR tries to help newer users have reproducible analysis. New user often prefer a graphical user interface but this is challenging to reproduce and scale. MetaR is a workbench for teaching R analysis to new users. Nextflow Workbench builds on top of Nextflow, trying to make it easier to write Nextflow workflows. The workbench allows you to organize them into modules for reproducibility. It provides a lot of nice user interface elements around Nextflow. Impressive interaction with Docker, along with specification of resources required. Shows real example of SALMON transcript indexes which require transcripts plus other resources to pull in. Data managed inside the Docker image with Nextflow Workbench. I wonder if there are better ways to handle this data and link them into Docker containers. The workbench also allows writing a bash script and easily converts to workflow processes. The workbench is interactive with auto-completion. Overall a nice development environment. It builds on top of JetBrains language workbench.

Bioinformatics and the packaging melee

Elijah Charles

Elijah makes the good point that we need distributions that have pre-built tools so we can start at a higher level. Docker provides solutions for configuration management, isolation and versioning.

Data, Volumes and portability with Flocker

Kai Davenport

ClusterHQ provides data solutions for working with Docker. Excited to hear about ideas for this. Why containers? Isolated: immutable environments you can keep separate as needed. Expediate: pre-built binary images. Compact: better resource usage compared to VMs. Images: save each layer, making everything pluggable. Now on to Docker Storage: Volumes. Each container has it’s own mount namespace, but volumes are bind mounted directories from the host into the container for maintenance beyond the lifetime of the container. Challenge is scaling this to multiple machines. Flocker solves this problem by orchestrating storage for a cluster of machines. Flocker control service pushes volumes to specific machines running clients. Supports a range of backend storage drivers. Integrates orchestration platforms (Mesos, Marathon, Kubernetes, Docker Swarm and friends) with Storage (EMC and friends). The orchestration tools hide the hardware and manage a pool of compute resources, handling monitoring and failures. Looks like a really nice abstraction to make this easier.