Notes: Bioinformatics Open Source Conference 2016 day 2 afternoon: developer tools and reproducible analyses

I’m at the 2016 Bioinformatics Open Source Conference (BOSC) in Orlando and these are the notes from the afternoon session on the second day, focused on developer tools and libraries, approaches for improving open science and reproducibility and a set of lightning talks to wrap up the day. It was another great conference thanks to Nomi, Peter, Hilmar, Moni, Heather, Chris, Karsten and the rest of the Open Bioinformatics Community.

Other notes:

  • Day 1 morning notes – Jennifer Gardy keynote on open data and infectious diseases, plus workflow development
  • Day 1 afternoon notes – Talks on standards and a panel on growing communities.
  • Day 2 afternoon notes – Steven Salzberg keynote on open source, data and access in science plus talk session on data science.

Developer tools and libraries

Christian Brueffer – Biopython Project Update 2016

Christian talks through the history of Biopython from 1999 until today. OpenHub has some great statistics on new contributors and contributions over the past year from GitHub. There is an entire filled slide with new contributors over the last year – an amazing platform for bringing new people into the community. Some cool changes – updates to Bio.SeqIO, Bio.PDB, Bio.Restriction, Bio.Data… Amazing diversity of new changes cool to have them catalogued in a single place and highlight contributors. New things coming for Biopython 1.68: updates to Bio.pairwise2 to make it faster. Cool since one of the initial things written by Jeff Chang early in the project. Christian personally working on improving standard style in Biopython since there are widely different coding styles in the past. Now providing standard style checking through TravisCI.

Peter Humburg – ReportMD: Writing complex scientific reports in R

Provide reproducibility of data analysis through literate programming. Rmarkdown makes this easy in R but flow of analysis may limit presentation of results. Want to communicate the story without two different stories. ReportMD generates multi-page HTML and amanages dependencies between markdown files. Lots of useful flexibility in analyses.

Shaun Jackman – Linuxbrew and Homebrew-Science to Navigate the Software Dependency Labyrinth

Shaun talks about the importance of reproducibility in analyses – be able to generate the same results, both yourself and others. Installing tools to do this is a huge challenge, hence the incredible work on Linuxbrew and Homebrew science. Amazing work porting homebrew to linux and building a community to install scientific packages.

Benjamin Hitz – SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata

ENCODE DCC manages data for ENCODE project, encodeD provides a flexible metadata modle for representing samples and SnoVault is a general purpose database it uses for representing it. Uses JSON-SCHEMA and JSON-LD to validate and normalize the data. Database can be exported as RDF and put into a SPARQL store and has an auditing system to check cross-object data integrity. Cool ideas, would love to understand how it compares to Datomic for representation.

Open Science and Reproducibility

Bastian Greshake – State of the openSNP.org Union: Dockerizing, Crowdfunding & Opening for Contributors

OpenSNP leverages the availability of direct to consumer sequencing. Allows users to directly share their data and many people willing to share phenotypes as well. Different sharing for traits (hair color) versus diseases (allergy). Can also share if you prefer vi or emacs (both, thank you Spacemacs). Lots of examples of open discussion but also danger of too much information if people don’t want to know. Some current growing issues (4500 users) – unassociated with any institution so looking for funding to support infrastructure and development. Patreon, Gratipay for community; SevenBridges helping as well. Broken up infrastructure into individual machines, managed with Docker containers deployed across machines. Lots of great community work to bring more people into the project. Three Google Summer of Code students working on project: adding fitbit support, phenotype integration, UI overhaul.

Michael Reich – The GenePattern Notebook Environment

Analysis approaches: analysis notebook environment like Jupyter allow you to describe, document and run analysis if you have programming skills. Bioinformatics Tool Aggregation Portals wrap existing tools and make them available to the non-programmer (Molbyle, Galaxy, GenePattern). GenePattern notebook environment makes Jupyter available to research users without requiring any programming. Michael switches over to demonstration of Jupyter versus GenePattern notebook for SVM training and prediction. Jupyter requires coding up pa bunch of loading while GenePattern avoids this by baking in lots of machine learning methods. Almost available and ready to go. Shows movie demo of trying to predict pediatric medulloblastoma outcome from genomic data.

Brett Beaulieu-Jones – Reproducibility in computationally intensive workflows with continuous analysis

Emphasizes that open science doesn’t equal reproducibility, since it can be so much work to reproduce other’s analysis. Data changes over time, so without version specified cannot get back what you did previously. Continuous analysis – combines Docker and Continuous Integration. Need Docker container + script to integrate with CI. Available on GitHub: Continuous analysis.

Nils Gehlenborg – Reproducible Research in the Cloud with the Refinery Platform

Refinery Project provides a set of visualization tool built around an ISA-TAB data repository. Uses Galaxy and CloudMan for analysis platforms. Runs on AWS to spin up CloudMan for analysis. Nice work to handle provenance graphs by collapsing and expansion – so can try to see the correct representation based on interest: called Avacado.

Lightning talks

Mónica Muñoz-Torres – Apollo Genome Annotation Editor: Latest Updates, Including Galaxy Integration

Apollo provides real time interactive collaborative editing. Architecture is an Apollo Server connected to clients – web clients, jBrowse. Latest news: export and update Chado databases, annotate multiple organisms per server, integration with Galaxy. Can bring in annotations from many different inputs.

Michael Zentner – An invitation to the bioinformatics community to participate in the HUBzero® open source release

HubZero is a software platform to create web sites to for scientific research and teaching. Tries to help folks provide web interfaces for tools. Organized through containers, showing examples from molecular dynamics. Looking at hosting Jupyter and arbitrary web apps. Remidi Central is a collection of 250 hospitals that share date through HubZero. Looking for partners in the bioinformatics community.

Peter Rose – PDB on steroids – compressive structural bioinformatics

PDB needing to work with larger structures and more data. 120,000 structures in June 2016. Need to have ways to compress and manage data for storage. For analysis, hold in memory on the cluster and run analyses using Apache Spark. Provide custom compression in MacroMolecular Transmission Format, using messagepack for compressed JSON-like binary.

Robin Andeer – Puzzle: VCF/GEMINI interface for genetic disease analysis

Puzzle designed to handle human genome analysis of VCF files. VCFs are complex so need standardization for improved analysis, including visualization. Will suck up a directory of files and prepare for visualization. Can handle VCF, BCF and Gemini databases. Viewers for structural variants, standard variants.

Keiichiro Ono – Modernization of the Cytoscape ecosystem

Cytoscape is a popular platform for network analysis and visualization. It has a single small core, layered with core apps and community third party apps with ~300 applications. It’s an application publishing platform for network biology. Many tools written in non-Java – how can you integrate well with Cytoscape? Support multiple plugins now at CyWiki.

Abigail Cabunoc Mayes – Collaborative Software Development: Lessons from Open Source

Abby is the lead developer for open source engagement at Mozilla Science Lab. History: netscape released as free software – want to bring this ethos to open source. Need to be public and participatory: structuring events so outsiders can participate and become insiders. Inspiration from Disney: do a great job of showing you the right way to go. Open Source checklist: public repo, open license, readme, roadmap, code of conduct, contributing and mentorship. Need to be better at this in bcbio. Mozilla Fellowships – bring open ethos to your community.

Notes: Bioinformatics Open Source Conference 2016 day 2 morning: Open research and data science

I’m at the 2016 Bioinformatics Open Source Conference (BOSC) in Orlando and these are the notes from the morning session on the second day. This includes a keynote from Steven Salzberg on openness in science and a session on data science.

Other notes:

Keynote

Steven Salzberg – Open source, open access, and open data: why science moves faster in an open world

Steven talking about power of open science in doinng better work. Tells a story about Glimmer – developed at TIGR – initial open source release from Steven. Lawyers preferred to try and sell Glimmer, but ended up making executive decision to release open source. Next open source package was MUMmer 3, which did super fast large scale alignments (moving to GitHub soon). Also very popular and well cited, thanks to being open. Citations source of credit in academia – currency we need. Bowtie/TopHat/Cufflinks – open software for NGS with over 20k citations. HISTAT2 is the successor to TopHat and StringTie is the successor to Cufflinks. If you develop useful open software, people will use it. Mentions GATK as example of not following this model – initially open source, supported by public grants, now has free-only-to-academics license.

Open Data is a bigger issue and talks through the human genome sequence race (1998-2001): Celera versus public project. ABI3700 was the technology enabling this. Questions about how Celera would make money: one idea was patenting human genes. Now invalidated thanks to BRCA Supreme Court decision. At the time NHGRI released data immediately to the public domain so it couldn’t be patented. A personal example from Steven – sequencing of 12 Drosophila species. Wolbachia bacteria co-evolve with most fruit flies. Q from Michael Eisen – did any of the 11 new Drosophila have Wolbachia? Yes, found 3 new species. Possible due to sharing of the data. Another open project – rapid data release from Influenza Genome Project to develop flu: 19,000 genomes and counting. At start asked Influenza researchers for samples – paid for it and only restriction is that it would be publicly available. Most declined because did not want to release data publicly. Some examples of major genomics projects that failed at data sharing. ENCODE was a pure data generation project without any hypothesis. Data released but embargoed and in 2014 changed the policy due to outside pressure. So 2003-2014 data harder to use than it should be. GTEx – massive RNA-seq analysis project. V3: could not publish until consortium does. V4: 9 month restriction, and finally in V5 lifted restrictions. Human data can be harder to share because of restrictions. However there are loopholes that allow producers to sit on data for 5 years or more until publication. Some positive developments: Biden launch of major open access database to advance cancer research. Good examples: ADNI Alzheimer’s project + Sage with great access policies. Releasing data helps correct bad science: example of tardigrade with initial paper claiming lots of horizontal gene transfer. A second group did not have transfer – looked at new data plus theirs and found everything is contamination in first paper. There are downsides to sharing data. If you publish a dataset, you lose control over authorship. Current problem where we need credit. How can we include better credit/co-authorship to help the people who produced the data?

Open access papers: everyone has access. Journal distribution artifact of 1800’s journal service to scientists. Now we have the internet. Patients forced changed to make NIH data be open access after 12 months. Why 12 months? Lobbying from journal publishers. Science kept secret is no different from science that was never done. What is the goal of science? Solve problems, communicate to others so they can help. Progress is slower without working together. Great reminder about working on hard problems together to help the world become a better place.

Data Science

Alyssa Morrow – Mango: Data Exploration on Large Genomic Datasets

Why do we need distributed visualization? Data explosion – human genome (10Gb) to TCGA (3Pb). Current browsers don’t scale – single node: 0.00125% of entire genome. Ideally you’d be multicore and able to explore thousands of whole genome samples. Mango browser uses Parquet and GA4GH Schemas, runs on Apache Spark managed with Toil, using ADAM for data management. avacado – scalable variant calling plus gnocchi – parallel variant analysis. Architecture – ADAM, Access optimization on top of it, Scalatra and pileup.js on frontend side. Goal is to have interactive latencies on top of batch-oriented platform. Provides persistent store optimizations with a front end cache. Provided memory optimizations for data already in memory – interval trees for region selection. Implemented discovery mode to do intelligent pre-fetching of feature/variant density. Scales nicely over lots of nodes.

Frank Nothaft – ADAM Enables Distributed Analyses Across Large Scale Genomic Datasets

ADAM genome analysis platform is a great project to build analyses on top of Apache Spark. Frank show examples of GATK best practice and run times – slow because designed for flat file formats on cluster systems. Formats optimize for specific use cases. Flat architectures expose bad programming interfaces that are not productive and hard to do. In ADAM start with a schema data model and build from that. Successes in other fields (networking) thanks to data formats (IP). You can optimize whatever you want on both sides as long as you have the same schema. ADAM/Spark has horizontal scalability – nice graph of utiliziation over 1024 cores. Working on validation project of ADAM framework against GATK best practices using Toil for parallelization. Uses 260 individuals from Simons Genome Diversity project. Toil pipeline system designed for cloud and takes advantage of spot pricing and failures. To run on Toil, they adapted to start up service job that manages Spark. Then workflow speaks to Spark cluster. Produces statistically similar results to GATK, 30x faster anad 3x cheaper than GATK. Working pipeline using hg19 and hg38.

Hannes Hettling – SUPERSMART – A Self-Updating platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa

SUPERSMART works to understand global biodiversity and biogeography by handling large scale phylogenies. Typical processing workflow – data retrieval, filtering/quality control, multiple sequence alignment then tree inference. SUPERSMART integrates all of these components with scaling for very large phylogenies. Uses cleaned data from PhyLoTA as baseline in SUPERSMART. As an integrated tool, it has a lot of dependencies. To make it easier provide virtualized installs with all version – uses Vagrant/VirtualBox so only two dependencies. Host a box at Naturalis/supersmart. Also Docker container and integrates with Galaxy.

Lorena Pantano Rubino – Characterization of the small RNA transcriptome using the bcbio-nextgen python framework

Lorena talking about her great work integrating small RNA analysis into bcbio. microRNA have important functionality and lots of diversity with isomiRs. Small, complex and biologically important. bcbio has variant calling, RNA-seq, small RNA-seq and contains over 200 peer reviewed tools installed via Bioconda. small RNA pipeline does everything you can imagine wanting to process with short RNAs – lots of custom tools also built by Lorena. seqcluster to deal with multi-mapped reads on the same small RNA. Visualization interface built in to select. Fits into MultiQC for interactive summarization of results. Used miRQC for validation and quality control of pipeline. Shows plots of resource usage. Starting an open project for small RNA annotation and analysis: MIRTop.

Fabien Campagne – MetaR: simple, high-level languages for data analysis with the R ecosystem

MetaR aims at providing an interactive data analysis environment as part of R. IDEs like Rstudio are primarily useful for experienced programmers. Provide some nice features, along with interactive notebooks like IPython. Notebooks can be challenging because of breaking up into fragmented chunks. MetaR tries to put advantages of UI with ability to program using language workbench technolgy. Provides composable R – can combine a Biomart statement with R and then it generates it into pure R. End up with seamless interaction trying to remove disconnect between different interactive languages. Works with source control and provides better reproducibility by combining MetaR with Docker.

Jorge Duitama – Development of NGSEP as an open-source comprehensive solution for analysis of high throughput sequencing data

NGSEP is yet another tool, in Jorge’s words, to help with sequencing analysis. Does alignment, variant calling including custom algorithms. Contains implementation of Indel realignment, structural variant detection with CNVnator + custom algorithms. Integrates with Galaxy, iPlant and DNANexus. Uses for WGS of 100 genomes in rice, now part of Rice 3000 genomes project. Cool examples of structural variation in important rice genes detected.

Kam Dahlquist – GRNmap and GRNsight: open source software for dynamical systems modeling and visualization of medium-scale gene regulatory networks

GRNmap and GRNsight used as platform for talking about the open science ecosystem. How can open science facilitate teaching and connections. Students benefit from Open Source and Open Data – helps them become involved with the community and analysis. Software development done for teaching and as student projects. Cool way of integrating research in wet lab into teaching undergraduates. GRNmap collaboration with math students, GNRsight with computer science students. Math collaborators not used to working with open source repositories. GRNsight with computer science folks and used GitHub and open collaboration from the beginning. We need to teach software best practices while developing useful code. Great opportunities for undergraduates.

Notes: Bioinformatics Open Source Conference 2016 day 1 afternoon: Standards; Panel on growing communities

I’m at the 2016 Bioinformatics Open Source Conference (BOSC) in Orlando and these are the notes from the afternoon sessions from the first day focusing on standards and a panel on growing and sustaining open source communities.

Other notes:

  • Day 1 morning notes – Jennifer Gardy keynote on open data and infectious diseases, plus workflow development

Standards

Andre Masella – Enhancements to MISO: An open-source community-driven LIMS

Andre describes the use of MISO LIMS at OICR integrating with SeqWare. Cool because a different workflow system than development team at TGAC/Earlham Institute. 1 developer at Earlham and 4 developers at OICR. Require code reviews for everything with at least two developers – have a mainline with site-specific repositories for configuration. Nice development processes. Lots of configurable plug ins to externalize the configuration so handles lots of new integration features. Planning to expand the tissue processing workflow used at OICR. Also planning to continue to improve UI with new Excel-like handsontable interface.

Chunlei Wu – Biothings APIs: high-performance bioentity-centric web services

Building a unified API for biological entities. Represent heterogeneous data in a document database – JSON documents with nested key/value pairs. MyGene.info and MuVariant.info – great resources for querying and retrieving information about genes and variants. Easy to use for developers – two endpoints, no API keys or sign ins. Python and R clients available for usage. Incredible update rate: weekly for genes, monthly for variants. BioThings.io is an great resource for the community. Scales to 5M hits/day with 99.9%+ availability and quick return times. Used by MyGene CIViC, jBrowse

Seth Carbon – The Noctua Modeling Tool

Noctua provides collaborative editing of RDF instance graphs for modeling biological processes (models). Use a LEGO abstraction for modeling biology and under the covers uses OWL for modeling LEGO. Noctua is really an abstract graph editor. It avoids pitfalls of tabular models that causes issues. By being highly collaborative you can work together at the same time remotely. Nice demo of using Noctua for chaining together complicated biological interactions: feedback loops with positive and negative feedback.

Chris Mungall – Processing phenotype data using Phenopackets-API and PXFTools

Chris emphasizes that standards enable computation, but for phenotypic information a big lack of standards compared to genomics (GFF, VCF, BED). Phenotypes are hard, so walks us through important example using Mickey Mouse. Hard to capture in words how perfect and funny this is. In seriousness, there is a lot of squishiness in defining things and we can improve this using ontologies. Shows how hard this is – in many cases need to combine multiple terms to describe. To try and improve this situation developed PXF: Phenotype Exchange Format. Handles much of the complexity required in real life descriptions. PXFTools makes it easier to work with these formats. Can use JSON-LD/RDF magic that I don’t understand to convert into triples for modeling and query. Has a reference API in Java, plus Python and Javascript bindings.

John Bradley – Towards traceable, scriptable, and efficient data distribution for next-generation genomics

John talks about the data lifecycle in research – wants to have reproducibility, provenance and versioning. Typically we have terrible ID management based on filesystems. Duke Data Service handles better provenance and management. Have a full workflow around this. Cool, similar ideas to Arvados.

Panel: Growing and sustaining open source communities

This is the always great BOSC discussion panel. The goal of this panel is to discuss how open source communities can grow and develop. We have five great panelists:

  • Natasha Wood – Establishing a bioinformatics community in western cape of South Africa using Bioinformatics unseminars.
  • Bastian Greshake – Created openSNP, a crowdsourced environment to share genomic data. Growing quickly with lots of users and 5-10 contributors on coding side.
  • Abby Cabunoc Mayes – Developer for the Mozilla Science Lab, engaging external researchers. To build a community, you need to make things.
  • John Chilton – Developer in the Galaxy project. It exists for the community and has a strong focus on users and researchers. Awesome 200 person Galaxy conference showing the strong devoted community.
  • Jamie Whitacre – Background in software development at Smithsonian, now technical project manager for Jupyter (IPython) project. Core team of 25 developers.

Discussion questions:

What is the motivation for community development? Natasha – knowing people from a large number of backgrounds helps with getting things done. Networking helps everyone. Abby – initial motivation for most people is selfish. Have to offer something useful, then they start to care about the bigger mission. Jamie and John – Galaxy origins: very bare bones initially, but got using it as more useful. For IPython, early small project that expanded as it got picked up. Bastian – different situation because people

Project solving a need, but how to define that need and differentiate to grow the community? Abby – learn things from startups. Very clear about value proposition and understanding what makes them unique. People come because they care about your mission so you have to be clear about that. John – easy to be technically focused but need to stay focused on use cases, and helping people. Jamie – having a really good idea and a strong personality. Fernando helps draw people to the project and picks a great team.

What is the value of empathy in the community building process? Jamie – recommends Art of Community to learn better about what your community needs. John – need to understand what people want key to bringing people in. Example from Galaxy: fully opened up the development process and gave the community a voice.

Building bridges between communities and maintaining identity and ownership. Abby – difficult problem, unfortunate about Mozilla Persona. OpenID. Jamie – Jupyter hub wants to maintain a single identity so people can take research and work with them.

How to foster volunteers to continue to develop within established projects that are often engrained? Abby – being able to delegate tasks. Leave small tasks for others to get do as well. As a leader, don’t do too much. Idea: have easier bugs that you can mentor people through. John – seconds Abby’s comments. In additional, breaking project into smaller units allows them to have ownership of specific components. This ownership provides incentives for people to contribute. Abby – make people feel appreciated and show what they accomplished.

How can you convert non-initiated people into your project? Abby – okay if everyone not intimately involved, can get good feedback from closely associated paper.

What about contributor badges for community contributions? Abby – a great idea. Badges show what people actually accomplished, uniquely identified with ORCiD. John – names, names, names. Highlight what people did in release notes to show you value their contributions.

What is the best strategy for dissemination and growth? Natasha – get people involved in a project. Hackathons worked. Not worked: get non-bioinformaticians involved as they are looking for bioinformaticians not wanting to get involved. Jamie – Jupyter gets best developers through long tail of interested contributors. Lots of ways to communicate and get involved with discussions. Newsletter and mailing list. They also attend a lot of events and host specific events. John – full time people doing outreach and help (Dave, Jennifer). Put in the effort to get the word out. Abby – want things to be more open by focusing on face to face and events. Mozilla Global Sprint good example of later. Study groups meet in different cities around the world.

Advice on people starting a project with a couple of developers; pitfalls? Abby – be open so everyone knows what is happening, don’t forget about rest of world. Bastian – patience, it can take time before something comes. Natasha – useful to have a clear exit strategy for people who want to be part time contributors. John – build a community by becoming part of a larger community. A lot benefit from using other projects instead of developing your own things. This helps find commonalities and collaborations.

When is a project large enough to recruit people to do community development? John – Dave Clements early on, 7th employee of Galaxy. Abby – even one person can do this. Michael – CWL, first person.

Hilmar Lapp – Open Bioinformatics Foundation (OBF) update

Hilmar starts by taking a poll of audience to see who is at BOSC for the first time. ~50% of audience – awesome to have some many new people. Provides an overview of all the work OBF does. Describes some difficult things currently affecting the community: running infrastructure with volunteers is hard. Easier to setup than to maintain over time. Open Bioinformatics involved in Google Summer of Code thanks to Kai Blin, the organization admin: 8 projects, none are original Bio* project showing the cool growth of the community. This year we have an OBF Travel Fellowship to improve diversity in the community. 3 travel awards from April 15th round, applications for August 15t rounds are now open.

Notes: Bioinformatics Open Source Conference 2016 day 1 morning — Open Data and infectious disease, workflows

I’m at the 2016 Bioinformatics Open Source Conference (BOSC) in Orlando, Florida. BOSC is a two day community conference devoted to open scientific development communities. Nomi Harris starts the day off with an introduction to the 17th annual BOSC. The theme this year is about connecting Communities of Communities and Nomi emphasizes the importance of bringing together multiple independent groups to create a larger community that can solve important problems. This is the key goal of BOSC and why we come together to share and learn from each other.

Keynote

Jennifer Gardy – The open-source outbreak: can data prevent the next pandemic?

Jennifer starts the conference off talking about the roll of open data in infectious disease. She has some great stories about presentations from BOSC 2004 and trips to Orlando when she was in high school. On the science side, Jennifer talks about using genomic data to understand where new diseases come from and how we can understand their spread. 56 million deaths a year and 1/3 of these are due to infectious disease; 5x higher death rate in low income countries. Tuberculosis has been found in mummies from 2050BC, so we have both old and new diseases to deal with. Old diseases can change and become resistant, so lots of things to worry about. Beautiful map of where diseases emerge on a global scale. We know where they come from but are not looking for new diseases in a systematic way. Most locations do not have systems for collecting and sharing data, mostly resource constrained. Good example is Ebola: December 6th first deadh, March 22nd first pro-med e-mail alert, August declaration of emergency. Public health is quite bureaucratic and most data is kept private to research groups.

Awesome work on crowd sourcing problem: EpiCollect for your phone, EpiHack for in depth problems. Called Digitial epidemiology. HealthMap provides reports on issues located nearby. They identified Ebola before pro-med. nEmesis monitors tweets for food poisoining updates.

Lots of bioinformatics opportunities in disease detection: rapid ID of pathogenes from metagenomic surveillance data. Example of the great ZIBRA project sequencing Zika in Brazil and releasing data in real time.

Shows the incredible John Snow graphic from Cholera outbreak in London. Identified infected water pump. Still useful technique – identify isolates, use molecular epidemiology to id clustered isolates. Then try to find connections between clusters. However, lots of current limitations: you don’t get order/direction of transmission, size and membership of cluster varies, lots of manual work to identify underlying transmission structure. Genomic epidemiology: use genomic sequence to track how it spreads without needing to manually talk to everyone. Data is simple to use – 21 base pairs vary over whole genome in tuberculosis outbreak and can compare only these. Easy to visualize and see sub-groups with high resolution picture. Next work – want to automatically infer transmission from this structure. Infer a transimission tree from phylogenetic tree: uses Beast to draw tree of outbreak, and identifies locations where there are jumps using this to infer an infection network.

Cool examples of open data and analyses available to anyone: Virological, NextFlu. Open-source crowd-sourced analysis of E coli outbreak paper. So incredible, how can the community do it regularly? Challenge is to bridge the evolutionary-math-bioinformatics gap. Better information visualization and interpretation needed.

Workflows

Ted Liefeld – GenomeSpace: Open source interoperability platform with crowd-sourced analysis recipes

GenomeSpace is an open source tool to connect bioinformatics programs. Provides a standard way to organize files in the cloud, then ship them off to integrated tools. Has publicly shared files. 20 GenomeSpace tools, including cBio portal, Galaxy, GenePattern, ISATools, Cytospace. Great example of connecting tools together and creating community of communities. To integrate a tool it needs to be able to handle authorization, read files from URLs. Have a recipe resource with step-by-step instructions for doing integrative analyses. Awesome community distributed way to document and share.

Michael Crusoe – This is Why We Can Have Nice Things: Getting to 1.0 of the Common Workflow Language

Michael talks about the release of Common Workflow Language (CWL) v1.0. The motivation is that there are many workflow standards and could we move workflows between them. Standards create a surface for collaboration that promotes innovation. Works on both shared-nothing clusters (cloud), academic clusters with shared filesystems. Michael does a great job of explaining the goals of CWL (practical standard) and the community (large set of members) and lessons learned. He also presents an awesome vision of building great full reproducible workflows with the workshope for sustainable software in science.

Dan Leehr – CWL in Practice: Experiences, challenges, and results from adopting Common Workflow Language

Dan talks about his experience adopting CWL for practical usage in a biological research project. Some challenges: change in paradigm and new way of thinking. Advantages: better representation of workflow and portability. Required changing an architecture from bash scripts into CWL tools and using sub-workflows to group them together into steps, and high level workflows to run the full thing. Need to think through the data flow dependencies. Shows example of ChiP-seq workflow with quality control. Some things to do in CWL: no branching/conditionals so have distinct workflows for each code path, use scatter/gather instead of loops. Useful things: simple javascript expressions, embraces linux conventions and requirement specifications. Use a different CWL implementation: Toil to run distributed on SLURM.

Peter Amstutz – Using the Common Workflow Language (CWL) to run portable workflows with Arvados and Toil

Peter works on the Arvados project and will talk about work running pipelines in CWL in multiple environments. What kind of software can we have if we have the baseline assumption that we can move workflows between systems. Can we run an unmodified workflow using completely different workflow software, cloud providers, storage systems and schedulers. Used bcbio to run in two environments: Toil and Arvados. Toil: running on AWS, S3 storage, Mesos scheduler, converts CWL to Toil workflow graph. Arvados is a managed multi-tenant architecture with web workbench, running on Azure, Arvados Keep storage for files, Crunch + SLURM scheduler. Ran and got the exact same outputs. Great demonstration. To ensure this kind of compatibility, there is continuous validation: a CI server and continuously tests every implementation and provide guidance to users of CWL. If you can trust ability to bring your own workflow you can choose the platform that matches your needs. Portable APIs associated with this: GA4GH Tool Registry API. DockStore implements this API and is a usable implementation available now. GA4GH Workflow submission API for further standardization.

John Chilton – Planemo – A Scientific Workflow SDK

Galaxy philosophy on workflows – the most important user is the bench scientist using the GUI. Galaxy will never require an SDK, but the SDKs are rather for bioinformaticians who prefer this approach over the GUI. Planemo (pronounced Plah-nemo – Nemo like the famous fish) is the way to develop Galaxy tools and focuses on developers. planemo creates a profile for testing workflows, then can re-run without needing setup every time. Galaxy’s workflow format is JSON, hard to read and impossible to write. Swapped over to a Format 2 workflow which is very similar to CWL. CWL-inspired and hopefully real CWL soon. Planemo also provides nice facilities to test workflows. CWL and Galaxy: right now CWL tools work with Galaxy tools. No support for CWL workflows yet but hopeful outcome for BOSC 2017. Planemo can lint CWL tools – useful functionality for standard CWL development. John describes other great work to make tool installation easier: bioconda, docker.

Daniel Blankenberg – Sample Size Does Matter: Scaling Up Analysis in Galaxy with Metagenomics

Dan talking about enabling metagenomic work with Galaxy. Handles a whole bunch of standard metagenomic tools. Dan is incredibly fast having trouble keeping up. Handles normalization, metadata, graphs of differentiation between results, integrated Phinch from Holly Bik (last year’s keynote – awesome). Also handles large-scale multiple sample analysis – 500 samples. 5000+ – still under development.

Fabien Campagne – NextflowWorkbench: Reproducible and Reusable Workflows for Beginners and Experts

Nextflow Workbench is an integrated development enviornment for Nextflow. Nice typing system, auto-completion, error highlighting. It’s a GUI environment that makes developers much more productive. Looks like a great environment, built on top of Nextflow so can work on laptops, clusters, or Google cloud. It’s also built in MPS as a cloud language.

Notes: Bio in Docker Symposium 2015 day 2: Docker infrastructure and hackathon

I’m at day 2 of the Bio in Docker Symposium at the Wellcome Collection in London. This is a great 2 day set of talks and coding sessions around the use of Docker containerization in biology workflows. The idea is to coordinate around using these tools to be able to improve our ability to do science. There is hackathon for half of today but the morning talks focus on tools within the Docker ecosystem and learning how to use these to design better applications for biology. There is a ton of cool engineering ongoing, but untangling all of the different components is a challenge hopefully these talks will help with.

  • Notes for day 1: NGS pipelines, workflow tools and data volume management

Publication

F1000Research – a publishing platform for the Docker community

Thomas Ingraham & Michael Markie

F1000 Research is an open publisher that both handles traditional articles but also outputs like posters and presentations. The do open post-publication peer review and are generally a great way to publish research. Articles are only indexed and officially published when passing peer review. Articles have versions so you can see changes over time. Collections organize related content for easier discoverability. One example is the Bioinformatics Open Source Conference (BOSC) channel. Today they’re announcing a channel for Container virtualization in informatics. A great way to make Docker-enabled posts publicly available.

New Docker functionality

Weaving Containers in Amazon’s ECS

Alfonso Acosta

Microservice oriented architecture model work around small components that each do a single job. This helps avoid complexity, but introduces challenges in coordinating and putting things together. WeaveWorks has tools to connect, observe and control containers. It connects containers, managed IP issues, sorts out DNS, and does load balancing between identically named containers. It handles node failures with recovery. Weave does not require a distributed key/value store – uses gossip in a peer-to-peer fashion so resistant to failures. On each machine you start up a new weave client. It only needs to know the name of one other host on the network, then they’re all connected. Then you start and register containers in the network. Weave sits on top of Amazon ECS, providing all of the container registry and discovery

Orchestrating containers with docker compose

Aanand Prasad

Docker compose allows defining and running multi-container Docker applications. Motivation is that multi-container apps are a hassle to setup, configure and coordinate. Coordinates with Docker machine, which creates Docker hosts on a machine, and Docker swarm, which provides clustering of Docker containers. From our previous talk, it looks like Weave can work with Swarm directly. For a demo, Aanand shows docker compose talking directly to a swarm cluster. The swarm integration with Docker is nice – all of the standard docker commands work when you’re running a coordinated cluster. It does a great job of distributing containers across multiple hosts. Docker machine does work on running inside VMs in an easy way, but you also lose all the native goodness of docker by going this route.

Manage your infrastructure like Google

Matt Barker & Matt Banes

JetStack helps with management and orchestration of containers. At Google everything runs in a container with 2 billion containers a week. Kubernetes is the open source version of their internal cluster management tool. Pods look really cool and are a way to group together multiple applications with shared volumes. The replication controller does the work of maintaining a consistent state of pods, restarting when things fail. Kubernete’s shared-state scheduler does the job of scheduling containers to machines based on resource requests. Services provides a way to label pods and then do discovery based on the naming, instead of introducing this complexity into your code. As of the new version, Kubernetes now has a Job object which explicity managed jobs. Nice live example wrapping Mykrobe predictor inside a Docker container and then running with Kubernetes – mykrobe predictor spits out JSON which gets sucked into MongoDB from a watcher in another container. Kubernetes is impressive and the discoverability of coordinating between services is nice.

Docker and real world problems

Clive Stringer and Adam Hatherly

King’s College London is a NHS hospital with big hospital problem. Many of the components don’t fit well together so things don’t interoperate. Major need is components that slot together easily. CIAO (Care Integration and Orchestration) is an open source middleware to make it easier to use standards and share information. The NHS is a complex set of organizations so not set up to work as a whole system. Encapsulates the complex XML integration code inside of components composed in Docker containers. Each components is a microservice and self-contained. Each component in a reference standard for libraries. A second focus is on trying to characterize genes based on genetic components. Challenge is to make medical records available.

OSwitch: One-line access to other operating systems.

Yannick Wurm

Yannick works with ants, and starts with a plugs for ant being cool. Shows example of leaf cutter ants using symbiotic fungus to digest leaves. Sold. Then he motivates about the difficulties of working with computational tools to answer biological questions. Also sold. oSwitch is an awesome tool that provides a small wrapper around Docker. It creates an interactive instance around a tool, run things on the files in the current directory, then exiting. It removes all of the abstraction around Docker. We should have docs on debugging bcbio runs with oSwitch when running more with Docker.

Hackathon

The afternoon of the workshop is for working on problems and getting coding done. The organizers have nicely setup up tutorials ranging from learning Docker, to learning advanced Docker concepts to working with tools like Nextflow and Nextflow workbench that build on top of Docker. I’m working on continuing to improve CWL support in bcbio, moving towards the vision of bcbio running on alternative infrastructure presented in my talk at the conference.

Notes: Bio in Docker Symposium 2015 day 1: ngs pipelines, workflow tools, volume management

I’m at the Bio in Docker Symposium at the Wellcome Collection in London. This is a great 2 day set of talks and coding sessions around the use of Docker containerization in biology workflows. The idea is to coordinate around using these tools to be able to improve our ability to do science.

Pipelines and Workflows

Evaluating and ranking bioinformatics software using docker containers. plus Overview of the BioBoxes project

Peter Belmann

BioBoxes is a community effort driven by two efforts: CAMI and nucleotid.es. The goal is to improve reproducibility and reuse. Avoids issues with compilation and file format management to focus on putting things together. BioBoxes is a standard for creating interchangeable software containers. It groups software based on interface definitions. For example: Assembler takes a list of Fastq files and produces Fasta. This allows you to plug in any set of assemblers that work with this interface. Built a command line interface that hides configuration and usage of Docker to make it transparent. The key is agreeing on specifications to implement, which the community can iterate on. Good question about how to handle external data and mount into containers.

Portable workflow and tool descriptions with Common Workflow Language and Rabix

Nebojsa Tijanic

Nebojsa works at Seven Bridges on building reproducible computational analyses. Large issue = hard to setup environment to run another analysis. Docker solves these problems for us by encapsulating the environment and code. Remaining issues: understand resource needed, how to supply inputs and how to capture outputs. A second usage pattern, which is what we do in bcbio, is to provide a docker image with multiple tools together, running in parts. Once you start to orchestrate these individual tools, need a workflow engine to organize and run things. Result of multiple workflow engines is to have a single specification for describing workflows: Common Workflow Language (CWL). The goal of CWL is to have a specification for portable workflows. Some implementations that run CWL now: reference implementation, Rabix, Arvados Work in progress implementations: Cromwell (Broad), Galaxy. Design choices for CWL: declarative, extendable, uses Docker for packaging, YAML/JSON encoding, re-uses existing work as much as possible. Contains a description for defining tools, which defines all of inputs and outputs. Combining with a community sourced specification like BioBoxes this creates a standard usage for interoperability between workflow runners. Beyond tools, contains a directed acyclic graph of workflow steps. Workflow supports parallelization through scatter/gather. Rabix is the open source toolkit for CWL from Seven Bridges and will focus on consolidating with reference implementation.

Manage reproducibility in genomics pipelines with Nextflow and Docker containers

Paolo Di Tommaso

Paolo starts by talking about results of survey from Nick Loman to figure out common obstacles to running computational workflows. Why are things so hard? They’re complex, experimental and run on heterogeneous hardware. Containers solve a lot of these issues and are more lightweight than VMs – smaller images, fast start time and almost native performance. From a reproducibility standpoint, they have transparent build processes. To scale out, tools like Swarm, Fleet, Kubernetes and Mesos orchestrate containers. These are not for task scheduling, but for orchestration. The main cost is the startup time – it’s small for Docker but matters if you’re running a lot of very fast runs.

Paolo helps develop Nextflow which manages compute, data registry and local filesystem. Benchmarked native runs versus docker and very close in execution time. It provides a DSL on top of the JVM with high level parallelization. Workflow based on dataflow processing. Nextflow is well thought out including specifications for resources used and scaling. Works locally, on clusters with multiple schedulers, and on AWS with ClusterK. A nice feature is direct interaction with GitHub – you can pull a workflow from a repository automatically. Brave live demo of an RNA-seq analysis – cool example showing not working locally because of tool installation but does if you use Docker containers.

Next generation sequencing pipelines in Docker

Amos Folarin and Stephen Newhouse

Amos and Steve, the organizers of this great workshop, present their work on NGSeasy. The idea is to explore the use of Docker for workflow pipelines. Amos describes the motivations and good bits of Docker containers. There is a lot of promise but some technical challenges: moving quickly, user namespaces not yet supported (but planned for next release) so need root equivalent access. In biology you create a larger number of interfaces, potentially introducing complexity. NGSeasy strike balance in terms of separation of tools inside containers. In general, use a larger container with large number of available tools. Similar to the approach in bcbio, although NGSeasy has smaller suites of tools instead of everything. Uses bash scripts for orchestration and calling out to docker. Goal of conference is to bring together all of the projects to work together.

Pipelines to analysis data from the 100,000 genomes project as part of the Genomics England Clinical Interpretation Partnership (GeCIP)

Tim Hubbard

The goal oof Genomics England is to transform the NHS to use genomics on a large scale – focus on treatment, not research design. It does all clinical whole genome sequencing for rare disease and cancer. Also meant to build up genomics infrastructure in England, leaving a legacy of infrastructure human capacity and capability. For pipelines. Split up components into sub-pipelines and get these available from local companies. For health records, use OpenClinica. It’s an impressive model around helping patients with genomic sequencing. For research work with data, setting up a Clinical Interpretation Partnership (GECIP). How can we improve the 50% intepretation rate for rare diseases? Feed these to research groups to improve interpretation for specific diseases. Practically there are 11 Genomic Medicine Centers, 70 hospitals, 9000 participants consented. Infrastructure is fully leased and virtualized with rented compute. For data sharing a couple of current models: open to all (1000 genomes), managed repositories like DbGap. The new model for Genomics England is managed access but no redistribution. You have to work inside the environment. Long term goals of GEL: engine for NHS transformation to genomics, data standardization, acceptance of data centers for securely processing patient data.

MetaR and the Nextflow Workbench: application of Docker and language workbench technology to simplify bioinformatics training and data analysis.

Fabien Campagne

MetaR tries to help newer users have reproducible analysis. New user often prefer a graphical user interface but this is challenging to reproduce and scale. MetaR is a workbench for teaching R analysis to new users. Nextflow Workbench builds on top of Nextflow, trying to make it easier to write Nextflow workflows. The workbench allows you to organize them into modules for reproducibility. It provides a lot of nice user interface elements around Nextflow. Impressive interaction with Docker, along with specification of resources required. Shows real example of SALMON transcript indexes which require transcripts plus other resources to pull in. Data managed inside the Docker image with Nextflow Workbench. I wonder if there are better ways to handle this data and link them into Docker containers. The workbench also allows writing a bash script and easily converts to workflow processes. The workbench is interactive with auto-completion. Overall a nice development environment. It builds on top of JetBrains language workbench.

Bioinformatics and the packaging melee

Elijah Charles

Elijah makes the good point that we need distributions that have pre-built tools so we can start at a higher level. Docker provides solutions for configuration management, isolation and versioning.

Data, Volumes and portability with Flocker

Kai Davenport

ClusterHQ provides data solutions for working with Docker. Excited to hear about ideas for this. Why containers? Isolated: immutable environments you can keep separate as needed. Expediate: pre-built binary images. Compact: better resource usage compared to VMs. Images: save each layer, making everything pluggable. Now on to Docker Storage: Volumes. Each container has it’s own mount namespace, but volumes are bind mounted directories from the host into the container for maintenance beyond the lifetime of the container. Challenge is scaling this to multiple machines. Flocker solves this problem by orchestrating storage for a cluster of machines. Flocker control service pushes volumes to specific machines running clients. Supports a range of backend storage drivers. Integrates orchestration platforms (Mesos, Marathon, Kubernetes, Docker Swarm and friends) with Storage (EMC and friends). The orchestration tools hide the hardware and manage a pool of compute resources, handling monitoring and failures. Looks like a really nice abstraction to make this easier.

Notes: Bioinformatics Open Source Conference 2015 day 2 afternoon: Translational, Visualization and Lightning Talks

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are notes about translational biology, visualization and last minute lightning talks.

My other notes from the conference:

Translational Bioinformatics

CIViC: Crowdsourcing the Clinical Interpretation of Variants in Cancer

Malachi Griffith

Large scale cancer genome sequencing is becoming routine. We can get lots of mutations but bottleneck is in visualization and interpretation of events. Shows example sample interpretations from Foundation Medicine done by paid curators. We should be doing this in public, and need resources to support this. Resources at WashU: DoCM and CIViC. Issue is that there are many hospitals and researchers building up lists of variants we should care about in cancer – this needs to be done together. Existing resources aren’t meant to be programmable and have non open licenses so hard to use. Principles of CIViC: interpretations should be freely available and debated openly. Content needs to be transparent and kept up to date. Need interface for both API and web interface. Access should remain free. Hope is that CIViC will end up in a precision medicine treatment cycle: capture information from trying to help late stage cancer patients who are fine with experimental treatments. CIViC is trying to capture known clinically actionable genes, so very specific goals to avoid going in two many directions. Currently has 500 evidence statements from 230 published sources.

From Fastq To Drug Recommendation – Automated Cancer Report Generation using OncoRep & Omics Pipe

Tobias Meissner

Tobias talks about work on defining actionable targets that can get prescribed to the patient. Aim for 2 week turnaround for sequencing and analysis. Most time spent in analysis so benefits from automation and reproducibility improvements. Omics Pipe does the process work and OncoRep prepares the clinical report. Introduces example of real patient that has been through multiple rounds of drug treatments. Omics Pipe implements best practice pipelines that run out of the box. OncoRep prepares a HTML patient report based on calls generated using knitr. Links back to evidence. Provides a PDF patient report, generated with sweave.

Cancer Informatics Collaboration and Computation: Two Initiatives of the U.S. National Cancer Institute

Ishwar Chandramouliswaran

Ishwar from the National Cancer Institute (NCI) presenting the collaborative NCIP Hub intitiatve. Idea is to make tools available for biologists. The idea is to fit these in with MOOCs for learning and training. NCIP Hub provides a home for content and keep transparent metrics about usage. The second initiative is the Cancer Cloud Initiative with three implementations: Seven Bridges, Broad and Institute for Systems Biology. Please partitipate in the evaluation with $2 million in cloud credits.

Bioinformatics Open Source Project Updates

Biopython Project Update 2015

João Rodrigues

João talking about Biopython project. Mentions all of the diverse conitributions to the source code. Also talks about the benefit of Google Summer of Code (GSoC) for recruiting and retaining contributors. Eric Talevich was a student, then mentor, then administrator for GSoC – open source career prorgression. João talks about improvements in Biopython the last year, demos some cool functionality from KEGG. Beyond the code: Docker containers with Biopython + dependencies that supports IPython notebooks. Tiago wrote a book: Bioinformatics with Python Cookbook.

The biogems community: Challenges in distributed software development in bioinformatics

George Githinji and Pjotr Prins

BioRuby migrated into BioGems. The idea is to decentralize the contribution process so there is no longer a central idea, and instead promote and rank the new packages. Show a lot of metrics of downloads, GitHub issues, mailing list activity: good question about how to measure success. Published paper on sambamba, saw a big uptick in downloads and GitHub issues: both bug reports and feature requests.

Apache Taverna: Sustaining research software at the Apache Software Foundation

Stian Soiland-Reyes

Apache Taverna is a workflow system that has been in development since 2001. Since 2006, productionized Taverna to make it easier to install and run. Since 2014 moved to Apache incubating project. Stian describes the typical evolution of research software: incidentally open source and then developed ad-hoc over time in different directions than initially expected. There is a strong need for open development so original starters aren’t leaders of the project. Move the focus towards the people that are doing things; move towards a do-ocracy. Looked at ways to change the legal ownwership of Taverna. Decided to move towards Apache – they favor community over code and move towards longer term sustainability.

Visualization

Simple, Shareable, Online RNA Secondary Structure Diagrams

Peter Kerpedjiev

Peter is talking about making it easy to show RNA secondary structure: tool called forna with d3 goodness. Goal of making these is to show things that are hard to visualize. Simplify 3d structures back to 2d to make them easier to see. Convert 1d to 2d to make them obvious. Nice examples. Another tool that does this is RNA-PDB. Can make more complex applications with d3 and rnaPlot layout. Container component is fornac.

BioJS 2.0: an open source standard for biological visualization

Guy Yachdav

BioJS is a set of reusable blocks for representing biological data on the web. Have an online registry to make it easy to discover new packages. Uses npm for installation. Looking for new components and contributors.

Visualising Open PHACTS linked data with widgets

Ian Dunlop

OpenPHACTS brings together a large number of pharmaceutical resources into an integrated infrastructure. Uses RDF under the covers but has an API to query. Lots of nice visualization widgets and compound displays included with BioJS.

Late-Breaking Lightning Talks

Biospectra-by-sequencing genetic analysis platform

Aurelie Laugraud

Originally called Genotyping by Sequencing (GBS) – cheap and easy way to sequence only part of a genome. Used first on maize because they have lots of population data and a massive genome. Analysis pipeline called TASSEL with both reference and non-reference pipelines. BioSpectra-by-Sequencing (BSS). Brings together a community to make tools available for existing data.

hyloToAST: Bioinformatics tools for species-level analysis and visualization of complex microbial communities

Shareef Dabdoub

Shareef highlight issues found with QIMME that led them to develope PhyloToAST which modifies and extends the main pipeline. Includes new plots through matplotlib – nice 2d + 3d on same data to readily distinguish. Also added automatic export of data into the the interactive tree of life (iTOL).

Otter/ZMap/SeqTools: A productive alternative to web browser genome visualisation

Gemma Guest

Gemma talks about visualization and annotation tools from the Sanger. Otter does interactive graphical annotation. ZMap is a high performace genome browser. SeqTools and Blixem provides a tool for visualizing sequence alignments at a higher level of detail compared to ZMap. Dotter provides detailed comparisons of two sequences.

bioaRchive: enabling reproducibility of Bioconductor package versions

Nitesh Turaga

Nitesh is part of the Galaxy team at Johns Hopkins. Issue with Bioconductor is that it’s quite difficult to get an older version of tools – you can only really get the latest. bioarchive provides a nice browsable website and packages of old version of tools. Can use standard install.packages and point to bioarchive. For Galaxy, this now makes all versions available for full reproducibility. Future goals are to get bioconductor involved in the process and integrate with biocLite.

Developing an Arvados BWA-GATK pipeline

Pjotr Prins

Pjotr working at a HiSeq X-10 facility. 18k genomes per year and 50 genomes per day. Existing pipeline takes 3 days on the cluster. Bottleneck is the shared filesystem. Decided to try using Arvados based on conversations at BOSC last year. Took a week to port Perl script over to Arvados. Runs in 2 days with 1 run and flat performance with 8 samples on AWS. Nice ability to share pipelines in Arvados.

Out of the box cloud solution for Next-Generation Sequencing analysis

Freerk van Dijk

Put together a VM for NGS analysis using Molgenis. Can download image, upload data to the VM and then run. Used OpenStack framework for running. Used easyBuild to install the software. Define the inputs with a CSV file. Generates jobs through Molgenis. Nice setup, creates a Reproducible, Scalble and Portable system.

Notes: Bioinformatics Open Source Conference 2015 day 2 morning — Ewan Birney, Open Science and Reproducibility

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are my notes on the morning session with a keynote from Ewan Birney and a section on Open Science and Reproducibility.

My other notes from the conference:

Keynote

Big Data in Biology and why open source matters

Ewan Birney

Ewan is a founder of BOSC and former director of OpenBio, so we’re excited to have him giving a keynotes. We start with lots of stories and memories about his impact on the current set of community members in BOSC.

Ewan starts by reminding us that we’re going through a revolution. The cost of sequencing in 10 years dropped from millions to thousands: mansion versus season tickets to Arsenal. He is brilliantly good at describing the high level picture of the importance of sequencing technology and changes.

3 reasons why open source code matters:

  1. Scientific transperency: providing access to your data and analysis approach is a fundamental process of science
  2. Efficiency: library scale code re-use. It’s a careful art and need to do more than just make your code available. Bugs can have downstream consequences on conclusions – have an extra big job to avoid bugs so need to risk pool for key components.
  3. Community: sharing code allows people to specialize in specific areas, outlasts any individual’s contribution. Challenge is to fund and support these community projects.

Infrastructures are crucial and we only notice them when they fail. Hard to work on because you only hear about it when something is wrong. Life sciences is about small details: lots of data types, metadata. EMBL-EBI can scale in terms of data, but won’t scale in terms of people and expertise. ELIXIR makes this a joint problem tackled across Europe. The goal is to scale in terms of expertise and collaboration.

Now Ewan talks about his work on storing information in DNA. Dreamed up over beers, and then came up with ways to encode binary as DNA in small easily made chunks with redundancy. Stored picture of EBI, Martin Luther King’s I have a dream speech, the Watson/Crick paper and Shakespeare’s sonnets.

Good questions around scaling and the cloud. Growing need to practice healthcare and work with research science. For access reasons, need fedarated systems that can share data. Will not be able to aggregate data, so need to make this future work for us.

Question about how to financially support open-source infrastructure developers. Need to look at how Linux/Apache foundations work and try to build off their success at supporting paid developers managing a larger volunteer community.

Open Science and Reproducibility

A curriculum for teaching Reproducible Computational Science bootcamps

Hilmar Lapp

There is a reproducibility issue in science – we’re not that good at reproducing results and it’s time consuming to do it. Reproducible science makes it easier and faster to build upon. Computationally it’s especially hard because of dependency and tool issues. Reproducible Science Curriculum Workshop and Hackathon in Durham aimed at improving this. The earlier you start doing this in a project, the more benefits it accrues to yourself. Promoting literate programming via IPython/RStudio/Knitr. Curriculum teaches both new users and those that have some expertise but could convince to switch to new approaches. Held two workshops so far: in May and June – changed to focus more on literate programming earlier since you didn’t need to convince people. There was a ton of awareness and demand for these courses. Future plans are to tweak and improve workshop, teach more. Need to figure out how to fund and sustain this effort – it’s funded through 2015. All materials available on GitHub.

Research shared: http://www.researchobject.org

Norman Morrison

Research Objects are part of a framework to enable reproducible, transpart objects. Want to have a manifest of materials/resources available inside research containers. The manifest is a plain text file full of dates, DOIs and links to resources about what it contains. Lots of scope for ontologies and naming. Tutorials and specifications for everything available on GitHub. Lots of cool use cases showing how make results fully computable and available. FAIR = Findable, Accessible, Interoperable, Reproducible. Awesome graph of the time costs of moving to reproducibility.

Nextflow: a tool for deploying reproducible computational pipelines

Paolo Di Tommaso

Nextflow provides a declarative syntax to write parallel and scalable workflows. It has a nice domain specific language (DSL) based on Dataflow. This is a declarative model for concurrent processes. Under the covers it uses async channels and handles parallelization implicitly by input/output definitions. Think of program like a network. Implicitly handles parallelizing over multiple inputs. It is platform agnostic, so scales out from multicore to all the standard cluster managers.

Why is it so hard to deploy a pipeline? Big issues: many dependencies that change quickly. To migigate this, manage a pipeline as a self-contained GitHub repository. Use Docker to manage tools. Provides a nice example pipeline on GitHub to demonstrate. GitHub provides versioning that feeds right into Nextflow.

Free beer today: how iPlant + Agave + Docker are changing our assumptions about reproducible science

John Fonner

John works on iPlant, tackling problems in cyberinfrastructure for the plant community. They have a bunch of storage and compute, API level services for command line access, then a web based user interface for interactivity. iPlant has a Discovery Environment for managing data and analysis history. Atmosphere is an open cloud for Life Sciences, can request additional resources. Agavi provides a programming interface for building things on top of iPlant. Handles pretty much everything you’d need to build on it. Uses Docker to store dependencies and tools alongside a GitHub repo with code. Agave is handling job provenance, sample data and platform integration along with sharing and cluster execution.

The 500 builds of 300 applications in the HeLmod repository will at least get you started on a full suite of scientific applications

Aaron Kitzmiller

Lots of ways to publish code, which is good. Problem is that there are so many different ways to install these that it take a lot of work to build them. Built HELmod on top of Lmod to manage a huge set of scientific dependencies for a clean environment. Has 500 specification files allowing installation of all these tools. Really nice shared resource – we should all be building modules together instead of at each facility. HELmod GitHub repo

Bioboxes: Standardised bioinformatics tools using Docker containers

Peter Belmann

bioboxes motivated by docker-based benchmarking projects (CAMI and nucleotid.es). Standards project to provide a way to specify inputs and outputs of boxes. This allows you to easily interchange tools for benchmarking. Nice community project for specifying these.

The perfect fit for reproducible interactive research: Galaxy, Docker, IPython

Björn Grüning

Björn talks about his brilliant work to combine Galaxy, IPython and Docker. Galaxy can run Docker containers and has a rich set of visualizations. What was missing was a way to interact and tweak your data. Invented the concept of an interactive environment in Galaxy – spins up Docker container that works against Galaxy data. This is a sharable Galaxy data object so has all those advantages. RStudio is also integrated if you prefer R over Python. Also has a Docker based way to install and use Galaxy quickly.

COPO: Bridging the Gap from Data to Publication in Plant Science

Robert Davey

Cultural issues to help get scientsits to deposit metadata. Idea: make it easier and more connected so there is a bigger benefit to users and overcome this cultural barrier. This allows you to build graphs of interconnected data, and track usage of your data. Can be helpful in describing the value of your data. COPO project.

ELIXIR UK building on Data and Software Carpentry to address the challenges in computational training for life scientists

Aleksandra Pawlik

Aleksandra has 1 slide for her lightning talk – brilliant. ELIXIR is adopting software carpentry as a training model. Really awesome to be spreading a single teaching model across multiple countries. It feels like finally we are not developing independent materials everywhere and can have good training for everyone.

Parallel recipes: towards a common coordination language for scientific workflow management systems

Yves Vandriessche

Yves builds tools for people who build tools. Scripts deal with the complexity of gluing together applications but we need more distributed jobs. The biggest issue in this move is ordering dependencies when running in parallel. Integrated CWL workflow specification into precipes. Code on GitHub.

openSNP – personal genomics and the public domain

Bastian Greshake

OpenSNP is about open data and personal genomics. The idea is to provide a way to upload and share genotype and phenotype data along with genotyping from 23andme. Mines SNPedia to provide useful feedback to users. Provides complete dumps of data and APIs. 2000 genetic datasets and 4000 people registered. Users have mined his data and provided back interpretations.

Notes: Bioinformatics Open Source Conference 2015 day 1 afternoon — Standards, Interoperability and Diversity Panel

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are my notes on the day 1 afternoon session on standards and interoperability, and the panel on improving diversity in the BOSC community.

Standards and Interoperability

Portable workflow and tool descriptions with the CWL

Michael R. Crusoe

Michael providing an overview of the Common Workflow Language, which is a great standard developed out of Codefest 2014 and the BOSC community. It provides a way to define workflows and tools in a re-usable and interoperable way. Looked at many existing standards and approaches. Giving a showout to Worflow4Ever which was an awesome standard to build off.

From peer-reviewed to peer-reproduced: a role for research objects in scholarly publishing in the life sciences

Alejandra Gonzalez-Beltran

Alejandra talking about research objects and the ability to use them to improve reproducibility. Work is published in PLOS-One and builds off ISATools, Galaxy and GigaScience. Took an article on SOAPdenovo2 and built up the infrastructure that should have been present: description of the experimental steps in ISATab, running in Galaxy. Publishing findings as nanopublications in RDF and linked the statements of results with experimental descriptions. The idea behind Research Objects is that they bring together data, analysis and conclusions.

Demystifying the Interoperability of Disparate Genomic Resources

Daniel Blankenberg

Dan starts by describing what Galaxy is, trying to correct some earlier misconceptions and define the goals of the project. Awesome new thing is the interactive support for iPython and RStudio. Main goal of talk is talking about getting data into Galaxy in more automated ways, including multiple files and metadata and GenomeSpace. Export data out of Galaxy to places like UCSC or to other sinks like GenomeSpace. Whew, a whirlwind tour.

Increasing the utility of Galaxy workflows

John Chilton

John is talking about Galaxy workflows, designed for biologists. However only used by 15% of users, perhaps because of limitations. It previously only scheduled jobs, but now handles map/reduce system operations and has a real workflow engine. Now have Collection types (list and paired). Support for dealing with these via a nice web interface and build the sample relationships themselves. Now can do solid sample tracking and tools can consumed paired datasets. John shows some nice examples of community developed workflows. The workflow engine and model is so nice now, lots of great features. I’m looking forward to building on top of it with bcbio when everything will be CWL.

Kipper: A software package for sequence database versioning for Galaxy bioinformatics servers

Damion Dooley

Kipper focuses on helping to recreate sequencing analysis by organizing reference databases. Some of them are nice and you can get and download previous databases, but large stuff like NCBI is continuously updated. Kipper is a Python script that keeps track of everything.

Evolution of the Galaxy tool ecosystem – happier developers, happier users

Martin Čech

Martin talks about recent improvements to the tool shed to make it easier to use for developers. Galaxy involved with the ICGC-TCGA DREAM challenge for re-running. Awesome. Planemo helps in testing and developing new tools, treating Galaxy as a transparent dependency.

Bionode – Modular and universal bioinformatics

Bruno Vieira

Bionode is a javascript library for working with biological data. Works both in the browser and locally with node.js. Really nice looking code: Code in GitHub. Also takls about the really cool oSwitch to help make it easy to run local commands in Docker containers with the actual tools.

The EDAM Ontology

Hervé Ménager

EDAM Ontology describes bioinformatics operations, data types and topics. It’s a critical way to define reproducibile workfows so we know what is happening where and can convert between different tools that do the work. It’s part of ELIXIR so will be in new infrastructure.

Panel Discussion – Open Source, Open Door: Increasing diversity in the bioinformatics open source community

Mónica Muñoz-Torres, Holly Bik, Michael R. Crusoe, Aleksandra Pawlik, Jason Williams

Moni is chairing the panel, talking about our goals at BOSC to have a more diverse community. We want to welcome underrepresented members of all types, to help encirch the set of skills and opinions in our community. The experience of the panelists is so incredible: helping bring new scientists into the community from all levels. There are a large number of cultures and communities we interact with, how can we better at that.

The goal of the panel is in starting the conversation and hearing people’s voices. Goal is to hear about the situations that make people feel uncomfortable and try to remedy them. How can we set up BOSC to not make people feel excluded?

How do you start get involved with the community? Holly’s recommendation: volunteer to help with organizing. Yes please, we’d love to have more help with BOSC organizing. Conversely, how to bring more people into your community? Provide role models for the people you want to attract. Need to build and encourage diverse set of role models.

How can you get more diverse applicants? Often get a large number of men applications and very few women applicants. Suggestions: pay better, more flexibility for working hours. Paying undergraduates to work in the lab to create a more diverse environment. Co-mentoring with other groups – say more biology focused and more computational focused. Also a numbers problem: we need to be investing in education of under-represented of younger students. Can we sell and provide a clear career path for bioinformatics so people want to go into bioinformatics?

Suggestions for helping to improve diversity and open science. Collaborate with folks who don’t normally do that and bring your diverse, open worldview into the collaboration.

Bioinformatics is a chance for more people to get involved. All you need is an internet connection and community. The barrier is the knowledge. We need to motivate and welcome people to become involved, then provide the training to get them there.

Notes: Bioinformatics Open Source Conference 2015 day 1 morning — Holly Bik and Data Science

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and science. Nomi Harris starts the day off with an introduction to BOSC – this is the 16th annual year. The theme this year in Open Source, Open Door: we have a strong interest in improving diversity at BOSC. Some of the work in progress includes: reaching out to many under-represented groups and sending personal invitations, adding a code of conduct to both BOSC and ISMB, appointing Sarah Hird as the BOSC outreach coodinator, and providing financial support for scientists who might otherwise not be able to attend.

Keynote

Holly Bik – Bioinformatics: Still a scary world for biologists

Holly plans to talk about her experience transitioning from biology to bioinformatics. 2010 is the year that Holly saw the command line for the first time, and will give her perspective as an end user. Holly works on marine nematodes, although environmental sequencing allows you to sample everything you’ll find. So you might be looking at nematodes, but will find plenty of other tiny eukaryotes to study: neglected taxa in understudied habitats.

Holly provides a motivating example: the impact of the Deepwater Horizon oil spill in September of 2010. She then describes her biology experience: collecting deep water samples, nematode classification, drawing of nematodes. None of that was a good preparation for the command line. Holly had to adpat to a fast-moving field: collecting 1000s of samples, dealing with bias and contamination error, using multiple approaches. Holly quotes Donald Rumsfeld about known unknowns. Holly feels confident knowing how to do basic scripting and debugging. Known unknowns: new techs like Hadoop, Drupal, etc creates a barrier. Unknown unknowns: distributed computing, binary formatted files that aren’t clearly different than previous text files.

Holly mentions the useful point that how you call yourself influences people’s perceptions. Computational biologist says one thing to people and Marine Biologist says another. How do we properly present ourselves both in jobs and grants.

Two options for biologists who need bioinformatics: hacking it together yourself or using pre-packaged tools. She awesomely compares to the Oregon trail game. I had no idea that Oregon trail was so comparable to biology + bioinformatics. Do not try to cross the river yourself.

How to learn and get going? Hardest part is intermediate learners: lots of resources for early adopters but less as you already known something but want to level up.

Holly talks about her open-source work on Phinch – research-driven data visualation. Provides visualization of data in chrome. Really nice interface designed by actual interface designers. It’s a prototype framework, currently has five unique visualizations for summarizing data. Allows you to look at both high and low level patterns. Visualization helps make this data available to citizen scientists.

Next phases of development: phylogenetic visualizations, open public APIs with a self-sustaining developer community. Code on GitHub.

Main points: provide more interdisciplinary training, collaboration and interaction for intermediate learners. Tools that actually take advantage of biology and reduce the need for biological expertise.

We have a new Q/A format we’re trying out with questions from twitter and index cards. On training courses – found that in-person courses are much more useful than online because of actually following through and finishing them. Teach people to Google their error messages – yes please.

Data Science

Mónica Muñoz-Torres – Apollo: Scalable & collaborative curation for improved comparative genomics

Moni works at LBL on WebApollo with Nathan Dunn and Suzi Lewis. Goal is to improve Manual Annotation and improve scalability of this work. Architecture has 3 components: web based front end, annotation editing enging and server side data service. Uses Google Web Toolkit on front end, Grails on the backend plus single datastore with PostgreSQL. Moni describes a lot of the pain of refactoring and difficulty in working with new technologies to estimate times. Code on GitHub.

Kévin Rue-Albrecht – GOexpress: Visualize gene expression using gene ontology

Kévin talks about motivations behind GOexpress, an R/Bioconductor tool for making sense of gene expression data. Shows huge map to demonstrate the difficult in identifying the effect of a treatment. Often have multiple experimenal factors, noise in the data and clustering driven by a few genes of large effects. For instance, MHC drives clustering in animals and need to cluster genes by effects to determine this and actually look at signal. Uses random forest and ANOVA methods to classify genes and separates by ontology. Expose selection of genes through a Shiny web application. Tricky parts with random forest – need to run permutations to get p-values that everyone wants. Code on GitHub.

Peter Amstutz – Arvados: A Free Software Platform for Big Data Science

Peter talks about computational reproducibility with Arvados, first motivating with examples of how we improve our ability to run complex pipelines. Components of Arvaods: Keep content addressable storage. It’s versioned, immutable and manifests allow reorganization. I wish I understood filesystems better to know how Hadoop-y file systems different from this. Crunch is the computational engine. uses Keep for data, Git for code and Docker for tools to have a full set of reproducible components. This architecture allows moving analyses between multiple instances. Arvados provides facilities for this sharing – both public and between groups in private. Ends by mentioning writing Common workflow language and move towards better workflow standards that will be in a talk by Michael later.

Sebastian Schoenherr – Bringing Hadoop into Bioinformatics with Cloudgene and CloudMan

Sebastian motivates by talking about a BOSC talk fom 2012 by Enis Afgan on CloudMan, and describes the awesome work on automating Galaxy + AWS. Sebastian also talked about CloudGene talk from himself in 2012. Now CloudGene is more of a software as a service platform, providing dedicated services for a given workflow. Supports the full Hadoop stack: Spark, MRv2, Pig. Lots in common with CloudMan and decided to combine projects, using CloudGene for Hadoop execution within CloudMan. Presents a cool use for hadoop – Michigan Imputation Server: a free service to provide QC + Phasing + Imputation. Really nice, and provides a platform for building more services.

Michael Hoffman – Segway: semi-automated genome annotation

Segway finds patterns from multiple biological signal tracks like ChIP-seq. It discovers patterns and then provides annotation, visualization and interpretation. Genome segmentation breaks up the genome into non-overlapping segments, then pushes around boundaries to maximize similarity within regions. Uses a generalized HMM to discover structure in the inputs based on a specific number of classifications to segment by. Michael makes good point that coders are biologists too and can use his knowledge of chromatin structure to develop hypotheses from these inputs and then test those. Use this knowledge to apply labels for each segment of the genome. This provides nice annotation tracks for making sense of variations in non-coding regions. Nicely able to ensure that Segway signals match with the expected biology. Michael makes another good point about the importance of looking in depth at specific regions, and then using this to do biological experiments to confirm.

Konstantin Okonechnikov – QualiMap 2.0: quality control of high throughput sequencing data

Qualimap does a comprehensive job of running quality control of seuencing data. The new version of BAM QC was redesigned to add many new metrics. Added a method to combine results from multiple bam results together and summarize them. Runs a PCA analysis to detect outliers in a group of results. Also redesigned RNA-seq quality control. Konstantin highlights the larger number of folks from the community who contribute to QualiMap.

Andrew Lonie – A Genomics Virtual Laboratory

Andrew talks about the Genomics Virtual Lab (GVL) which supplies compute along with a set of tools to work on top of it. Awesome resource for Australian Bioinformatics with CloudMan, Galaxy. Provides reproducibie, redeployable platforms.

Tony Burdett – BioSolr: Building better search for bioinformatics

BioSolr provides an optimized approach for complexity of life sciences data on top of Solr, Lucene and Elastic Search. The mission is to build a community of users around improving searching for biology. Provides faceting with ontologies, plugins for joining indexes with external indexes to provide federated search.