Notes: Bioinformatics Open Source Conference 2014 day 2: Philip Bourne, Software Interoperability, Open Science

I’m at the 2014 Bioinformatics Open Source Conference (BOSC) in Boston. It’s a great two day conference devoted to open source, science and community. These are my notes from the day 2 morning session. Next Day Video recorded all the talks and they’re available on the Open Bio video site.

Other notes:

Biomedical Research as an Open Digital Enterprise

Philip Bourne

Philip starts off apologizing for not writing a line of code in the last 14 years. He talks about current funding issue in science: lack of growth in NIH funding but exponential increase in biological data. Hard to quantify right now: how much do we spend on data and software? Futher, how much should we be spending to achieve the maximum benefit? Biggest current issue is reproducibility; critical to improve public perception of biological research as well as fundamentally important to good science.

From a funders perspective, it’s a time to squeeze every penny to maximize the amount of research that can be done with money. Two approaches: top down and bottom up. Top down: regulations on data, data sharing policies, digital enablement and move towards reproducibility. Emphasizes the importance of discussions between communities for working with large data. Bottom up approaches: collaboration, open source and standards.

Mentions the current issue that software developers are in great demand, and rewards outside of academia are greater than inside. Need new business models and software best practices. Current challenge: elements of the research life cycle are not connected. Presents nice graph of areas where he feels like we’re doing well and struggling.

Presents public/private partnership called The Commons. Idea is to have agile pilots testing out ideas and new funding strategies. One example, porting dbGAP to the cloud. The commons is a conceptual framework, analagous to the internet. Meant as a collaboration using researching objects with unique identifiers and provenance. Commons meant to handle long tail of data that does not fit anywhere, high throughput data from big facilities and clinical data rules. Brilliant, smart and on-point ideas: excited about what NIH/NSF is doing.

What does the Commons enable? Dropbox like storage, quality metrics and metadata, bringing compute next to data, and giving places to collaborate and discover. The most critical element is establishing a business model: current thinking is to provide a broker service. Idea to is have a series of pilots to evaluate, and hoping to make decisions by 2015.

Great discussion questions. Ann Loraine: 20% time on grants for more exploratory research. Titus Brown: how can we speed up review/grant process to work with agile processes? Thomas Down: can we improve incentives/fun for making data available? Currently not as fun as actually doing science. Really awesome to see so much discussion during Philip’s talk.

Great example of current software folks to talk to: MyExperiment, Galaxy. Also would suggest iPlant, SAGE Synapse, Figshare. Lots of new Big Data to knowledge (BD2K) grants coming.

Overall talk idea is to foster an ecosystem of biomedical research as a digital enterprise.

Software Interoperability

Pathview: an R/Bioconductor Package for Pathway-based Data Integration and Visualization

Weijun Luo

Pathview provides pathway visualization using KEGG-based pathway. Bioconductor R package. High level API calls to get very nice visualizations of multiple sample experiments layered on top of named pathways. Nice automated segmentation and labeling of attributes. Supports over 3000 species: everything in KEGG with sequenced genomes. Takes care of ID mess by supporting everything. Nice workflow for running RNA-seq processing, then feeding into Pathview and GAGE for visualization.

Use of Semantically Annotated Resources in the Mobyle2 Web Framework

Hervé Ménager

Mobyle is a web-based bioinformatics workbench. Currently working on Mobyle2 re-write which includes groupware functionality, secure data sharing, REST API and ontology bsaed annotations. Re-write deals with issues with current classification and typing systems, specifically designed to provide improved sharing. Ontology describes relationships using EDAM ontology. With these, can now annotate formats of inputs/outputs to enable automated conversions and improved chaining of tools. Can also do cool stuff like automated splitting.

Towards Ubiquitous OWL Computing: Simplifying Programmatic Authoring of and Querying with OWL Axioms

Hilmar Lapp

Hilmar describes work done by Jim Balhoff at NESCent on using RDF and OWL to improve computational data mining. Done as part of PhenoScape project with the goal of understanding causes of diversification. Very difficult to author ontology axioms at scale: translating complex assertions into rules make my head hurt. Scowl tool provides a declarative approach to defining these. This also helps make ontologies easier to declare and version control. Similar work is Tawny OWL from Phil Lord. Second tool is to handle Ontology driven queries in SPARQL: owlet. Vision is to make programming with ontologies easy: small tools to fill gaps and holes.

Integrating Taverna Player into Scratchpads

Robert Haines

Describes works integrating Scratchpads and Taverna. Scratchpads are websites that hold data for you and your community. Scratchpads is a Drupal backend with lots of models: 500 sites using this system hosting on two application servers. Taverna provides the workflow system and uses Taverna Player which is a Rudy on Rails plugin that talks to Taverna’s REST interface. Wow, a lot of Taverna calling Taverna calling Taverna: very modular setup. The integration happened to joining two biodiversity communities and make it easier to disseminate data via Scratchpads. With Taverna allows running on this generalized disseminated data. Interesting work to have multiple ways to do this: both tight and lightweight integration.

Small Tools for Bioinformatics

Pjotr Prins

Pjotr talks about approaches to improving how we can better integrate tools, manage workflows. Bioinformatics is often about not invented here and monolithic solutions. Does this happen because of technology and deployment. Wrote up a Bioinformatics Manifesto to build small tools, which should each do the smallest possible task well. Idea is to make each part anti-fragile so the whole system can be more robust. 3 examples of tools that do this: Pfff is a replacement for md5 comparisons that samples files so scales to large inputs. sambamba is a great tool with samtools like functionality and parallelization. bio-vcf is a fast VCF parser and filtering tool.

Open Science and Reproducible Research

SEEK for Science: A Data Management Platform which Supports Open and Reproducible Science

Carole Goble

Carole talks about SEEK work enabling systems biology, linking models and experimental work. Idea is to preserve results, organize data, exchange and share data. Difficulty in dealing with home-brewed solutions from each lab. Tricky to deal with both small and large data, lots of different inputs to mix together. Catalogue data as the critical component using ISA metadata. They integrate a crazy incredible number of standards and other tools in general; beautiful reuse. Use Just Enough Results Model (JERM) to describe relationships between everything done in experiments. Research Object provides nice way to provide tagging and provenance of research work. Funded to extend this as open system for european systems biology data.

Arvados: Achieving Computational Reproducibility and Data Provenance in Large-Scale Genomic Analyses

Brett Smith

Arvados is a open source platform for managing and computing on biological data. Parts of Arvados: Keep provides immutable content addressable storage, providing git-like behavior for data. Objects tracked and managed through the Arvados API server. Jobs submitted as big ol’ JSON format. Uses Docker to manage images. Everything gets GUID identifiers for data, code and docker images so get provenance for everything run.

Enhancing the Galaxy Experience through Community Involvement

Daniel Blankenberg

Dan describing work to involve the Galaxy community in development and analysis. He starts by describing all off the awesome stuff that Galaxy does with shout out to Galaxy on AWS using CloudMan. Interesting stats on running jobs on Galaxy main – leveling off due to complete usage of resources: need to diversify big jobs on Cloud or local installations. Awesome plots of community code contributions, 51 unique users in the past year. They’ve moved to BioStar interface for answering questions.

Open as a Strategy for Durability, Reproducibility and Scalability

Jonathan Rees

Jonathan motivates with example of chicken evolution; my son would love this. The problem is that the tree of life is hard to find, hence the Open Tree of Life. Nice resource and love the continued chicken examples; ready for a viewing of Chicken Chicken Chicken. Has nice browser and API views to the tree. Trick part is getting the trees, which references back to Ross’ talk yesterday. Big emphasis on actually open data (CC0) and publications (CC-BY). All data going into GitHub as JSON.

Notes: Bioinformatics Open Source Conference 2014 day 1 morning: Titus Brown, Genome Scale Data, OBF

I’m at the 2014 Bioinformatics Open Source Conference (BOSC) in Boston. It’s a great two day conference devoted to open source, science and community. Nomi Harris starts things off with an introduction to the goals and history of BOSC. She describes the evolution of BOSC since 2000: movement from open source advocacy to open source plus a community of developers. The emphasis is now on how to work better together and enable open science.

A History of Bioinformatics (in the Year 2039)

Titus Brown

Titus introduces his talk: “It’s hard to make predictions, especially about the future” as he pretends it’s 25 years from now and gets to look back on Bioinformatics from that perspective: bioinformatics science fiction. So read these notes as if Titus is talking from the future.

In the 20s, there was a datapoalypse due to the increasing ability to sequence and sample everything: biology became a data-intensive science. Biology, however, had optimized itself for hypothesis-driven investigation, not data exploration.

Issue 2 was the reproducibility crisis: a large percentage of papers could not replicate even with extensive effort. This is due to a lack of career/funding incentives for doing reproducible science. There was no independent replication.

Issue 3 was that biology was weak in terms of computing education. Many labs had datasets generated but without expertise or plans of how to analyze them. As a result, had an emphasize on easy to use tools: however, they embodied so many assumptions that many results are weak. Emphasis on bioinformatics work as sweatshops doing service bioinformatics without any future career path. As a result, well-trained students left for data science.

Came up with 3 rules for reviews: all data and source code must be in paper, full methods included in primary paper review and methods need publication in associated paper. Answer: more pre-prints. Open peer review led to replication studies, where a community of practice developed around replication. Shift in thinking about biology: biomedical enterprise rediscovers basic biology, rise of open science, investment in people.

Biomedical community moves away from translational medicine into veterinary and agricultural animals as model organisms. Biotech pressured congress to decrease funding since adameic papers were often wrong without raw data, and funding crunch joined hypothesis discovery with data interpretation. Resulted in university collapse which led to a massive increase in creativity during research.

Sage bionetworks: collected data from small consortia and made it available to everyone at publication. Led people to understand there was nothing to fear from making data available. NIH finally invested heavily in training: software, data and model carpentry.

Current problems (in 2039): still unknown functional annotations, career paths still uncertain, glam data replaces glam publications. Many complex diseases remain poorly understood.

BRAIN2050 10-year proposal to understand the brain, focusing on neurodegenerative diseases. Correlation is not causation: problem with current MIND project. Hard to extract data from recording all of the neurons. Computational modeling is critical: can we develop hypotheses that test against the data. Holistic approach needed.

Focus less on reproducibility: strict requirement makes science slow. Can we compromise? Borrow idea of technical debt from software: replication debt. Do rapid iterations of analysis, then re-examine with semi-independent replication. Acknowledge debt to make it known to potential users of research.

Invest in infrastructure for collaboration: enable notification of analyses to allow collaboration between previously unconnected groups. Build commercial software when basics understood. Invest in training as first class research citizen. Biology suggestions: needs to understand the full network system to understand complex biology.

Conclusion: there will be a tremendous amount of change that we cannot predict. We need to invest in people and process and must help figure out the right process and provide career incentives. However, economics matter a lot. Need to convince the public that support for science matters.

Plugs for other talks: Mike Schatz on next ten years in quantitative biology.

Genome-scale Data and Beyond

ADAM: Fast, Scalable Genomic Analysis

Frank Nothaft

Frank talks about work at UC Berkeley, Mt Sinai, Broad on ADAM: a distributed framework for scaling genome analysis. Toolsets – avacado: distributed local assembler, RNAdam: RNA-seq analysis, bdg-services: ADAM clusters, Guacamole: distributed somatic variant caller. Provides a data format that can be cleanly parallelized. ADAM stack uses Apache Spark, Avro, Parquet. Principle is to avoid locking data in and play nicely with other tools. Parquet is an open source columnar file format that compresses well, limits IO and enables fast scans by only loading needed columns.

One approach is to reimplement base quality score recalibration (BQSR). Have a fully parallel version that splits by reads, only requiring a shared 3Gb read-only table of variants (from dbSNP) to mask known variants. 2x faster than Picard on single cores, and 50x faster on cluster on Amazon with smaller machines. Have >99% concordant with BQSR; actually better due to error in GATK implementation.

Automated RNA-seq differential expression validation

Rory Kirchner

Contrast: 1/2 million hits for RNA-seq analysis pipelines, but poll on SeqAnswers: biggest problem in RNA-seq is a lack of reproducible pipelines. Complexity issue is the large number of tools and combinations of those tools. Implemented in bcbio-nextgen. Describes all the goals for bcbio, stealing everything I’m going to talk about tomorrow. Nice slide of work in the RNA-seq pipeline that used. The validation framework enables evaluating actual changes in the pipeline: demonstrates example with trimming versus non-trimming – no difference at all on high-quality validation set. Another nice plots shows difference in doing RNA-seq with 3 replicates at 100M/replicate, 15 minutes at 20M reads. More replicates = better than deeper sequencing.

New Frontiers of Genome Assembly with SPAdes 3.1

Andrey Prjibelski

SPAdes initially designed for single-cell assembly but also works well on standard multi-cell material. Ranked high alongside Salzberg lab with MaSuCRa. Handles tricky non-diploid genomes like plants. Works with IonTorrent for error correction: IonHammer that corrects indels and mismatches. Alongside BayesHammer for Illumina reads. SPAdes works on Illumina BaseSpace platform, DNANexus and Galaxy. Wow, integrated everywhere. With Illumina nextera mate pairs, have improved distribution of correct read pairs. Velvet assembled these better than SPAdes according to N50-based metrics, but in quality metrics SPAdes show better. Shows importance of establishing community benchmarks and values. Titus mentions memory usage a problem on large genomes.

SigSeeker: An Ensemble for Analysis of Epigenetic Data

Jens Lichtenberg

SigSeeker handles analysis of epigenetic methylation and histone modifications: CHiP-seq. Motivation: lots of tools but not good evaluations of processes. Idea was to provide an ensemble based approach to understand tradeoffs of tools. Ways to correlate: by location in the genome, and by intensity. Removes outliers to help resolve ones called consistently between tools. Produced nice correlations between all of the different tools. Also layered on top of biology for blood cell differentiation. Argues that adding tools adds power alongside addition of technical replicates.

Galaxy as an Extensible Job Execution Platform

John Chilton

John talking about Galaxy integration with clusters. Goal is to convince that Galaxy runs jobs awesomely on clusters and clouds. The whole process is pluggable and John wants to convince you to use Galaxy as a platform. Galaxy Pulsar allows you to execute jobs in the same manner Galaxy does. Allows deployers to route jobs based on inputs: the dynamic job installations allows delaying of job, parameter collection from tools, pull resource usage from recent jobs. Dynamic job state handlers: enable resubmission of jobs to larger resources after hitting resource/time limits. New plugin infrastructure with Docker containers for installation.

Added job metrics, collecting information about job runtime, cores and compute using collectl. Need to look at what John is doing with initial work in bcbio. Pulsar (formerly LWR) getting lots of usage under the covers at Galaxy by running jobs on TACC. Prototype using Pulsar on top of Mesos.

Open Bioinformatics Foundation (OBF) update

Hilmar Lapp

Hilmar provides an update on what is happening with the Open Bioinformatics Foundation. He describes all the work Open Bio does as a non-profit, volunteer run organization: BOSC, Codefests, Hackathons and GSoC. Interestingly, 120 Open-Bio members but Biopython/BioPerl mailing list communities are 1000+ people. OBF now associated with Software in the Public Interest (SPI), can easily donate. Hilmar discusses the challenges associated with moving forward on progress using an all volunteer organization.

Notes, Galaxy Community Conference 2014 day 1 afternoon: State of the Galaxy, IonTorrent, Lightning Talks

I’m at the 2014 Galaxy Community Conference in Baltimore. These are my notes from the day 1 afternoon session, following up on notes from the morning session.

State of the Galaxy

Anton Nekrutenko and James Taylor, Galaxy Project

Anton and James talk about the history and current status of Galaxy. Start off by recapping previous GCCs: went from 75 to 250 attendees. Nice numbers about growth in contributors over the past year. For Galaxy main, job usage became a crisis: more jobs than hardware could handle. Galaxy main switched over to TACC last October. Biggest issue is that with more data, jobs are now longer.

Summarization of new features: new visualizations and full visualization framework, dataset collections, Galaxy BioStar community, Toolshed with automated installation, data managers.

New stuff that is coming: organization of the toolshed, want to make the toolshed process more straightforward. New workflow scheduling engine and streaming to improve scaling on large datasets. Visualization a planned focus for the next year. Goal is to think about a distributed Galaxy ecosystem that includes federation and data localization. This is hard, but would be so awesome. Also want to figure out use of Docker integration with Galaxy. A nice discussion by Anton about scalable training, to better coordinate how training works across institutions.

Update on Ion Torrent Sequencing – Accurate, Long Reads

Mike Lelivelt – Ion Torrent

Mike talks about the role of Ion Torrent sequencing in world where Illumina dominates: challenging and driving cost and accuracy. Mike talks about work to improve indels with inherent chemistry limitations in homopolymers. New chemistry: Hi-Q that provides improved resolution of SNPs and indels. Shows a nice IGV plot and errors are now random instead of systematic. Advantage is that depth and consensus can now resolve issues.

Mike talks through great work re-evaluating the Genome in a Bottle paper. Biggest issue was using bwa + GATK, which are not good with Ion Torrent inputs. Recommended workflow is TMAP + TVC. We need a place to find these best practices and tools. TMAP and Torrent Suite are available on GitHub but not clear externally where to get and compile. After registration on IonCommunity site, can find recommended approaches: RNA-seq recommendations. Need similar summaries of recommendations for variant calling. They are releasing new versions of Torrent Caller and TMAP soon as easy to use separate command line programs, so a great time to work on integrating it.

The Galaxy Tool Shed: A Framework for Building Galaxy Tools

Greg von Kuster, Penn State University

Greg talks about recent work on the Galaxy toolshed to enable building tool dependencies automatically. Shows a demo with HVIS tool that makes Hilbert curves. He walks through the process of bootstrapping a toolshed installation on the local machine. This allows you to test and evaluate the tool locally in entirely isolated environment. Once validated, this can get exported to Galaxy toolshed, where it is run through a similar isolated framework for testing every 48 hours.

Integrating the NCBI BLAST+ suite into Galaxy

Peter Cock, The James Hutton Institute

Peter talks about his work integrating BLAST+ into Galaxy. He emphasizes the importance of making tools freely distributable so we can automated installation and distribution. They have done awesome work creating a community around developing these tools on GitHub, with functional tests automatically run via TravisCI. Good example of tricky tool because they needed to add new Galaxy datatypes to support potential output formats. BLAST tools had a lot of repeated XML and used macros to reduce this. Downside is the added complexity to the tool definitions.

On their local instance, they have Galaxy job splitting enabled which batches files into groups of 1000 query sequences. Peter also has a wrapper script which caches BLAST databases on individual nodes. Current work in progress is to create data managers for BLAST databases.

deepTools: a flexible platform for exploring deep-sequencing data

Björn Grüning, University of Freiburg

deepTools aims to standardize work to do quality control and conversion to bigWig format. Some tools available: bamCorrelate compares similarity between multiple BAM files. bamCoverage does the conversion of BAM to bigWig. bamCompare shows the differences between two BAM files. heatmapper: beautiful visualization of BAM coverage across all genes. Really need to integrate this work into bcbio-nextgen. They also have an awesome Galaxy instance for exploring usage. Finally to add they’ve got a docker instance with everything pre-installed.

Lightning talks

Lots of 5 minute talks. Will try to keep up.

David van Enckevort talks about work building a scalable Galaxy cluster in the Netherlands. Using Galaxy in a clinical setting for NGS and proteomics work. Main bottleneck is I/O performance of file intermediates.

Marius talks about Mississippi, a tool suite to work on short RNA work. Main components: trimming of small RNAs, uploaded bowtie to handle short regions, cascade tool provides a quick overview of small RNA targets. Have nice visualizations to look at the size distribution of reads and small RNA properties with good look faceted plots.

Next talk is on handling online streaming analytics for heart rate variability. Used BioBlend to retrieve and execute workflows, and provided a custom Django GUI for users to select workflows. Future work includes adding an intermediate distribute queue via Celery for streaming.

Yvan talks about a use case for the structuration of the biologist community, where they did a project in France to bring together scientists working in multiple areas. Found that with more automation, need additional human expertise to train and improve.

Ira talks about visualization of proteomics data in Galaxy. Protviz was an initial visualization tool build within Galaxy by communicating with an outside server. Unfortunately leads to confusing error messages and issues during communication. Issues is that data is in multiple places leading to long lead times for visualization. Results are not self-contained and difficult to install and maintain. Improved approaches does processing up front and gather results in a SQLite database, now integrated directly into Galaxy visualization based on prototype at Galaxy hackathon.

Nate talks about what the Galaxy team had to go through to move Galaxy main over to TACC thanks to collaboration with iPlant collaboration. Got a 10Gb/s connection to XSEDE via PSC. Tried using GlobusOnline, GridFTP and ended up with rsync. To transfer 600Tb, had to slow down because of saturating 10Gb line; ended up taking 2 months. Used Pulsar and hierarchical object store to help manage infrastructure

Notes, Galaxy Community Conference 2014 day 1 morning: Steven Salzberg, deployment, visualization, reproducibility

I’m at the 2014 Galaxy Community Conference in Baltimore. After two days at the Galaxy Hackathon and a presentation on CloudBioLinux at the Galaxy training day, it’s now time for the conference. These are my notes from the talks and discussions from the morning

Transcriptome assembly: Computational challenges of NGS data

Steven Salzberg, Johns Hopkins

Steven’s lab focuses on exome sequencing, transcriptome sequencing and RNA-seq, microbiome studies and de novo assembly. He starts by talking about Bowtie2, TopHat and Cufflinks; they handle alignment, spliced alignment and transcript assembly, respectively.

The next generation of Tuxedo suite tools in development: bowtie2 -> HISAT -> stringtie -> ballgown. Motivations: pseudogenes cause alignment issues and 14,000 pseudogenes in the human genome. TopHat2 works around this by working in two stages: discovering splice sites and then realigning to avoid problems with reads incorrectly aligning to pseudogenes. In validation work, biggest improvement are in regions with short-anchored reads. Based on speed on STAR, motivated to make TopHat2 faster – HISAT: Hierarchical Indexing for Spliced Alignment of Transcripts. The major issue is that BWT is not local, so you have to start over at each splice site instead of being able to take advantage of the FM index. Now create local indexes corresponding to small 64k regions of the genome, overlapping by 1k. Creates 48000 indexes, but is only 4.3Gb for human genomes. HISATs algorithm: use global index where read fully in an index, use local index for regions with small anchor region, skipping to nearby local indexes when read hit an intron. Big advantage for looking up short indexes since you don’t have to dig through the whole index. This also works for large anchors, which is an easier problem.

Measured performance of HISAT. First approach, simulated data including lots of different types of anchor sizes. Speed tests show reads/sec aligned does better than STAR. Sensitivity of HISAT is equivalent and slightly better than TopHat with some additional tweaks to algorithm: 3 versions of HISAT currently under evaluation. These also handle tricky cases with 1-7 base anchors over exons.

Expectation for HISAT release is this summer – soon.

StringTie is a new method for transcriptome assembly. Provides two new ideas: assemble reads de novo to provide “super-reads”; create a flow network to assemble and quantitate simultaneously. First step uses MaSuRCA which creates long reads by extending original reads forward and backwards until not unique. These super reads get aligned to the genome, potentially capturing more exons. Following this, build an alternative splice graph and then use this to create the flow network.

Splice graph made up of nodes as exons and connections as potential combinations between the exons. Cufflinks idea is to create the most parsimonious representation of this splice graph. StringTie builds a splice graph, then converts into a flow network and finds the maximum flow through the network: flow = number of reads to assign to different isoforms.

Accuracy, evaluated through simulated examples: both have nice sensitivity and precision improvements over Cufflinks – 20% increase in sensitivity. On transcripts with read support: 80% sensitivity and 75% precision. Evaluation on real data with “known” genes, also provides a good improvement over TopHat.

For speed, much faster than Cufflinks. 29 minutes versus 81 minutes, Cufflinks versus StringTie, 95 versus 941 on another dataset. So 3-10x faster.

Awesome question from Angel Pizzaro about independent benchmarks for measuring transcriptomes. Steven happy to share his data but no standard benchmark available. We need a community doing this for RNA-seq.

The Galaxy framework as a unifying bioinformatics solution for multi-omic data analysis

Pratik Jagtap, University of Minnesota

Pratik works on integrating proteomic tools into Galaxy. Shows examples of proteomic workflows, which look so foreign when you deal with genomic data regularly. However, similar challenges to genomics to learn from: quality of underlying database representations essential for good results. Provide Galaxy workflows to help improve these representations. Have nice integrated genome viewers within Galaxy. Shows examples of biological results achieved with this platform; I like how they are working on the ground squirrel, described succinctly as a non-model organism. Proteomics results help improve in-progress genome. The GalaxyP community project drives this software development and data submission process.

iReport: HTML Reporting in Galaxy

Saskia Hiltemann, Erasmus University Medical Center

iReport is a method to show workflows that have a lot of outputs, enabling the ability to view these directly in Galaxy. Outputs in iFuse2 report included SVs, small variants, CNVs and provides a visualization from each of these. Results are fully interactive via jQuery. iReport creates these type of HTML reports from workflow outputs without having to do all that work from scratch.

Saskia shows a iReport example output that demonstrates the capabilities. Awesome ability to combine multiple result types into useful output. I will definitely investigate as a useful way to coordinate outputs for bcbio integrated into Galaxy.

Galaxy Deployment on Heterogenous Hardware

Carrie Ganote, National Center for Genome Analysis Support

Carrie is talking about approaches to putting Galaxy on multiple architectures. The National Center for Genome Analysis support provides bioinformatics support and cluster access for free; fully grant funded. Awesome. Doing a collaboration with Brian Haas at the Broad getting Trinity working well with Galaxy. Have multiple Galaxy integrations connected with 3 different local computes with shared filesystem and remote systems that do not have a shared filesystem. Carrie describes in-depth issues dealing with Galaxy: can’t communicate with Torque due to PBS configuration changes, integration with DRMAA. To get things working with Cray, need to create a shell wrapper around the Galaxy wrapper submit script; ugh too many wrappers. Also integrated with the Open Science Grid dealing with unevently distributed resources.

A journal’s experiences of reproducing published data analyses using Galaxy

Peter Li, GigaScience

GigaScience focuses on reproducibility in published results, but primarily papers do not provide enough information to reproduce. Investigated whether results in a GigaScience journal could reproduce in Galaxy. Pilot project was to reproduce SOAPdenovo2, specifically a table comparing to SOAPdenovo1 and ALLPATHs. Paper has tarball of shell scripts and data for re-running – good stuff. Idea was to convert these over to Galaxy: download datasets used into Galaxy history; wrapped tools used in comparison.

The downloaded pipeline shell script did not have automated way to go from run output to N50 comparison stats used in paper. Paper methods did not have enough to replicate, and needed an additional step not described in methods. Figured out by asking authors but not represented in shell script or paper.

After adding in the missing steps to Galaxy workflow, replicated the original results from the paper. Got similar results for SOAPdenovo1 and ALLPATHS, although numbers were not identical.

Observations: reproduction work is difficult, required a lot of time and effort, help from authors. Sigh.

Enabling Dynamic Science with Flexible Infrastructure

Anushka Brownley and Aaron Gardner, BioTeam

BioTeam focuses on providing IT expertise for scientific work. They do great work and well regarded in the scientific community.BioTeam SlipStream is a dedicated compute resource pre-installed and configured with Galaxy. The goal is to connect SlipStream with additional resources: Amazon and other local infrastructure. Aaron shows example where jobs spill over from SlipStream to existing SGE cluster when busy. Managed to do this both locally and on Amazon with StarCluster. In both cases modified SGE only to achieve this integration.

The move forward to the future is to use LWR to enable this, look at Docker containerization for the toolshed, and use Mesos for heterogenous scheduling.

Notes: Strategies for Accelerating the Genome Sequencing Pipeline workshop at Mt Sinai

I’m in New York City at Mt Sinai’s workshop on Strategies for accelerating the genomic sequencing pipeline. It’s a great one day session focusing on approaches and tips for making genomics pipelines faster. The goal is to create a community of implementers interested in sharing approaches for avoiding bottlenecks while processing large genomic samples, and learning to better structure code and data to take advantage of large scale parallelization. Slides from the workshop are also available.

Parallelizing the Sequence Analysis Pipeline: What are the tradeoffs?

Toby Bloom – New York Genome Center

Different requirements for individual patients and larger research studies. Patient: fast as possible. Larger study: throughput. This is tricky when you’ve got a mix of things. Bottlenecks are often network load, compute time and response time. Example: improving alignment by 100x but what are the tradeoffs in terms of accuracy and throughput.

Good discussion of why IO is so painful in pipelines. Mark Duplicates: read and write a big file and only writing a flag. Works on file as a blob; alternative is streaming. Alternative way of thinking is the share nothing Hadoop version of the world where data can be independently processe throughout. Not totally compatible with biological pipelines where you need to combine together and go back to the same data over and over again.

What are the differences between biological pipelines and assumptions in most big data tools? Considerations: How frequently are data accessed. How often does computation require reshuffling data.

General ideas. Streaming algorithms are good: avoid IO. Keep data in parallel column oriented database style structures. Also need to evaluate all changes within results from entire pipeline.

Integrating new and existing tools within a sequencing analysis pipeline

Timothy Danford – Genome Bridge

Timothy talks about his experience running a genomics pipeline in the cloud. GenomeBridge is part of a Broad, but long term goal is to take GATK best practices into cloud platform. Want to contribute gains back to Global alliance for sharing of clinical and genomic data. Estimates for data storage: 3.5Pb and 9 million CPU-hours. Current pipeline is course-grained parallelism sticking exome data on AWS x-large instances. Uses docker to package up code and feeds into the Broad firehose pipeline. Emphasizes that you need a development environment that appeals to bioinformaticians.

What did they learn from this approach? Multi-sample analysis is difficult. First tried replicating SGE/LSF on Amazon but not super successful due to failing jobs, long running jobs and other failures. Fundamental conflict between distributed computation and “bring your own code” approach where you have to integrate specific command line tools. Need code to be able to work with distributed data to actually parallelize.

Better approach: incremental, distributed analysis. Needs code that exploits data locality and allows you to run joint analyses. Problem is that it requires re-write of tools. Shout out to ADAM that builds on top of Spark and Parquet / Avro for representation and looks like a good way forward.

Speeding up the DNA pipeline in practice

Paolo Narvaez – Intel

Approach is to look at big picture for improving pipelines, not necessarily in optimizing a single tool. Genomics efforts focus on understanding emerging workloads and current optimizations. Some case studies: OSHU cancer genomics pipeline: bwa mem, Picard MarkDuplicates, GATK IndelRealignment, then mutect. Improved pipelins from 177 to 44 hours with multithreading improvements in tools. Lots of great profiling of disk, bandwidth, memory. Found that disk is often not bottleneck except at certain points. Memory also has spikes of utilization throughout. Summary: hardware is not used efficiently and lots of different points where you’re stressing only one of memory/CPU/IO/network and others are lagging.

Practically worked with Picard team at Broad to improve compression libraries optimized for modern CPUs: IPP in Picard. Reduced compression time by 1/3. Also looked at Pair HMM Acceleration in GATK HaplotypeCaller using Indel Advanced Vector Expressions (AVX). Also worked on improving Smith-Waterman algorithm with AVX.

G-Make, our Make-Based Infrastructure for Rapid Genome Characterization and the Genomes in a Bottle Consortium,

Sheng Li – Cornell

Sheng works in Christopher Mason’s lab at Cornell. She talks about work to evaluate variants: Only 95% of variants agreed across 14 replicates; 3.4% of ClinVar sites gave multiple results in the replicates. Coverage helps.

Their approach to manage processes is G-Make, which builds on top of make taking advantage of existing features. Also have R-Make for RNA sequencing reads. Using replicates from this, can identify numerous QC differences which contribute to different results from replicates. G-Make generates Makefiles for each sample based on GATK best practices for variant calling. Splits alignments by regions and runs in parallel using native make support. To submit jobs to clusters, use qmake through SGE.

Work on clinical standards through the Genome in a Bottle Consortium to ensure validation rates. Want to integrate tools into a meta-make tool that handles combining the information.

Accelerated GATK best-practices variant calling pipeline

Mauricio Carneiro – Broad

Mauricio provides an overview of GATK best practices, soon to expand to RNA-seq analysis best practice as well. GATK is working on a new best practice pipeline which focuses entirely on streaming algorithms. Unveiling everything at AGBT. New framework will also have focus on exposing likelihoods for downstream analyses. That’s the overview, but the talk today focuses on joint variant calling optimization. Also working on HaplotyperCaller joint-calling with incremental singe sample discovery to help scale to multiple samples.

Mauricio emphasizes that HapolotypeCaller replaces UnifiedGenotyper and improved on calls in every way; no reason to use UnifiedGenotyper, except HapolotypeCaller is slow. Shows nice examples of how HaplotypeCaller can resolve tricky regions with heterozygous insertions/deletions. UnifiedGenotyper cannot do well on indels.

HapolotypeCaller falls into 4 steps: find regions, perform local de-novo assembly, do a pair-HMM evaluation of reads against all haplotypes, then genotype using the exact model developed in UnifiedGenotyper. ~70% of the time of work was in the pair-HMM steps.

Approaches to improving performance: distribute with Queue system, provide an alternative way to calculate likelihoods, or provide heterogenous parallel compute. Distribution already do-able (split by genomic regions) but not ideal since requires infrastructure. To improve HMM, can constrain work by removing unrealistic alignments prior to feeding to pair-HMM. Need `–graph-based-likelihoods` flag to make this work in GATK 2.8; provides a 4x improvement.

To parallelize better, have been focusing on 3 areas: AVX, GPU, FPGA. Started with C++ implementation; it is 10x better than GATK. Mauricio says C++ is better than Java; oops, GATK. These may eventually be available to GATK through JNI calls. AVX improvements look good and already present on most machines. Provides a 35x improvement over current Java GATK with 1 core. If you use AVX with 24-cores, can get a 720x improvement. Provides almost perfect scaling from 1 to 24 cores: promising good scalability for the future. AVX automatically included in next shipments of engine (post 2.8) and does not need additional flags.

GATK engine is not ready to leverage the increased parallelism. It uses synchronous traversal so is waiting for mappers/reducers in the GATK framework. Need to make the engine asynchronous.

Parallelizing and Optimizing Genomic Codes

Clay Breshears – Intel

Working on intel specific optimizations for bwa, hmmer, blast, velvet, abyess and bowtie. Goal is to make individual applications faster and roll changes back into the open source tools. Nice way of giving back and improving existing tools while also being helpful to Intel by fitting better with their processors. For bwa sampe, improved 59 to 12 hours. Hope to also bring this to bwa mem. Changes were: using opemmp instead of pthreads, provided overlapped I/O and vectorization of critical loops. Overlapping I/O and computation improvement swaps buffers so have two threads switching I/O computation. Super nice approach.

HMMER optimizations: improved processing by 1.56x. BLAST got 4.5x improvement for blastn, to release in BLAST 2.2.29+ release. For velvet, provided a 10x memory reduction. The velour optimization that provides the improvement to velveth step is open source and available on GitHub. Awesome. ABySS improvements: identified initial improvement looking at assembler code in a baseToCode approach: 1.3x speed up with only structural changes. Can also split the data into 10 parts to get 4.2x speed up. Nice example of how thinking about a bottleneck identified a new idea for quick speed ups.

Worked on speeding up a RNA-seq pipeline at TGen: 1.8x speed-up on pipeline. For bowtie2 steps, can get 13x speedup with 32 threads and 18.4x using multiple cores.

Genome in a Bottle Consortium 2013: standards for validating high-throughput variant detection

These are my notes from the 2013 Genome in a Bottle Consortium meeting, focused around developing reference materials and standards for measuring variation detection for human genome sequencing. The consortium is a fully open community that seeks to expand its scope to include additional materials and standards within the general goal of being able to accurately characterize how well a sequencing analysis performs. Marc Salit provided a nice introductory overview of these goals and presented an open invitation to define the scope and plans for the consortium.

Talks

Justin Zook described his work on developing the current Genome in a Bottle reference materials: an integrated call set for the well-sequenced NA12878 genome. The goal is to define genotypes with high confidence, including distinguishing homozygous reference and uncertain calls to assess both false positives as well as negatives. It’s a stringent set of calls integrating 12 different inputs and after stringent filtering has 2.3 million variants covering 82.3% of the genome. The reference calls are available on the GCAT variant comparison platform for comparison. One current difficulty with comparing high quality datasets is different variant representations: important to normalize these.

Michael Eberle from Illumina talked about their Platinum Genome initiative. Goal is to develop a catalog of known mutations and toolkit to compare algorithms to it. Using a pedigree analysis on 17 individuals in the NA12878 family. The family structure allows generation of haplotypes, allowing identification of errors. Use calls from GATK, cortexvar, Isaac and complete genomics. Identify CNVs using BreakDancer and Grouper. After initial predictions, overlap between methods and corroborate based on read counts. Of the confident 772 total events identified by this method: 408 called by both, 266 by BreakDancer, 98 by Grouper. Another idea is to look at variants in these regions since you expect loss of heterozygosity in deleted regions. Found more deletions than amplifications in CNVs. Biggest current gap is in events from 30bp to 2kb.

Francisco De La Vega from Real Time Genomics discussed approaches to using trios to help develop reference materials. By assessing Mendelian inheritance from the full family using Bayesian priors for parents and children, improve calling. By specifically investigating relationships, this does better than standard family based calling, ala GATK populations. Got 150k more calls than GATK using this approach. Has 94.1% sensitivity in GiaB regions compared to 92.5% for GATK. Reduces mendelian inconsistency dramatically, which makes good sense since the approach optimizes for that.

Deanna Church discussed the GeT-RM project from NCBI and CDC. GeT-RM incorporates data from 12 labs. Integration is a pain and have to deal with each input format specifically. They have a nice browser that allows viewing the calls in-line with the BAM files. This helps identify discordant variants and potential causes from specific submissions. Identified regions verified from multiple submissions but with no alignment evidence. Website provides high quality variants in VCF format and BED file of Sanger verified regions. Plans going forward involve development of additional web tools, including incorporation with new genome reference: GRCh38 is coming soon. A lot of realignment is coming as well.

Working groups

Reference Materials Selection

In progress cancer calling assessment work: NCI Cancer spike in, TGen-Illumina tumor-normal pairs, HorizonDX cell lines. Jason Lih from NCI has a set of plasmid constructs with engineered mutations relative to cancer. Sanger confirmed sequences. Stephanie Pond from Illumina collaborating with TGen on COLO829B/COLO829: has 37 confirmed mutations. Also have 9 synthetic fusion constructs spiked into COLO829. Jonathan Frampton from Horizon Diagnostics has engineered cancer mutations in 3 cancer cell lines. Lots of nice setups for replicating FFPE and dilution series. Have ~36 additional mutations. Kara Norman from AcroMetrix discussed engineered cell line controls with 12-26 engineered variations from COSMIC.

Reference Material Characterization

General goal: get as much data as possible. Would like to be able to apply Moleculo to NA12878 to get phasing information. Also add in 20-30x PacBio data, Optimal mapping from BioNanoGenomics, 454 and Ion. For PGP trio data, plan to characterize with Illumina, Complete, PacBio and Moleculo. Overall should help with phasing completeness and help resolve conflicts.

Bioinformatics working group

Reviewed the data release policy, which follows the Fort Lauderdale approach: can get data but can’t publish on it until the consortium does. Existing resources for NA12878: NIST, X Prize, Illumina Platinum Genomes, 1000 genomes, Real Time Genomics, Broad chr20 and exome. Personalis has CNV calls for NA12878.

For data integration, trying to use pedigree methods and multi-platform methods together. Critical issue is handling growth in datasets. Emphasize importance of recognizing heterogeneity across the genome, which the performance metrics group handles. Quality of variants is difficult to assess since not calibrated across multiple approaches: need a way to map the quality onto a standard scale during data integration.

Data from NA12878 and parents NA12891/NA12892 available from FTP/Aspera site at NCBI and a mirrored giab Amazon S3 bucket.

Goal is to develop automated workflows for future reference materials so we can handle multiple new genomes including PGP data.

Performance metrics

  • Metrics to measure
    • Laboratory side: library prep, platform, lots of other metrics
    • How do you motivate to make this worth while? Hard problem to capture and upload.
    • Other issues: lots of variables with lots of incompleteness, tricky statistics.
    • Decisions: flattened key/value JSON structure for metadata. Circulate list of things to provide.
  • Provide better characterization of trusted/non-trusted genome regions with provided BED files of:
    • Trusted regions: robust across multiple technologies and protocols
    • Less trusted regions: difficult to call in regions that might be do-able on specific technologies or approaches but are not as reliable.
    • Non called regions: Regions we cannot characterize.
  • Development of infrastructure to assess and report accuracy. General goal is to merge existing software tools.
    • Get-RM
    • X Prize/Harvard School of Public Health: bcbio.variation handles comparison of variant calls to reference genomes and reporting of concordant/discordants
    • Harvard School of Public Health: bcbio-nextgen pipeline for variant calling and comparison
    • Chris Mason’s software (name?)

Bioinformatics Open Source Conference 2013, day 2 afternoon: cloud computing, translational genomics and funding

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 afternoon sessions focusing on cloud computing, infrastructure, translational genomics and funding of community open source tools.

Previous notes:

Cloud and Genome scale computing

Towards Enabling Big Data and Federated Computing in the Cloud

Enis Afgan

Hadoop is a commonly used approach to distribute work using map-reduce. HTCondor scavenges cycles from idle computation on machines. CloudMan organizes Amazon architecture and provides a management interface. Goal of the project is to combine these three to explore usage in biology. Provides a Hadoop-over-SGE integration component that spins up a Hadoop cluster with HDFS and master/worker nodes. Hadoop setup takes 2 minutes: nice solution for using Hadoop on top of existing cluster scheduler. HTCondor provides scaling out between local and cloud infrastructure or two cloud infrastructures. Currently a manual interchange process to connect the two clusters. Once connected can submit jobs across multiple jobs, transferring data via HTCondor.

MyGene.info: Making Elastic and Extensible Gene­-centric Web Services

Chunlei Wu

mygene.info provides a set of web services to query and retrieve gene annotation information. The goal is to avoid the need to update and maintain local annotation data: annotation as a service. Updates data weekly from external resources and place into document database MongoDB. Exposed as a simple REST interface allowing query and retrieval via gene IDs. They have a Python API called mygene and javascript autocomplete widget.

An update on the Seal Hadoop-based sequence processing toolbox

Luca Pireddu

Seal provides distribution of bioinformatics alogrithms on Hadoop. Motivated by success Google has had scaling larger data problems with this approach. Key idea: move to a different computing paradigm. Developed for sequencing core at CSR4. Some tools implemented: Seqal – short read mapping with bwa; Demux – demultiplex samples from multiplexed runs. RecabTable – recalibration of base quality equivalent to GATK’s CountCovariatesWalker; ReadSort – distributed sorting of read alignments. Provided wrappers for Seal tools which is a bit tricky since Hadoop doesn’t follow Galaxy model.

Open Source Configuration of Bioinformatics Infrastructure

John Chilton

John is tackling the problem of configuring complex applications. His work builds on top of CloudBioLinux which makes heavy use of Fabric. Fabric doesn’t handle configuration management well: not a goal of the project. Two examples of configuration management systems: Puppet and Chef. John extended CloudBioLinux to allow use of puppet modules and chef cookbooks. A lightweight job runner tool called LWR sits on top of Galaxy using this. Also working on integration with the Globus toolkit. John advocates creating a community called bioconfig around these ideas.

An Open Source Framework for Gene Prioritization

Hoan Nguyen, Vincent Walter

SM2PH (Strucural Mutation to Pathology Phenotypes in Human) project helps prioritize most promising features associated with a biological question: processes, pathologies or networks. Involves a three step process: building a model for training features, locally prioritizing these and then globally prioritizing. Developed a tool called GEPETTO that handles prioritization. Built in a modular manner for plugins and extension from the community. Integrates with Galaxy framework via tool shed. Prioritization modules: protein sequence alignment, evolutionary barcodes, genomic context, transcription data, protein-protein interactions, hereditary disease gene probability. Uses jBPM as a workflow engine. Applied GEPETTO prioritization to work on age-related macular degeneration (those scary eye pictures you see at the optometrist).

RAMPART: an automated de novo assembly pipeline

Daniel Mapleson

RAMPART provides an automated de-novo assembly pipeline as part of core service. Motivation is that the TGAC core handles a hydrogenous input of data so need to support multiple approaches and parameters. One difficulty is that it’s hard to assess which assembly is best. Some ideas: known genome length, most contiguous (N50), alignments of reads to assembly and assembly to reference. Nicely wrapped all of this up into a single tool that works across multiple assemblers and clusters. Broken into stages of error correction, assembly with multiple approaches, decision on assembly to use, then an assembly improver. Builds on top of the EBI’s Conan workflow management application. Provides an external tool API to interface with third party software.

Flexible multi-omics data capture and integration tools for high-throughput biology

Joeri van der Velde

Molgenis provides a software generator and web interface for commandline tools based on a domain specific language. Provides customized front ends for diverse set of tools. Nice software setup with continuous integration and deployment to 50 servers. Motivation is to understand genotype to phenotype with heterogeneous data inputs. Challenge is how to prepare the custom web interfaces when data is multi-dimensional in terms of comparisons. Treat this as a big matrix of comparisons between subject and traits. Shows nice plots of displaying QTLs for C elegans projects warehoused on Molgenis. Same approach works well across multiple organisms: Arabidopsis.

Translational genomics

Understanding Cancer Genomes Using Galaxy

Jeremy Goecks

Jeremy’s research model: find computing challenges, invent software to handle it, and demonstrate usefulness via genomics. Focus on this talk is pancreatic cancer transcriptome analysis. Jeremy’s builds tools on top of Galaxy. Added new tools for variant calling, fusion detection and VCF manipulation. Jeremy shows a Galaxy workflow for transcriptome analysis. Advantages of Galaxy workflows: recomputable, human readable, importable, sharable and publishable in Galaxy pages. Uses Cancer Cell Line Encylopedia for comparisons. Now a more complex workflow with variants, gene expression and annotations to do targeted eQTL analysis. Custom visualizations provide ability to extract partial sets of data, then publish results of those views. Provides an API to plug in custom visualization tools. Shows a nice demo of recalling variants on only a single gene with adjusted parameters. Has another tool which does parameter sweeps and shows quickly how output looks with different subsets of parameters.

Strategies for funding and maintaining open source software

BOSC ended with a panel discussion featuring Peter Cock, Sean Eddy, Carole Goble, Scott Markel and Jean Peccoud. We discussed approaches for funding long term open source scientific software. I chaired the panel so didn’t get to take the usual notes but will summarize the main points:

  • Working openly and sharing your work helps with your impact on science.
  • It is critical to be able to effectively demonstrate your impact to reviewers, granting agencies, and users of your tools. Sean Eddy shared the Deletion Metric for research impact: Were a researcher to be deleted, would there be any phenotype?
  • To demonstrate impact, be able to quantify usage in some way. Some of the best things are personal stories and recommendations about how your software helps enable science in ways other tools can’t.
  • Papers play in important role in educating, promoting and demonstrating usage of your software, but are also not the only metric.
  • We need to take personal responsibility as developers for categorizing impact and usage. Download/views are not great metrics since hard to categorize. Better to be able to engage and understand usage and ask users to cite and recommend. Lightweight easy citation systems would go a long way towards enabling this.

Bioinformatics Open Source Conference 2013, day 2 morning: Sean Eddy and Software Interoperability

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 morning session focused on software interoperability.

Previous notes:

Biological sequence analysis in the post-data era

Sean Eddy

Sean starts off with a self-described embarrassing personal history about how he developed his scientific direction. Biology background: multiple self-splicing introns in a bacteriophage, unexpected in a highly streamlined genome. The introns are self-removing transposable elements that are difficult for the organism to remove from the genome. No sequence conservation of these, only structural conservation, but no tools to detect this. Sean was an experimental biologist and used this as motivational problem to search for an algorithm/programming solution to the problem. Not able to accomplish this straight off until learned about HMM approaches. Reimplemented HMMs and re-invented stochastic context free grammars to model structural work as a tree structure. Embarrassing part was that his post-doc lab work on GFP was not going well and scooped, so wrote a postdoc grant update to switch to computational biology. This switch led to HMMER, Infernal, Biological Sequencing Analysis.

From this: general advice to not do incremental engineering is wrong. A lot of great work came from incremental engineering: automobiles, sequence analysis (Smith Waterman -> BLAST -> PSI-BLAST -> HMMER). Engineering is a valuable part of science. Requires insane dedication to a single problem. The truth: science rewards for how much impact you have, not how many papers you write. Arbitrage approach to science: take ideas and tools and make them usable for biologists who need them. Not traditionally valuable but useful so can carve out a niche.

General approach to Pfam that helps tame exponential growth of sequences. Strategy is to use representative seed alignments, sweep the full database, use scalable models in HMMER and Infernal, then automate. Scales as you’ve got more data.

Scientific publication is a 350 year old tradition of open science. First journal with peer review in 1665: scientific priority and fame in return for publication and disclosure. This quid pro quo still exists today. The intent of the system has been open data since the beginning, but tricky part now is that the part you want to be open does not fit into the paper. Specifically in computational science, the paper is an advertisement, not a delivery mechanism.

Two magic tricks. We need sophisticated infrastructure, but most of the time we’re exploring. For one-off data analysis, premium is on expert biology and tools as simple as possible. Trick 1: use control experiments over statistical tests. Things you need: trusted methods, data availability, command line. Trick 2: take small random sub-samples of large datasets. Review example using this approach to catch algorithm approach error in spliced aligner.

Bioinformatics: data analysis needs to be part of the science. Biologists need to be fluent in computational analysis and strong computational tools will always be in demand. Great end to a brilliant talk.

Software Interoperability

BioBlend – Enabling Pipeline Dreams

Enis Afgan

BioBlend is a Python wrapper around the Galaxy and CloudMan APIs. The goal is to enable creation of automated and scalable pipelines. For some workflows the Galaxy GUI workflow isn’t enough because we need metadata to drive the analysis. Luckily Galaxy has a documented REST API that supports most of the functionality. To support scaling out Galaxy, CloudMan automates the entire process of spinning up an instance, creating and SGE cluster and managing data and tools. Galaxy is a execution engine and CloudMan is the infrastructure manager. BioBlend has extensive documentation and has lots of community contributions.

Taverna Components: Semantically annotated and shareable units of functionality

Donal Fellows

Taverna components are well described parts that plug into a workflow. It needs curation, documentation and to work (and fail) in predictable ways. The component hides the complexity of calling the wrapped tool service. This is a full part of the Taverna 2.5 release: both workbench and server. Components are semantically annotated to describe inputs/outputs according to domain ontologies. Components are not just nested workflows since they obey a set of rules so can treat as a black box and drill in only if needed. Components enable additional abstraction allowing workflows to be more modular: allows individual work on components and high level workflows with updates for new versions. Long term goal is to treat the entire workflow as a RDF model to improve searching.

UGENE Workflow Designer – flexible control and extension of pipelines with scripts

Yuriy Vaskin

UGENE focuses on integration of biological tools using a graphical interface. It has a workflow designer like Galaxy and Taverna and runs on local machines. Also offers a python API for scripting through UGENE. Nice example code feeding Biopython inputs into the API natively.

Reproducible Quantitative Transcriptome Analysis with Oqtans

Vipin Sreedharan

Starts off talk with poll from RNA-seq blog. The most immediate needs for the community are standard bioinformatics pipelines and skilled bioinformatics specialists. oqtans is online quantitative transcriptome analysis, code available on GitHub. Drives an automated pipeline with a vast assortment of RNA-seq data analysis tools. Some useful tools used: PALMapper for mapping, rDiff for differential expression analysis, rQuant for alternative transcripts. oqtans available from a public Galaxy instance and with Amazon AMIs.

MetaSee: An interactive visualization toolbox for metagenomic sample analysis and comparison

Xiaoquan Su

MetaSee provides an online tool for visualizing metagenomic data. It’s a general visualization tool and integrates multiple input types. Nice tools specifically for metagenomics to display taxa in a population. Have a nice MetaSee mouth example which maps metagenomics of the mouth. Also pictures of teeth are scary without gums. Meta-Mesh is a metagenomic database and analysis system.

PhyloCommons: community storage, annotation and reuse of phylogenies

Hilmar Lapp

Phylocommons provides an annotated repository of phylogenetic trees. Trees are key to biological analyses and increasing in number, but difficult to reuse and build off. Most are not archived, and even if so are in images or other hard to automatically use. It uses Biopython to convert trees into RDF and allows query through the Virtuoso RDF database. Code is available on GitHub.

GEMBASSY: an EMBOSS associated package for genome analysis using G-language SOAP/REST web services

Hidetoshi Itaya

GEMBASSY provides an EMBOSS package that integrates with the G-Language using a web service. This gives you commandline access through EMBOSS for a wide variety of visualization and analysis tools. Nice integration examples show it working directly in a command line workflow.

Rubra – flexible distributed pipelines for bioinformatics

Clare Sloggett

Rubra provides flexible distributed pipelines for bioinformatics, build on top of Ruffus. Used to build a variant calling pipeline built on bwa, GATK and ENSEMBL.

Bioinformatics Open Source Conference 2013, day 1 afternoon: visualization and project updates

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches to openly developed community software to support scientific research. These are my notes from the day 1 afternoon session focused on Open Science.

Previous notes:

Visualization

Refinery Platform – Integrating Visualization and Analysis of Large-Scale Biological Data

Nils Gehlenborg

The Refinery Platform provides an approach to manage and visualize data pipelines. TCGA: 10,000 patients, with mRNA, miRNA, methylation, expression, CNVs, variants and clinical parameters. Lots of heterogeneous data, made more extensive after processing. Need an approach to manage long running pipelines with numerous outputs. Want to integrate horizontally across all data types to gain biological insight. Want to integrate vertically across data levels to provide confirmation and troubleshooting. ISA-Tab provides data model for metadata and provenance evaluation. Web interface performs faceted views of all data based on metadata, and visualizations to explore attribute relationships. Underlying workflow engine is Galaxy. Approach is to setup workflows in Galaxy, then make them available in Refinery at a higher level. Uses the Galaxy API by developing custom workflows based on a template for 100s of samples.

Two approaches to visualization in Refinery. The first is file-based visualization: connect to IGV and display raw BAM data long with associated metadata. Galaxy also supports this well, so the hope is to build off of this. The second approach is database-driven visualization that uses an associated Django server to read/write from a simple API server. Can use callbacks also with REST building off TastyPie so quick and easy to develop custom visualizations.

DGE-Vis: Visualisation of RNA-seq data for Differential Gene Expression analysis

David Powell

DGE-vis provides a visualization framework to identify differentially expressed genes from RNA-seq analysis. Provides approaches to handle two comparison differentially expressed list. To generalize to three-comparison compares it creates a Venn diagram and allows selection of each of the subcomponents to inspect individually. Given the limitations of this, they then developed a new approach. David shows a live demo of comparisons between 4 conditions, which identifies changes over the conditions. A heatmap groups conditions based on differential expression similarities. The heatmap is nicely linked to expression differences for each gene and subselection shows you list of genes. All three items link so change in real time as others adjust. Provides integrated pathway maps with colors linked to each experiment, allowing biologists to identify changed genes via pathways. Written with Haskell on the backend, R for analysis, CoffeScript and Javascript using D3 for visualization.

Genomic Visualization Everywhere with Dalliance

Thomas Down

Thomas starts by motivating visualization: humans love to look at things and practically scientists write papers around a story told by the figures. Unfortunately we focus on print/old-school visualizations: what more could we present if they weren’t so static. The Dalliance genome browser provides complete interactivity with each loading of custom files and multiple tracks. Designed to be able to fit into other applications easily so embed into your website. Also meant to be usable in more lightweight contexts: blog posts, slides, journal publications. It’s a fully client side implementation but does need CORS allowed header on remote websites that feed data in.

Robust quality control of Next Generation Sequencing alignment data

Konstantin Okonechnikov

Goal is to avoid common traps in next-generation sequencing data: avoid poor runs and platform/protocol-specific errors. Provides a more user-friendly tool in comparison to FastQC, samtools, Picard and RNA-seQC. Konstantin’s tool is QualiMap. Provides interactive plots inspired by FastQC’s displays, and also does count quality control, transcript coverage and 5’/3′ bias tools for RNA-seq analyses.

Visualizing bacterial sequencing data with GenomeView

Thomas Abeel

GenomeView provides genome browser for interactive, real-time exploration of NGS data. Allows editing and curation of data. Configurability and extensibility through plug-ins. Designed for bacterial genomes so focuses on consensus plus gaps and missing regions. Handles automated mapping between multiple organisms, show annotations across them. Handles 60,000 contigs for partially sequenced genomes, allowing selection by query to trim down to a reasonable number.

Genomics applications in the Cloud with DNANexus

Andrey Kislyuk

DNANexus has an open and comprehensive API to talk to the DNANexus platform. Provides genome browser, circos and other visualization tools. Have a nice set of GitHub repositories including client code for interacting with the API and documentation. StackOverflow clone called DNANexus Answers for question/answer and community interaction.

Open source project updates

BioRuby project updates – power of modularity in the community-based open source development model

Toshiaki Katayama

Toshiaki provides updates on latest developments in the BioRuby community. Important changes in openness during the project: move to GitHub, BioGems system lowers barrier to joining the BioRuby community. Users can publish standalone packages that integrate. Some highlights: bio-gadget, bio-svgenes, bio-synreport, bio-diversity.

Two other associated projects. biointerchange provides RDF converters for GFF3, GVF, Newick and TSV; developed during 2012 and 2013 BioHackathon. The second is basespace-ruby. See the Codefest 2013 report for more details on the project.

Biopython project update

Peter Cock

Peter provides updates on the latest updates from the Biopython community. Involvement with GSoC for the last several years with both NESCent and OpenBio foundation. This has been a great source of new contributors as well as code development. It’s an important way to develop and train new programmers interested in open source and biology. Biopython uses continuous integration with BuildBots and Travis. Tests run on multiple environments: Python versions, Linux, Windows, MacOSX. Next release of Biopython supports Python 3.3 through the 2to3 converter. Long term will write code to be compatible with both. Nice tip from discussion: the six tool for Python 2/3 compatibility checks and a blog post on writing for 2 and 3. Peter describes thoughts on how to make Biopython more modular to encourage experimental contributions that could then make their way into officially support releases later on: trying to balance need for well-tested and documented code with trying to encourage new users.

InterMine – Collaborative Data Mining

Alex Kalderimis

Intermine is a data-integration system and query engine that supplies data analysis tools, graphical web-app components and a REST API. It provides a modular set of parts that you can use to build tools in addition to the pre-packaged solution. The InterMOD consortium organizes all the Intermine installations to make them able to better interact and share data. Recent work: re-write of Intermine javascript tools. Also can use external tools more cleanly: shows nice interaction of jBrowse with Intermine. Working on rebuilding their web interface on top of the more modular approach.

The GenoCAD 2.2 Grammar Editor

Jean Peccoud

Jean argues for the importance of domain specific languages to make it easier to handle specific tasks. Change the language to your problem. Idea behind GenoCAD is to empower end-users to develop their own DSL. Formal grammars are a set of rules describing how to form sentences in the language’s grammar. Start by defining categories mapping to biological parts, follow with the re-writing rules. All of this happens in a graphical drag and drop interface. For parts, they can use BioBricks as inputs.

Improvements and new features in the 7th major release of the Bio-Linux distro

Tim Booth

Bio-Linux is in its 10th year and recently released version 7. Bio-Linux is a set of debian packages and a full bioinformatics linux distribution you can get and live boot from a USB stick. Strong interactions with DebianMed and CloudBioLinux. Working with integration of Galaxy into Debian packages. Large emphasis on teaching and courses with Bio-Linux for learning commandline work.

Bioinformatics Open Source Conference 2013, day 1 morning: Cameron Neylon and Open Science

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches to openly developing research software to support scientific research. These are my notes from the morning 1 session focused on Open Science.

Open Science

Network ready research: The role of open source and open thinking

Cameron Neylon

Cameron keynotes the first day of the conference, discussing the value of open science. He begins with a historical perspective of a connected world: first internet, telegraphs, stagecoaches all the way to social networks, twitter and GitHub. A nice overview of the human desire to connect. As the probability of connectivity rises, individual clusters of connected groups can reach a critical sudden point of large-scale connectivity. A nice scientific example is Tim Gowers PolyMath work to solve difficult mathematical problems, coordinated through his blog and facilitated by internet connectivity. Instructive to look at examples of successful large scale open science projects, especially in terms of organization and leadership.

Successful science projects exploit the order-disorder transition that occurs when the right people get together. By being open, you increase the probability that your research work will reach this critical threshold for discovery. Some critical requirements: document so people can use it, test so we can be sure it works, package so it’s easy to use.

What does it mean to be open? First idea: your work has value that can help people in a way you never initially imagined. Probability of helping someone is the interest divided by the usability times the number of people you can reach. Second idea: someone can help me in a way you never expected. Probability of getting help same: interest, usability/friction and number of people. Goal of being open: minimize friction by making it easier to contribute and connect.

Challenge: how do we best make our work available with limited time? Good example is how useful are VMs: are they criitical for recomputation or do they create black boxes that are hard to reuse. Both are useful but work for different audiences: users versus developers. Since we want to enable unexpected improvements it’s not clear which should be your priority with limited time and money. Goal is to make both part of your general work so they don’t require extra work.

How can we build systems that allow sharing as the natural by-product of scientific work? Brutal reminder that you’re not going to get a Nobel prize for building infrastructure. Can we improve the incentives system? One attempt to hack the system: the Open Research Computation journal, which has high standards for inclusion: 100% test coverage, easy to run and reproduce. Difficult to get papers because the burden was too high.

Goal: build community architecture and foundations that become part of our day to day life. This makes openness part of the default. Where are the opportunities to build new connectivity in ways that make real change? An unsolved open question for discussion.

Open Science Data Framework: A Cloud enabled system to store, access, and analyze scientific data

Anup Mahurkar

The Open Science Data Framework comes from the NIH human microbiome project. Needed to manage large connections of data sets and associate metadata. Developed a general language agnostic collaborative framework. It’s a specialized document database with a RESTful API on top, and provides versioning and history. Under the covers, stores JSON blobs in CouchDB, using ElasticSearch to provide rapid full text search. Keep ElasticSearch indexes in sync on updates to CouchDB. Provides a web based interface to build queries and custom editor to update records. Future places include replicated servers and Cloud/AWS images.

myExperiment Research Objects: Beyond Workflows and Packs

Stian Soiland-Reyes

Stian describes work on developing, maintaining and sharing scientific work. Uses Taverna, MyExperiment and Workflow4Ever to provide a fully shared environment with Research Object. These objects bundle everything involved in a scientific experiment: data, methods, provenance and people. Creates a sharable, evolvable and contributable object that can be cited via ROI. The Research Object is a data model that contains everything needed to rerun and reproduce it. Major focus on provenance: where did data come from, how did it change, who did the work, when did it happen. Uses the PROV w3c standard for representation, and built a w3c community to discuss and improve research objects. There are PROV tools available for Python and Java.

Empowering Cancer Research Through Open Development

Juli Klemm

The National Cancer Informatics Program provides support for community developed software. Looking to support sustainable, rapidly evolving, open work. The Open Development initiative exactly designed to support and nurture open science work. Uses simple BSD licenses and hosts code on GitHub. Moving hundreds of tools over to this model and need custom migrations for every project. Old SVN repositories required a ton of cleanup. The next step is to establish communities around this code, which is diverse and attracts different groups of researchers. Hold hackathon events for specific projects.

DNAdigest – a not-for-profit organisation to promote and enable open-access sharing of genomics data

Fiona Nielsen

DNAdigest is an organization to share data associated with next-generation sequencing, with a special focus on trying to help with human health and rare diseases. Researchers have access to samples they are working on, but remain siloed in individual research groups. Comparison to other groups is crucial, but no methods/approaches for accessing and sharing all of this generated data. To handle security/privacy concerns, goal is to share summarized data instead of individual genomes. DNAdigest’s goal is to aggregate data and provide APIs to access the summarized, open information.

Jug: Reproducible Research in Python

Luis Pedro Coelho

Jug provides a framework to build parallelized processing pipelines in Python. Provides a decorator on each function that handles distribution, parallelization and memoization. Nice documentation is available.

OpenLabFramework: A Next-Generation Open-Source Laboratory Information Management System for Efficient Sample Tracking

Markus List

OpenLabFramework provides a Laboratory Information Management System to move away from spreadsheets. Handles vector clone and cell-line recombinant systems for which there is not a lot of support. Written with Grails and built for extension of parts. Has nice documentation and deployment.

Ten Simple Rules for the Open Development of Scientific Software

Andreas Prlic, Jim Proctor, Hilmar Lapp

This is a discussion period around ideas presented in the published paper on Ten Simple Rules for the Open Development of Scientific Software. Andreas, Jim and Hilmar pick their favorite rules to start off the discussion. Be simple: minimize time sinks by automating good practice with testing and continuous integration frameworks. Hilmar talks about re-using and extending other code. The difficult thing is that the recognition system does not reward this well since it assumes a single leader/team for every probject. Promotes ImpactStory which provides alternative metrics around open source contributions. The Open Source Report Card also provides a nice interface around GitHub for summarizing contributions. Good discussion around how to measure metrics of usage of your project: need to be able to measure impact of your software.