Notes: Bioinformatics Open Source Conference 2014 day 2: Philip Bourne, Software Interoperability, Open Science

I’m at the 2014 Bioinformatics Open Source Conference (BOSC) in Boston. It’s a great two day conference devoted to open source, science and community. These are my notes from the day 2 morning session. Next Day Video recorded all the talks and they’re available on the Open Bio video site.

Other notes:

Biomedical Research as an Open Digital Enterprise

Philip Bourne

Philip starts off apologizing for not writing a line of code in the last 14 years. He talks about current funding issue in science: lack of growth in NIH funding but exponential increase in biological data. Hard to quantify right now: how much do we spend on data and software? Futher, how much should we be spending to achieve the maximum benefit? Biggest current issue is reproducibility; critical to improve public perception of biological research as well as fundamentally important to good science.

From a funders perspective, it’s a time to squeeze every penny to maximize the amount of research that can be done with money. Two approaches: top down and bottom up. Top down: regulations on data, data sharing policies, digital enablement and move towards reproducibility. Emphasizes the importance of discussions between communities for working with large data. Bottom up approaches: collaboration, open source and standards.

Mentions the current issue that software developers are in great demand, and rewards outside of academia are greater than inside. Need new business models and software best practices. Current challenge: elements of the research life cycle are not connected. Presents nice graph of areas where he feels like we’re doing well and struggling.

Presents public/private partnership called The Commons. Idea is to have agile pilots testing out ideas and new funding strategies. One example, porting dbGAP to the cloud. The commons is a conceptual framework, analagous to the internet. Meant as a collaboration using researching objects with unique identifiers and provenance. Commons meant to handle long tail of data that does not fit anywhere, high throughput data from big facilities and clinical data rules. Brilliant, smart and on-point ideas: excited about what NIH/NSF is doing.

What does the Commons enable? Dropbox like storage, quality metrics and metadata, bringing compute next to data, and giving places to collaborate and discover. The most critical element is establishing a business model: current thinking is to provide a broker service. Idea to is have a series of pilots to evaluate, and hoping to make decisions by 2015.

Great discussion questions. Ann Loraine: 20% time on grants for more exploratory research. Titus Brown: how can we speed up review/grant process to work with agile processes? Thomas Down: can we improve incentives/fun for making data available? Currently not as fun as actually doing science. Really awesome to see so much discussion during Philip’s talk.

Great example of current software folks to talk to: MyExperiment, Galaxy. Also would suggest iPlant, SAGE Synapse, Figshare. Lots of new Big Data to knowledge (BD2K) grants coming.

Overall talk idea is to foster an ecosystem of biomedical research as a digital enterprise.

Software Interoperability

Pathview: an R/Bioconductor Package for Pathway-based Data Integration and Visualization

Weijun Luo

Pathview provides pathway visualization using KEGG-based pathway. Bioconductor R package. High level API calls to get very nice visualizations of multiple sample experiments layered on top of named pathways. Nice automated segmentation and labeling of attributes. Supports over 3000 species: everything in KEGG with sequenced genomes. Takes care of ID mess by supporting everything. Nice workflow for running RNA-seq processing, then feeding into Pathview and GAGE for visualization.

Use of Semantically Annotated Resources in the Mobyle2 Web Framework

Hervé Ménager

Mobyle is a web-based bioinformatics workbench. Currently working on Mobyle2 re-write which includes groupware functionality, secure data sharing, REST API and ontology bsaed annotations. Re-write deals with issues with current classification and typing systems, specifically designed to provide improved sharing. Ontology describes relationships using EDAM ontology. With these, can now annotate formats of inputs/outputs to enable automated conversions and improved chaining of tools. Can also do cool stuff like automated splitting.

Towards Ubiquitous OWL Computing: Simplifying Programmatic Authoring of and Querying with OWL Axioms

Hilmar Lapp

Hilmar describes work done by Jim Balhoff at NESCent on using RDF and OWL to improve computational data mining. Done as part of PhenoScape project with the goal of understanding causes of diversification. Very difficult to author ontology axioms at scale: translating complex assertions into rules make my head hurt. Scowl tool provides a declarative approach to defining these. This also helps make ontologies easier to declare and version control. Similar work is Tawny OWL from Phil Lord. Second tool is to handle Ontology driven queries in SPARQL: owlet. Vision is to make programming with ontologies easy: small tools to fill gaps and holes.

Integrating Taverna Player into Scratchpads

Robert Haines

Describes works integrating Scratchpads and Taverna. Scratchpads are websites that hold data for you and your community. Scratchpads is a Drupal backend with lots of models: 500 sites using this system hosting on two application servers. Taverna provides the workflow system and uses Taverna Player which is a Rudy on Rails plugin that talks to Taverna’s REST interface. Wow, a lot of Taverna calling Taverna calling Taverna: very modular setup. The integration happened to joining two biodiversity communities and make it easier to disseminate data via Scratchpads. With Taverna allows running on this generalized disseminated data. Interesting work to have multiple ways to do this: both tight and lightweight integration.

Small Tools for Bioinformatics

Pjotr Prins

Pjotr talks about approaches to improving how we can better integrate tools, manage workflows. Bioinformatics is often about not invented here and monolithic solutions. Does this happen because of technology and deployment. Wrote up a Bioinformatics Manifesto to build small tools, which should each do the smallest possible task well. Idea is to make each part anti-fragile so the whole system can be more robust. 3 examples of tools that do this: Pfff is a replacement for md5 comparisons that samples files so scales to large inputs. sambamba is a great tool with samtools like functionality and parallelization. bio-vcf is a fast VCF parser and filtering tool.

Open Science and Reproducible Research

SEEK for Science: A Data Management Platform which Supports Open and Reproducible Science

Carole Goble

Carole talks about SEEK work enabling systems biology, linking models and experimental work. Idea is to preserve results, organize data, exchange and share data. Difficulty in dealing with home-brewed solutions from each lab. Tricky to deal with both small and large data, lots of different inputs to mix together. Catalogue data as the critical component using ISA metadata. They integrate a crazy incredible number of standards and other tools in general; beautiful reuse. Use Just Enough Results Model (JERM) to describe relationships between everything done in experiments. Research Object provides nice way to provide tagging and provenance of research work. Funded to extend this as open system for european systems biology data.

Arvados: Achieving Computational Reproducibility and Data Provenance in Large-Scale Genomic Analyses

Brett Smith

Arvados is a open source platform for managing and computing on biological data. Parts of Arvados: Keep provides immutable content addressable storage, providing git-like behavior for data. Objects tracked and managed through the Arvados API server. Jobs submitted as big ol’ JSON format. Uses Docker to manage images. Everything gets GUID identifiers for data, code and docker images so get provenance for everything run.

Enhancing the Galaxy Experience through Community Involvement

Daniel Blankenberg

Dan describing work to involve the Galaxy community in development and analysis. He starts by describing all off the awesome stuff that Galaxy does with shout out to Galaxy on AWS using CloudMan. Interesting stats on running jobs on Galaxy main – leveling off due to complete usage of resources: need to diversify big jobs on Cloud or local installations. Awesome plots of community code contributions, 51 unique users in the past year. They’ve moved to BioStar interface for answering questions.

Open as a Strategy for Durability, Reproducibility and Scalability

Jonathan Rees

Jonathan motivates with example of chicken evolution; my son would love this. The problem is that the tree of life is hard to find, hence the Open Tree of Life. Nice resource and love the continued chicken examples; ready for a viewing of Chicken Chicken Chicken. Has nice browser and API views to the tree. Trick part is getting the trees, which references back to Ross’ talk yesterday. Big emphasis on actually open data (CC0) and publications (CC-BY). All data going into GitHub as JSON.


Notes: Bioinformatics Open Source Conference 2014 day 1 morning: Titus Brown, Genome Scale Data, OBF

I’m at the 2014 Bioinformatics Open Source Conference (BOSC) in Boston. It’s a great two day conference devoted to open source, science and community. Nomi Harris starts things off with an introduction to the goals and history of BOSC. She describes the evolution of BOSC since 2000: movement from open source advocacy to open source plus a community of developers. The emphasis is now on how to work better together and enable open science.

A History of Bioinformatics (in the Year 2039)

Titus Brown

Titus introduces his talk: “It’s hard to make predictions, especially about the future” as he pretends it’s 25 years from now and gets to look back on Bioinformatics from that perspective: bioinformatics science fiction. So read these notes as if Titus is talking from the future.

In the 20s, there was a datapoalypse due to the increasing ability to sequence and sample everything: biology became a data-intensive science. Biology, however, had optimized itself for hypothesis-driven investigation, not data exploration.

Issue 2 was the reproducibility crisis: a large percentage of papers could not replicate even with extensive effort. This is due to a lack of career/funding incentives for doing reproducible science. There was no independent replication.

Issue 3 was that biology was weak in terms of computing education. Many labs had datasets generated but without expertise or plans of how to analyze them. As a result, had an emphasize on easy to use tools: however, they embodied so many assumptions that many results are weak. Emphasis on bioinformatics work as sweatshops doing service bioinformatics without any future career path. As a result, well-trained students left for data science.

Came up with 3 rules for reviews: all data and source code must be in paper, full methods included in primary paper review and methods need publication in associated paper. Answer: more pre-prints. Open peer review led to replication studies, where a community of practice developed around replication. Shift in thinking about biology: biomedical enterprise rediscovers basic biology, rise of open science, investment in people.

Biomedical community moves away from translational medicine into veterinary and agricultural animals as model organisms. Biotech pressured congress to decrease funding since adameic papers were often wrong without raw data, and funding crunch joined hypothesis discovery with data interpretation. Resulted in university collapse which led to a massive increase in creativity during research.

Sage bionetworks: collected data from small consortia and made it available to everyone at publication. Led people to understand there was nothing to fear from making data available. NIH finally invested heavily in training: software, data and model carpentry.

Current problems (in 2039): still unknown functional annotations, career paths still uncertain, glam data replaces glam publications. Many complex diseases remain poorly understood.

BRAIN2050 10-year proposal to understand the brain, focusing on neurodegenerative diseases. Correlation is not causation: problem with current MIND project. Hard to extract data from recording all of the neurons. Computational modeling is critical: can we develop hypotheses that test against the data. Holistic approach needed.

Focus less on reproducibility: strict requirement makes science slow. Can we compromise? Borrow idea of technical debt from software: replication debt. Do rapid iterations of analysis, then re-examine with semi-independent replication. Acknowledge debt to make it known to potential users of research.

Invest in infrastructure for collaboration: enable notification of analyses to allow collaboration between previously unconnected groups. Build commercial software when basics understood. Invest in training as first class research citizen. Biology suggestions: needs to understand the full network system to understand complex biology.

Conclusion: there will be a tremendous amount of change that we cannot predict. We need to invest in people and process and must help figure out the right process and provide career incentives. However, economics matter a lot. Need to convince the public that support for science matters.

Plugs for other talks: Mike Schatz on next ten years in quantitative biology.

Genome-scale Data and Beyond

ADAM: Fast, Scalable Genomic Analysis

Frank Nothaft

Frank talks about work at UC Berkeley, Mt Sinai, Broad on ADAM: a distributed framework for scaling genome analysis. Toolsets – avacado: distributed local assembler, RNAdam: RNA-seq analysis, bdg-services: ADAM clusters, Guacamole: distributed somatic variant caller. Provides a data format that can be cleanly parallelized. ADAM stack uses Apache Spark, Avro, Parquet. Principle is to avoid locking data in and play nicely with other tools. Parquet is an open source columnar file format that compresses well, limits IO and enables fast scans by only loading needed columns.

One approach is to reimplement base quality score recalibration (BQSR). Have a fully parallel version that splits by reads, only requiring a shared 3Gb read-only table of variants (from dbSNP) to mask known variants. 2x faster than Picard on single cores, and 50x faster on cluster on Amazon with smaller machines. Have >99% concordant with BQSR; actually better due to error in GATK implementation.

Automated RNA-seq differential expression validation

Rory Kirchner

Contrast: 1/2 million hits for RNA-seq analysis pipelines, but poll on SeqAnswers: biggest problem in RNA-seq is a lack of reproducible pipelines. Complexity issue is the large number of tools and combinations of those tools. Implemented in bcbio-nextgen. Describes all the goals for bcbio, stealing everything I’m going to talk about tomorrow. Nice slide of work in the RNA-seq pipeline that used. The validation framework enables evaluating actual changes in the pipeline: demonstrates example with trimming versus non-trimming – no difference at all on high-quality validation set. Another nice plots shows difference in doing RNA-seq with 3 replicates at 100M/replicate, 15 minutes at 20M reads. More replicates = better than deeper sequencing.

New Frontiers of Genome Assembly with SPAdes 3.1

Andrey Prjibelski

SPAdes initially designed for single-cell assembly but also works well on standard multi-cell material. Ranked high alongside Salzberg lab with MaSuCRa. Handles tricky non-diploid genomes like plants. Works with IonTorrent for error correction: IonHammer that corrects indels and mismatches. Alongside BayesHammer for Illumina reads. SPAdes works on Illumina BaseSpace platform, DNANexus and Galaxy. Wow, integrated everywhere. With Illumina nextera mate pairs, have improved distribution of correct read pairs. Velvet assembled these better than SPAdes according to N50-based metrics, but in quality metrics SPAdes show better. Shows importance of establishing community benchmarks and values. Titus mentions memory usage a problem on large genomes.

SigSeeker: An Ensemble for Analysis of Epigenetic Data

Jens Lichtenberg

SigSeeker handles analysis of epigenetic methylation and histone modifications: CHiP-seq. Motivation: lots of tools but not good evaluations of processes. Idea was to provide an ensemble based approach to understand tradeoffs of tools. Ways to correlate: by location in the genome, and by intensity. Removes outliers to help resolve ones called consistently between tools. Produced nice correlations between all of the different tools. Also layered on top of biology for blood cell differentiation. Argues that adding tools adds power alongside addition of technical replicates.

Galaxy as an Extensible Job Execution Platform

John Chilton

John talking about Galaxy integration with clusters. Goal is to convince that Galaxy runs jobs awesomely on clusters and clouds. The whole process is pluggable and John wants to convince you to use Galaxy as a platform. Galaxy Pulsar allows you to execute jobs in the same manner Galaxy does. Allows deployers to route jobs based on inputs: the dynamic job installations allows delaying of job, parameter collection from tools, pull resource usage from recent jobs. Dynamic job state handlers: enable resubmission of jobs to larger resources after hitting resource/time limits. New plugin infrastructure with Docker containers for installation.

Added job metrics, collecting information about job runtime, cores and compute using collectl. Need to look at what John is doing with initial work in bcbio. Pulsar (formerly LWR) getting lots of usage under the covers at Galaxy by running jobs on TACC. Prototype using Pulsar on top of Mesos.

Open Bioinformatics Foundation (OBF) update

Hilmar Lapp

Hilmar provides an update on what is happening with the Open Bioinformatics Foundation. He describes all the work Open Bio does as a non-profit, volunteer run organization: BOSC, Codefests, Hackathons and GSoC. Interestingly, 120 Open-Bio members but Biopython/BioPerl mailing list communities are 1000+ people. OBF now associated with Software in the Public Interest (SPI), can easily donate. Hilmar discusses the challenges associated with moving forward on progress using an all volunteer organization.

Notes, Galaxy Community Conference 2014 day 1 afternoon: State of the Galaxy, IonTorrent, Lightning Talks

I’m at the 2014 Galaxy Community Conference in Baltimore. These are my notes from the day 1 afternoon session, following up on notes from the morning session.

State of the Galaxy

Anton Nekrutenko and James Taylor, Galaxy Project

Anton and James talk about the history and current status of Galaxy. Start off by recapping previous GCCs: went from 75 to 250 attendees. Nice numbers about growth in contributors over the past year. For Galaxy main, job usage became a crisis: more jobs than hardware could handle. Galaxy main switched over to TACC last October. Biggest issue is that with more data, jobs are now longer.

Summarization of new features: new visualizations and full visualization framework, dataset collections, Galaxy BioStar community, Toolshed with automated installation, data managers.

New stuff that is coming: organization of the toolshed, want to make the toolshed process more straightforward. New workflow scheduling engine and streaming to improve scaling on large datasets. Visualization a planned focus for the next year. Goal is to think about a distributed Galaxy ecosystem that includes federation and data localization. This is hard, but would be so awesome. Also want to figure out use of Docker integration with Galaxy. A nice discussion by Anton about scalable training, to better coordinate how training works across institutions.

Update on Ion Torrent Sequencing – Accurate, Long Reads

Mike Lelivelt – Ion Torrent

Mike talks about the role of Ion Torrent sequencing in world where Illumina dominates: challenging and driving cost and accuracy. Mike talks about work to improve indels with inherent chemistry limitations in homopolymers. New chemistry: Hi-Q that provides improved resolution of SNPs and indels. Shows a nice IGV plot and errors are now random instead of systematic. Advantage is that depth and consensus can now resolve issues.

Mike talks through great work re-evaluating the Genome in a Bottle paper. Biggest issue was using bwa + GATK, which are not good with Ion Torrent inputs. Recommended workflow is TMAP + TVC. We need a place to find these best practices and tools. TMAP and Torrent Suite are available on GitHub but not clear externally where to get and compile. After registration on IonCommunity site, can find recommended approaches: RNA-seq recommendations. Need similar summaries of recommendations for variant calling. They are releasing new versions of Torrent Caller and TMAP soon as easy to use separate command line programs, so a great time to work on integrating it.

The Galaxy Tool Shed: A Framework for Building Galaxy Tools

Greg von Kuster, Penn State University

Greg talks about recent work on the Galaxy toolshed to enable building tool dependencies automatically. Shows a demo with HVIS tool that makes Hilbert curves. He walks through the process of bootstrapping a toolshed installation on the local machine. This allows you to test and evaluate the tool locally in entirely isolated environment. Once validated, this can get exported to Galaxy toolshed, where it is run through a similar isolated framework for testing every 48 hours.

Integrating the NCBI BLAST+ suite into Galaxy

Peter Cock, The James Hutton Institute

Peter talks about his work integrating BLAST+ into Galaxy. He emphasizes the importance of making tools freely distributable so we can automated installation and distribution. They have done awesome work creating a community around developing these tools on GitHub, with functional tests automatically run via TravisCI. Good example of tricky tool because they needed to add new Galaxy datatypes to support potential output formats. BLAST tools had a lot of repeated XML and used macros to reduce this. Downside is the added complexity to the tool definitions.

On their local instance, they have Galaxy job splitting enabled which batches files into groups of 1000 query sequences. Peter also has a wrapper script which caches BLAST databases on individual nodes. Current work in progress is to create data managers for BLAST databases.

deepTools: a flexible platform for exploring deep-sequencing data

Björn Grüning, University of Freiburg

deepTools aims to standardize work to do quality control and conversion to bigWig format. Some tools available: bamCorrelate compares similarity between multiple BAM files. bamCoverage does the conversion of BAM to bigWig. bamCompare shows the differences between two BAM files. heatmapper: beautiful visualization of BAM coverage across all genes. Really need to integrate this work into bcbio-nextgen. They also have an awesome Galaxy instance for exploring usage. Finally to add they’ve got a docker instance with everything pre-installed.

Lightning talks

Lots of 5 minute talks. Will try to keep up.

David van Enckevort talks about work building a scalable Galaxy cluster in the Netherlands. Using Galaxy in a clinical setting for NGS and proteomics work. Main bottleneck is I/O performance of file intermediates.

Marius talks about Mississippi, a tool suite to work on short RNA work. Main components: trimming of small RNAs, uploaded bowtie to handle short regions, cascade tool provides a quick overview of small RNA targets. Have nice visualizations to look at the size distribution of reads and small RNA properties with good look faceted plots.

Next talk is on handling online streaming analytics for heart rate variability. Used BioBlend to retrieve and execute workflows, and provided a custom Django GUI for users to select workflows. Future work includes adding an intermediate distribute queue via Celery for streaming.

Yvan talks about a use case for the structuration of the biologist community, where they did a project in France to bring together scientists working in multiple areas. Found that with more automation, need additional human expertise to train and improve.

Ira talks about visualization of proteomics data in Galaxy. Protviz was an initial visualization tool build within Galaxy by communicating with an outside server. Unfortunately leads to confusing error messages and issues during communication. Issues is that data is in multiple places leading to long lead times for visualization. Results are not self-contained and difficult to install and maintain. Improved approaches does processing up front and gather results in a SQLite database, now integrated directly into Galaxy visualization based on prototype at Galaxy hackathon.

Nate talks about what the Galaxy team had to go through to move Galaxy main over to TACC thanks to collaboration with iPlant collaboration. Got a 10Gb/s connection to XSEDE via PSC. Tried using GlobusOnline, GridFTP and ended up with rsync. To transfer 600Tb, had to slow down because of saturating 10Gb line; ended up taking 2 months. Used Pulsar and hierarchical object store to help manage infrastructure

Notes, Galaxy Community Conference 2014 day 1 morning: Steven Salzberg, deployment, visualization, reproducibility

I’m at the 2014 Galaxy Community Conference in Baltimore. After two days at the Galaxy Hackathon and a presentation on CloudBioLinux at the Galaxy training day, it’s now time for the conference. These are my notes from the talks and discussions from the morning

Transcriptome assembly: Computational challenges of NGS data

Steven Salzberg, Johns Hopkins

Steven’s lab focuses on exome sequencing, transcriptome sequencing and RNA-seq, microbiome studies and de novo assembly. He starts by talking about Bowtie2, TopHat and Cufflinks; they handle alignment, spliced alignment and transcript assembly, respectively.

The next generation of Tuxedo suite tools in development: bowtie2 -> HISAT -> stringtie -> ballgown. Motivations: pseudogenes cause alignment issues and 14,000 pseudogenes in the human genome. TopHat2 works around this by working in two stages: discovering splice sites and then realigning to avoid problems with reads incorrectly aligning to pseudogenes. In validation work, biggest improvement are in regions with short-anchored reads. Based on speed on STAR, motivated to make TopHat2 faster – HISAT: Hierarchical Indexing for Spliced Alignment of Transcripts. The major issue is that BWT is not local, so you have to start over at each splice site instead of being able to take advantage of the FM index. Now create local indexes corresponding to small 64k regions of the genome, overlapping by 1k. Creates 48000 indexes, but is only 4.3Gb for human genomes. HISATs algorithm: use global index where read fully in an index, use local index for regions with small anchor region, skipping to nearby local indexes when read hit an intron. Big advantage for looking up short indexes since you don’t have to dig through the whole index. This also works for large anchors, which is an easier problem.

Measured performance of HISAT. First approach, simulated data including lots of different types of anchor sizes. Speed tests show reads/sec aligned does better than STAR. Sensitivity of HISAT is equivalent and slightly better than TopHat with some additional tweaks to algorithm: 3 versions of HISAT currently under evaluation. These also handle tricky cases with 1-7 base anchors over exons.

Expectation for HISAT release is this summer – soon.

StringTie is a new method for transcriptome assembly. Provides two new ideas: assemble reads de novo to provide “super-reads”; create a flow network to assemble and quantitate simultaneously. First step uses MaSuRCA which creates long reads by extending original reads forward and backwards until not unique. These super reads get aligned to the genome, potentially capturing more exons. Following this, build an alternative splice graph and then use this to create the flow network.

Splice graph made up of nodes as exons and connections as potential combinations between the exons. Cufflinks idea is to create the most parsimonious representation of this splice graph. StringTie builds a splice graph, then converts into a flow network and finds the maximum flow through the network: flow = number of reads to assign to different isoforms.

Accuracy, evaluated through simulated examples: both have nice sensitivity and precision improvements over Cufflinks – 20% increase in sensitivity. On transcripts with read support: 80% sensitivity and 75% precision. Evaluation on real data with “known” genes, also provides a good improvement over TopHat.

For speed, much faster than Cufflinks. 29 minutes versus 81 minutes, Cufflinks versus StringTie, 95 versus 941 on another dataset. So 3-10x faster.

Awesome question from Angel Pizzaro about independent benchmarks for measuring transcriptomes. Steven happy to share his data but no standard benchmark available. We need a community doing this for RNA-seq.

The Galaxy framework as a unifying bioinformatics solution for multi-omic data analysis

Pratik Jagtap, University of Minnesota

Pratik works on integrating proteomic tools into Galaxy. Shows examples of proteomic workflows, which look so foreign when you deal with genomic data regularly. However, similar challenges to genomics to learn from: quality of underlying database representations essential for good results. Provide Galaxy workflows to help improve these representations. Have nice integrated genome viewers within Galaxy. Shows examples of biological results achieved with this platform; I like how they are working on the ground squirrel, described succinctly as a non-model organism. Proteomics results help improve in-progress genome. The GalaxyP community project drives this software development and data submission process.

iReport: HTML Reporting in Galaxy

Saskia Hiltemann, Erasmus University Medical Center

iReport is a method to show workflows that have a lot of outputs, enabling the ability to view these directly in Galaxy. Outputs in iFuse2 report included SVs, small variants, CNVs and provides a visualization from each of these. Results are fully interactive via jQuery. iReport creates these type of HTML reports from workflow outputs without having to do all that work from scratch.

Saskia shows a iReport example output that demonstrates the capabilities. Awesome ability to combine multiple result types into useful output. I will definitely investigate as a useful way to coordinate outputs for bcbio integrated into Galaxy.

Galaxy Deployment on Heterogenous Hardware

Carrie Ganote, National Center for Genome Analysis Support

Carrie is talking about approaches to putting Galaxy on multiple architectures. The National Center for Genome Analysis support provides bioinformatics support and cluster access for free; fully grant funded. Awesome. Doing a collaboration with Brian Haas at the Broad getting Trinity working well with Galaxy. Have multiple Galaxy integrations connected with 3 different local computes with shared filesystem and remote systems that do not have a shared filesystem. Carrie describes in-depth issues dealing with Galaxy: can’t communicate with Torque due to PBS configuration changes, integration with DRMAA. To get things working with Cray, need to create a shell wrapper around the Galaxy wrapper submit script; ugh too many wrappers. Also integrated with the Open Science Grid dealing with unevently distributed resources.

A journal’s experiences of reproducing published data analyses using Galaxy

Peter Li, GigaScience

GigaScience focuses on reproducibility in published results, but primarily papers do not provide enough information to reproduce. Investigated whether results in a GigaScience journal could reproduce in Galaxy. Pilot project was to reproduce SOAPdenovo2, specifically a table comparing to SOAPdenovo1 and ALLPATHs. Paper has tarball of shell scripts and data for re-running – good stuff. Idea was to convert these over to Galaxy: download datasets used into Galaxy history; wrapped tools used in comparison.

The downloaded pipeline shell script did not have automated way to go from run output to N50 comparison stats used in paper. Paper methods did not have enough to replicate, and needed an additional step not described in methods. Figured out by asking authors but not represented in shell script or paper.

After adding in the missing steps to Galaxy workflow, replicated the original results from the paper. Got similar results for SOAPdenovo1 and ALLPATHS, although numbers were not identical.

Observations: reproduction work is difficult, required a lot of time and effort, help from authors. Sigh.

Enabling Dynamic Science with Flexible Infrastructure

Anushka Brownley and Aaron Gardner, BioTeam

BioTeam focuses on providing IT expertise for scientific work. They do great work and well regarded in the scientific community.BioTeam SlipStream is a dedicated compute resource pre-installed and configured with Galaxy. The goal is to connect SlipStream with additional resources: Amazon and other local infrastructure. Aaron shows example where jobs spill over from SlipStream to existing SGE cluster when busy. Managed to do this both locally and on Amazon with StarCluster. In both cases modified SGE only to achieve this integration.

The move forward to the future is to use LWR to enable this, look at Docker containerization for the toolshed, and use Mesos for heterogenous scheduling.