Bioinformatics Open Source Conference 2011 — Day 2 afternoon talks

Ben Vandervalk — SADI for GMOD: Bringing Model Organism Data onto the Semantic Web

SADI is a Semantic Web framework developed to get RDF documents, described with OWL. Ben wrote several SADI services to access sequence feature data from GMOD. Takes in input RDF, does a query, and returns back feature descriptions in RDF. SADI service works off GFF files and is a CGI script. Can get from SADI GMOD google website.

Stian Soiland-Reyes — Scufl2: because a workflow is more than its definition

Scufl2 is a workflow format for Taverna. Wanted to handle workflows and be compatible with Semantic Web technologies. Capture information to re-run and re-use any part of workflow. Idea is to get fully reproducible workflow from a paper.

Workflow distributed as a bundled zip file. It contains a workflow bundle RDF file along with workflow RDF, profile RDF, annotation RDF, input and output RDFs. Main development page is at workflow4ever and code on Github. Currently in alpha stage, aiming for releases later in the year.

Tomasz Adamusiak — OntoCAT – an integrated programming toolkit for common ontology application tasks

Tomasz starts off by calling everyone wolves. Then mentions that the problem with the 3 little pigs is that they did not provide a consistent API. Relevance of RDF is that we have multiple ontology repositories: EBI, BioPortal, plus local ontologies. OntoCAT is a database and browser, REST service and GoogleApp. It provides an extraction layer over the different resources so provides a set of common methods to access ontologies. Has a web tool and R package.

Steffen Moeller — Debian Med: individuals’ expertize and their sharing of package build instructions

Bioinformatics has an enormous number of bioinformatics tools that are specialized and can be challenging to install. Debian Med is a community package repository: publication of packages, collaboration and public engagement. Packaging is all volunteer work and shared with Bio-Linux and other communities. Current work is incorporating additional Java packages, establishing complete workflows, data management with BioMaj.

Andreas Hildebrandt — The Biochemical Algorithms Library for Rapid Application Development in Structural Bioinformatics

Motivation: difficult to design new drugs, despite improvements in data generation, knowledge and techniques. Want to improve drug design. Approach is to generate structures: start with PDB, parse them, add missing atoms, infer bonds and optimize atom positions. Biochemical Algorithims Library helps handle these steps. It has Python language bindings so you can work with it directly from scripting languages. The viewer has some beautiful pictures of 3D structures.

Simon Mercer — A Framework for Bioinformatics on the Microsoft Platform

Simon talks about the work Microsoft is doing to build a reusable Bioinformatics toolkit built on the .NET framework. The Microsoft Bioinformatics Framework is at version 2.0; cross platform and works on Mono. The name of the project is going to change as they are moving to the OuterCurve foundation and will not be guided or owned by Microsoft.


Bioinformatics Open Source Conference 2011 — Day 2 morning talks

The second day of the Bioinformatics Open Source Conference (BOSC) started off with a session on Cloud Computing.

Matt Wood — Into the Wonderful

Matt worked at Ensembl/Sanger on sequencing pipelines, now technology evangelist at Amazon dealing with Cloud Computing.

Start off by talking about data, well, lots of data. Challenges are distributing and making data available with lots of constraints: throughput, data management, software, availability, reproducibility, cost. So far, we’ve managed to move from Gb-scale to Tb-scale work. Open source software has played a role in making software easily availability, and active development communities.

Work so far is foundational blocks for next steps; how can we optimize with existing tools and infrastructure? Optimize for developer productivity; also wider development community because lower barriers to entry. Goals are to abstract difficult and tedious parts, maximizing time you can spend on the fun parts.

What are the building blocks that are available to do this? Cloud provides a collection of foundations: compute, storage, databases, automation; couple with workflows, analytics, warehouses and visualization. Move data/compute to materials/methods. Usability is the most important metric for tools: available, flexible, and reliable. Cloud is awesome example of this: quick to get new image and API to flexibly access them.

Machine images are really key to sharing code, data, configuration and services with others. Also a reproducible representation of work that was done. You have a ton of moving parts and want to be able to capture these: tools like Puppet and Chef allow you to reproduce this as well.

Amazon data replicated across multiple availability zones for redundancy. Each availability zone separate. However, data stays local so if cannot move from US to Europe does not. There are a ton of options for building different infrastructure and managing costs: standard, reserved and spot instances. Spot instances great way to get access to cheap compute; need to architect for interruption.

Matt talks about some of his favorite projects: Galaxy, Cloud BioLinux, Taverna, StarCluster, CloudCrowd. Also companies doing interesting stuff here: Cycle Computing, ionflux, DNANexus.

For Hadoop, Amazon’s ElasticMapReduce takes away a lot of the pain of setting up a Hadoop server. Another example is Amazon’s Relational Database Service for MySQL/Oracle. General idea is lowering the barrier to utilization.

The free tier and research grants are some no-cost ways to get started.

Richard Holland — Securing and sharing bioinformatics in the cloud

Talking about commercial deployment with open source software: PlasMapper and Ensembl. Proof of concept cloud architecture with Ensembl and custom databases and open source applications on top, Ensembl, PlasMapper and GeneAtlas.

For security, used OpenAM to authenticate, then encrypted data on disk, SSL encryption of communication, hide Apache information, and firewalled.

Some potential issues. With PlasMapper writes to tmp directory, where your original data is available; if you don’t secure this directory in apache others can grab your data. With Ensembl, can do html injection by appending nasty javascript tricks to GET parameters. Also has global identifiers for things like BLAST results; security by obscurity since if someone had or guessed the id could look at your results. Need to tie these identifiers to login.

Recommendations: firewall externally and internally, validate file uploads, don’t store uploaded files in accessible location, avoid GET parameters where possible.

Chunlei Wu — Gene Annotation as a Service – GAaaS

Chunlei starts off with a migration story for BioGPS, a Gene-centric annotation data representation. They started with a relational database solution, then switched into a document based solution: json style objects with CouchDB. Infrastructure uses Tornado on top of that, then nginx. Web based API for query, alongside web application:

Ntino Krampis — Cloud BioLinux: open source, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond

Paradigm shift associated with next gen sequencing data and small sequencing machines. Now small labs can handle their own sequencing; second step is how do you analyze it. CloudBioLinux is community project with JCVI, NEBC Bio-Linux associated. Ntino demos using CloudBioLinux, connecting with graphical client and making data available with collaborators through sharing AMIs.

Olivier Sallou — OBIWEE : an open source bioinformatics cloud environment

OBIWEE is bioinformatics framework based on Torque job scheduler. It combines 3 software: a workflow authoring tool, a virtual Torque cluster and a set of deployment scripts for private or public cloud. SLICEE is the workflow authoring tool with front end to submit jobs. Has API, commandline and GUI interfaces for running.

Brian O’Connor — SeqWare: Analyzing Whole Human Genome Sequence Data on Amazon’s Cloud

SeqWare is an open source toolset for large-scale sequence analysis. Project ported it to EC2 for scaling out. Uses the Pegasus workflow engine to define workflows and run on clusters. SeqWare has multiple levels to interact with: workflow description language, java class interface. Can also provision and bundle dependencies.

To port to EC2, used StarCluster with custom AMI containing dependencies. 9 human genomes analyzed, cost $1000 per genome, and $100 per exome.

Lars Jorgensen — Sequencescape – a cloud enabled Laboratory Information Management Systems (LIMS) for second and third generation sequencing

Sequencescape is the LIMS at Sanger institute, so can definitely scale. Supports all sequencing technologies. Development is open on GitHub and what’s there matches what is running at Sanger currently. Sanger data needs to be publicly release 60 days after quality controlled. Really impressed.

LIMS handles pretty much everything: from freezer tracking, study management to automation, workflows to data release and reporting. Live demo is sweet and covers every use case you could imagine; runs on a laptop.

Enis Afgan — Enabling NGS Analysis with(out) the Infrastructure

Enis talks about CloudMan, Galaxy on the cloud with reusable backend for scaling analyses. Lets you do NGS analyses on Amazon without needing any computational resources. Has even more tools and reference datasets than the Galaxy main site. It offers a wizard-guided setup directly in th
e browser, is customizable and can be shared with other users. Contains tons of NGS tools built on top of CloudBioLinux.

Aleksi Kallio — Hadoop-BAM: A Library for Genomic Data Processing

Chipster is the main project and Hadoop-BAM was abstracted from that. Designed for dealing with large numbers of BAM files coming out of NGS analyses. Detect BAM record chunks based on compression and data for splitting up. Has a Picard compatible API. Data import/export is slow but otherwise scales based on parallelization well; used for batch pre-processing.

Bioinformatics Open Source Conference 2011 — Day 1 afternoon talks

The afternoon session at the 2011 Bioinformatics Open Source Conference is focused on 2 areas: Visualization and next-gen sequencing.

Michael Smoot — Cytoscape 3.0: Architecture for Extension

Cytoscape is a visualization framework for complex network analyses. It has a plugin architecture which allows customization by users; developed a strong community of contributions. Some issues are that Cytoscape architecture is very complicated, which makes it difficult to change. Changes often break plugins which aren’t updated regularly.

Challenge to Cytoscape is to improve this architecture: hence the new 3.0 version of Cytoscape. The new technologies used: OSGi defines boundaries of modules and Spring-DM to help manage OSGi based on XML configuration. Semantic Versioning (major.minor.patch) used to make version numbers meaningful; this allows you to specify ranges of working packages. Maven used for dependency management.

Jeremy Goecks — Applying Visual Analytics to Extend the Genome Browser from Visualization Tool to Analysis Tool

Trackster is a Genome Browser integrated with Galaxy. 3 unique features:

  • Dynamic visualization of NGS data — Jeremy bravely does live demo and everything works. Whew.
  • Support visual analytics — use interactive visualization to reason about and solve problems. Sliders in trackster allow you to visually explore parameters, and then apply filters to entire dataset. Awesome part is that you can work only in a small region and re-run results in Galaxy. Running on a whole dataset would take a long time, but can quickly re-run on a small region. Demo shows doing this with Cufflinks.
  • Sharing working visualization — Can make a visualization link that others can pull up directly. Allow you to show exactly what you want to share but can also be dynamically manipulated.

Can integrate tools with trackster by specifying how it can be run on a local region, or how you can re-use a global model on a local region.

Nomi Harris — WebApollo: A web-based sequence annotation editor for community annotation

WebApollo succeeds the "old" Apollo which was designed for community annotation, but very difficult to do collaborative annotation. WebApollo is, well, web-based instead of Java and does common real-time annotation updating for collaborative work.

Use JBrowse for genome browsing with extensions for annotation work. Accesses public data at UCSC and uses custom DAS servers as well. Demo server is impressive and changes to annotated transcripts are pushed immediately to other users working on different servers.

Florian Breitwieser — The isobar R package: Analysis of quantitative proteomics data

isobar works with mass spectrometry data to visualize protein expression changes; generates PDF and LaTeX reports. Overview of techniques: fragment peptides to get spectrum and use isobaric peptide tags for quantitation; multiplex up to 8 samples. isobar extracts identification from databases and quantitative details from mass spec.

Handles normalization issues and correcting for technical variability, plots for sample variability and visualization. Analysis can be automated with Sweave to produce PDF reports with fully reproducible approach along with outputs.

Julian Catchen — Stacks: building and genotyping loci de novo from short-read sequences

Stacks motivated by work on Zebrafish, which have a duplicated genome relatively recently in their evolutionary history. Use an outgroup fish (spotted Gar) that did not undergo a duplication. Use RAD-seq (restriction-site associated DNA) technique to sample the genome at SbfI common cut sites. Use stacks to do comparative analyses between non-duplicated and duplicated fish.

Algorithm in Stacks: reads are combined into regions called stacks, then broken down in kmers that are loaded into a dictionary. The kmers between stacks are used to establish similar regions in duplications. Can look at SNP variation within similar blocks.

Morris Swertz — Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

MOLGENIS provides a XML interface to define tools, and wanted to use this to analyze genome of the Netherlands. Sequencing done at BGI and 75% aligned and analyzed so far. Used technology approaches from 1000 genomes, alignment with BWA and GATK SNP calling. Big challenges were in tracking samples and results. Built a custom system to handle this. Can send MOLGENIS results to Galaxy.

Raoul Bonnal — Bio-NGS: BioRuby plugin to conduct programmable workflows for Next Generation Sequencing data

Based on Bio-Gem, which is a general framework for extending BioRuby. Bio-NGS uses this modular framework to combine together multiple tools for next-gen sequencing. Currently runs locally but next steps are distributed jobs over muliple machines. Tasks and programs defined with Ruby classes. Approaches to distributed tasks including messaging:Bio-Hub. Percolate is a related Ruby project which might be worth looking at for parallelization.

Kevin Dorff — Goby framework: native support in GSNAP, BWA and IGV 2.0

Goby provides file formats for next-gen sequencing that are more compact than BAM. They provide several algorithms and bridges to GSNAP, BWA and IGV.

Frank Drews — A Scalable Multicore Implementation of the TEIRESIAS Algorithm

TERESIAS is a motif discovery algorithm from IBM. It’s available and has binaries but they are not useful for large datasets. Used to discover common patterns within the human genome. Algorithm has two phases: initial scan step and then alignment/convolve step that resolves patterns.

To parallelize scan, split up word space by initial letters: with 4 cores use 1 letter, with 16 cores use 2, and so on. For parallel convolve, need to combine initial seeds into similar groups and then apply separately on each. 4-10X speedup with 16 core machines depending on kmer size.

Future work to include regular expression patterns and distributed computing.

Jean-Fr??d??ric Berthelot — Biomanycores, open-source parallel code for many-core bioinformatics

Biomanycores is a repository of parallel code that works on GPUs and multicore CPUs. OpenCL is used to generalize CUDA code over multiple machines. Implementations available of Smith-Waterman, TFM-Cuda, Unafold. Want to bridge gap to biologists but building a pool of applications. Interfaces available for Biopython, BioJava and BioPerl. Impressive speed ups.

Kerensa McElroy — GemSIM: General, Error-Model Based Simulator of next-generation sequencing

Interested in trying to find low frequency variations within bacterial populations. GemSIM handles error models, populations, makes reads, and then produces stats. Want to generate an error model specific to the data you have; Illumina error rates can be quite variable between runs. Shows graphs of 454 versus Illumina results and importance of quality cutoffs.

Bioinformatics Open Source Conference 2011 — Day 1 morning talks

I’m in beautiful Vienna, Austria at the Bioinformatics Open Source Conference on Friday and Saturday, July 15-16th. The conference emphasizes freely available biological software and the communities that contribute to them. These are my notes from the morning talks.

Larry Hunter — The role of openness in knowledge-based systems for biomedicine

Difficulty in Artificial Intelligence is capturing common sense information that you expect to an intelligent agent to know. There is a ton of this information; but thankfully in molecular biology this common sense is less of a barrier — you can capture everything known about molecular biology from textbooks, papers, databases. Can we write programs that get all this information?

Difficulty is that the interesting questions we want to answer are complex. The one gene, one disease model is extremely rare; in general we are looking at perturbations of complex models that change over time. Now that we are so good at sequencing, the hard problems in bioinformatics are understanding the data. This is not only about facts, but rather putting facts together to answer "why" questions. Judging if an explanation is plausible in AI requires a knowledge base and a way to score results.

Some knowledge based computational biology solutions are: BioCyc, AskHermes, Watson Medicine, GO over-representation analysis, HyQue; anything that uses ontologies like GO. 3 reasons that openness matters in these areas:

  • Productivity: very hard problem so need to build off results; can’t do it alone
  • Equity: allow anyone to contribute by lowering barriers and costs
  • Ethics: AI is a social concern; need to earn the trust of society

How do we get the process going?

  • Build on current open ontologies: OBO, Semantic Web, Open Access Publishing, Linked Life Data (wow)
  • Social infrastructure to work together to solve hard problems: cooperation and competition combined
  • Conform to shared infrastructure and avoid fears of losing ideas and credit to the community

Idea is to organize competitions that require open source code and using shared infrastructure, specifically with goals of combining existing communities (BioCreative, BioNLP). For this you need software to work off of, computational power, training data to work with, and significant prizes.

Larry’s group has made CRAFT available, an open source set of semantic annotations that uses existing community ontologies. This can be the basis of these competitions. Key is leveraging these existing standards to serve as a basis for future work so we are actually building off each other’s work.

Remaining challenges are ensuring openness of papers in a way that they can be bulk downloaded for AI text mining, improving onotology and connections to existing text. The technical aspects are things that are excellent targets for competitions.

Konstantin Okonechnikov — Unipro UGENE: an open source toolkit for complex genome analysis

UGENE integrates bioinformatics tools. Written in C++/Qt with a plugin system. It provides a large library of bioinformatics algorithms: Smith-Waterman, Muscle, Blast, HMM, Bowtie and more. It contains a visualization toolkit for sequence viewing, alignment. Algorithms are parallelized for multi-core CPUs, GPUs and support launching on clusters.

Contains a visual environment for constructing workflows. The workflow can be turned into a shell command to run from the commandline. Future plans are to develop a web environment, and support next-gen sequencing analysis.

Thomas Down — Exploring the genome with Dalliance

Existing Genome Browsers fall into two classes: heavy-weight clients that require installation like IGV, or light-weight browser-based clients like UCSC. Can we have a browser client that acts more like heavy-weight clients but without installation? Now a lot of web technologies to drive this: javascript, SVG/Canvas views, browsers focused on performance for games, HTML5.

Interactive demo of the Dalliance Genome Browser shows nice scrolling and interaction fully within the web. For getting data, uses DAS: XML based annotations from the web. Used to be limited here by javascript same-origin policies but now can set the server to allow cross origin requests in the headers.

Some alternatives to DAS include dense binary formats like BAM, BigBed and BigWig with indexes for random access. Dalliance can support these directly. Nice interactive demo: quick, easy, and can drill all the wall down to reads with BAM display.

Alex Kalderimis — InterMine – Using RESTful Webservices for Interoperability

InterMine is a data warehouse framework for biological experiments and raw data: FlyMine, ModMine and more. The database is heavily de-normalised so is loaded and served as a read-only database: very performant. This is coupled with a read-write User database that references items in the data repository. Provide a web application interface to the repository, with custom query templates for biologists.

With the increase in number of InterMine instances, need ways to communicate between them. Use a REST API with clients for Java, Perl, JavaScript, Python and, soon, Ruby. This has a low threshold to usage, and can return data formats people are used to like tab delimited, but also structured formats for programs. Can use this to build automated workflows that query one mine, grab identifiers, then get data from another one. API for clients is improving to reduce boilerplate code required.

Some Lessons learned: JSON is awesome; use GET/PUT to make it more browser friendly; fail loudly with http or JSON error codes so you actually know if you have a problem.

Bernat Gel — easyDAS: Automatic creation of DAS servers

easyDAS is a small web server to make it easy to create a DAS server. DAS servers are meant to be easy, with smart clients using the simple servers. Most DAS servers provided by larger institutes; how to handle it if you are a small place without lots of resources?

easyDAS removes all the server configuration details, and you only need to upload a data file; it is hosted at EBI with a web interface. The maximum size is a million rows; not suitable for full genome base level information but for lots of other information.

Kostas Karasavvas — Enacting Taverna Workflows through Galaxy

Taverna is a workflow management system; goal is to integrate with the Galaxy web framework. Taverna has a graphical interface to connect tools into a larger workflow. It provides a server that can run these workflows.

Implemented as ruby gem that makes a Galaxy tool from a workflow in MyExperiment along with connection to Taverna server. Install the tool XML into Galaxy and then run.

Herv?? M??nager — Mobyle 1.0: new features, new types of services

[Mobyle] is a web user interface for running commandline tools. Also allows chaining of jobs into workflows. XML definitions are used for both. Also implemented viewers that visualize RNA structure, multiple alignments, phylogenetic tree
s. Can edit alignments directly in the web interface. For workflows, can run on LSF clusters to parallelize.

Junjun Zhang — BioMart 0.8 offers new tools, more interfaces, and increased flexibility through plug-ins

BioMart is an open source federated data management system meant to make in-house data available online. It is built as a Java system with lots of good software engineering. The new version provides additional ways to query the data backend. Used in several large scale collaborations.