The afternoon session at the 2011 Bioinformatics Open Source Conference is focused on 2 areas: Visualization and next-gen sequencing.
Michael Smoot — Cytoscape 3.0: Architecture for Extension
Cytoscape is a visualization framework for complex network analyses. It has a plugin architecture which allows customization by users; developed a strong community of contributions. Some issues are that Cytoscape architecture is very complicated, which makes it difficult to change. Changes often break plugins which aren’t updated regularly.
Challenge to Cytoscape is to improve this architecture: hence the new 3.0 version of Cytoscape. The new technologies used: OSGi defines boundaries of modules and Spring-DM to help manage OSGi based on XML configuration. Semantic Versioning (major.minor.patch) used to make version numbers meaningful; this allows you to specify ranges of working packages. Maven used for dependency management.
Jeremy Goecks — Applying Visual Analytics to Extend the Genome Browser from Visualization Tool to Analysis Tool
Trackster is a Genome Browser integrated with Galaxy. 3 unique features:
- Dynamic visualization of NGS data — Jeremy bravely does live demo and everything works. Whew.
- Support visual analytics — use interactive visualization to reason about and solve problems. Sliders in trackster allow you to visually explore parameters, and then apply filters to entire dataset. Awesome part is that you can work only in a small region and re-run results in Galaxy. Running on a whole dataset would take a long time, but can quickly re-run on a small region. Demo shows doing this with Cufflinks.
- Sharing working visualization — Can make a visualization link that others can pull up directly. Allow you to show exactly what you want to share but can also be dynamically manipulated.
Can integrate tools with trackster by specifying how it can be run on a local region, or how you can re-use a global model on a local region.
Nomi Harris — WebApollo: A web-based sequence annotation editor for community annotation
WebApollo succeeds the "old" Apollo which was designed for community annotation, but very difficult to do collaborative annotation. WebApollo is, well, web-based instead of Java and does common real-time annotation updating for collaborative work.
Use JBrowse for genome browsing with extensions for annotation work. Accesses public data at UCSC and uses custom DAS servers as well. Demo server is impressive and changes to annotated transcripts are pushed immediately to other users working on different servers.
Florian Breitwieser — The isobar R package: Analysis of quantitative proteomics data
isobar works with mass spectrometry data to visualize protein expression changes; generates PDF and LaTeX reports. Overview of techniques: fragment peptides to get spectrum and use isobaric peptide tags for quantitation; multiplex up to 8 samples. isobar extracts identification from databases and quantitative details from mass spec.
Handles normalization issues and correcting for technical variability, plots for sample variability and visualization. Analysis can be automated with Sweave to produce PDF reports with fully reproducible approach along with outputs.
Julian Catchen — Stacks: building and genotyping loci de novo from short-read sequences
Stacks motivated by work on Zebrafish, which have a duplicated genome relatively recently in their evolutionary history. Use an outgroup fish (spotted Gar) that did not undergo a duplication. Use RAD-seq (restriction-site associated DNA) technique to sample the genome at SbfI common cut sites. Use stacks to do comparative analyses between non-duplicated and duplicated fish.
Algorithm in Stacks: reads are combined into regions called stacks, then broken down in kmers that are loaded into a dictionary. The kmers between stacks are used to establish similar regions in duplications. Can look at SNP variation within similar blocks.
Morris Swertz — Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
MOLGENIS provides a XML interface to define tools, and wanted to use this to analyze genome of the Netherlands. Sequencing done at BGI and 75% aligned and analyzed so far. Used technology approaches from 1000 genomes, alignment with BWA and GATK SNP calling. Big challenges were in tracking samples and results. Built a custom system to handle this. Can send MOLGENIS results to Galaxy.
Raoul Bonnal — Bio-NGS: BioRuby plugin to conduct programmable workflows for Next Generation Sequencing data
Based on Bio-Gem, which is a general framework for extending BioRuby. Bio-NGS uses this modular framework to combine together multiple tools for next-gen sequencing. Currently runs locally but next steps are distributed jobs over muliple machines. Tasks and programs defined with Ruby classes. Approaches to distributed tasks including messaging:Bio-Hub. Percolate is a related Ruby project which might be worth looking at for parallelization.
Kevin Dorff — Goby framework: native support in GSNAP, BWA and IGV 2.0
Goby provides file formats for next-gen sequencing that are more compact than BAM. They provide several algorithms and bridges to GSNAP, BWA and IGV.
Frank Drews — A Scalable Multicore Implementation of the TEIRESIAS Algorithm
TERESIAS is a motif discovery algorithm from IBM. It’s available and has binaries but they are not useful for large datasets. Used to discover common patterns within the human genome. Algorithm has two phases: initial scan step and then alignment/convolve step that resolves patterns.
To parallelize scan, split up word space by initial letters: with 4 cores use 1 letter, with 16 cores use 2, and so on. For parallel convolve, need to combine initial seeds into similar groups and then apply separately on each. 4-10X speedup with 16 core machines depending on kmer size.
Future work to include regular expression patterns and distributed computing.
Jean-Fr??d??ric Berthelot — Biomanycores, open-source parallel code for many-core bioinformatics
Biomanycores is a repository of parallel code that works on GPUs and multicore CPUs. OpenCL is used to generalize CUDA code over multiple machines. Implementations available of Smith-Waterman, TFM-Cuda, Unafold. Want to bridge gap to biologists but building a pool of applications. Interfaces available for Biopython, BioJava and BioPerl. Impressive speed ups.
Kerensa McElroy — GemSIM: General, Error-Model Based Simulator of next-generation sequencing
Interested in trying to find low frequency variations within bacterial populations. GemSIM handles error models, populations, makes reads, and then produces stats. Want to generate an error model specific to the data you have; Illumina error rates can be quite variable between runs. Shows graphs of 454 versus Illumina results and importance of quality cutoffs.