Galaxy Community Conference 2012, notes from day 2

These are my notes from day 2 of the 2012 Galaxy Community Conference.

Ira Cooke: Proteomics tools for Galaxy

Goal is to develop pipelines and interactive visualizations for proteomics work. Awesome tools that provide raw data + pipeline run as part of a visualization built into Galaxy. Connects to raw spectrum data from higher level summary reports. On a higher level, trying to integrate Proteomic and Genomic approaches inside Galaxy. Available from two bitbucket repositories: protk and Protvis.

James Ireland: Using Galaxy for Molecular Assay Design

James works at 5AM solutions, where they’ve been using Galaxy for a year. He’s working on molecular assay design: identifying oligos to detect or quantify molecular targets. Need to design short assays avoiding known locations of genomic variation. Developed a Galaxy workflow for assay design, including wrapping of primer3, prioritizing and filtering of designed primers, examination of secondary structure.

Richard LeDuc: The National Center for Genome Analysis Support and Galaxy

NCGAS provides large memory clusters, bioinformatics consulting. You can access infrastructure if you have NFS funding. They provide a virtual machine hosting Galaxy on top of cluster infrastructure. The VM approach allows them to spin up Galaxy instances for particular labs. Underlying infrastructure is Lustre filesystem. Do custom work on libraries: helped improved Trinity resource usage.

Liram Vardi: Window2Galaxy ??? Enabling Linux-Windows Hybrid Workflows

Provide hybrid galaxy workflow with steps done on linux and windows: transparent to the user. Works by creating an interaction between Linux and Windows VMs using Sambda and a VM shared directory. Works by using Windows2Galaxy command in Galaxy tool which does all of the wrapping.

David van Enckevort: NBIC Galaxy to Strengthen the Bioinformatics Community in the Netherlands

NCIB BioAssist provides bioinformatics assistance to help with analysis of biological data. Galaxy used for training, collaboration and sharing of developed tools and pipelines. Also used to deal with reproducible research workflows for publications. Provide a NBIC public instance and moving to a cloud Galaxy VM plus Galaxy module repository.

Ted Liefeld: GenomeSpace

GenomeSpace aims to make it easier to do integrative analysis with multiple datasets and tools. Facilitates connections between tools: Cytoscape, Galaxy, GenePattern, IGV, Genomica, UCSC. Provides an online filesystem for data, importing and exporting to data.

Greg von Kuster: Tool Shed and Changes to Galaxy Distributions

Galaxy Tool Shed improvements to integrate closer with Galaxy. Galaxy now provides ability to install tools directly from the user interface. Kicks into live demo mode: when importing workflows it will tell you missing tools that require installation from tool sheds. Tools handle custom datatypes. Allow removal of tools through user interface. Can install dependencies directly. Incredibly awesome automation and interaction improvements for managing tools. External dependencies linked with exact versions for full reproducibility.

Larry Helseth: Customizing Galaxy for a Hospital Environment

Larry describes a use case in a HIPAA environment: locked down internet and corporate browser standards. Bonuses are solid IT and resources. Exome sequence analysis work: annotation with SeattleSeq and Annovar. Everything requires validation before full clinical use.

Nate Coraor: Galaxy Object Store

Galaxy can access object stores like S3 and iRODS using plugin architecture. Extracted out access of data to not be directly on files, but rather through high level accessor methods. This lets you have complete flexibility for storage, managing where data is behind the scenes. This lets you push data to compute resources: so you could store on S3 and compute directly on Amazon.

Jaimie Frey: Galaxy and Condor integration

Wrote Galaxy module to run tasks on Condor clusters. Checked into galaxy-central as of yesterday. Use Parrot virtual filesystem to manage disk I/O to analysis machines.

Brian Ondov: Krona

Krona displays hierarchical data as zoomable pie charts. Has a tool in tool shed and can interact with charts directly in Galaxy.

Clare Sloggett: Reusable, scalable workflows

Usage example: cuffdiff analysis for large number of inputs. How can you readily do this without a lot of clicking on different workflows? Current approach: write a script with the API, but not a great way to do this through the user interface currently. John steps up, ready to work on the problem.

John Chilton: Galaxy VM Launcher

Built a Galaxy workflow for clinical variant detection. One concern about CloudMan was storage costs: CloudMan depends heavily on EBS but you can save money by using the local instance store. Galaxy VM Launcher configures tools, genome indices, users and upload data all from commandline. Awesome.

Pratik Jagtap: Galaxy-P

Galaxy-P works with Galaxy for proteomics. Proteomics work is super popular at this year’s GCC. Trying to tie together lots of discussions today: windows access from Galaxy, visualization, and push to cloud resources.

Geir Kjetil Sandve: The Genomic Hyperbrowser: Prove Your Point

Genomic Hyperbrowser provides custom Galaxy with 100 built in statistical analyses for biological datasets. Provides top level questions, using the correct statistical test under the covers. Provides nice output with simplified and detailed answers along with full set of tests used.

Bj??rn Gr??ning: ChemicalToolBoX

Provides a Galaxy instance for Cheminformatics: drug design work. Tools allow drawing of structures, upload into Galaxy. Wrapped lots of tools for chemical properties, structure search, compound plotting and molecular modification.

Breakout: Automation Strategies for Data, Tools, & Config

During the Galaxy breakout sessions, I joined folks who’ve been working on strategies to automate post-Galaxy tool and data installation. The goal was to consolidate implementations that install reference data, update Galaxy location files, and eventually install tools and software. The overall goal is to make moving to a production Galaxy instance as easy as getting up and running using ‘sh run.sh.’

The work plan moving forward is:

  • Community members will look at building tools that include dependencies and sort out any issues that might arise with independent dependency installation scripts through Fabric.

  • Galaxy team is working on exposing tool installation and data installation scripts through the API to allow access through automated script
    s. The current data installation code is in the Galaxy tree.

  • Community is going to work on consolidating preparation of pre-Galaxy base machines using the CloudBioLinux framework. The short term goal is to use CloudBioLinux flavors to generalize existing scripts. Longer term, we will explore moving to a framework like Chef that handles high level configuration details.

It was great to bring all these projects together and I’m looking forward to building well supported approaches to automating the full Galaxy installation process.

Advertisements

Galaxy Community Conference 2012, notes from day 1

These are my notes from day 1 of the 2012 Galaxy Community Conference. Apologies to the morning speakers since my flight got me here in time for the morning break.

Liu Bo: Integrating Galaxy with Globus Online: Lessons learned from the CVRG project

Work with Galaxy part of the CardioVascular Research Grid project which sets up infrastructure for sharing and analyzing cardiovascular data. Challenges they were tackling: distributed data at multiple locations, inefficient data movement and integration of new tools.

Integrated GlobusOnline as part of Galaxy: provides hosted file transfer. Provide 3 tools that put and pull data between GlobusOnline and Galaxy.

Wrapped: GATK tools to run through the Condor job scheduler, CummeRbund for visualization of Cufflinks results, and MISO for RNA-seq analysis.

Implemented Chef recipes to deploy Galaxy + GlobusOnline on Amazon EC2.

Gianmauro Cuccuru: Scalable data management and computable framework for large scale longitudinal studies

Part of CRS4 project studying autoimmune diseases, working to scale across distributed labs across Italy. It’s a large scale projet with 28,000 biological samples with both Genotyping and Sequencing data. Use OMERO platform which is a client-server platform for visualization, then integrated specialized tools to deal with biobank data. Using seal for analysis on Hadoop clusters. Problem was that the programming/script interfaces were too complex for biologists, so wanted to put a Galaxy front end on all of these distributed services.

Integrated interactions with custom Galaxy tools, using Galaxy for short term history and the backend biobank for longer term storage. Used iRODS to handle file transfer across a heterogeneous storage system, providing interaction with Galaxy data libraries.

Valentina Boeva: Nebula – A Web-Server for Advanced ChIP-Seq Data Analysis

Nebula ChIP-seq server developed as result of participation in GCC 2011. Awesome. integrated 23 custom tools in the past year. He provides a live demo of tools in their environment, which has some snazzy CSS styling. Looks like lots of useful ChIP-seq functionality, and it would be useful to compare with Cistrome.

Sanjay Joshi: Implications of a Galaxy Community Cloud in Clinical Genomics

Want to move analysis to clinically actionable analysis: personalized, predictive, participatory and practical. Main current focus of individualized medicine are two areas: early disease detection and intervention and personalized treatments.

Underlying analysis of variants: lots of individual rare variants that have unknown disease relationships. Analysis architecture: thin, think, thin: sequencer, storage, cluster. Tricky bit is that most algorithms are not parallel so adding more cores is not magic. Also need to scale storage along with compute.

Project Serengeti provides deployment of Hadoop clusters of virtualized hardware with VMware. Cloud infrastructure a great place to drive participatory approaches in analysis.

Enis Afgan: Establishing a National Genomics Virtual Laboratory with Galaxy CloudMan

Enis talking about new additions to CloudMan: support for alternative clouds like OpenStack, Amazon spot instances and using S3 Buckets as file systems.

New library called blend that provides an API to Galaxy and CloudMan. This lets you start CloudMan instances and manage them from the commandline, doing fun stuff like adding and removing nodes programatically.

CloudMan being used on nectar: the Australian National Research Cloud. Provides a shell interface built on top with web-based interfaces via CloudMan and public data catalogues. Also building online tutorials and workshops for training on using best practice workflows.

Bj??rn Gr??ning: Keeping Track of Life Science Data

Goal is to develop an update strategy to keep Galaxy associated databases up to date. Current approaches are FTP/rsync which have a tough time scaling with updates to PDB or NCBI. Important to keep these datasets versioned so analyses are fully reproducible.

Approach: use a distributed version control system for life science data. Provides updates and dataset revision history. Used PDB as a case study to track weekly changes in a git repository. Include revision version as part of dropdown in Galaxy tools, and version pushed into history for past runs for reproducibility.

The downside is that rollback and cloning are expensive since repositories get large quickly.

Vipin Sreedharan: Easier Workflows & Tool Comparison with Oqtans+

oqtrans+ (online quantitative transcript analysis) provides a set of easily interchangeable tools for RNA-seq analysis. Some tools wrapped: PALMapper, rQuant6-3, mTim. They have automated tool deployment via custom fabric scripts. The public instance with integrated tools is available for running, and also a Amazon instance.

Gregory Minevich: CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Using C elegans and looking for specific variations in progeny with neurological problems. Approach applied to other model organisms like Arabidopsis. Available as a Galaxy page with full workflows and associated data. Nice example of a complex variant calling pipeline. Provides nice variant mapping plots across the genome. Pipeline was bwa + GATK + snpEff + custom code to identify a final list of candidate genes.

Approach to identify deletions: look for unique uncovered regions within individual samples relative to full population. Annotate these with snpEff and identify potential problematic deletions.

Tin-Lap Lee: GDSAP — A Galaxy-based platform for large-scale genomics analysis

Genomic Data Submission and Analytical Platform (GDSAP): provides a customized Galaxy instance. Integrated tools like SOAP aligner and variant caller and now part of the toolshed. Push Galaxy workflows to MyExperiment: example RNA-seq workflow.

Platform also works with GigaScience to push and pull data in association with GigaDB.

Karen Reddy: NGS analysis for Biologists: experiences from the cloud

Karen is a biologist, talking about experiences using Galaxy. Moving from large ChIP-seq datasets back to analysis w
ork. Used Galaxy CloudMan for analysis to avoid need to develop local resources. Custom analysis approach called DAMID-seq, translated into a Galaxy workflow all with standard Galaxy tools. Generally had great experience. Issues faced: is it okay to put data on the Cloud? It can be hard to judge capacity: use high memory extra large instances for better scaling.

Mike Lelivelt: Ion Torrent: Open, Accessible, Enabling

Key element of genomics software usage is that users want high level approaches but also be able to drill down into details = Galaxy. That’s why Ion Torrent is a sponsor of the Galaxy conference. IonTorrent software system sounds awesome: built on Ubuntu with a bunch of open source tools. It has a full platform to help hook into processing and runs. Have open source aligner and variant caller for IonTorrent data in the toolshed: code is all on iontorrent GitHub.

BOSC 2012 day 2 pm: Panel discussion on bioinformatics review and open source project updates

Talk notes from the 2012 Bioinformatics Open Source Conference.

Herv?? M??nager: Mobyle Web Framework: New Features

Mobyle provides easy commandline tool integration. Provides a tool-based XML to describe programs that converts into web-based interface. BMPS provides easy pipeline design and execution. Workflow execution can be dynamically reused as a simple form. Provides data versioning with integrated correction: for instance, visualize an alignment in JalView, correct, then save as updated data file. Now that Taverna and Galaxy workflows integrate, it would be great to be able to do the same with Mobyle.

Eric Talevich: Biopython Project Update

Eric talks about Biopython, discussing new and exciting features in the past year. GenomeDiagram provides beautiful graphics of sequences and features. Lots of new format parsing included in Biopython: SeqXML, Abi chromatograms, Phylip relaxed format. Bio.phylo has merged in PAML wrappers and new drawing functionality, plus paper in late review. Now have BGZF support which helps with BAM and Tabix support. Working to support PyPy and Python 3. Bug fixes for PDB via Lenna, who is now a GSoC student that I’m mentoring doing variant work, including with PyVCF.

Hiroyuki Mishima: Biogem, Ruby UCSC API, and BioRuby

BioRuby update on latest work. The community has been working on ways to make being a BioRuby member easy. Original way to contribute is to be a commiter or get patches accepted. To get more people involved, have moved to GitHub to help make it easier to accept pull requests. They’ve also introduced BioGems, a plugin system so that anyone can contribute associated packages. This includes a nice website displaying packages along with popularity metrics to make it easy to identify associated packages. bio-ucsc-api provides ActiveRecord API on top of UCSC tables. The future direction of BioGems will involve more quality control by peer-review, including required documentation and style.

Jeremy Goecks: A Framework for Interactive Visual Analysis of NGS Data using Galaxy

Jeremy talks about the Galaxy visualization framework to make highly interactive visual analysis for NGS datasets. The goal is to integrate visualizations + web tools. Jeremy then bravely launches into a live demo with Trackster. Trackster has dynamic filtering so can use sliders to view based on coverage, FKPM, or other metrics. Integration with Galaxy allows you to re-run tools with alternative parameters based on visualizations. Can create a cool tree of possible parameters than you can set in the Galaxy tool, easily varying selected parameters. This can then be dynamically re-run on a subset of the data letting you re-run and visualize multiple parameters easily. This is an incredibly easy way to find the best settings based on some known regions.

Spencer Bliven: Why Scientists Should Contribute to Wikipedia

New initiative through PLoS Computational Biology called Topic Pages. Why don’t scientists contribute more to Wikipedia? Some identified concerns: perceived inaccuracies, little time for outreach like this and no direct annotation or citation. If you contribute to Wikipedia, you get a citation. Don’t use it to fill up your CV. Topic pages are peer reviewed via Open Review, have a CC-BY license and are similar to a review article. Already have published topic pages and interest in contributing.

Markus Gumbel: scabio – a framework for bioinformatics algorithms in Scala

scabio contains algorithms written in Scala for the bioinformatics domain. Designed as teaching tool for a lecture + lab. Scala combines object oriented and functional paradigms. Akka framework provides concurrent and distributed functionality. Contains lots of teaching code for dynamic programming as a great resource. Easy BioJava 3 integration and reuse of existing libraries. Code available from GitHub

Jianwu Wang: bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data

bioKepler build on top of Kepler, a scientific workflow system. It uses distributed frameworks for parallelization. Plans are to build bioActors for alignment, NGS mapping, gene prediction and more.

Limin Fu: Dao: a novel programming language for bioinformatics

Dao is a new programming language that supports concurrent programming, based on LLVM and easily loads C libraries with Clang. Provides native concurrent iterators, map, apply and find.

Scott Cain: GMOD in the Cloud

Generic Model Organism Database project has a running Cloud instance at https://cloud.gmod.org. Has Chado, Gbrowse, Jbrowse plus sample data. AMI information available from GMOD wiki. Tripal is a Drupal based web interface to Chado.

Ben Temperton: Bioinformatics Testing Consortium: Codebase peer-review to improve robustness of bioinformatic pipelines

Ben kicks off the panel discussion with a short lightning talk about the Bioinformatics Testing Consortium which provides a way to do peer-review on codebases. Idea came from dedicated unit testers, but need a "non-cost" way to do this that fits with current workflows: peer review. Idea is to register a project and have volunteer testers actually test it.

Panel discussion

BOSC wrapped up with a panel discussion centered around ideas to improve reviewing of the bioinformatics components of papers. The panel emerged from an open review discussion between myself and Titus Brown about ideas for establishing a base set of criteria for reviewers of bioinformatics methods. The 5 panelists were Titus Brown, a professor at Michigan State; Iain Hrynaszkiewicz, an open-access publisher with BMC; Hilmar Lapp, an editor at PLoS computational biology; Scott Markel from Accelrys; and Ben Temperton from the Bioinformatics Testing Consortium.

I took these notes while chairing, so apologies if I missed any points. Please send correction and updates.

The main area of focus was around improving the bioinformatics component of papers at the time of review. Titus’ opening slides presented ideas to help improve replicability of results with the key idea being: does the software do what it claims in the paper?

  • Existing communities to connect with
  • When do tests get put in place? Last minute at review time is going to be painful for peopl
    e. There is a lot of hard work involved overall.

    • Difficult to setup VM + replicable
    • Barriers to entry
    • On the other hand, are you doing good science? What is the baseline?
    • How can you help people do this?
    • Learning to develop this way from the start with training courses like Software Carpentry.
    • Can Continuous integration play a role? travis-ci
  • Defining what to do for reviewers
  • Tough question is that editors also must review as well, so job falls on both reviewers and editors. Get a before submission seal before being able to send in for review. This is where the Bioinformatics Testing Consortium could fit in.

  • Start up idea: provide service for testing software
    • Insight journals
    • Could you incentivize for testing? Provide journal credit.
  • Tight relationship between reviews + grants: need to enforce base level of actually having minimum criteria.

  • Provide incentives + credit

Another component of discussion was around openness of reviews:

  • BMC has even split between open and non-open peer review
  • What is the policy for who owns copyright on review?
  • From a testing side, it does need to be open to iterate
  • What effect can this have on your career? Bad reviews for a senior professor.

The final conclusion was to draw up a set of best practice guidelines for reviewers, publish this as a white paper, then move forward with website implementations that help make this process easier for scientists, editors and reviewers. If we as a community can set out what best practice is, and then make it as easy as possible, this should help spread adoption.

BOSC 2012, day 2 am: Carole Goble on open-source community building; Software Interoperability talks

Talk notes from the 2012 Bioinformatics Open Source Conference.

Carole Goble: If I Build It Will They Come?

Carole’s goal is to discuss experience building communities around open source software that facilitates reuse and reusability. 3 areas of emphasis: computational methods and scientific workflows (Taverna), social collaboration like MyExperiment and knowledge acquisition like RightField.

Goal of MyExperiment is to collaboratively share workflow. Awesomely, it now interoperates with Galaxy workflows. For knowledge acquisition, Seek4Science handles data sharing and inter-operates with IsaTab.

General philosophy: laissez-faire philosophy trying to encourage inclusiveness and ability to evolve over time. Some difficulty with being too free since led to some ugly metadata when tools are too general. People prefer simple interfaces that work and they can adopt to. By having flexibility and extensibility, have been able to work across multiple communities and widen adoption. Majority of work paid for in projects outside of Biology. Currently supporting 16 full time informaticians.

How can you get users? Be useful for something, by somebody, some of the time. Need to under promise and over deliver. 4 things that drive adoption: 1. Provide added value 2. Provide a new asset 3. Keep up with the field 4. Because there is no choice.

7 thinks that hinder adoption: 1. Not enough added value 2. Doesn’t work for non-expert users 3. No time or capacity to take on learning. The first 3 sum up to: the software sucks and is difficult to improve. Good solutions exist to help: Software Carpentry. 4. Cost of disruption 5. Exposure to risk 6. No community 7. Changes to work practice. Last 4 boil down to being too costly. Tipping points for usage are normally not technical. Another issue is that people haven’t heard of it so need to promote and discuss.

Some other problems that happen: adoption is incidental, familial adoption for people like me. Difficult to build for others that are not like me. Need to motivate others to help fill in gaps. Need to fully interoperate and not tweak to be non-back-compatible without specific breaks.

Adoption model: need to build an initial seed of users that can advocate and encourage others to use. Trickiest part is to establish this initial set of friends that love your software. This can primarily be a relationship building process: need trust and usability even for the best backend implementations. Need to understand where your projects fit and target the right people even if they are not exactly like you. Difficult.

In social environments, need to understand what drives and fears people have with regards to sharing. A real problem is people not giving credit for things they use. How can we improve micro-attribution in science? How do we harness this competitiveness? Provide reputation and attribution for what you do. Protect and preserve data to help make people more productive.

If you are lucky enough to build a community, then move to the next level and need to maintain and nurture. As you concentration on maintenance can lose focus on the initial things that drove adoption. Version 2 syndrome: featuritis.

In summary, need to focus on: What is it that you provide and people value? Who are you targeting? Need to be able to be agile and provide improvements continuously to existing users, but be careful not to end up with hideous code that is unmaintainable. Difficult balance between being general and specific. Final trick is to be long term once you’ve got adoption: how can you keep software around and provide the funding to continue developing and improving it?

Social and technical aspects of adoption both equally important.

Richard Holland: Pistoia Alliance Sequence Squeeze: Using a competition model to spur development of novel open-source algorithms

Pistoia Alliance Sequence Squeeze competition to develop improved sequence compression algorithms. Pistoia alliance is an not-for-profit alliance of life-science companies by trying to collaborate on shared research and development. Goal of competition was to come up with approach for compression that is not linear but retains 100% of the information. Work on FASTQ data and all software is open-source.

Overall has 12 entrants with 108 distinct submissions. Provided a public leaderboard that encouraged competition. Winner was James Bonfield. Other useful compression approaches like PAQ fared well.

Michael Reich: GenomeSpace: An open source environment for frictionless bioinformatics

GenomeSpace tackles the difficulty of switching between different tools. Creates a connection layer between genome analysis tools with biological data types. Based supported tools are Cytospace, Galaxy, GenePattern, Genomica, IGV and the UCSC browser. GenomeSpace aimed at non-programming users and support automatic cross-tool compatibility. It’s a great resource to connect tools.

Dannon Baker: Galaxy Project Update

Details on some of the fun stuff that the Galaxy team has been working on. Lots of development on the API, including providing a set of wrapper methods that wrap the base REST API. API allows running of tools in Galaxy.

Automated parallelism is now built into Galaxy. Can split up BLAST jobs into pieces, run on cluster and then combine back together. Main concerns are: overhead of splitting plus temporary space requirement. Set use_tasked_jobs=True in configuration and supports BLAST, BWA and Bowtie. There is a parallelism tag in Galaxy XML that enables it. Advanced splitting has a FUSE later that makes the file look like a directory to avoid re-copying files.

Galaxy tool shed allows automated input of tools into Galaxy instance.

Enis Afgan: Zero to a Bioinformatics Analysis Platform in 4 Minutes

Enis currently working with Australian Nectar national cloud to port CloudMan to private clouds in addition to currently supported Amazon EC2. Idea is to make resources instantly available to users and provide a layer on top of programmer shell interfaces. 4 projects put together: BioCloudCentral, CloudMan, CloudBioLinux, Galaxy. The Blend library provides a python API on top of Galaxy and CloudMan, with full docs.

Alexandros Kanterakis: PyPedia: A python crowdsourcing development environment for bioinformatics and computational biology

PyPedia provides a collaborative programming web environment. General idea is that wiki can provide a clean way to upload and run python code: Google App Engine + wiki. This has lots of useful example code, like converting VCF to reference. PyPedia provides a REST interface and everything so everything is fully reproducible and re-runnable.

This would be great to host an
interactive cookbook for Biopython with automated tests and ability to fork.

Alexis Kalderimis: InterMine – Embeddable Data-Mining Components

InterMine is an integrated data warehouse with customizable backend storage. It provides a web interface with query functionality. Provides a nice technology stack underneath with web service and Java API interfaces. Provides libraries in Python, Perl, Java that give a nice intuitive interface for querying pragmatically. Can write custom analysis widgets with some nice javascript interfaces: built using CoffeeScript with Backbone.js and Underscore.js so lots of pretty javascript underlying it.

Bruno Kinoshita: Creating biology pipelines with BioUno

BioUno is a biology workflow system built using Jenkins build server. Provides 5 plugins for biological work. Jenkins is an open source continuous integration system that makes it easy to write plugins. Bruno bravely shows a live demo including fabulous Java war files.

BOSC 2012, day 1 pm: Genome-scale Data Management, Linked Data and Translational Knowledge Discovery

Talk notes from the 2012 Bioinformatics Open Source Conference.

Dana Robinson: Using HDF5 to Work With Large Quantities of Biological Data

HDF5 is a structured binary file format and abstract data model for describing data. Not client/server, has a C interface + other high level interfaces. HDF5 has loads of advantages in terms of technical details. One disadvantage is that querying is a bit more difficult since access is more low level. You write higher level APIs specific to your data, with speed advantages.

Aleksi Kallio: Large scale data management in Chipster 2 workflow environment

Chipster is an environment for biological data analysis aimed at non-computational users. Recent work reworked architecture to handle large NGS data. Hides data handling on the server side from the user to provide a higher level interface. With NGS data storing all the data becomes problematic so data is only moved when needed. Data stored in sessions which provide quotas and management of disk space. Handles shared filesystems invisibly to user.

Qingpeng Zhang: Khmer: A probabilistic approach for efficient counting of k-mers

Custom k-mer counting approach based on bloom filters. Allows you to tradeoff false positives with memory. This makes the approach highly scalable to large datasets. Accuracy related to the kmer size and number of unique kmers at that size. Time usage of khmer is comparable to other approaches like jellyfish but main advantage is memory efficiency and streaming.

Implementation available from GitHub.

Seth Carbon: AmiGO 2: a document-oriented approach to ontology software and escaping the heartache of an SQL backend

The Amigo Browser displays gene ontology information. Retrieves basic key/value pairs about items and connections to other data. As data has expanded the SQL backend is difficult to scale. The solution thus far has been Solr using a Lucene index to query documents. Decided to push additional information into Lucene, including complex stuff like hashes as JSON. Turns out to be a much better model for the underlying data. Downside is that you need to build additional software on top of the thin client.

Jens Lichtenberg: Discovery of motif-based regulatory signatures in whole genome methylation experiments

Software to detect regulatory elements in NGS data. Goal is to correlate multiple sources of NGS data: peak calling + RNA-seq + methylation. These feed into motif-discovery algorithms. Looking at Hematopoietic Stem Cell differentiation in mouse. The framework is Perl-based that uses bedtools and MACS under the covers. The future goal is to re-write as C++ to parallelize and speed up approach.

Philippe Rocca-Serra: The open source ISA metadata tracking framework: from data curation and management at the source, to the linked data universe

ISA metadata framework for describing experimental information in a structured way. Build a set of tools to allow people to create and edit metadata, producing valid ISA for describing the experiments. ISATab now has a plugin infrastructure for specialized experiments.

Newest work focuses on version control for distributed, decentralized groups of users. OntoMaton provides search and ontology tagging for Google Spreadsheets which helps make ontologies available to a wider variety of users.

ISATab is working on exporting to an RDF representation and OWL ontologies. Some issues involve gaps in OBO ontologies for representation.

Julie Klein: KUPKB: Sharing, Connecting and Exposing Kidney and Urinary Knowledge using RDF and OWL

Built specialized ontology for kidney disease with domain experts. Difficulty is dealing with existing software. Provided a spreadsheet interface which was easier for biologists to work with, called Populous. Ended up with SPARQL endpoint and built a web interface on top for biologists: KUPKB. Nice example of using RDF under the covers to answer interesting questions but exposing it in a way that biologists query and manipulate the data.

Sophia Cheng: eagle-i: development and expansion of a scientific resource discovery network

eagle-i connects researchers to resources to help them get work done. eagle-i provides open access to data, software and onotologies are open-source. Built using semantic web technologies under the covers. Provides downloads of all RDF. Federated architecture built around a Sesame RDF store with SPARQL and CRUD REST APIs on top. Can pull available code stack from subversion along with docs.

BOSC 2012, day 1 am: Jonathan Eisen on open science; cloud and parallel computing

Notes from the 2012 Bioinformatics Open Source Conference.

Jonathan Eisen: Science Wants to Be Open – If Only We Could Get Out of Its Way

Jonathan Eisen starts off by mentioning this is the first time he’s giving an open science talk to a friendly audience. He’s associated, and obsessed, with the PLoS open access journal. History of PLoS: started with a petition coming out of free microarray specifications from Michael Eisen and Pat Brown to make journals available. 25,000 people signed, but had a small impact on open access support mainly because available open source journals were not high profile enough. PLoS started as selective high-profile open-access journal to fill this gap.

 

Ft Lauderdale agreement debated how to be open with genomic data. Sean Eddy argued for openness in data and source code by promoting advantages of collaborations and feedback. First open data experiment at TIGR was Tetrahumena thermophila. Openly released and got useful biological feedback. Published next paper on Wolbachia in PLoS Biology despite overtures from Science/Nature.

 

Medical emergency with wife was real motivator for openness. Could not get access to journals to research to help. Terrible lack of access for people outside of big academic institutions. Same problem happened with access to make all father’s research papers available.

 

Open access definition: free, immediate access online with unrestricted distribution and re-use. Authors retain rights to paper. PLoS uses broad creative commons license. Why is reuse important? Thought example: what if you had to pay for access to each sequence when searching for BRCA1 homologs? Then what if it were free, but you couldn’t re-analyze it in a new paper? Science built off re-use and re-purposing of results. Extends to education and fair use of figures. Additional areas to consider are open discussion and open reviews.

 

What are things you can do to support openness? Share things as openly as possible, participate in open discussion, consider being more open pre-publication. Risk to sharing is low, and benefit is high with help and discussion. Important to judge people by contributions, instead of surrogates like journal impact factor. Enhance and embrace open material while giving credit to everything you can. Support jobs and places that are into openness.

 

Great talk and a nice way to start off the meeting by focusing on why folks here care about open source.

 

Sebastian Schönherr: Cloudgene – an execution platform for MapReduce programs in public and private clouds

 

How to support scientists when using MapReduce programs? Goal is to simplify access to a working MapReduce cluster. Cloudgene designed to handle these usability improvements. Sebastian talks through all of the work to setup a cluster: build cluster, HDFS, run program, and retrieve. It’s a lot of steps. Cloudgene builds a unified interface for all of these.

 

Cloudgene merges software like Myrna under one unified interface. It works on both public and private clouds. New programs integrated into Cloudgene via a simple configuration file in YAML format. Cool web interface similar to BioCloudCentral but on top of MapReduce work and with lots more metrics and functionality. Configuration files map to web forms ala Galaxy or Genepattern.

 

Java/Javascript codebase at GitHub.

 

C Titus Brown: Data reduction and division approaches for assembling short-read data in the cloud

 

Titus has loads of useful code on his lab’s GitHub page and shares on blog, twitter and preprints: fully open. Uses approaches that are single-pass, involve compression and work with low-memory datastructures. Goal is to supplement existing work that exists, like assembly, with pre-processing algorithms. Digital normalization aims to remove unnecessary coverage for metagenomic assembly. Downsamples based on a de Bruijn graph to normalized coverage. Analysis is streaming so doesn’t require pre-loading a graph in memory and uses fixed memory. Avoids the nasty high memory requirements for assembly to allow it to run on commodity hardware, like EC2.

 

Removing redundant reads has nice side-effect of removing errors. Effective in combination with other error-correction approaches. Results in virtually identical contig assembly after normalization.

 

Need approaches to both improve algorithms in combination with additional capacity and infrastructure. Some tough things to overcome: biologists hate throwing away data; normalization gets rid of abundance data. New approaches are to use this for streaming error correction. Error correction is the biggest data problem left in sequencing.

 

All figures and information from paper can be entirely reproduced as an ipython notebook. Can redo data analysis from any point: awesome. Approach has been useful in teaching and training as well as new projects.

 

Andrey Tovchigrechko: MGTAXA – a toolkit and a Web server for predicting taxonomy of the metagenomic sequences with Galaxy front-end and parallel computational back-end

 

MGTAXA predicts taxonomic classifications for bacterial metagenomic sequences. Uses an ICM (Interpolated Context Model) to help extract signal from shorter sequences: better than using a fixed k-mer model. Started off using self-organized maps to identify taxonomy, but not possible in complex cases where clades cluster together.

 

ICM used to identify phase relationship with host based on shared k-mer composition of virus and host. Parallelization approach used to scale model training with multiple backends: serially, SGE, Makeflow workflow engine. Cool, I didn’t know about Makeflow. Andrey suggests as a nice fit for Galaxy tool parallelization. Also implemented a BLAST+ MPI implementation using MapReduce-MPI which gives fault tolerance exposing an MPI library.

 

Python source is available on GitHub and integrated with Galaxy frontend. Network architecture setup at JCVI to allow access to firewalled cluster for processing. Uses AMQP messaging for communication with Apache Qpid.

 

Katy Wolstencroft: Workflows on the Cloud – Scaling for National Service

 

Building workflows for genetic testing on the cloud with National Health Service in the UK. Done in collaboration with Eagle Genomics. Diagnostic testing today uses a small numbers of variants, but will soon be scaling up to whole genomes.

 

Using Taverna workflows to run the analyses. Workflow is to identify variants, annotate with dbSNP, 1000 genomes and conservation data. Gather evidence to classify variants as problematic or not. Taverna provides workflow provenance so it’s accessible, secure and reproducible.

 

Architecture currently uses Amazon cloud, Taverna server with data in S3 buckets. Modified Taverna to work better with Amazon and improvements will be available in the next Taverna release.

 

Andreas Prlic: How to use BioJava to calculate one billion protein structure alignments at the RCSB PDB website

 

Andreas combines interests in PDB for work and open-source contributions to BioJava. He describes a workflow to find novel relationships between systematic structural alignments. Work uses Open Science Grid by pushing a custom job management system talking on port 80. Converts CPU bound problem to IO problems. Alignment comparison and visualization code is available from BioJava.