Bioinformatics Open Source Conference 2011 — Day 1 morning talks

I’m in beautiful Vienna, Austria at the Bioinformatics Open Source Conference on Friday and Saturday, July 15-16th. The conference emphasizes freely available biological software and the communities that contribute to them. These are my notes from the morning talks.

Larry Hunter — The role of openness in knowledge-based systems for biomedicine

Difficulty in Artificial Intelligence is capturing common sense information that you expect to an intelligent agent to know. There is a ton of this information; but thankfully in molecular biology this common sense is less of a barrier — you can capture everything known about molecular biology from textbooks, papers, databases. Can we write programs that get all this information?

Difficulty is that the interesting questions we want to answer are complex. The one gene, one disease model is extremely rare; in general we are looking at perturbations of complex models that change over time. Now that we are so good at sequencing, the hard problems in bioinformatics are understanding the data. This is not only about facts, but rather putting facts together to answer "why" questions. Judging if an explanation is plausible in AI requires a knowledge base and a way to score results.

Some knowledge based computational biology solutions are: BioCyc, AskHermes, Watson Medicine, GO over-representation analysis, HyQue; anything that uses ontologies like GO. 3 reasons that openness matters in these areas:

  • Productivity: very hard problem so need to build off results; can’t do it alone
  • Equity: allow anyone to contribute by lowering barriers and costs
  • Ethics: AI is a social concern; need to earn the trust of society

How do we get the process going?

  • Build on current open ontologies: OBO, Semantic Web, Open Access Publishing, Linked Life Data (wow)
  • Social infrastructure to work together to solve hard problems: cooperation and competition combined
  • Conform to shared infrastructure and avoid fears of losing ideas and credit to the community

Idea is to organize competitions that require open source code and using shared infrastructure, specifically with goals of combining existing communities (BioCreative, BioNLP). For this you need software to work off of, computational power, training data to work with, and significant prizes.

Larry’s group has made CRAFT available, an open source set of semantic annotations that uses existing community ontologies. This can be the basis of these competitions. Key is leveraging these existing standards to serve as a basis for future work so we are actually building off each other’s work.

Remaining challenges are ensuring openness of papers in a way that they can be bulk downloaded for AI text mining, improving onotology and connections to existing text. The technical aspects are things that are excellent targets for competitions.

Konstantin Okonechnikov — Unipro UGENE: an open source toolkit for complex genome analysis

UGENE integrates bioinformatics tools. Written in C++/Qt with a plugin system. It provides a large library of bioinformatics algorithms: Smith-Waterman, Muscle, Blast, HMM, Bowtie and more. It contains a visualization toolkit for sequence viewing, alignment. Algorithms are parallelized for multi-core CPUs, GPUs and support launching on clusters.

Contains a visual environment for constructing workflows. The workflow can be turned into a shell command to run from the commandline. Future plans are to develop a web environment, and support next-gen sequencing analysis.

Thomas Down — Exploring the genome with Dalliance

Existing Genome Browsers fall into two classes: heavy-weight clients that require installation like IGV, or light-weight browser-based clients like UCSC. Can we have a browser client that acts more like heavy-weight clients but without installation? Now a lot of web technologies to drive this: javascript, SVG/Canvas views, browsers focused on performance for games, HTML5.

Interactive demo of the Dalliance Genome Browser shows nice scrolling and interaction fully within the web. For getting data, uses DAS: XML based annotations from the web. Used to be limited here by javascript same-origin policies but now can set the server to allow cross origin requests in the headers.

Some alternatives to DAS include dense binary formats like BAM, BigBed and BigWig with indexes for random access. Dalliance can support these directly. Nice interactive demo: quick, easy, and can drill all the wall down to reads with BAM display.

Alex Kalderimis — InterMine – Using RESTful Webservices for Interoperability

InterMine is a data warehouse framework for biological experiments and raw data: FlyMine, ModMine and more. The database is heavily de-normalised so is loaded and served as a read-only database: very performant. This is coupled with a read-write User database that references items in the data repository. Provide a web application interface to the repository, with custom query templates for biologists.

With the increase in number of InterMine instances, need ways to communicate between them. Use a REST API with clients for Java, Perl, JavaScript, Python and, soon, Ruby. This has a low threshold to usage, and can return data formats people are used to like tab delimited, but also structured formats for programs. Can use this to build automated workflows that query one mine, grab identifiers, then get data from another one. API for clients is improving to reduce boilerplate code required.

Some Lessons learned: JSON is awesome; use GET/PUT to make it more browser friendly; fail loudly with http or JSON error codes so you actually know if you have a problem.

Bernat Gel — easyDAS: Automatic creation of DAS servers

easyDAS is a small web server to make it easy to create a DAS server. DAS servers are meant to be easy, with smart clients using the simple servers. Most DAS servers provided by larger institutes; how to handle it if you are a small place without lots of resources?

easyDAS removes all the server configuration details, and you only need to upload a data file; it is hosted at EBI with a web interface. The maximum size is a million rows; not suitable for full genome base level information but for lots of other information.

Kostas Karasavvas — Enacting Taverna Workflows through Galaxy

Taverna is a workflow management system; goal is to integrate with the Galaxy web framework. Taverna has a graphical interface to connect tools into a larger workflow. It provides a server that can run these workflows.

Implemented as ruby gem that makes a Galaxy tool from a workflow in MyExperiment along with connection to Taverna server. Install the tool XML into Galaxy and then run.

Herv?? M??nager — Mobyle 1.0: new features, new types of services

[Mobyle] is a web user interface for running commandline tools. Also allows chaining of jobs into workflows. XML definitions are used for both. Also implemented viewers that visualize RNA structure, multiple alignments, phylogenetic tree
s. Can edit alignments directly in the web interface. For workflows, can run on LSF clusters to parallelize.

Junjun Zhang — BioMart 0.8 offers new tools, more interfaces, and increased flexibility through plug-ins

BioMart is an open source federated data management system meant to make in-house data available online. It is built as a Java system with lots of good software engineering. The new version provides additional ways to query the data backend. Used in several large scale collaborations.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s