I’m in beautiful Vienna, Austria at the Bioinformatics Open Source Conference on Friday and Saturday, July 15-16th. The conference emphasizes freely available biological software and the communities that contribute to them. These are my notes from the morning talks.
Larry Hunter — The role of openness in knowledge-based systems for biomedicine
Difficulty in Artificial Intelligence is capturing common sense information that you expect to an intelligent agent to know. There is a ton of this information; but thankfully in molecular biology this common sense is less of a barrier — you can capture everything known about molecular biology from textbooks, papers, databases. Can we write programs that get all this information?
Difficulty is that the interesting questions we want to answer are complex. The one gene, one disease model is extremely rare; in general we are looking at perturbations of complex models that change over time. Now that we are so good at sequencing, the hard problems in bioinformatics are understanding the data. This is not only about facts, but rather putting facts together to answer "why" questions. Judging if an explanation is plausible in AI requires a knowledge base and a way to score results.
Some knowledge based computational biology solutions are: BioCyc, AskHermes, Watson Medicine, GO over-representation analysis, HyQue; anything that uses ontologies like GO. 3 reasons that openness matters in these areas:
- Productivity: very hard problem so need to build off results; can’t do it alone
- Equity: allow anyone to contribute by lowering barriers and costs
- Ethics: AI is a social concern; need to earn the trust of society
How do we get the process going?
- Build on current open ontologies: OBO, Semantic Web, Open Access Publishing, Linked Life Data (wow)
- Social infrastructure to work together to solve hard problems: cooperation and competition combined
- Conform to shared infrastructure and avoid fears of losing ideas and credit to the community
Idea is to organize competitions that require open source code and using shared infrastructure, specifically with goals of combining existing communities (BioCreative, BioNLP). For this you need software to work off of, computational power, training data to work with, and significant prizes.
Larry’s group has made CRAFT available, an open source set of semantic annotations that uses existing community ontologies. This can be the basis of these competitions. Key is leveraging these existing standards to serve as a basis for future work so we are actually building off each other’s work.
Remaining challenges are ensuring openness of papers in a way that they can be bulk downloaded for AI text mining, improving onotology and connections to existing text. The technical aspects are things that are excellent targets for competitions.
Konstantin Okonechnikov — Unipro UGENE: an open source toolkit for complex genome analysis
UGENE integrates bioinformatics tools. Written in C++/Qt with a plugin system. It provides a large library of bioinformatics algorithms: Smith-Waterman, Muscle, Blast, HMM, Bowtie and more. It contains a visualization toolkit for sequence viewing, alignment. Algorithms are parallelized for multi-core CPUs, GPUs and support launching on clusters.
Contains a visual environment for constructing workflows. The workflow can be turned into a shell command to run from the commandline. Future plans are to develop a web environment, and support next-gen sequencing analysis.
Thomas Down — Exploring the genome with Dalliance
Some alternatives to DAS include dense binary formats like BAM, BigBed and BigWig with indexes for random access. Dalliance can support these directly. Nice interactive demo: quick, easy, and can drill all the wall down to reads with BAM display.
Alex Kalderimis — InterMine – Using RESTful Webservices for Interoperability
InterMine is a data warehouse framework for biological experiments and raw data: FlyMine, ModMine and more. The database is heavily de-normalised so is loaded and served as a read-only database: very performant. This is coupled with a read-write User database that references items in the data repository. Provide a web application interface to the repository, with custom query templates for biologists.
Some Lessons learned: JSON is awesome; use GET/PUT to make it more browser friendly; fail loudly with http or JSON error codes so you actually know if you have a problem.
Bernat Gel — easyDAS: Automatic creation of DAS servers
easyDAS is a small web server to make it easy to create a DAS server. DAS servers are meant to be easy, with smart clients using the simple servers. Most DAS servers provided by larger institutes; how to handle it if you are a small place without lots of resources?
easyDAS removes all the server configuration details, and you only need to upload a data file; it is hosted at EBI with a web interface. The maximum size is a million rows; not suitable for full genome base level information but for lots of other information.
Kostas Karasavvas — Enacting Taverna Workflows through Galaxy
Taverna is a workflow management system; goal is to integrate with the Galaxy web framework. Taverna has a graphical interface to connect tools into a larger workflow. It provides a server that can run these workflows.
Herv?? M??nager — Mobyle 1.0: new features, new types of services
[Mobyle] is a web user interface for running commandline tools. Also allows chaining of jobs into workflows. XML definitions are used for both. Also implemented viewers that visualize RNA structure, multiple alignments, phylogenetic tree
s. Can edit alignments directly in the web interface. For workflows, can run on LSF clusters to parallelize.
Junjun Zhang — BioMart 0.8 offers new tools, more interfaces, and increased flexibility through plug-ins
BioMart is an open source federated data management system meant to make in-house data available online. It is built as a Java system with lots of good software engineering. The new version provides additional ways to query the data backend. Used in several large scale collaborations.