These are some notes and thoughts from day 1 of BioHackathon 2010 in Tokyo, Japan.
Still feeling a bit strange from jet lag and 13 hour flights but up and ready for BioHackathon 2010. The day started off with a tour of the wonderful Tokyo Transit system during rush hour to head to the CRBC, which is located cross town in Tokyo; it’s hard to be more specific, since I’m not sure I actually understand where we went. But it was a beautiful trip over the water on a sunny morning; it beats an underground T ride through damp dark tunnels in Boston.
So we made it up to the 11th floor of the Computational Biology Research Center (CBRC) for a day of talks and discussions to kick off the coding session. The idea is to introduce various existing tools and try and build consensus about useful things to build. Here are notes from the first day of presentations, helping to build up some consensus around different topics to work on.
Toshiaki Katayama — Introduction
Hackathon generously sponsored by: Database Center of Life Sciences (DBCLS) National Institute of Advanced Industrial Science and Techology (AIST)
- Learning Semantic Web — OWL, RDF, SPARQL. What works best?
- RDF (Resource description format): subject – predicate – object; referred to as a triple.
- SPARQL — query language to provide directed search
- OWL — Web Ontology language
- Triple stores — how should we store and make data available?
- Open Bio* libraries — accessing RDF tools and SPARQL endpoints
Erick Antezana– Semantic Systems Biology
Why? – Lots of data and lack of structure makes it difficult to analyze
Motivation – Problem to work on: cell cycle and all processes in Gene Ontology
- Develop a knowledge representation. Need specific terms and meanings: an ontology. This is a controlled vocabulary of biological terms and their relations.
- Provides a Semantic Web: machine understandable content. This allows moving beyond keyword search to complex query formulation.
Examples Cell Cycle Ontology: what, where, and when of cell cycle processes. Available in two formats: Open Biomedical Ontologies (OBO) and OWL. CCO exported as RDF, and can query using SPARQL.
BioGateway: builds on cell cycle to also include all processes in the Gene Ontology. Goal is to build complex queries over many organisms supported by GO.
Systems Biology cycle: start with mathematical model of a biological system, simulate the system and generate hypotheses, do biological wet lab experiments, analyze the data and then feed back in to adjust the model. An iterative process.
BioGateway supports this approach. Integrates several resources into an RDF database and triple store. Web page provides proposed queries into SPARQL that can then be edited.
Matthias Samwald — high level representation of the semantic web
aTag — web program allowing you to select text and then annotate it with semantic ontology terms. It is then available as RDF from a web page for automated discovery by programs. The idea is to read a paper and pull out facts into a format that can be queried. This is a step beyond blogging about a paper: you present both your interpretation of the data and a structured way for programs to query it.
For you to do: represent your data in a way that is compatible with aTags.
Thomas Kappler / Jerven Bolleman — UniProt in RDF
What is UniProt?
- SwissProt — manually annotated and curated protein sequences
- TrEMBL — computationally analyzed proteins
- UniRef: clusters of protein sequences (100%, 90%, 50%)
- UniParc: archive of all protein sequences
Experience with RDF:
- All data available as RDF, including things like Taxonomy
- Contains cross refs to many other databases
- Migration process to move from flat file and XML to RDF
- 85% of searches are simple keyword searches requiring full text indexing
- SPARQL queries available at public endpoints:
Francois Belleau — Bio2RDF
Bio2RDF applies semantic web rules to integrate bioinformatics data:
- Convert many public databases to RDF: 2.3 billion triples
- Stored in Virtuoso triple store
- Ask useful questions using SPARQL:
Federated search across all databases:
Rules for linked data: Tim Berners-Lee – Use URIs as names, with HTTP so people can look them up – Provide useful information for names – Include links to other URIs, allowing discovery
Propose a new idea: cognoscope, based on mash ups. Build a specific database of the items you are interested in, then query that. Break up a SPARQL query into parts, and then submit the query of each part to the right workflow node. Based on Taverna: get each result, and put it in a triple store. Then query that triple store.
Get cognoscope by searching MyExperiment.
Heiko Horn — Reflect: text mining in semantic web
Give users a way to get semantic data from papers or web pages. Current journals do not provide semantic information, since incentives are not there for publishers or authors; takes more time and money. Reflect identifies chemical and protein names in a web page and provide additional information. Backed with 7.4 million chemicals and 2.6 million proteins.
New functionality: can add or remove names in a document that should be reflected. Helps fix manual problems identifying names of interest.
Has API services: with a document can get back the tagged HTML document (GetHTML) or an XML of found names (GetEntities). Support REST and SOAP interfaces.
Practically, the document is searched for an organism, and this is used to identify relevant chemicals and proteins. Can also specify the organism if it is not mentioned, which will help reduce false positives.
Tetsuro Toyoda — RIKEN SciNeS
SciNes is bioinformatics infrastructure in Japan. Goal is to fill gap between database integrators and scientists. Scientists can create databases and make it accessible to collaborators, and then make data publicly available at the time of publication. This helps bridge multiple national projects on different genomes. Provides a web interface to create and manage data and collaborators.
Semantic-JSON makes data available in several programming languages; a good point to query and retrieve semantic data from the databases.
Mark Wilkinson — SADI (Semantic Automated)
Why Semantic Web? Relational database model does not fit knowledge-based problems, because knowledge, data and schemas are constantly changing:
First step: use URIs instead of numeric primary keys.
Goal: integrating web services and semantic web, without inventing any new technologies or standards.
Web services create implicit biological relationships between objects. For instance: identifier -> has_sequence -> GATC. SADI makes those relationships explicit.
Next step, add in ontologies. Use OWL to collapse complicated relationships into a simpler query. This is an XML that describes the information that goes into the relationship; these will be defined by experts and then can be shared for others to use.
Andrea Splendiani — Visualization and analysis of biological networks
RDFScape — provide
a Cytoscape interface for interacting with RDF Semantic web data. Allows interactive viewing of items and connections, discovering relationships. Queries are supported in a graphical manner, without SPARQL requirement. Can customize the colors and overall structure of the graph to emphasize elements of interest. Finally can make interferences based on existing relationships.
Ondex — http://ondex.org: Generic analysis tools for the semantic web.
Tohiaka Katayama — TogoWS and Open Bio* libraries
TogoDB allows users to take data in CSV and makeit available as a REST/SOAP service in TogoWS. Next step would be to use this as a RDF/SPARQL provider.