Day 3 of BioHackathon 2010 kicked off early in the morning; are they really forcing poor overworked coders to wake up at 4am to start work? Well, not to work but rather to take an early morning trip to the Tsukiji Fish Market. 4:30am on Japanese trains are comfortingly similar to early morning train rides everywhere: the passengers are a mix of girls who’ve been up all night drinking, overworked businessmen who will fall behind if they’re not on the first train of the day, and tourists trying to get to the fish market in time for the Tuna auction.
The fish market is an unbelievably busy storm of trucks, forklifts and styrofoam. If you can avoid getting run over or yelled at in Japanese, you can make it through to the early morning fish auction, where restaurant providers grade and bid on massive tuna, sharks and other assorted massive fish. The manic bidding results in prices of $5000 for the lucky recipient of a massive amount of sushi grade tuna. I still have no idea how they know who won the auction and how much they bid for, but everyone seems satisfied with their purchases.
Following that, we wandered through the sea of squid, orange eyed fish, seaweed with fish eggs, and other strange things from the ocean. Luckily, Toshiaki and Atsuko kept us oriented in the manic maze until we made it to a row of fabulous sushi restaurants. Our tiny place had amazing tuna bowls; don’t let anyone tell you that raw fish isn’t a breakfast food.
In the post fish morning discussion, Francois led us through an introduction to SPARQL queries. SPARQL is an SQL like syntax used to query RDF stores, and was the basis of the work I discussed yesterday. The general idea behind SPARQL is faceted querying based on properties of the various objects. Facet browsers are a nice way to navigate and explore RDF data. A nice example of a powerful query generator over triple stores is the one in Freebase.
- Identifiers: Are we sure we are using the same ID type to refer to a gene or other biological thing?
- Namespaces: How do you name different data associated with genes?
- Files: Do we know what file format and flavor of file format we are working with?
- Genome version: Reconciling various build names (Ensembl name, NCBI name, UCSC name, model organism names)
How do we resolve these issues? The answer is ontologies, but do ontologies already exist to solve these problems? If so, having providers adopt them could lead to some defacto standards. However, an issue is that many of these tools do not actually produce the data; they just redistribute them in more useful ways. Could data providers focus more strongly on semantics?
The overall practical conclusion is to establish a set of unique identifiers to use in tabular data as column headers. This would allow automated reasoning in intersecting tables from multiple sources. For instance, if you have a column of Ensembl identifiers, it would be useful to name it consistently. We need to establish a set of standard names for these common cases.
Our thoughts on establishing these names is first dumping a set of current column names from BioMart and InterMine, intersecting those, and pulling out the most useful ones. Then we would need to find some authority to bless them and practically be sure they are available and used in data providing tools.
Following yesterday’s work on providing a Python client interface for BioGateway’s SPARQL query server, the next step is to try and generalize the API for multiple servers providing similar information. This can serve as a basis for an interface that:
- Helps users build useful queries without have to understand the underlying data structures.
- Returns results in a consistent tabular fashion.
Another really useful tool represented here is modMine, which is an InterMine interface to a whole ton of awesome raw data from the modEncode project for C. elegans and Drosophila. The InterMine web services interface allows building complex queries with an XML syntax; it’s a very similar problem to building up SPARQL queries. Inspired by the current perl API for accessing the database, this implementation takes a slightly different tact on building a query. See the full code on GitHub.
Here we start with a query builder, and then define some common filtering operations, taking care of the query specific magic behind the scenes. This is an improvement over the more raw interface built yesterday since it compartmentalizes the query simplifying logic in a single class and makes the actual query building code easier to follow:
The next step is to apply this to the BioGateway interface developed yesterday and expand the two interfaces to include additional query types.