BioHackathon 2010: Day 2 — Python SPARQL query builder

BioHackathon day two started off with a walk through the University of Tokyo campus. It’s a beautiful place with a combination of European style college buildings and isolated ponds in the shape of Japanese kanji characters. We ended up at the Database Center of Life Sciences (DBCLS) to start actually doing some coding. Since things normally break up into smaller groups at this point, my thoughts will reflect the things that I ended up working on but I’ll try to capture as much of the general work being done as possible.

Discussion

The morning kicked off with a viewing of Tim Berners-Lee’s TED talk on linked data and the semantic web. It’s an inspirational talk for anyone interested in being better at sharing and organizing data on the web.

Francois Belleau from Bio2RDF followed that talk up with his own about the basics of RDF representation of data. The relevant URLs are bookmarked from his delicious account.

Some useful tools are:

  • Tabulator is a semantic web browsing tool that you can use as a Firefox extension. It’s a way to test your produced RDF and be sure it is produced properly.

  • Virtuoso is a RDF triple store recommended by several RDF providers. It can provide a SPARQL endpoint for querying the database and retrieving results.

Code

Today several of the OpenBio folks from Biopython and BioRuby discussed plans for providing libraries to make querying and providing semantic resources easier. The grand plan is to provide KEGG data via a SPARQL query with BioRuby and use a Biopython client for retrieval.

To build towards this goal, I started work on a generic query builder interface from Python. The idea is to provide a programmer’s API to query semantic resources that does not require knowing about all of the underlying details of RDF and SPARQL. This initial version uses the Biogateway SPARQL query endpoint, which provides semantic query access to GO terms and SwissProt data.

The full code is available from GitHub. It uses SPARQLwrapper to do the work of accessing the query server, and provides a Python API to help determine what query to build. The returned result is a table like object provided as a numpy RecordArray.

Building the query involves passing both retrieval and select objects to a builder. In the example below, we search human proteins for those with a GO annotations related to insulin, and disease descriptions containing references to diabetes. We retrieve both protein names, and the names of proteins interacting with them:

https://gist.github.com/299035

By avoiding exposing any of the underlying details to the library user, this helps provide focus on the items of interest. The next steps are to determine good ways to generalize this style of building to a wider range of queries, and to test it across multiple search providers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s