Link potpourri: Large file indexing and analysis, open science and visualization

A roundup of interesting links from my feed reader:

  • Brent provides two Python modules to deal with large scale genomic data. The first indexes a matrix file for rapid cell lookup, while the second indexes a custom flat file format. While it pains me to see GFF reimplemented in a different format, I do understand the retrieval issues with nested GFF3. Maybe what is needed here is a Flat-like implementation that sits on top of standard GFF files. He also passes along a pointer to GenomeTools, which contains a number of C bioinformatics libraries with an available Python interface.

  • Amund Tveit at Atbrox describes parallelizing a classifier using Map Reduce on Hadoop. They use Python with Hadoop streaming to split the computationally intensive matrix math over multiple nodes as the map step, combing the results in the reducer. This includes a nice discussion of ways to further speed up processing, and is the start of the Snabler project which aims to collection parallel algorithms for Hadoop in Python.

  • Daniel Lemire persuasively argues for making your research, data and code open sourced. These are some nice points to have in the back of your head when discussing the issue with researchers who are worried about being scooped. The comment debate is also a good read, with legitimate concerns by folks navigating the world of academic incentives. Publish or perish; when can well written critiqued blog posts start counting as publishing?

  • Speaking of open science, I have been enjoying following Sean Eddy’s posts on HMMER3 development; it’s inspiring to see brilliant scientists working on widely used software candidly talking about the pros and cons of the software. The latest news is that the HMMER 3.0 release candidate is available.

  • UsabilityPost reviews the new book The Laws of Simplicity in the context of software design:

    • Reduce interfaces as much as possible.
    • Organize related items together into groups to differentiate similar functionality.
    • Make things quicker to do; fast seems simpler.
  • Seth Roberts discusses his thoughts on how exploratory and confirmatory statistical analyses work together. Things that are considered exploratory, like graphing, are actually also critical components of more complex confirmatory statistical tests.

  • Getting Genetics Done provides some nice visualization pointers:

    • LocusZoom is a web toolkit for plotting linkage disequilibrium, recombination rates and SNPs in the context of gene regions. The code is apparently written in R and Python, but I couldn’t find a download.
    • Plotting contingency tables with ggplot2 in R. Contingency tables examine the relationships between two categorical variables, and the plots help reveal trends. In the example presented the categories are pre-ordered, but one general trick would be getting them organized in a way that makes the trends apparent.

BioHackathon 2010: Day 5 — the final day

BioHackathon 2010 wrapped up today after five excellent days of discussion, coding and community. Sometimes it feels like you hit your stride right as things are wrapping up, which just goes to show how much a great set of people can get done once they are organized and comfortable working together. Or it could be Parkinson’s principle kicking in.

Hackathon summary

After a full day of work, everyone came together in the late afternoon to help summarize the accomplishments this week. Toshiaki kicked us off with a summary of everything that was accomplished during the week, highlighting the community and a nice demo of RDF TogoDB populated with the international liquor selection assembled for evening discussion time.

Below are summaries for the various projects and groups. For more details check out:


If you’ve been reading these updates, you have a good idea of what I presented for the Biopython summary of our work: interfaces to query BioGateway and InterMine. Raoul following this with a description of the BioRuby work. They were working with us to develop a similar API for accessing SPARQL endpoints at BioGateway and Bio2RDF. Thomas reported on the state of RDF support in Perl, recommending RDF:Trine and providing some working code on GitHub.


The G-language team presented an awesome javascript tool called cube which allows selection of web text and calling out to external services. We also got a demo of a Japanese video game circa 1986 which inspired the cube interface.


Andrea and Kei discussed their work with Cytoscape to handle Semantic Web formats like RDF. It can access triple stores and load RDF data directly from it, and query against SPARQL endpoints. See RDFScape for more details.

Text mining

Alberto discussed the work on semantic text mining with Heiko and folks from Reflect to get results as triplets. This was also done with Whatzit which recognizes items in biological text and makes them available as RDF.

Semantic Data Exchange

Gos talked about the discussion by data provider folks in improving semantics so that you can readily combine results between multiple sources of data. You can see the full notes on this meeting. There were a few different levels of interoperability considered:

  • File formats
  • Specifying locations (1-based versus 0-based; chromosome names)
  • Controlling the namespaces of columns in tabular data.
  • Specification of genome versions and annotations


Christian spoke about uses of RDF in Taxonomy. Use cases are biodiversity informatics, metagnomics. One big question: how to deal with uncertainty in biological information?

Converting to RDF from other formats

Pierre spoke about transforming XML resources to RDF using XSLT using an Ontology. Specifically transforming Genotype data on NCBI to RDF with xsltproc. A second example was converting large XML files from dbSNP to RDF. This is a 2 step process: parse with java into parts and then use XSLT to convert each part to RDF.


Mark gave an overview of work on SADI at the hackathon. He started with a bit of evangelism to encourage people to change their thinking to help with adopting RDF across multiple providers. SADI was added to provide additional tools to access it from Perl and Java. It was also integrated with Taverna support. Now older WSDL frameworks can be converted into SADI services without any coding.


Akira spoke about work this week integrating DDBJ, PDB and KEGG using RDF. The initial plan is to convert the data to RDF tables, and then link between tables. This was demonstrated with KEGG pathway to PDB queries. By transferring this to a powerful server, huge RDF stores like KEGG are accessible with rapid queries.


Jerven described the work on UniProt RDF this week. Using Pellet, they compared their current RDF output to the OWL description file. The consistency was improved greatly this week, improving the downstream applicability of RDF from UniProt.


Francois presented the work on generalizing Bio2RDF to multiple providers, and several decisions make at the Hackathon on using RDF: one important decision was polite URIs letting us how to name things.


I’m planning to get together a fully buttoned up post for Blue Collar Bioinformatics on the libraries for anyone who has been following along and is interested in using the libraries. The code is available on GitHub:

BioHackathon 2010: Day 4 — Improved python SPARQL query interface

Day 4 of BioHackathon 2010 is all about discussion and coding. Things settle in as everyone finds their group projects starting to get going. Then the reality hits that you only have two days left to finish your work, and things start moving at a more hectic pace.

Last night Peter and I had the chance to meet up with Michiel, another fellow Biopython coder who lives nearby in Japan. We had an authentic Japanese dinner consisting of some meat; I was unable to identify the animal or organ from which it was derived. More important than the fabulous food is the opportunity to reconnect with old friends and discuss some biology and programming. It’s amazing the things you can get accomplished just by talking through them.


After working up an improved coding interface to build queries for modMine yesterday, today I came back to the BioGateway interface from day 2 and expanded and simplified making queries. A nice discussion with Erick, who put together BioGateway, helped me understand some additional items that could be put into the queries.

Here is our query from day 2, redone and simplified. The query looks for human proteins that are:

  • Involved in generating or regulating insulin response.
  • Implicated in causing diabetes.

We retrieve the name of the proteins, along with the associated gene, known interacting proteins, and a Gene Ontology (GO) description of the protein function.

Here is a second query that looks at a different type of retrieval task. With a known protein, what papers should I look at to start understanding it’s function? The following query searches for your known protein name and returns references to the primary literature in PubMed.

To further automate this, the journal IDs can be used to automate retrieval of the paper details using Biopython and the Entrez interface:

BioHackathon 2010: Day 3 — Fish, interoperating and data retrieval

Day 3 of BioHackathon 2010 kicked off early in the morning; are they really forcing poor overworked coders to wake up at 4am to start work? Well, not to work but rather to take an early morning trip to the Tsukiji Fish Market. 4:30am on Japanese trains are comfortingly similar to early morning train rides everywhere: the passengers are a mix of girls who’ve been up all night drinking, overworked businessmen who will fall behind if they’re not on the first train of the day, and tourists trying to get to the fish market in time for the Tuna auction.

The fish market is an unbelievably busy storm of trucks, forklifts and styrofoam. If you can avoid getting run over or yelled at in Japanese, you can make it through to the early morning fish auction, where restaurant providers grade and bid on massive tuna, sharks and other assorted massive fish. The manic bidding results in prices of $5000 for the lucky recipient of a massive amount of sushi grade tuna. I still have no idea how they know who won the auction and how much they bid for, but everyone seems satisfied with their purchases.

Following that, we wandered through the sea of squid, orange eyed fish, seaweed with fish eggs, and other strange things from the ocean. Luckily, Toshiaki and Atsuko kept us oriented in the manic maze until we made it to a row of fabulous sushi restaurants. Our tiny place had amazing tuna bowls; don’t let anyone tell you that raw fish isn’t a breakfast food.



In the post fish morning discussion, Francois led us through an introduction to SPARQL queries. SPARQL is an SQL like syntax used to query RDF stores, and was the basis of the work I discussed yesterday. The general idea behind SPARQL is faceted querying based on properties of the various objects. Facet browsers are a nice way to navigate and explore RDF data. A nice example of a powerful query generator over triple stores is the one in Freebase.


After lunch several data providers and manipulators, including folks from Galaxy, BioMart, and InterMine, got together to discuss interchange of metadata and interoperability. The major issues are:

  • Identifiers: Are we sure we are using the same ID type to refer to a gene or other biological thing?
  • Namespaces: How do you name different data associated with genes?
  • Files: Do we know what file format and flavor of file format we are working with?
  • Genome version: Reconciling various build names (Ensembl name, NCBI name, UCSC name, model organism names)

How do we resolve these issues? The answer is ontologies, but do ontologies already exist to solve these problems? If so, having providers adopt them could lead to some defacto standards. However, an issue is that many of these tools do not actually produce the data; they just redistribute them in more useful ways. Could data providers focus more strongly on semantics?

The overall practical conclusion is to establish a set of unique identifiers to use in tabular data as column headers. This would allow automated reasoning in intersecting tables from multiple sources. For instance, if you have a column of Ensembl identifiers, it would be useful to name it consistently. We need to establish a set of standard names for these common cases.

Our thoughts on establishing these names is first dumping a set of current column names from BioMart and InterMine, intersecting those, and pulling out the most useful ones. Then we would need to find some authority to bless them and practically be sure they are available and used in data providing tools.


Following yesterday’s work on providing a Python client interface for BioGateway’s SPARQL query server, the next step is to try and generalize the API for multiple servers providing similar information. This can serve as a basis for an interface that:

  • Helps users build useful queries without have to understand the underlying data structures.
  • Returns results in a consistent tabular fashion.

Another really useful tool represented here is modMine, which is an InterMine interface to a whole ton of awesome raw data from the modEncode project for C. elegans and Drosophila. The InterMine web services interface allows building complex queries with an XML syntax; it’s a very similar problem to building up SPARQL queries. Inspired by the current perl API for accessing the database, this implementation takes a slightly different tact on building a query. See the full code on GitHub.

Here we start with a query builder, and then define some common filtering operations, taking care of the query specific magic behind the scenes. This is an improvement over the more raw interface built yesterday since it compartmentalizes the query simplifying logic in a single class and makes the actual query building code easier to follow:

The next step is to apply this to the BioGateway interface developed yesterday and expand the two interfaces to include additional query types.

BioHackathon 2010: Day 2 — Python SPARQL query builder

BioHackathon day two started off with a walk through the University of Tokyo campus. It’s a beautiful place with a combination of European style college buildings and isolated ponds in the shape of Japanese kanji characters. We ended up at the Database Center of Life Sciences (DBCLS) to start actually doing some coding. Since things normally break up into smaller groups at this point, my thoughts will reflect the things that I ended up working on but I’ll try to capture as much of the general work being done as possible.


The morning kicked off with a viewing of Tim Berners-Lee’s TED talk on linked data and the semantic web. It’s an inspirational talk for anyone interested in being better at sharing and organizing data on the web.

Francois Belleau from Bio2RDF followed that talk up with his own about the basics of RDF representation of data. The relevant URLs are bookmarked from his delicious account.

Some useful tools are:

  • Tabulator is a semantic web browsing tool that you can use as a Firefox extension. It’s a way to test your produced RDF and be sure it is produced properly.

  • Virtuoso is a RDF triple store recommended by several RDF providers. It can provide a SPARQL endpoint for querying the database and retrieving results.


Today several of the OpenBio folks from Biopython and BioRuby discussed plans for providing libraries to make querying and providing semantic resources easier. The grand plan is to provide KEGG data via a SPARQL query with BioRuby and use a Biopython client for retrieval.

To build towards this goal, I started work on a generic query builder interface from Python. The idea is to provide a programmer’s API to query semantic resources that does not require knowing about all of the underlying details of RDF and SPARQL. This initial version uses the Biogateway SPARQL query endpoint, which provides semantic query access to GO terms and SwissProt data.

The full code is available from GitHub. It uses SPARQLwrapper to do the work of accessing the query server, and provides a Python API to help determine what query to build. The returned result is a table like object provided as a numpy RecordArray.

Building the query involves passing both retrieval and select objects to a builder. In the example below, we search human proteins for those with a GO annotations related to insulin, and disease descriptions containing references to diabetes. We retrieve both protein names, and the names of proteins interacting with them:

By avoiding exposing any of the underlying details to the library user, this helps provide focus on the items of interest. The next steps are to determine good ways to generalize this style of building to a wider range of queries, and to test it across multiple search providers.

BioHackathon 2010: Day 1

These are some notes and thoughts from day 1 of BioHackathon 2010 in Tokyo, Japan.

Still feeling a bit strange from jet lag and 13 hour flights but up and ready for BioHackathon 2010. The day started off with a tour of the wonderful Tokyo Transit system during rush hour to head to the CRBC, which is located cross town in Tokyo; it’s hard to be more specific, since I’m not sure I actually understand where we went. But it was a beautiful trip over the water on a sunny morning; it beats an underground T ride through damp dark tunnels in Boston.

So we made it up to the 11th floor of the Computational Biology Research Center (CBRC) for a day of talks and discussions to kick off the coding session. The idea is to introduce various existing tools and try and build consensus about useful things to build. Here are notes from the first day of presentations, helping to build up some consensus around different topics to work on.

Toshiaki Katayama — Introduction

Hackathon generously sponsored by: Database Center of Life Sciences (DBCLS) National Institute of Advanced Industrial Science and Techology (AIST)

Overall Goals

  • Learning Semantic Web — OWL, RDF, SPARQL. What works best?
    • RDF (Resource description format): subject – predicate – object; referred to as a triple.
    • SPARQL — query language to provide directed search
    • OWL — Web Ontology language
  • Triple stores — how should we store and make data available?
  • Open Bio* libraries — accessing RDF tools and SPARQL endpoints

Erick Antezana– Semantic Systems Biology

Why? – Lots of data and lack of structure makes it difficult to analyze

Motivation – Problem to work on: cell cycle and all processes in Gene Ontology


  1. Develop a knowledge representation. Need specific terms and meanings: an ontology. This is a controlled vocabulary of biological terms and their relations.
  2. Provides a Semantic Web: machine understandable content. This allows moving beyond keyword search to complex query formulation.

Examples Cell Cycle Ontology: what, where, and when of cell cycle processes. Available in two formats: Open Biomedical Ontologies (OBO) and OWL. CCO exported as RDF, and can query using SPARQL.

BioGateway: builds on cell cycle to also include all processes in the Gene Ontology. Goal is to build complex queries over many organisms supported by GO.

Systems Biology cycle: start with mathematical model of a biological system, simulate the system and generate hypotheses, do biological wet lab experiments, analyze the data and then feed back in to adjust the model. An iterative process.

BioGateway supports this approach. Integrates several resources into an RDF database and triple store. Web page provides proposed queries into SPARQL that can then be edited.

Matthias Samwald — high level representation of the semantic web

aTag — web program allowing you to select text and then annotate it with semantic ontology terms. It is then available as RDF from a web page for automated discovery by programs. The idea is to read a paper and pull out facts into a format that can be queried. This is a step beyond blogging about a paper: you present both your interpretation of the data and a structured way for programs to query it.

For you to do: represent your data in a way that is compatible with aTags.

Thomas Kappler / Jerven Bolleman — UniProt in RDF

What is UniProt?

  • UniProtKB:
    • SwissProt — manually annotated and curated protein sequences
    • TrEMBL — computationally analyzed proteins
  • UniRef: clusters of protein sequences (100%, 90%, 50%)
  • UniParc: archive of all protein sequences

Experience with RDF:

  • All data available as RDF, including things like Taxonomy
  • Contains cross refs to many other databases
  • Migration process to move from flat file and XML to RDF
  • 85% of searches are simple keyword searches requiring full text indexing
  • SPARQL queries available at public endpoints:

Francois Belleau — Bio2RDF

Bio2RDF applies semantic web rules to integrate bioinformatics data:

Federated search across all databases:

Rules for linked data: Tim Berners-Lee – Use URIs as names, with HTTP so people can look them up – Provide useful information for names – Include links to other URIs, allowing discovery

Propose a new idea: cognoscope, based on mash ups. Build a specific database of the items you are interested in, then query that. Break up a SPARQL query into parts, and then submit the query of each part to the right workflow node. Based on Taverna: get each result, and put it in a triple store. Then query that triple store.

Get cognoscope by searching MyExperiment.

Heiko Horn — Reflect: text mining in semantic web

Give users a way to get semantic data from papers or web pages. Current journals do not provide semantic information, since incentives are not there for publishers or authors; takes more time and money. Reflect identifies chemical and protein names in a web page and provide additional information. Backed with 7.4 million chemicals and 2.6 million proteins.

New functionality: can add or remove names in a document that should be reflected. Helps fix manual problems identifying names of interest.

Has API services: with a document can get back the tagged HTML document (GetHTML) or an XML of found names (GetEntities). Support REST and SOAP interfaces.

Practically, the document is searched for an organism, and this is used to identify relevant chemicals and proteins. Can also specify the organism if it is not mentioned, which will help reduce false positives.

Tetsuro Toyoda — RIKEN SciNeS

SciNes is bioinformatics infrastructure in Japan. Goal is to fill gap between database integrators and scientists. Scientists can create databases and make it accessible to collaborators, and then make data publicly available at the time of publication. This helps bridge multiple national projects on different genomes. Provides a web interface to create and manage data and collaborators.

Semantic-JSON makes data available in several programming languages; a good point to query and retrieve semantic data from the databases.

Mark Wilkinson — SADI (Semantic Automated)

Why Semantic Web? Relational database model does not fit knowledge-based problems, because knowledge, data and schemas are constantly changing:

First step: use URIs instead of numeric primary keys.

Goal: integrating web services and semantic web, without inventing any new technologies or standards.

Web services create implicit biological relationships between objects. For instance: identifier -> has_sequence -> GATC. SADI makes those relationships explicit.

Next step, add in ontologies. Use OWL to collapse complicated relationships into a simpler query. This is an XML that describes the information that goes into the relationship; these will be defined by experts and then can be shared for others to use.

Andrea Splendiani — Visualization and analysis of biological networks

RDFScape — provide
a Cytoscape interface for interacting with RDF Semantic web data. Allows interactive viewing of items and connections, discovering relationships. Queries are supported in a graphical manner, without SPARQL requirement. Can customize the colors and overall structure of the graph to emphasize elements of interest. Finally can make interferences based on existing relationships.

Ondex — Generic analysis tools for the semantic web.

Tohiaka Katayama — TogoWS and Open Bio* libraries

TogoDB allows users to take data in CSV and makeit available as a REST/SOAP service in TogoWS. Next step would be to use this as a RDF/SPARQL provider.