Link potpourri: Large file indexing and analysis, open science and visualization

A roundup of interesting links from my feed reader:

  • Brent provides two Python modules to deal with large scale genomic data. The first indexes a matrix file for rapid cell lookup, while the second indexes a custom flat file format. While it pains me to see GFF reimplemented in a different format, I do understand the retrieval issues with nested GFF3. Maybe what is needed here is a Flat-like implementation that sits on top of standard GFF files. He also passes along a pointer to GenomeTools, which contains a number of C bioinformatics libraries with an available Python interface.

  • Amund Tveit at Atbrox describes parallelizing a classifier using Map Reduce on Hadoop. They use Python with Hadoop streaming to split the computationally intensive matrix math over multiple nodes as the map step, combing the results in the reducer. This includes a nice discussion of ways to further speed up processing, and is the start of the Snabler project which aims to collection parallel algorithms for Hadoop in Python.

  • Daniel Lemire persuasively argues for making your research, data and code open sourced. These are some nice points to have in the back of your head when discussing the issue with researchers who are worried about being scooped. The comment debate is also a good read, with legitimate concerns by folks navigating the world of academic incentives. Publish or perish; when can well written critiqued blog posts start counting as publishing?

  • Speaking of open science, I have been enjoying following Sean Eddy’s posts on HMMER3 development; it’s inspiring to see brilliant scientists working on widely used software candidly talking about the pros and cons of the software. The latest news is that the HMMER 3.0 release candidate is available.

  • UsabilityPost reviews the new book The Laws of Simplicity in the context of software design:

    • Reduce interfaces as much as possible.
    • Organize related items together into groups to differentiate similar functionality.
    • Make things quicker to do; fast seems simpler.
  • Seth Roberts discusses his thoughts on how exploratory and confirmatory statistical analyses work together. Things that are considered exploratory, like graphing, are actually also critical components of more complex confirmatory statistical tests.

  • Getting Genetics Done provides some nice visualization pointers:

    • LocusZoom is a web toolkit for plotting linkage disequilibrium, recombination rates and SNPs in the context of gene regions. The code is apparently written in R and Python, but I couldn’t find a download.
    • Plotting contingency tables with ggplot2 in R. Contingency tables examine the relationships between two categorical variables, and the plots help reveal trends. In the example presented the categories are pre-ordered, but one general trick would be getting them organized in a way that makes the trends apparent.
Advertisements

6 thoughts on “Link potpourri: Large file indexing and analysis, open science and visualization

  1. hi brad, regarding gff in a different format, i agree, it shouldn’t be necessary, and in the flatfeature repository, there’s a fatfeature.Fat (no "L") class that relies directly on gff for cases where one needs all the sub-types and alternative splicings and such that the .gff format makes available. likewise, the .flat file, could likely be replaced by a method that parses a .gff file into the Flat format in memory… for now, having a .flat file does simplify things. thanks for the mention.

  2. Brent;Your implementation is a great worked case of how the API should be defined to be useful. I like this method of building something that works how you want it and then iteratively improving it. I’ve been coding a general GFF parser in Python:http://github.com/chapmanb/bcbb/tree/master/gff/and a useful addition would be indexing it for retrieval by ID and chromosome location like you described. I will try and have a go at this when I have a chance; maybe we can build up enough functionality to encourage more people to use GFF. Thanks,Brad

  3. brad, i/we could hook your parser up to fatfeature–which then allows the query by ID and chr and location (via binary search on location). fatfeature needs work on the API and general syntax, but yes, it would be nice if more people were using GFF. unfortunately, most of the files that claim to be gff3 are not valid and i have to manually edit them. even the TAIR 9 gff is not valid gff3–and it must be one of the most used annotation sets in plants. but, it beats SQL.also, it’s worth checking out the genometools gff stuff, they have a command-line gff3 validator and the ctypes api is not bad, i’ve been helping to add stuff like iterators and other pythonic sugar.

  4. Brent;That sounds like a really good plan. The general idea would be to rapidly pre-index a GFF file, using something like bx-python’s interval indexing format:http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.pyThen provide a fatfeature like query API on top of this index. Underneath, it would pull out the items of interest using the GFF parser and return them as general records and features. These are Biopython-style SeqRecords and SeqFeatures right now.Agreed about the GFF syntax issues; a lot of the code in the parser deals with GFF3/GTF/GFF2 irregularities so they are exposed identically. I’ll put TAIR9 GFF on my list of ones to look at. If you have good examples of problematic lines that especially annoyed you, let me know and we’ll add ’em to the test suite.Thanks again,Brad

  5. Brent;I implemented some code based on all that rambling in my last post. Here is a simple script I used against the TAIR9 GFF:http://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/access_gff_index.pyThis indexes the GFF file using bx-python’s intervals, and then allows queries via your get_features_in_region API. It returns Biopython SeqFeature objects with all of the gene -> transcript -> CDS/exon/UTR parts appropriately nested.It is pretty fast. The one-time indexing only requires reading and splitting the strings, so is as fast as using the standard csv module. From there the queries are instantaneous for reasonably sized regions.For other high level queries like via gene IDs or feature types, we would need a separate index. A SQLite table with IDs types and file offsets might make sense.Brad

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s