A roundup of interesting links from my feed reader:
Brent provides two Python modules to deal with large scale genomic data. The first indexes a matrix file for rapid cell lookup, while the second indexes a custom flat file format. While it pains me to see GFF reimplemented in a different format, I do understand the retrieval issues with nested GFF3. Maybe what is needed here is a Flat-like implementation that sits on top of standard GFF files. He also passes along a pointer to GenomeTools, which contains a number of C bioinformatics libraries with an available Python interface.
Amund Tveit at Atbrox describes parallelizing a classifier using Map Reduce on Hadoop. They use Python with Hadoop streaming to split the computationally intensive matrix math over multiple nodes as the map step, combing the results in the reducer. This includes a nice discussion of ways to further speed up processing, and is the start of the Snabler project which aims to collection parallel algorithms for Hadoop in Python.
Daniel Lemire persuasively argues for making your research, data and code open sourced. These are some nice points to have in the back of your head when discussing the issue with researchers who are worried about being scooped. The comment debate is also a good read, with legitimate concerns by folks navigating the world of academic incentives. Publish or perish; when can well written critiqued blog posts start counting as publishing?
Speaking of open science, I have been enjoying following Sean Eddy’s posts on HMMER3 development; it’s inspiring to see brilliant scientists working on widely used software candidly talking about the pros and cons of the software. The latest news is that the HMMER 3.0 release candidate is available.
UsabilityPost reviews the new book The Laws of Simplicity in the context of software design:
- Reduce interfaces as much as possible.
- Organize related items together into groups to differentiate similar functionality.
- Make things quicker to do; fast seems simpler.
Seth Roberts discusses his thoughts on how exploratory and confirmatory statistical analyses work together. Things that are considered exploratory, like graphing, are actually also critical components of more complex confirmatory statistical tests.
Getting Genetics Done provides some nice visualization pointers:
- LocusZoom is a web toolkit for plotting linkage disequilibrium, recombination rates and SNPs in the context of gene regions. The code is apparently written in R and Python, but I couldn’t find a download.
- Plotting contingency tables with ggplot2 in R. Contingency tables examine the relationships between two categorical variables, and the plots help reveal trends. In the example presented the categories are pre-ordered, but one general trick would be getting them organized in a way that makes the trends apparent.