Links: science communication, statistics, large data analysis

A long overdue set of links and quick thoughts:

Communicating research


Data analysis

  • Revolutions has advice for dealing with large data sets from the 2010 Workshop on Algorithms for Modern Massive Data Sets.

  • The larry package for manipulating tables in Python. This uses NumPy under the covers and is similar to dealing with data.frames in R.

  • Will describes how callbacks can drive an analysis pipeline. As analysis workflows get more complicated, your code can get to be a mess of special cases and become really fragile. Here he passes around functions through a standard runner to help generalize and abstract the process.




Links: R code, clustering gotchas, visualization example and RNAi screening

Kicking off this roundup of blog posts of interest from the last few weeks, Pietro discusses how to improve the writing of technical blog posts: be concise, be interesting to start off posts, and present your information as a story. Lots of nice examples from popular technical blogs.

  • R code examples are always interesting. Jeremy provides code to trim fastq sequencing reads using the Bioconductor ShortRead package. If you’re a self taught R coder like myself, Chris’ summary of the R type system will be useful to detangle vectors and lists in your mind. He also describes how to use the reshape package to do pivot tables in R, transposing two category data from a table into a matrix of values.

  • As a reminder that you should always be rethinking your data sources and analysis methods, Lars digs into an issue clustering proteins using Markov clustering and discovers a case where unconnected nodes are clustered together due to having many shared edges. Following up, he proposes a fix and demonstrates the issue in the wild in both ortholog and protein complex datasets.

  • Visualization and analysis examples are always good sources of inspiration for future projects. FlowingData has a visualization challenge to take a crossing line graph and make it easier to read; my favorite from the comments separated the graphs and provided a reference line. Another nice source of charts and comparisons is Juice Analytics analysis of survery results.

  • Rajarshi’s presentation on high throughput RNAi screens at the NIH is a great resource of techniques, approaches and high level questions. It is well worth reviewing both for the thoughtful approach and for tips and tricks.

  • Nico summarizes visualization tools for large graphs with the goal of dealing with RDF graphs and ontologies. Several of these tools are also useful for phylogenetic taxonomy work.

Link potpourri: Large file indexing and analysis, open science and visualization

A roundup of interesting links from my feed reader:

  • Brent provides two Python modules to deal with large scale genomic data. The first indexes a matrix file for rapid cell lookup, while the second indexes a custom flat file format. While it pains me to see GFF reimplemented in a different format, I do understand the retrieval issues with nested GFF3. Maybe what is needed here is a Flat-like implementation that sits on top of standard GFF files. He also passes along a pointer to GenomeTools, which contains a number of C bioinformatics libraries with an available Python interface.

  • Amund Tveit at Atbrox describes parallelizing a classifier using Map Reduce on Hadoop. They use Python with Hadoop streaming to split the computationally intensive matrix math over multiple nodes as the map step, combing the results in the reducer. This includes a nice discussion of ways to further speed up processing, and is the start of the Snabler project which aims to collection parallel algorithms for Hadoop in Python.

  • Daniel Lemire persuasively argues for making your research, data and code open sourced. These are some nice points to have in the back of your head when discussing the issue with researchers who are worried about being scooped. The comment debate is also a good read, with legitimate concerns by folks navigating the world of academic incentives. Publish or perish; when can well written critiqued blog posts start counting as publishing?

  • Speaking of open science, I have been enjoying following Sean Eddy’s posts on HMMER3 development; it’s inspiring to see brilliant scientists working on widely used software candidly talking about the pros and cons of the software. The latest news is that the HMMER 3.0 release candidate is available.

  • UsabilityPost reviews the new book The Laws of Simplicity in the context of software design:

    • Reduce interfaces as much as possible.
    • Organize related items together into groups to differentiate similar functionality.
    • Make things quicker to do; fast seems simpler.
  • Seth Roberts discusses his thoughts on how exploratory and confirmatory statistical analyses work together. Things that are considered exploratory, like graphing, are actually also critical components of more complex confirmatory statistical tests.

  • Getting Genetics Done provides some nice visualization pointers:

    • LocusZoom is a web toolkit for plotting linkage disequilibrium, recombination rates and SNPs in the context of gene regions. The code is apparently written in R and Python, but I couldn’t find a download.
    • Plotting contingency tables with ggplot2 in R. Contingency tables examine the relationships between two categorical variables, and the plots help reveal trends. In the example presented the categories are pre-ordered, but one general trick would be getting them organized in a way that makes the trends apparent.

Potpourri: borrowing and reusing, stats, key-value stores and testing

A collection of links that tackle using methods from outside your field of study, building a cohesive community, statistical techniques, distributed programming and testing:

Visualization examples: presenting data

Several nice visualization examples have been collecting in my feed

Trust in Bioinformatics

Rajarshi summarizes a recent paper in Bioinformatics about trusting
Bioinformatics results:

…paper by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue – “fishing for significance” and “publication bias.”
As I have been reading the microarray literature recently to help me with RNAi screening data, I have seen  the problem firsthand. There are hundreds of papers on normalization techniques and gene selection methods. And each one claims to be better than the others. But in most cases, the improvements seem incremental. Is the difference really significant? It?s not always clear.

The second part is very relevant to the development of good open source tools. We need our scientific rewards system to encourage building on prior work.

Related to trust, Clay Shirky has a nice post regarding algorithmic
authority. It covers the point at which you trust the results from
an algorithm to be correct, relating it to our trust systems in

Code examples — MapReduce on Amazon and colored country maps

Posts with working code examples rock:

* Using Elastic MapReduce to access SimpleDB on Amazon Web Services.
This uses the boto Python library to access the database and a
Ruby client, configured with JSON to run the example on Hadoop:…

* Making choropleth maps: plotting out maps with metrics organized
by state or region. This is a useful tactic to apply to any sort
of spatially oriented data. Flowing Data has a python example:…

and folks at Revolutions take up the challenge in R: