These are my notes on the afternoon talks from BOSC 2010.
Stefen Moeller — Community-driven computational biology with Debian and Taverna
Stefan describes the DebMed initiative to provide Debian/Ubuntu packages for biological programs. How can this be generalized to cloud instances? Taverna provides the ability to general tools as web services and avoid some of the burden of installing packages.
Final idea is shared public data which can be made available on cloud images that would work on Eucalyptus. Really good idea to have generalized data but not sure about technical aspects of providing images across providers.
Darin London — Dealing with the data deluge: what can the robotics community teach us?
Dealing with 50+ cell lines sequenced with multiple ChIP-seq anitbodies. How to best manage this? Next gen data is very heterogenous across time and types of data.
Can we think up any good ideas for dealing with this type of data by looking at things the robotic community has done? Behavior-based robots act via independent modules modeled after biological activity. Systems are fault tolerant since different modules can pick up when others fail to act. Can parallelize this since individual modules act autonomously instead of needing to be serialized.
One useful idea is to predict when problems might happen with running out of disk space or memory based on the system parameters.
Developed a pipelin to generate data for ENCODE. Three times of agents: runner agents, processing agents launched by runner agents, and human agents. The task list is developed in Google spreadsheet. By adding tasks to the spreadsheet, can control the agents. Available as Perl module on CPAN.
Nyasha Chambwe — Goby framework
One issue with scaling and dealing with data is the proliferation of biological file formats. What are the desirable characteristics of file formats: well specified, easy to parse, compression and streaming. Developed new file formats for next gen data with a file format to analyze them.
Goby uses protocol buffers to provide a flexible and efficient mechanism for serializing. The data is defined as a message in a proto file. File is chunked and each region can be gzipped for random access to each region.
Demonstrate a full pipeline for RNA-seq analysis using Goby file formats.
Dana Robinson — BioHDF
Goals are to create a data model to describe data, a store to allow for efficient retrieval, and a toolkit for development.
BioHDF is a database schema in HDF for storing biological data, and a library and C API which are coming, and commandline tools similar to samtools. Reads are stored in a hierarchical manner by reads and alignments. Information stored is: reads, alignments, annotations, clusters of aligned reads, reference sequences and indexes. Additional user specific data can be stored.
One exciting development that is being discussed on the samtools mailing list is switching the underlying representation of BAM to HDF and abstracting it out with a higher C API.
Jens Lichtenberg — Concurrent bioinformatics software for discovering genome-wide patterns
WordSeeker — a tool that does motif discovery: enumerate the word space using suffix and radix trees, score the motifs, cluster them based on word sizes, evaluate conservation analysis using phastCons scores from UCSC, look for biased distribution of motif locations.
Scalable approach is necessary to parallelize the enumeration of all words. Similarly for scoring need to do frequent lookups. Can be scaled via MPI for distributed memory processing or OpenMP for shared memory machines. Presented timing data for analysis on Arabidopsis genome.
Chris Hemmerich — Automated Annotation of NGS Transcriptome Data using ISGA and Ergatis
Ergatis is a workflow management tool for running pipelines. Integrative Services for Genomic Analysis (ISGA) is a biologist’s tool for running and customizing Ergatis pipelines. It provides a graphical interface for setting up a pipeline and customizing input parameters. A specific transcriptome pipeline example is presented.
Mark Wilkinson — SADI
Mark discusses his semantic web solution for pulling together web services to make it easy to ask complex questions. Idea is to support scientific method and discussion where we have opinions and debate: not necessarily 100% about what something means. General notion is to create OWL ontologies that help define expressed hypotheses.
Aravind Venkatesan — Bio-Ontologies in Galaxy
ONTO-Toolkit is a collection of tools to manage ontologies represented in the OBO file format. Wraps ONTO-PERL which provides a high level API for querying ontologies. Two use cases:
Investigate the similarities between two different molecular functions. Look upstream of both and see how many of their ancestor terms are shared. Most specific common term can be used to assess this.
Identify overlapping annotations for a given pair of distinct biological process terms. Look for overlap between two distinct biological processes.
Christian Zmasek — Connecting TOPSAN to computational analysis
The Open Protein Structure Annotation Network (TOPSAN). Structures are available in PDB but very little annotation about them beyond the PDB titles. So TOPSAN provides a database for community annotation of proteins.
Most annotations entered by humans, but can also provide structured data in a simple format TOPSAN Protein Syntax (TPS). This is a RDF triple of protein, predicte (homologous, encodedbj, citation, memberof) and the value
Jianjiong Gao — Musite: Global Prediction of General and Kinase-Specific Phosphorylation Sites
Musite is an open source tool for protein phosphorylation prediction. Disordered regions typically have phosphorylation regions, so may also be useful for evaluating protein disorder.