Talk notes from the 2012 Bioinformatics Open Source Conference.
Dana Robinson: Using HDF5 to Work With Large Quantities of Biological Data
HDF5 is a structured binary file format and abstract data model for describing data. Not client/server, has a C interface + other high level interfaces. HDF5 has loads of advantages in terms of technical details. One disadvantage is that querying is a bit more difficult since access is more low level. You write higher level APIs specific to your data, with speed advantages.
Aleksi Kallio: Large scale data management in Chipster 2 workflow environment
Chipster is an environment for biological data analysis aimed at non-computational users. Recent work reworked architecture to handle large NGS data. Hides data handling on the server side from the user to provide a higher level interface. With NGS data storing all the data becomes problematic so data is only moved when needed. Data stored in sessions which provide quotas and management of disk space. Handles shared filesystems invisibly to user.
Qingpeng Zhang: Khmer: A probabilistic approach for efficient counting of k-mers
Custom k-mer counting approach based on bloom filters. Allows you to tradeoff false positives with memory. This makes the approach highly scalable to large datasets. Accuracy related to the kmer size and number of unique kmers at that size. Time usage of khmer is comparable to other approaches like jellyfish but main advantage is memory efficiency and streaming.
Seth Carbon: AmiGO 2: a document-oriented approach to ontology software and escaping the heartache of an SQL backend
The Amigo Browser displays gene ontology information. Retrieves basic key/value pairs about items and connections to other data. As data has expanded the SQL backend is difficult to scale. The solution thus far has been Solr using a Lucene index to query documents. Decided to push additional information into Lucene, including complex stuff like hashes as JSON. Turns out to be a much better model for the underlying data. Downside is that you need to build additional software on top of the thin client.
Jens Lichtenberg: Discovery of motif-based regulatory signatures in whole genome methylation experiments
Software to detect regulatory elements in NGS data. Goal is to correlate multiple sources of NGS data: peak calling + RNA-seq + methylation. These feed into motif-discovery algorithms. Looking at Hematopoietic Stem Cell differentiation in mouse. The framework is Perl-based that uses bedtools and MACS under the covers. The future goal is to re-write as C++ to parallelize and speed up approach.
Philippe Rocca-Serra: The open source ISA metadata tracking framework: from data curation and management at the source, to the linked data universe
ISA metadata framework for describing experimental information in a structured way. Build a set of tools to allow people to create and edit metadata, producing valid ISA for describing the experiments. ISATab now has a plugin infrastructure for specialized experiments.
Newest work focuses on version control for distributed, decentralized groups of users. OntoMaton provides search and ontology tagging for Google Spreadsheets which helps make ontologies available to a wider variety of users.
ISATab is working on exporting to an RDF representation and OWL ontologies. Some issues involve gaps in OBO ontologies for representation.
Julie Klein: KUPKB: Sharing, Connecting and Exposing Kidney and Urinary Knowledge using RDF and OWL
Built specialized ontology for kidney disease with domain experts. Difficulty is dealing with existing software. Provided a spreadsheet interface which was easier for biologists to work with, called Populous. Ended up with SPARQL endpoint and built a web interface on top for biologists: KUPKB. Nice example of using RDF under the covers to answer interesting questions but exposing it in a way that biologists query and manipulate the data.
Sophia Cheng: eagle-i: development and expansion of a scientific resource discovery network
eagle-i connects researchers to resources to help them get work done. eagle-i provides open access to data, software and onotologies are open-source. Built using semantic web technologies under the covers. Provides downloads of all RDF. Federated architecture built around a Sesame RDF store with SPARQL and CRUD REST APIs on top. Can pull available code stack from subversion along with docs.