Bioinformatics Open Source Conference 2013, day 2 afternoon: cloud computing, translational genomics and funding

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 afternoon sessions focusing on cloud computing, infrastructure, translational genomics and funding of community open source tools.

Previous notes:

Cloud and Genome scale computing

Towards Enabling Big Data and Federated Computing in the Cloud

Enis Afgan

Hadoop is a commonly used approach to distribute work using map-reduce. HTCondor scavenges cycles from idle computation on machines. CloudMan organizes Amazon architecture and provides a management interface. Goal of the project is to combine these three to explore usage in biology. Provides a Hadoop-over-SGE integration component that spins up a Hadoop cluster with HDFS and master/worker nodes. Hadoop setup takes 2 minutes: nice solution for using Hadoop on top of existing cluster scheduler. HTCondor provides scaling out between local and cloud infrastructure or two cloud infrastructures. Currently a manual interchange process to connect the two clusters. Once connected can submit jobs across multiple jobs, transferring data via HTCondor. Making Elastic and Extensible Gene­-centric Web Services

Chunlei Wu provides a set of web services to query and retrieve gene annotation information. The goal is to avoid the need to update and maintain local annotation data: annotation as a service. Updates data weekly from external resources and place into document database MongoDB. Exposed as a simple REST interface allowing query and retrieval via gene IDs. They have a Python API called mygene and javascript autocomplete widget.

An update on the Seal Hadoop-based sequence processing toolbox

Luca Pireddu

Seal provides distribution of bioinformatics alogrithms on Hadoop. Motivated by success Google has had scaling larger data problems with this approach. Key idea: move to a different computing paradigm. Developed for sequencing core at CSR4. Some tools implemented: Seqal – short read mapping with bwa; Demux – demultiplex samples from multiplexed runs. RecabTable – recalibration of base quality equivalent to GATK’s CountCovariatesWalker; ReadSort – distributed sorting of read alignments. Provided wrappers for Seal tools which is a bit tricky since Hadoop doesn’t follow Galaxy model.

Open Source Configuration of Bioinformatics Infrastructure

John Chilton

John is tackling the problem of configuring complex applications. His work builds on top of CloudBioLinux which makes heavy use of Fabric. Fabric doesn’t handle configuration management well: not a goal of the project. Two examples of configuration management systems: Puppet and Chef. John extended CloudBioLinux to allow use of puppet modules and chef cookbooks. A lightweight job runner tool called LWR sits on top of Galaxy using this. Also working on integration with the Globus toolkit. John advocates creating a community called bioconfig around these ideas.

An Open Source Framework for Gene Prioritization

Hoan Nguyen, Vincent Walter

SM2PH (Strucural Mutation to Pathology Phenotypes in Human) project helps prioritize most promising features associated with a biological question: processes, pathologies or networks. Involves a three step process: building a model for training features, locally prioritizing these and then globally prioritizing. Developed a tool called GEPETTO that handles prioritization. Built in a modular manner for plugins and extension from the community. Integrates with Galaxy framework via tool shed. Prioritization modules: protein sequence alignment, evolutionary barcodes, genomic context, transcription data, protein-protein interactions, hereditary disease gene probability. Uses jBPM as a workflow engine. Applied GEPETTO prioritization to work on age-related macular degeneration (those scary eye pictures you see at the optometrist).

RAMPART: an automated de novo assembly pipeline

Daniel Mapleson

RAMPART provides an automated de-novo assembly pipeline as part of core service. Motivation is that the TGAC core handles a hydrogenous input of data so need to support multiple approaches and parameters. One difficulty is that it’s hard to assess which assembly is best. Some ideas: known genome length, most contiguous (N50), alignments of reads to assembly and assembly to reference. Nicely wrapped all of this up into a single tool that works across multiple assemblers and clusters. Broken into stages of error correction, assembly with multiple approaches, decision on assembly to use, then an assembly improver. Builds on top of the EBI’s Conan workflow management application. Provides an external tool API to interface with third party software.

Flexible multi-omics data capture and integration tools for high-throughput biology

Joeri van der Velde

Molgenis provides a software generator and web interface for commandline tools based on a domain specific language. Provides customized front ends for diverse set of tools. Nice software setup with continuous integration and deployment to 50 servers. Motivation is to understand genotype to phenotype with heterogeneous data inputs. Challenge is how to prepare the custom web interfaces when data is multi-dimensional in terms of comparisons. Treat this as a big matrix of comparisons between subject and traits. Shows nice plots of displaying QTLs for C elegans projects warehoused on Molgenis. Same approach works well across multiple organisms: Arabidopsis.

Translational genomics

Understanding Cancer Genomes Using Galaxy

Jeremy Goecks

Jeremy’s research model: find computing challenges, invent software to handle it, and demonstrate usefulness via genomics. Focus on this talk is pancreatic cancer transcriptome analysis. Jeremy’s builds tools on top of Galaxy. Added new tools for variant calling, fusion detection and VCF manipulation. Jeremy shows a Galaxy workflow for transcriptome analysis. Advantages of Galaxy workflows: recomputable, human readable, importable, sharable and publishable in Galaxy pages. Uses Cancer Cell Line Encylopedia for comparisons. Now a more complex workflow with variants, gene expression and annotations to do targeted eQTL analysis. Custom visualizations provide ability to extract partial sets of data, then publish results of those views. Provides an API to plug in custom visualization tools. Shows a nice demo of recalling variants on only a single gene with adjusted parameters. Has another tool which does parameter sweeps and shows quickly how output looks with different subsets of parameters.

Strategies for funding and maintaining open source software

BOSC ended with a panel discussion featuring Peter Cock, Sean Eddy, Carole Goble, Scott Markel and Jean Peccoud. We discussed approaches for funding long term open source scientific software. I chaired the panel so didn’t get to take the usual notes but will summarize the main points:

  • Working openly and sharing your work helps with your impact on science.
  • It is critical to be able to effectively demonstrate your impact to reviewers, granting agencies, and users of your tools. Sean Eddy shared the Deletion Metric for research impact: Were a researcher to be deleted, would there be any phenotype?
  • To demonstrate impact, be able to quantify usage in some way. Some of the best things are personal stories and recommendations about how your software helps enable science in ways other tools can’t.
  • Papers play in important role in educating, promoting and demonstrating usage of your software, but are also not the only metric.
  • We need to take personal responsibility as developers for categorizing impact and usage. Download/views are not great metrics since hard to categorize. Better to be able to engage and understand usage and ask users to cite and recommend. Lightweight easy citation systems would go a long way towards enabling this.


Bioinformatics Open Source Conference 2013, day 2 morning: Sean Eddy and Software Interoperability

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches for openly developed community software supporting scientific research. These are my notes from the day 2 morning session focused on software interoperability.

Previous notes:

Biological sequence analysis in the post-data era

Sean Eddy

Sean starts off with a self-described embarrassing personal history about how he developed his scientific direction. Biology background: multiple self-splicing introns in a bacteriophage, unexpected in a highly streamlined genome. The introns are self-removing transposable elements that are difficult for the organism to remove from the genome. No sequence conservation of these, only structural conservation, but no tools to detect this. Sean was an experimental biologist and used this as motivational problem to search for an algorithm/programming solution to the problem. Not able to accomplish this straight off until learned about HMM approaches. Reimplemented HMMs and re-invented stochastic context free grammars to model structural work as a tree structure. Embarrassing part was that his post-doc lab work on GFP was not going well and scooped, so wrote a postdoc grant update to switch to computational biology. This switch led to HMMER, Infernal, Biological Sequencing Analysis.

From this: general advice to not do incremental engineering is wrong. A lot of great work came from incremental engineering: automobiles, sequence analysis (Smith Waterman -> BLAST -> PSI-BLAST -> HMMER). Engineering is a valuable part of science. Requires insane dedication to a single problem. The truth: science rewards for how much impact you have, not how many papers you write. Arbitrage approach to science: take ideas and tools and make them usable for biologists who need them. Not traditionally valuable but useful so can carve out a niche.

General approach to Pfam that helps tame exponential growth of sequences. Strategy is to use representative seed alignments, sweep the full database, use scalable models in HMMER and Infernal, then automate. Scales as you’ve got more data.

Scientific publication is a 350 year old tradition of open science. First journal with peer review in 1665: scientific priority and fame in return for publication and disclosure. This quid pro quo still exists today. The intent of the system has been open data since the beginning, but tricky part now is that the part you want to be open does not fit into the paper. Specifically in computational science, the paper is an advertisement, not a delivery mechanism.

Two magic tricks. We need sophisticated infrastructure, but most of the time we’re exploring. For one-off data analysis, premium is on expert biology and tools as simple as possible. Trick 1: use control experiments over statistical tests. Things you need: trusted methods, data availability, command line. Trick 2: take small random sub-samples of large datasets. Review example using this approach to catch algorithm approach error in spliced aligner.

Bioinformatics: data analysis needs to be part of the science. Biologists need to be fluent in computational analysis and strong computational tools will always be in demand. Great end to a brilliant talk.

Software Interoperability

BioBlend – Enabling Pipeline Dreams

Enis Afgan

BioBlend is a Python wrapper around the Galaxy and CloudMan APIs. The goal is to enable creation of automated and scalable pipelines. For some workflows the Galaxy GUI workflow isn’t enough because we need metadata to drive the analysis. Luckily Galaxy has a documented REST API that supports most of the functionality. To support scaling out Galaxy, CloudMan automates the entire process of spinning up an instance, creating and SGE cluster and managing data and tools. Galaxy is a execution engine and CloudMan is the infrastructure manager. BioBlend has extensive documentation and has lots of community contributions.

Taverna Components: Semantically annotated and shareable units of functionality

Donal Fellows

Taverna components are well described parts that plug into a workflow. It needs curation, documentation and to work (and fail) in predictable ways. The component hides the complexity of calling the wrapped tool service. This is a full part of the Taverna 2.5 release: both workbench and server. Components are semantically annotated to describe inputs/outputs according to domain ontologies. Components are not just nested workflows since they obey a set of rules so can treat as a black box and drill in only if needed. Components enable additional abstraction allowing workflows to be more modular: allows individual work on components and high level workflows with updates for new versions. Long term goal is to treat the entire workflow as a RDF model to improve searching.

UGENE Workflow Designer – flexible control and extension of pipelines with scripts

Yuriy Vaskin

UGENE focuses on integration of biological tools using a graphical interface. It has a workflow designer like Galaxy and Taverna and runs on local machines. Also offers a python API for scripting through UGENE. Nice example code feeding Biopython inputs into the API natively.

Reproducible Quantitative Transcriptome Analysis with Oqtans

Vipin Sreedharan

Starts off talk with poll from RNA-seq blog. The most immediate needs for the community are standard bioinformatics pipelines and skilled bioinformatics specialists. oqtans is online quantitative transcriptome analysis, code available on GitHub. Drives an automated pipeline with a vast assortment of RNA-seq data analysis tools. Some useful tools used: PALMapper for mapping, rDiff for differential expression analysis, rQuant for alternative transcripts. oqtans available from a public Galaxy instance and with Amazon AMIs.

MetaSee: An interactive visualization toolbox for metagenomic sample analysis and comparison

Xiaoquan Su

MetaSee provides an online tool for visualizing metagenomic data. It’s a general visualization tool and integrates multiple input types. Nice tools specifically for metagenomics to display taxa in a population. Have a nice MetaSee mouth example which maps metagenomics of the mouth. Also pictures of teeth are scary without gums. Meta-Mesh is a metagenomic database and analysis system.

PhyloCommons: community storage, annotation and reuse of phylogenies

Hilmar Lapp

Phylocommons provides an annotated repository of phylogenetic trees. Trees are key to biological analyses and increasing in number, but difficult to reuse and build off. Most are not archived, and even if so are in images or other hard to automatically use. It uses Biopython to convert trees into RDF and allows query through the Virtuoso RDF database. Code is available on GitHub.

GEMBASSY: an EMBOSS associated package for genome analysis using G-language SOAP/REST web services

Hidetoshi Itaya

GEMBASSY provides an EMBOSS package that integrates with the G-Language using a web service. This gives you commandline access through EMBOSS for a wide variety of visualization and analysis tools. Nice integration examples show it working directly in a command line workflow.

Rubra – flexible distributed pipelines for bioinformatics

Clare Sloggett

Rubra provides flexible distributed pipelines for bioinformatics, build on top of Ruffus. Used to build a variant calling pipeline built on bwa, GATK and ENSEMBL.

Bioinformatics Open Source Conference 2013, day 1 afternoon: visualization and project updates

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches to openly developed community software to support scientific research. These are my notes from the day 1 afternoon session focused on Open Science.

Previous notes:


Refinery Platform – Integrating Visualization and Analysis of Large-Scale Biological Data

Nils Gehlenborg

The Refinery Platform provides an approach to manage and visualize data pipelines. TCGA: 10,000 patients, with mRNA, miRNA, methylation, expression, CNVs, variants and clinical parameters. Lots of heterogeneous data, made more extensive after processing. Need an approach to manage long running pipelines with numerous outputs. Want to integrate horizontally across all data types to gain biological insight. Want to integrate vertically across data levels to provide confirmation and troubleshooting. ISA-Tab provides data model for metadata and provenance evaluation. Web interface performs faceted views of all data based on metadata, and visualizations to explore attribute relationships. Underlying workflow engine is Galaxy. Approach is to setup workflows in Galaxy, then make them available in Refinery at a higher level. Uses the Galaxy API by developing custom workflows based on a template for 100s of samples.

Two approaches to visualization in Refinery. The first is file-based visualization: connect to IGV and display raw BAM data long with associated metadata. Galaxy also supports this well, so the hope is to build off of this. The second approach is database-driven visualization that uses an associated Django server to read/write from a simple API server. Can use callbacks also with REST building off TastyPie so quick and easy to develop custom visualizations.

DGE-Vis: Visualisation of RNA-seq data for Differential Gene Expression analysis

David Powell

DGE-vis provides a visualization framework to identify differentially expressed genes from RNA-seq analysis. Provides approaches to handle two comparison differentially expressed list. To generalize to three-comparison compares it creates a Venn diagram and allows selection of each of the subcomponents to inspect individually. Given the limitations of this, they then developed a new approach. David shows a live demo of comparisons between 4 conditions, which identifies changes over the conditions. A heatmap groups conditions based on differential expression similarities. The heatmap is nicely linked to expression differences for each gene and subselection shows you list of genes. All three items link so change in real time as others adjust. Provides integrated pathway maps with colors linked to each experiment, allowing biologists to identify changed genes via pathways. Written with Haskell on the backend, R for analysis, CoffeScript and Javascript using D3 for visualization.

Genomic Visualization Everywhere with Dalliance

Thomas Down

Thomas starts by motivating visualization: humans love to look at things and practically scientists write papers around a story told by the figures. Unfortunately we focus on print/old-school visualizations: what more could we present if they weren’t so static. The Dalliance genome browser provides complete interactivity with each loading of custom files and multiple tracks. Designed to be able to fit into other applications easily so embed into your website. Also meant to be usable in more lightweight contexts: blog posts, slides, journal publications. It’s a fully client side implementation but does need CORS allowed header on remote websites that feed data in.

Robust quality control of Next Generation Sequencing alignment data

Konstantin Okonechnikov

Goal is to avoid common traps in next-generation sequencing data: avoid poor runs and platform/protocol-specific errors. Provides a more user-friendly tool in comparison to FastQC, samtools, Picard and RNA-seQC. Konstantin’s tool is QualiMap. Provides interactive plots inspired by FastQC’s displays, and also does count quality control, transcript coverage and 5’/3′ bias tools for RNA-seq analyses.

Visualizing bacterial sequencing data with GenomeView

Thomas Abeel

GenomeView provides genome browser for interactive, real-time exploration of NGS data. Allows editing and curation of data. Configurability and extensibility through plug-ins. Designed for bacterial genomes so focuses on consensus plus gaps and missing regions. Handles automated mapping between multiple organisms, show annotations across them. Handles 60,000 contigs for partially sequenced genomes, allowing selection by query to trim down to a reasonable number.

Genomics applications in the Cloud with DNANexus

Andrey Kislyuk

DNANexus has an open and comprehensive API to talk to the DNANexus platform. Provides genome browser, circos and other visualization tools. Have a nice set of GitHub repositories including client code for interacting with the API and documentation. StackOverflow clone called DNANexus Answers for question/answer and community interaction.

Open source project updates

BioRuby project updates – power of modularity in the community-based open source development model

Toshiaki Katayama

Toshiaki provides updates on latest developments in the BioRuby community. Important changes in openness during the project: move to GitHub, BioGems system lowers barrier to joining the BioRuby community. Users can publish standalone packages that integrate. Some highlights: bio-gadget, bio-svgenes, bio-synreport, bio-diversity.

Two other associated projects. biointerchange provides RDF converters for GFF3, GVF, Newick and TSV; developed during 2012 and 2013 BioHackathon. The second is basespace-ruby. See the Codefest 2013 report for more details on the project.

Biopython project update

Peter Cock

Peter provides updates on the latest updates from the Biopython community. Involvement with GSoC for the last several years with both NESCent and OpenBio foundation. This has been a great source of new contributors as well as code development. It’s an important way to develop and train new programmers interested in open source and biology. Biopython uses continuous integration with BuildBots and Travis. Tests run on multiple environments: Python versions, Linux, Windows, MacOSX. Next release of Biopython supports Python 3.3 through the 2to3 converter. Long term will write code to be compatible with both. Nice tip from discussion: the six tool for Python 2/3 compatibility checks and a blog post on writing for 2 and 3. Peter describes thoughts on how to make Biopython more modular to encourage experimental contributions that could then make their way into officially support releases later on: trying to balance need for well-tested and documented code with trying to encourage new users.

InterMine – Collaborative Data Mining

Alex Kalderimis

Intermine is a data-integration system and query engine that supplies data analysis tools, graphical web-app components and a REST API. It provides a modular set of parts that you can use to build tools in addition to the pre-packaged solution. The InterMOD consortium organizes all the Intermine installations to make them able to better interact and share data. Recent work: re-write of Intermine javascript tools. Also can use external tools more cleanly: shows nice interaction of jBrowse with Intermine. Working on rebuilding their web interface on top of the more modular approach.

The GenoCAD 2.2 Grammar Editor

Jean Peccoud

Jean argues for the importance of domain specific languages to make it easier to handle specific tasks. Change the language to your problem. Idea behind GenoCAD is to empower end-users to develop their own DSL. Formal grammars are a set of rules describing how to form sentences in the language’s grammar. Start by defining categories mapping to biological parts, follow with the re-writing rules. All of this happens in a graphical drag and drop interface. For parts, they can use BioBricks as inputs.

Improvements and new features in the 7th major release of the Bio-Linux distro

Tim Booth

Bio-Linux is in its 10th year and recently released version 7. Bio-Linux is a set of debian packages and a full bioinformatics linux distribution you can get and live boot from a USB stick. Strong interactions with DebianMed and CloudBioLinux. Working with integration of Galaxy into Debian packages. Large emphasis on teaching and courses with Bio-Linux for learning commandline work.

Bioinformatics Open Source Conference 2013, day 1 morning: Cameron Neylon and Open Science

I’m in Berlin at the 2013 Bioinformatics Open Source Conference. The conference focuses on tools and approaches to openly developing research software to support scientific research. These are my notes from the morning 1 session focused on Open Science.

Open Science

Network ready research: The role of open source and open thinking

Cameron Neylon

Cameron keynotes the first day of the conference, discussing the value of open science. He begins with a historical perspective of a connected world: first internet, telegraphs, stagecoaches all the way to social networks, twitter and GitHub. A nice overview of the human desire to connect. As the probability of connectivity rises, individual clusters of connected groups can reach a critical sudden point of large-scale connectivity. A nice scientific example is Tim Gowers PolyMath work to solve difficult mathematical problems, coordinated through his blog and facilitated by internet connectivity. Instructive to look at examples of successful large scale open science projects, especially in terms of organization and leadership.

Successful science projects exploit the order-disorder transition that occurs when the right people get together. By being open, you increase the probability that your research work will reach this critical threshold for discovery. Some critical requirements: document so people can use it, test so we can be sure it works, package so it’s easy to use.

What does it mean to be open? First idea: your work has value that can help people in a way you never initially imagined. Probability of helping someone is the interest divided by the usability times the number of people you can reach. Second idea: someone can help me in a way you never expected. Probability of getting help same: interest, usability/friction and number of people. Goal of being open: minimize friction by making it easier to contribute and connect.

Challenge: how do we best make our work available with limited time? Good example is how useful are VMs: are they criitical for recomputation or do they create black boxes that are hard to reuse. Both are useful but work for different audiences: users versus developers. Since we want to enable unexpected improvements it’s not clear which should be your priority with limited time and money. Goal is to make both part of your general work so they don’t require extra work.

How can we build systems that allow sharing as the natural by-product of scientific work? Brutal reminder that you’re not going to get a Nobel prize for building infrastructure. Can we improve the incentives system? One attempt to hack the system: the Open Research Computation journal, which has high standards for inclusion: 100% test coverage, easy to run and reproduce. Difficult to get papers because the burden was too high.

Goal: build community architecture and foundations that become part of our day to day life. This makes openness part of the default. Where are the opportunities to build new connectivity in ways that make real change? An unsolved open question for discussion.

Open Science Data Framework: A Cloud enabled system to store, access, and analyze scientific data

Anup Mahurkar

The Open Science Data Framework comes from the NIH human microbiome project. Needed to manage large connections of data sets and associate metadata. Developed a general language agnostic collaborative framework. It’s a specialized document database with a RESTful API on top, and provides versioning and history. Under the covers, stores JSON blobs in CouchDB, using ElasticSearch to provide rapid full text search. Keep ElasticSearch indexes in sync on updates to CouchDB. Provides a web based interface to build queries and custom editor to update records. Future places include replicated servers and Cloud/AWS images.

myExperiment Research Objects: Beyond Workflows and Packs

Stian Soiland-Reyes

Stian describes work on developing, maintaining and sharing scientific work. Uses Taverna, MyExperiment and Workflow4Ever to provide a fully shared environment with Research Object. These objects bundle everything involved in a scientific experiment: data, methods, provenance and people. Creates a sharable, evolvable and contributable object that can be cited via ROI. The Research Object is a data model that contains everything needed to rerun and reproduce it. Major focus on provenance: where did data come from, how did it change, who did the work, when did it happen. Uses the PROV w3c standard for representation, and built a w3c community to discuss and improve research objects. There are PROV tools available for Python and Java.

Empowering Cancer Research Through Open Development

Juli Klemm

The National Cancer Informatics Program provides support for community developed software. Looking to support sustainable, rapidly evolving, open work. The Open Development initiative exactly designed to support and nurture open science work. Uses simple BSD licenses and hosts code on GitHub. Moving hundreds of tools over to this model and need custom migrations for every project. Old SVN repositories required a ton of cleanup. The next step is to establish communities around this code, which is diverse and attracts different groups of researchers. Hold hackathon events for specific projects.

DNAdigest – a not-for-profit organisation to promote and enable open-access sharing of genomics data

Fiona Nielsen

DNAdigest is an organization to share data associated with next-generation sequencing, with a special focus on trying to help with human health and rare diseases. Researchers have access to samples they are working on, but remain siloed in individual research groups. Comparison to other groups is crucial, but no methods/approaches for accessing and sharing all of this generated data. To handle security/privacy concerns, goal is to share summarized data instead of individual genomes. DNAdigest’s goal is to aggregate data and provide APIs to access the summarized, open information.

Jug: Reproducible Research in Python

Luis Pedro Coelho

Jug provides a framework to build parallelized processing pipelines in Python. Provides a decorator on each function that handles distribution, parallelization and memoization. Nice documentation is available.

OpenLabFramework: A Next-Generation Open-Source Laboratory Information Management System for Efficient Sample Tracking

Markus List

OpenLabFramework provides a Laboratory Information Management System to move away from spreadsheets. Handles vector clone and cell-line recombinant systems for which there is not a lot of support. Written with Grails and built for extension of parts. Has nice documentation and deployment.

Ten Simple Rules for the Open Development of Scientific Software

Andreas Prlic, Jim Proctor, Hilmar Lapp

This is a discussion period around ideas presented in the published paper on Ten Simple Rules for the Open Development of Scientific Software. Andreas, Jim and Hilmar pick their favorite rules to start off the discussion. Be simple: minimize time sinks by automating good practice with testing and continuous integration frameworks. Hilmar talks about re-using and extending other code. The difficult thing is that the recognition system does not reward this well since it assumes a single leader/team for every probject. Promotes ImpactStory which provides alternative metrics around open source contributions. The Open Source Report Card also provides a nice interface around GitHub for summarizing contributions. Good discussion around how to measure metrics of usage of your project: need to be able to measure impact of your software.