Notes: Bioinformatics Open Source Conference 2015 day 2 afternoon: Translational, Visualization and Lightning Talks

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are notes about translational biology, visualization and last minute lightning talks.

My other notes from the conference:

Translational Bioinformatics

CIViC: Crowdsourcing the Clinical Interpretation of Variants in Cancer

Malachi Griffith

Large scale cancer genome sequencing is becoming routine. We can get lots of mutations but bottleneck is in visualization and interpretation of events. Shows example sample interpretations from Foundation Medicine done by paid curators. We should be doing this in public, and need resources to support this. Resources at WashU: DoCM and CIViC. Issue is that there are many hospitals and researchers building up lists of variants we should care about in cancer – this needs to be done together. Existing resources aren’t meant to be programmable and have non open licenses so hard to use. Principles of CIViC: interpretations should be freely available and debated openly. Content needs to be transparent and kept up to date. Need interface for both API and web interface. Access should remain free. Hope is that CIViC will end up in a precision medicine treatment cycle: capture information from trying to help late stage cancer patients who are fine with experimental treatments. CIViC is trying to capture known clinically actionable genes, so very specific goals to avoid going in two many directions. Currently has 500 evidence statements from 230 published sources.

From Fastq To Drug Recommendation – Automated Cancer Report Generation using OncoRep & Omics Pipe

Tobias Meissner

Tobias talks about work on defining actionable targets that can get prescribed to the patient. Aim for 2 week turnaround for sequencing and analysis. Most time spent in analysis so benefits from automation and reproducibility improvements. Omics Pipe does the process work and OncoRep prepares the clinical report. Introduces example of real patient that has been through multiple rounds of drug treatments. Omics Pipe implements best practice pipelines that run out of the box. OncoRep prepares a HTML patient report based on calls generated using knitr. Links back to evidence. Provides a PDF patient report, generated with sweave.

Cancer Informatics Collaboration and Computation: Two Initiatives of the U.S. National Cancer Institute

Ishwar Chandramouliswaran

Ishwar from the National Cancer Institute (NCI) presenting the collaborative NCIP Hub intitiatve. Idea is to make tools available for biologists. The idea is to fit these in with MOOCs for learning and training. NCIP Hub provides a home for content and keep transparent metrics about usage. The second initiative is the Cancer Cloud Initiative with three implementations: Seven Bridges, Broad and Institute for Systems Biology. Please partitipate in the evaluation with $2 million in cloud credits.

Bioinformatics Open Source Project Updates

Biopython Project Update 2015

João Rodrigues

João talking about Biopython project. Mentions all of the diverse conitributions to the source code. Also talks about the benefit of Google Summer of Code (GSoC) for recruiting and retaining contributors. Eric Talevich was a student, then mentor, then administrator for GSoC – open source career prorgression. João talks about improvements in Biopython the last year, demos some cool functionality from KEGG. Beyond the code: Docker containers with Biopython + dependencies that supports IPython notebooks. Tiago wrote a book: Bioinformatics with Python Cookbook.

The biogems community: Challenges in distributed software development in bioinformatics

George Githinji and Pjotr Prins

BioRuby migrated into BioGems. The idea is to decentralize the contribution process so there is no longer a central idea, and instead promote and rank the new packages. Show a lot of metrics of downloads, GitHub issues, mailing list activity: good question about how to measure success. Published paper on sambamba, saw a big uptick in downloads and GitHub issues: both bug reports and feature requests.

Apache Taverna: Sustaining research software at the Apache Software Foundation

Stian Soiland-Reyes

Apache Taverna is a workflow system that has been in development since 2001. Since 2006, productionized Taverna to make it easier to install and run. Since 2014 moved to Apache incubating project. Stian describes the typical evolution of research software: incidentally open source and then developed ad-hoc over time in different directions than initially expected. There is a strong need for open development so original starters aren’t leaders of the project. Move the focus towards the people that are doing things; move towards a do-ocracy. Looked at ways to change the legal ownwership of Taverna. Decided to move towards Apache – they favor community over code and move towards longer term sustainability.

Visualization

Simple, Shareable, Online RNA Secondary Structure Diagrams

Peter Kerpedjiev

Peter is talking about making it easy to show RNA secondary structure: tool called forna with d3 goodness. Goal of making these is to show things that are hard to visualize. Simplify 3d structures back to 2d to make them easier to see. Convert 1d to 2d to make them obvious. Nice examples. Another tool that does this is RNA-PDB. Can make more complex applications with d3 and rnaPlot layout. Container component is fornac.

BioJS 2.0: an open source standard for biological visualization

Guy Yachdav

BioJS is a set of reusable blocks for representing biological data on the web. Have an online registry to make it easy to discover new packages. Uses npm for installation. Looking for new components and contributors.

Visualising Open PHACTS linked data with widgets

Ian Dunlop

OpenPHACTS brings together a large number of pharmaceutical resources into an integrated infrastructure. Uses RDF under the covers but has an API to query. Lots of nice visualization widgets and compound displays included with BioJS.

Late-Breaking Lightning Talks

Biospectra-by-sequencing genetic analysis platform

Aurelie Laugraud

Originally called Genotyping by Sequencing (GBS) – cheap and easy way to sequence only part of a genome. Used first on maize because they have lots of population data and a massive genome. Analysis pipeline called TASSEL with both reference and non-reference pipelines. BioSpectra-by-Sequencing (BSS). Brings together a community to make tools available for existing data.

hyloToAST: Bioinformatics tools for species-level analysis and visualization of complex microbial communities

Shareef Dabdoub

Shareef highlight issues found with QIMME that led them to develope PhyloToAST which modifies and extends the main pipeline. Includes new plots through matplotlib – nice 2d + 3d on same data to readily distinguish. Also added automatic export of data into the the interactive tree of life (iTOL).

Otter/ZMap/SeqTools: A productive alternative to web browser genome visualisation

Gemma Guest

Gemma talks about visualization and annotation tools from the Sanger. Otter does interactive graphical annotation. ZMap is a high performace genome browser. SeqTools and Blixem provides a tool for visualizing sequence alignments at a higher level of detail compared to ZMap. Dotter provides detailed comparisons of two sequences.

bioaRchive: enabling reproducibility of Bioconductor package versions

Nitesh Turaga

Nitesh is part of the Galaxy team at Johns Hopkins. Issue with Bioconductor is that it’s quite difficult to get an older version of tools – you can only really get the latest. bioarchive provides a nice browsable website and packages of old version of tools. Can use standard install.packages and point to bioarchive. For Galaxy, this now makes all versions available for full reproducibility. Future goals are to get bioconductor involved in the process and integrate with biocLite.

Developing an Arvados BWA-GATK pipeline

Pjotr Prins

Pjotr working at a HiSeq X-10 facility. 18k genomes per year and 50 genomes per day. Existing pipeline takes 3 days on the cluster. Bottleneck is the shared filesystem. Decided to try using Arvados based on conversations at BOSC last year. Took a week to port Perl script over to Arvados. Runs in 2 days with 1 run and flat performance with 8 samples on AWS. Nice ability to share pipelines in Arvados.

Out of the box cloud solution for Next-Generation Sequencing analysis

Freerk van Dijk

Put together a VM for NGS analysis using Molgenis. Can download image, upload data to the VM and then run. Used OpenStack framework for running. Used easyBuild to install the software. Define the inputs with a CSV file. Generates jobs through Molgenis. Nice setup, creates a Reproducible, Scalble and Portable system.

Advertisements

Notes: Bioinformatics Open Source Conference 2015 day 2 morning — Ewan Birney, Open Science and Reproducibility

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are my notes on the morning session with a keynote from Ewan Birney and a section on Open Science and Reproducibility.

My other notes from the conference:

Keynote

Big Data in Biology and why open source matters

Ewan Birney

Ewan is a founder of BOSC and former director of OpenBio, so we’re excited to have him giving a keynotes. We start with lots of stories and memories about his impact on the current set of community members in BOSC.

Ewan starts by reminding us that we’re going through a revolution. The cost of sequencing in 10 years dropped from millions to thousands: mansion versus season tickets to Arsenal. He is brilliantly good at describing the high level picture of the importance of sequencing technology and changes.

3 reasons why open source code matters:

  1. Scientific transperency: providing access to your data and analysis approach is a fundamental process of science
  2. Efficiency: library scale code re-use. It’s a careful art and need to do more than just make your code available. Bugs can have downstream consequences on conclusions – have an extra big job to avoid bugs so need to risk pool for key components.
  3. Community: sharing code allows people to specialize in specific areas, outlasts any individual’s contribution. Challenge is to fund and support these community projects.

Infrastructures are crucial and we only notice them when they fail. Hard to work on because you only hear about it when something is wrong. Life sciences is about small details: lots of data types, metadata. EMBL-EBI can scale in terms of data, but won’t scale in terms of people and expertise. ELIXIR makes this a joint problem tackled across Europe. The goal is to scale in terms of expertise and collaboration.

Now Ewan talks about his work on storing information in DNA. Dreamed up over beers, and then came up with ways to encode binary as DNA in small easily made chunks with redundancy. Stored picture of EBI, Martin Luther King’s I have a dream speech, the Watson/Crick paper and Shakespeare’s sonnets.

Good questions around scaling and the cloud. Growing need to practice healthcare and work with research science. For access reasons, need fedarated systems that can share data. Will not be able to aggregate data, so need to make this future work for us.

Question about how to financially support open-source infrastructure developers. Need to look at how Linux/Apache foundations work and try to build off their success at supporting paid developers managing a larger volunteer community.

Open Science and Reproducibility

A curriculum for teaching Reproducible Computational Science bootcamps

Hilmar Lapp

There is a reproducibility issue in science – we’re not that good at reproducing results and it’s time consuming to do it. Reproducible science makes it easier and faster to build upon. Computationally it’s especially hard because of dependency and tool issues. Reproducible Science Curriculum Workshop and Hackathon in Durham aimed at improving this. The earlier you start doing this in a project, the more benefits it accrues to yourself. Promoting literate programming via IPython/RStudio/Knitr. Curriculum teaches both new users and those that have some expertise but could convince to switch to new approaches. Held two workshops so far: in May and June – changed to focus more on literate programming earlier since you didn’t need to convince people. There was a ton of awareness and demand for these courses. Future plans are to tweak and improve workshop, teach more. Need to figure out how to fund and sustain this effort – it’s funded through 2015. All materials available on GitHub.

Research shared: http://www.researchobject.org

Norman Morrison

Research Objects are part of a framework to enable reproducible, transpart objects. Want to have a manifest of materials/resources available inside research containers. The manifest is a plain text file full of dates, DOIs and links to resources about what it contains. Lots of scope for ontologies and naming. Tutorials and specifications for everything available on GitHub. Lots of cool use cases showing how make results fully computable and available. FAIR = Findable, Accessible, Interoperable, Reproducible. Awesome graph of the time costs of moving to reproducibility.

Nextflow: a tool for deploying reproducible computational pipelines

Paolo Di Tommaso

Nextflow provides a declarative syntax to write parallel and scalable workflows. It has a nice domain specific language (DSL) based on Dataflow. This is a declarative model for concurrent processes. Under the covers it uses async channels and handles parallelization implicitly by input/output definitions. Think of program like a network. Implicitly handles parallelizing over multiple inputs. It is platform agnostic, so scales out from multicore to all the standard cluster managers.

Why is it so hard to deploy a pipeline? Big issues: many dependencies that change quickly. To migigate this, manage a pipeline as a self-contained GitHub repository. Use Docker to manage tools. Provides a nice example pipeline on GitHub to demonstrate. GitHub provides versioning that feeds right into Nextflow.

Free beer today: how iPlant + Agave + Docker are changing our assumptions about reproducible science

John Fonner

John works on iPlant, tackling problems in cyberinfrastructure for the plant community. They have a bunch of storage and compute, API level services for command line access, then a web based user interface for interactivity. iPlant has a Discovery Environment for managing data and analysis history. Atmosphere is an open cloud for Life Sciences, can request additional resources. Agavi provides a programming interface for building things on top of iPlant. Handles pretty much everything you’d need to build on it. Uses Docker to store dependencies and tools alongside a GitHub repo with code. Agave is handling job provenance, sample data and platform integration along with sharing and cluster execution.

The 500 builds of 300 applications in the HeLmod repository will at least get you started on a full suite of scientific applications

Aaron Kitzmiller

Lots of ways to publish code, which is good. Problem is that there are so many different ways to install these that it take a lot of work to build them. Built HELmod on top of Lmod to manage a huge set of scientific dependencies for a clean environment. Has 500 specification files allowing installation of all these tools. Really nice shared resource – we should all be building modules together instead of at each facility. HELmod GitHub repo

Bioboxes: Standardised bioinformatics tools using Docker containers

Peter Belmann

bioboxes motivated by docker-based benchmarking projects (CAMI and nucleotid.es). Standards project to provide a way to specify inputs and outputs of boxes. This allows you to easily interchange tools for benchmarking. Nice community project for specifying these.

The perfect fit for reproducible interactive research: Galaxy, Docker, IPython

Björn Grüning

Björn talks about his brilliant work to combine Galaxy, IPython and Docker. Galaxy can run Docker containers and has a rich set of visualizations. What was missing was a way to interact and tweak your data. Invented the concept of an interactive environment in Galaxy – spins up Docker container that works against Galaxy data. This is a sharable Galaxy data object so has all those advantages. RStudio is also integrated if you prefer R over Python. Also has a Docker based way to install and use Galaxy quickly.

COPO: Bridging the Gap from Data to Publication in Plant Science

Robert Davey

Cultural issues to help get scientsits to deposit metadata. Idea: make it easier and more connected so there is a bigger benefit to users and overcome this cultural barrier. This allows you to build graphs of interconnected data, and track usage of your data. Can be helpful in describing the value of your data. COPO project.

ELIXIR UK building on Data and Software Carpentry to address the challenges in computational training for life scientists

Aleksandra Pawlik

Aleksandra has 1 slide for her lightning talk – brilliant. ELIXIR is adopting software carpentry as a training model. Really awesome to be spreading a single teaching model across multiple countries. It feels like finally we are not developing independent materials everywhere and can have good training for everyone.

Parallel recipes: towards a common coordination language for scientific workflow management systems

Yves Vandriessche

Yves builds tools for people who build tools. Scripts deal with the complexity of gluing together applications but we need more distributed jobs. The biggest issue in this move is ordering dependencies when running in parallel. Integrated CWL workflow specification into precipes. Code on GitHub.

openSNP – personal genomics and the public domain

Bastian Greshake

OpenSNP is about open data and personal genomics. The idea is to provide a way to upload and share genotype and phenotype data along with genotyping from 23andme. Mines SNPedia to provide useful feedback to users. Provides complete dumps of data and APIs. 2000 genetic datasets and 4000 people registered. Users have mined his data and provided back interpretations.

Notes: Bioinformatics Open Source Conference 2015 day 1 afternoon — Standards, Interoperability and Diversity Panel

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and open science. These are my notes on the day 1 afternoon session on standards and interoperability, and the panel on improving diversity in the BOSC community.

Standards and Interoperability

Portable workflow and tool descriptions with the CWL

Michael R. Crusoe

Michael providing an overview of the Common Workflow Language, which is a great standard developed out of Codefest 2014 and the BOSC community. It provides a way to define workflows and tools in a re-usable and interoperable way. Looked at many existing standards and approaches. Giving a showout to Worflow4Ever which was an awesome standard to build off.

From peer-reviewed to peer-reproduced: a role for research objects in scholarly publishing in the life sciences

Alejandra Gonzalez-Beltran

Alejandra talking about research objects and the ability to use them to improve reproducibility. Work is published in PLOS-One and builds off ISATools, Galaxy and GigaScience. Took an article on SOAPdenovo2 and built up the infrastructure that should have been present: description of the experimental steps in ISATab, running in Galaxy. Publishing findings as nanopublications in RDF and linked the statements of results with experimental descriptions. The idea behind Research Objects is that they bring together data, analysis and conclusions.

Demystifying the Interoperability of Disparate Genomic Resources

Daniel Blankenberg

Dan starts by describing what Galaxy is, trying to correct some earlier misconceptions and define the goals of the project. Awesome new thing is the interactive support for iPython and RStudio. Main goal of talk is talking about getting data into Galaxy in more automated ways, including multiple files and metadata and GenomeSpace. Export data out of Galaxy to places like UCSC or to other sinks like GenomeSpace. Whew, a whirlwind tour.

Increasing the utility of Galaxy workflows

John Chilton

John is talking about Galaxy workflows, designed for biologists. However only used by 15% of users, perhaps because of limitations. It previously only scheduled jobs, but now handles map/reduce system operations and has a real workflow engine. Now have Collection types (list and paired). Support for dealing with these via a nice web interface and build the sample relationships themselves. Now can do solid sample tracking and tools can consumed paired datasets. John shows some nice examples of community developed workflows. The workflow engine and model is so nice now, lots of great features. I’m looking forward to building on top of it with bcbio when everything will be CWL.

Kipper: A software package for sequence database versioning for Galaxy bioinformatics servers

Damion Dooley

Kipper focuses on helping to recreate sequencing analysis by organizing reference databases. Some of them are nice and you can get and download previous databases, but large stuff like NCBI is continuously updated. Kipper is a Python script that keeps track of everything.

Evolution of the Galaxy tool ecosystem – happier developers, happier users

Martin Čech

Martin talks about recent improvements to the tool shed to make it easier to use for developers. Galaxy involved with the ICGC-TCGA DREAM challenge for re-running. Awesome. Planemo helps in testing and developing new tools, treating Galaxy as a transparent dependency.

Bionode – Modular and universal bioinformatics

Bruno Vieira

Bionode is a javascript library for working with biological data. Works both in the browser and locally with node.js. Really nice looking code: Code in GitHub. Also takls about the really cool oSwitch to help make it easy to run local commands in Docker containers with the actual tools.

The EDAM Ontology

Hervé Ménager

EDAM Ontology describes bioinformatics operations, data types and topics. It’s a critical way to define reproducibile workfows so we know what is happening where and can convert between different tools that do the work. It’s part of ELIXIR so will be in new infrastructure.

Panel Discussion – Open Source, Open Door: Increasing diversity in the bioinformatics open source community

Mónica Muñoz-Torres, Holly Bik, Michael R. Crusoe, Aleksandra Pawlik, Jason Williams

Moni is chairing the panel, talking about our goals at BOSC to have a more diverse community. We want to welcome underrepresented members of all types, to help encirch the set of skills and opinions in our community. The experience of the panelists is so incredible: helping bring new scientists into the community from all levels. There are a large number of cultures and communities we interact with, how can we better at that.

The goal of the panel is in starting the conversation and hearing people’s voices. Goal is to hear about the situations that make people feel uncomfortable and try to remedy them. How can we set up BOSC to not make people feel excluded?

How do you start get involved with the community? Holly’s recommendation: volunteer to help with organizing. Yes please, we’d love to have more help with BOSC organizing. Conversely, how to bring more people into your community? Provide role models for the people you want to attract. Need to build and encourage diverse set of role models.

How can you get more diverse applicants? Often get a large number of men applications and very few women applicants. Suggestions: pay better, more flexibility for working hours. Paying undergraduates to work in the lab to create a more diverse environment. Co-mentoring with other groups – say more biology focused and more computational focused. Also a numbers problem: we need to be investing in education of under-represented of younger students. Can we sell and provide a clear career path for bioinformatics so people want to go into bioinformatics?

Suggestions for helping to improve diversity and open science. Collaborate with folks who don’t normally do that and bring your diverse, open worldview into the collaboration.

Bioinformatics is a chance for more people to get involved. All you need is an internet connection and community. The barrier is the knowledge. We need to motivate and welcome people to become involved, then provide the training to get them there.

Notes: Bioinformatics Open Source Conference 2015 day 1 morning — Holly Bik and Data Science

I’m at the 2015 Bioinformatics Open Source Conference (BOSC) in Dublin, Ireland. BOSC is a two day community conference devoted to open source and science. Nomi Harris starts the day off with an introduction to BOSC – this is the 16th annual year. The theme this year in Open Source, Open Door: we have a strong interest in improving diversity at BOSC. Some of the work in progress includes: reaching out to many under-represented groups and sending personal invitations, adding a code of conduct to both BOSC and ISMB, appointing Sarah Hird as the BOSC outreach coodinator, and providing financial support for scientists who might otherwise not be able to attend.

Keynote

Holly Bik – Bioinformatics: Still a scary world for biologists

Holly plans to talk about her experience transitioning from biology to bioinformatics. 2010 is the year that Holly saw the command line for the first time, and will give her perspective as an end user. Holly works on marine nematodes, although environmental sequencing allows you to sample everything you’ll find. So you might be looking at nematodes, but will find plenty of other tiny eukaryotes to study: neglected taxa in understudied habitats.

Holly provides a motivating example: the impact of the Deepwater Horizon oil spill in September of 2010. She then describes her biology experience: collecting deep water samples, nematode classification, drawing of nematodes. None of that was a good preparation for the command line. Holly had to adpat to a fast-moving field: collecting 1000s of samples, dealing with bias and contamination error, using multiple approaches. Holly quotes Donald Rumsfeld about known unknowns. Holly feels confident knowing how to do basic scripting and debugging. Known unknowns: new techs like Hadoop, Drupal, etc creates a barrier. Unknown unknowns: distributed computing, binary formatted files that aren’t clearly different than previous text files.

Holly mentions the useful point that how you call yourself influences people’s perceptions. Computational biologist says one thing to people and Marine Biologist says another. How do we properly present ourselves both in jobs and grants.

Two options for biologists who need bioinformatics: hacking it together yourself or using pre-packaged tools. She awesomely compares to the Oregon trail game. I had no idea that Oregon trail was so comparable to biology + bioinformatics. Do not try to cross the river yourself.

How to learn and get going? Hardest part is intermediate learners: lots of resources for early adopters but less as you already known something but want to level up.

Holly talks about her open-source work on Phinch – research-driven data visualation. Provides visualization of data in chrome. Really nice interface designed by actual interface designers. It’s a prototype framework, currently has five unique visualizations for summarizing data. Allows you to look at both high and low level patterns. Visualization helps make this data available to citizen scientists.

Next phases of development: phylogenetic visualizations, open public APIs with a self-sustaining developer community. Code on GitHub.

Main points: provide more interdisciplinary training, collaboration and interaction for intermediate learners. Tools that actually take advantage of biology and reduce the need for biological expertise.

We have a new Q/A format we’re trying out with questions from twitter and index cards. On training courses – found that in-person courses are much more useful than online because of actually following through and finishing them. Teach people to Google their error messages – yes please.

Data Science

Mónica Muñoz-Torres – Apollo: Scalable & collaborative curation for improved comparative genomics

Moni works at LBL on WebApollo with Nathan Dunn and Suzi Lewis. Goal is to improve Manual Annotation and improve scalability of this work. Architecture has 3 components: web based front end, annotation editing enging and server side data service. Uses Google Web Toolkit on front end, Grails on the backend plus single datastore with PostgreSQL. Moni describes a lot of the pain of refactoring and difficulty in working with new technologies to estimate times. Code on GitHub.

Kévin Rue-Albrecht – GOexpress: Visualize gene expression using gene ontology

Kévin talks about motivations behind GOexpress, an R/Bioconductor tool for making sense of gene expression data. Shows huge map to demonstrate the difficult in identifying the effect of a treatment. Often have multiple experimenal factors, noise in the data and clustering driven by a few genes of large effects. For instance, MHC drives clustering in animals and need to cluster genes by effects to determine this and actually look at signal. Uses random forest and ANOVA methods to classify genes and separates by ontology. Expose selection of genes through a Shiny web application. Tricky parts with random forest – need to run permutations to get p-values that everyone wants. Code on GitHub.

Peter Amstutz – Arvados: A Free Software Platform for Big Data Science

Peter talks about computational reproducibility with Arvados, first motivating with examples of how we improve our ability to run complex pipelines. Components of Arvaods: Keep content addressable storage. It’s versioned, immutable and manifests allow reorganization. I wish I understood filesystems better to know how Hadoop-y file systems different from this. Crunch is the computational engine. uses Keep for data, Git for code and Docker for tools to have a full set of reproducible components. This architecture allows moving analyses between multiple instances. Arvados provides facilities for this sharing – both public and between groups in private. Ends by mentioning writing Common workflow language and move towards better workflow standards that will be in a talk by Michael later.

Sebastian Schoenherr – Bringing Hadoop into Bioinformatics with Cloudgene and CloudMan

Sebastian motivates by talking about a BOSC talk fom 2012 by Enis Afgan on CloudMan, and describes the awesome work on automating Galaxy + AWS. Sebastian also talked about CloudGene talk from himself in 2012. Now CloudGene is more of a software as a service platform, providing dedicated services for a given workflow. Supports the full Hadoop stack: Spark, MRv2, Pig. Lots in common with CloudMan and decided to combine projects, using CloudGene for Hadoop execution within CloudMan. Presents a cool use for hadoop – Michigan Imputation Server: a free service to provide QC + Phasing + Imputation. Really nice, and provides a platform for building more services.

Michael Hoffman – Segway: semi-automated genome annotation

Segway finds patterns from multiple biological signal tracks like ChIP-seq. It discovers patterns and then provides annotation, visualization and interpretation. Genome segmentation breaks up the genome into non-overlapping segments, then pushes around boundaries to maximize similarity within regions. Uses a generalized HMM to discover structure in the inputs based on a specific number of classifications to segment by. Michael makes good point that coders are biologists too and can use his knowledge of chromatin structure to develop hypotheses from these inputs and then test those. Use this knowledge to apply labels for each segment of the genome. This provides nice annotation tracks for making sense of variations in non-coding regions. Nicely able to ensure that Segway signals match with the expected biology. Michael makes another good point about the importance of looking in depth at specific regions, and then using this to do biological experiments to confirm.

Konstantin Okonechnikov – QualiMap 2.0: quality control of high throughput sequencing data

Qualimap does a comprehensive job of running quality control of seuencing data. The new version of BAM QC was redesigned to add many new metrics. Added a method to combine results from multiple bam results together and summarize them. Runs a PCA analysis to detect outliers in a group of results. Also redesigned RNA-seq quality control. Konstantin highlights the larger number of folks from the community who contribute to QualiMap.

Andrew Lonie – A Genomics Virtual Laboratory

Andrew talks about the Genomics Virtual Lab (GVL) which supplies compute along with a set of tools to work on top of it. Awesome resource for Australian Bioinformatics with CloudMan, Galaxy. Provides reproducibie, redeployable platforms.

Tony Burdett – BioSolr: Building better search for bioinformatics

BioSolr provides an optimized approach for complexity of life sciences data on top of Solr, Lucene and Elastic Search. The mission is to build a community of users around improving searching for biology. Provides faceting with ontologies, plugins for joining indexes with external indexes to provide federated search.