Galaxy Community Conference 2012, notes from day 1

These are my notes from day 1 of the 2012 Galaxy Community Conference. Apologies to the morning speakers since my flight got me here in time for the morning break.

Liu Bo: Integrating Galaxy with Globus Online: Lessons learned from the CVRG project

Work with Galaxy part of the CardioVascular Research Grid project which sets up infrastructure for sharing and analyzing cardiovascular data. Challenges they were tackling: distributed data at multiple locations, inefficient data movement and integration of new tools.

Integrated GlobusOnline as part of Galaxy: provides hosted file transfer. Provide 3 tools that put and pull data between GlobusOnline and Galaxy.

Wrapped: GATK tools to run through the Condor job scheduler, CummeRbund for visualization of Cufflinks results, and MISO for RNA-seq analysis.

Implemented Chef recipes to deploy Galaxy + GlobusOnline on Amazon EC2.

Gianmauro Cuccuru: Scalable data management and computable framework for large scale longitudinal studies

Part of CRS4 project studying autoimmune diseases, working to scale across distributed labs across Italy. It’s a large scale projet with 28,000 biological samples with both Genotyping and Sequencing data. Use OMERO platform which is a client-server platform for visualization, then integrated specialized tools to deal with biobank data. Using seal for analysis on Hadoop clusters. Problem was that the programming/script interfaces were too complex for biologists, so wanted to put a Galaxy front end on all of these distributed services.

Integrated interactions with custom Galaxy tools, using Galaxy for short term history and the backend biobank for longer term storage. Used iRODS to handle file transfer across a heterogeneous storage system, providing interaction with Galaxy data libraries.

Valentina Boeva: Nebula – A Web-Server for Advanced ChIP-Seq Data Analysis

Nebula ChIP-seq server developed as result of participation in GCC 2011. Awesome. integrated 23 custom tools in the past year. He provides a live demo of tools in their environment, which has some snazzy CSS styling. Looks like lots of useful ChIP-seq functionality, and it would be useful to compare with Cistrome.

Sanjay Joshi: Implications of a Galaxy Community Cloud in Clinical Genomics

Want to move analysis to clinically actionable analysis: personalized, predictive, participatory and practical. Main current focus of individualized medicine are two areas: early disease detection and intervention and personalized treatments.

Underlying analysis of variants: lots of individual rare variants that have unknown disease relationships. Analysis architecture: thin, think, thin: sequencer, storage, cluster. Tricky bit is that most algorithms are not parallel so adding more cores is not magic. Also need to scale storage along with compute.

Project Serengeti provides deployment of Hadoop clusters of virtualized hardware with VMware. Cloud infrastructure a great place to drive participatory approaches in analysis.

Enis Afgan: Establishing a National Genomics Virtual Laboratory with Galaxy CloudMan

Enis talking about new additions to CloudMan: support for alternative clouds like OpenStack, Amazon spot instances and using S3 Buckets as file systems.

New library called blend that provides an API to Galaxy and CloudMan. This lets you start CloudMan instances and manage them from the commandline, doing fun stuff like adding and removing nodes programatically.

CloudMan being used on nectar: the Australian National Research Cloud. Provides a shell interface built on top with web-based interfaces via CloudMan and public data catalogues. Also building online tutorials and workshops for training on using best practice workflows.

Bj??rn Gr??ning: Keeping Track of Life Science Data

Goal is to develop an update strategy to keep Galaxy associated databases up to date. Current approaches are FTP/rsync which have a tough time scaling with updates to PDB or NCBI. Important to keep these datasets versioned so analyses are fully reproducible.

Approach: use a distributed version control system for life science data. Provides updates and dataset revision history. Used PDB as a case study to track weekly changes in a git repository. Include revision version as part of dropdown in Galaxy tools, and version pushed into history for past runs for reproducibility.

The downside is that rollback and cloning are expensive since repositories get large quickly.

Vipin Sreedharan: Easier Workflows & Tool Comparison with Oqtans+

oqtrans+ (online quantitative transcript analysis) provides a set of easily interchangeable tools for RNA-seq analysis. Some tools wrapped: PALMapper, rQuant6-3, mTim. They have automated tool deployment via custom fabric scripts. The public instance with integrated tools is available for running, and also a Amazon instance.

Gregory Minevich: CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Using C elegans and looking for specific variations in progeny with neurological problems. Approach applied to other model organisms like Arabidopsis. Available as a Galaxy page with full workflows and associated data. Nice example of a complex variant calling pipeline. Provides nice variant mapping plots across the genome. Pipeline was bwa + GATK + snpEff + custom code to identify a final list of candidate genes.

Approach to identify deletions: look for unique uncovered regions within individual samples relative to full population. Annotate these with snpEff and identify potential problematic deletions.

Tin-Lap Lee: GDSAP — A Galaxy-based platform for large-scale genomics analysis

Genomic Data Submission and Analytical Platform (GDSAP): provides a customized Galaxy instance. Integrated tools like SOAP aligner and variant caller and now part of the toolshed. Push Galaxy workflows to MyExperiment: example RNA-seq workflow.

Platform also works with GigaScience to push and pull data in association with GigaDB.

Karen Reddy: NGS analysis for Biologists: experiences from the cloud

Karen is a biologist, talking about experiences using Galaxy. Moving from large ChIP-seq datasets back to analysis w
ork. Used Galaxy CloudMan for analysis to avoid need to develop local resources. Custom analysis approach called DAMID-seq, translated into a Galaxy workflow all with standard Galaxy tools. Generally had great experience. Issues faced: is it okay to put data on the Cloud? It can be hard to judge capacity: use high memory extra large instances for better scaling.

Mike Lelivelt: Ion Torrent: Open, Accessible, Enabling

Key element of genomics software usage is that users want high level approaches but also be able to drill down into details = Galaxy. That’s why Ion Torrent is a sponsor of the Galaxy conference. IonTorrent software system sounds awesome: built on Ubuntu with a bunch of open source tools. It has a full platform to help hook into processing and runs. Have open source aligner and variant caller for IonTorrent data in the toolshed: code is all on iontorrent GitHub.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s