These are my notes from day 1 of the 2012 Galaxy Community Conference. Apologies to the morning speakers since my flight got me here in time for the morning break.
Liu Bo: Integrating Galaxy with Globus Online: Lessons learned from the CVRG project
Work with Galaxy part of the CardioVascular Research Grid project which sets up infrastructure for sharing and analyzing cardiovascular data. Challenges they were tackling: distributed data at multiple locations, inefficient data movement and integration of new tools.
Integrated GlobusOnline as part of Galaxy: provides hosted file transfer. Provide 3 tools that put and pull data between GlobusOnline and Galaxy.
Implemented Chef recipes to deploy Galaxy + GlobusOnline on Amazon EC2.
Gianmauro Cuccuru: Scalable data management and computable framework for large scale longitudinal studies
Part of CRS4 project studying autoimmune diseases, working to scale across distributed labs across Italy. It’s a large scale projet with 28,000 biological samples with both Genotyping and Sequencing data. Use OMERO platform which is a client-server platform for visualization, then integrated specialized tools to deal with biobank data. Using seal for analysis on Hadoop clusters. Problem was that the programming/script interfaces were too complex for biologists, so wanted to put a Galaxy front end on all of these distributed services.
Integrated interactions with custom Galaxy tools, using Galaxy for short term history and the backend biobank for longer term storage. Used iRODS to handle file transfer across a heterogeneous storage system, providing interaction with Galaxy data libraries.
Valentina Boeva: Nebula – A Web-Server for Advanced ChIP-Seq Data Analysis
Nebula ChIP-seq server developed as result of participation in GCC 2011. Awesome. integrated 23 custom tools in the past year. He provides a live demo of tools in their environment, which has some snazzy CSS styling. Looks like lots of useful ChIP-seq functionality, and it would be useful to compare with Cistrome.
Sanjay Joshi: Implications of a Galaxy Community Cloud in Clinical Genomics
Want to move analysis to clinically actionable analysis: personalized, predictive, participatory and practical. Main current focus of individualized medicine are two areas: early disease detection and intervention and personalized treatments.
Underlying analysis of variants: lots of individual rare variants that have unknown disease relationships. Analysis architecture: thin, think, thin: sequencer, storage, cluster. Tricky bit is that most algorithms are not parallel so adding more cores is not magic. Also need to scale storage along with compute.
Project Serengeti provides deployment of Hadoop clusters of virtualized hardware with VMware. Cloud infrastructure a great place to drive participatory approaches in analysis.
Enis Afgan: Establishing a National Genomics Virtual Laboratory with Galaxy CloudMan
New library called blend that provides an API to Galaxy and CloudMan. This lets you start CloudMan instances and manage them from the commandline, doing fun stuff like adding and removing nodes programatically.
CloudMan being used on nectar: the Australian National Research Cloud. Provides a shell interface built on top with web-based interfaces via CloudMan and public data catalogues. Also building online tutorials and workshops for training on using best practice workflows.
Bj??rn Gr??ning: Keeping Track of Life Science Data
Goal is to develop an update strategy to keep Galaxy associated databases up to date. Current approaches are FTP/rsync which have a tough time scaling with updates to PDB or NCBI. Important to keep these datasets versioned so analyses are fully reproducible.
Approach: use a distributed version control system for life science data. Provides updates and dataset revision history. Used PDB as a case study to track weekly changes in a git repository. Include revision version as part of dropdown in Galaxy tools, and version pushed into history for past runs for reproducibility.
The downside is that rollback and cloning are expensive since repositories get large quickly.
Vipin Sreedharan: Easier Workflows & Tool Comparison with Oqtans+
oqtrans+ (online quantitative transcript analysis) provides a set of easily interchangeable tools for RNA-seq analysis. Some tools wrapped: PALMapper, rQuant6-3, mTim. They have automated tool deployment via custom fabric scripts. The public instance with integrated tools is available for running, and also a Amazon instance.
Gregory Minevich: CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences
Using C elegans and looking for specific variations in progeny with neurological problems. Approach applied to other model organisms like Arabidopsis. Available as a Galaxy page with full workflows and associated data. Nice example of a complex variant calling pipeline. Provides nice variant mapping plots across the genome. Pipeline was bwa + GATK + snpEff + custom code to identify a final list of candidate genes.
Approach to identify deletions: look for unique uncovered regions within individual samples relative to full population. Annotate these with snpEff and identify potential problematic deletions.
Tin-Lap Lee: GDSAP — A Galaxy-based platform for large-scale genomics analysis
Genomic Data Submission and Analytical Platform (GDSAP): provides a customized Galaxy instance. Integrated tools like SOAP aligner and variant caller and now part of the toolshed. Push Galaxy workflows to MyExperiment: example RNA-seq workflow.
Karen Reddy: NGS analysis for Biologists: experiences from the cloud
Karen is a biologist, talking about experiences using Galaxy. Moving from large ChIP-seq datasets back to analysis w
ork. Used Galaxy CloudMan for analysis to avoid need to develop local resources. Custom analysis approach called DAMID-seq, translated into a Galaxy workflow all with standard Galaxy tools. Generally had great experience. Issues faced: is it okay to put data on the Cloud? It can be hard to judge capacity: use high memory extra large instances for better scaling.
Mike Lelivelt: Ion Torrent: Open, Accessible, Enabling
Key element of genomics software usage is that users want high level approaches but also be able to drill down into details = Galaxy. That’s why Ion Torrent is a sponsor of the Galaxy conference. IonTorrent software system sounds awesome: built on Ubuntu with a bunch of open source tools. It has a full platform to help hook into processing and runs. Have open source aligner and variant caller for IonTorrent data in the toolshed: code is all on iontorrent GitHub.