Galaxy Developer Conference: Day 2 afternoon discussions

While the morning talks were mostly from Galaxy developers, the afternoon talks focused on Galaxy users at various sites sharing their stories and tips on running Galaxy and integrating their external applications with it.

Davide Cittaro — ZFS for NGS data analysis

Focus on dealing with next generation sequencing data in Galaxy. Major issues are with resource consumption: expensive to do computations and lots of big files to store. Uses ZSF, a file system with support for large storage based on block level compression. It helps a lot with non-compressed file formats like bigWig, bigBed; not as useful with already compressed formats like BAM. For a Galaxy workflow this is especially useful since you will have SAM files sitting around that are non-compressed.

ZFS also has file deduplication which saves space in files with similar information. Really helpful if two people are running the same analysis. Open supported on OpenSolaris right now. If you are not on OpenSolaris, then take a look at Nexenta, or FreeBSD with ZFS. Another solution is to NFS mount Solaris boxes with ZFS filesystems on Linux machines where the actual work gets done.

Setup is still a bit sysadmin heavy right now, but Galaxy runs fine on ZFS so gets your savings if you go through the work.

Hans-Rudolf Hotz — Do-It-Yourself Bioinformatics with the FMI Galaxy Server

Describing Galaxy setup at FMI – Friedrich Miescher Institute for Biomedical Research. Desire to move people from ad-hoc analysis systems to using Galaxy. Decent results in number of users; would like to see more uptake in terms of jobs run.

Have local custom tools, including a jazzed up version of EMBOSS fuzzprot and fuzznuc. Rewriting using Biostrings from Bioconductor. Developed a custom tool to upload sequences that uses the user e-mail to determine if a customer has the ability to access them. For next gen analysis have a pipeline of perl scripts that is integrated with Galaxy or can run standalone. Allows users to directly dump their sequences to local export directories. Then tools work on information outside Galaxy using extraction tools.

Ed Kirton — Galaxy at US DOE Joint Genome Institute

Lessons learned from Ed’s experience rolling out Galaxy at JGI. They have a large number of sequencers and provide services for a wide variety of other institutes. Looking for a way to get data to collaborators and also provide analysis services.

Interesting point of view from a more business perspective that us mere coders don’t often deal with. General idea was to define a scope for rolling out Galaxy and defining what is a successful implementation. Encountered a bit of concern from existing analysists who were not familiar with Galaxy; difficulty was getting buy-in from folks who were not sure about change.

Suggestions and challenges:

  • Limit people and pipelines that you plan to support and work with them initially. This helps avoid too many feature requests.
  • Empower biologists to publish workflows; help transfer responsibility to analysts who can help other analysts.
  • Stability of the system is very important.
  • Programmers like the command line and need to buy-in to Galaxy way of thinking about things.
  • Generalizing tools from different groups can be a problem.
  • Users often don’t understand their role in development; need to accept that there may be problems but can help with feedback to make things better.
  • The user community is very important and can help bootstrap support.

JGI has about 40 tools for the community site; lots of useful assembly tools.

Tao Liu — Cistrome Project: Integrative platform to analyze ChIP-chip/seq data

Cistrome — cis-acting targets of transcription factors at a genomic scale. Basic analysis steps: pull down with ChIP, sequence, align, predict peaks and identify motifs. This is deployed on the Amazon EC2 cloud:

Cistrome is loaded with more than 600 public ChIP-seq datasets and 500 modENCODE sets. Really awesome raw data ready to analyze. Includes Tao’s widely used peak calling software: MACS with direct integration. MACS is also included on the Galaxy main site. Cis-regulatory element annotation system (CEAS) also included for downstream analysis of peaks. Has plots for promoter/exon enrichment and to generate plots of binding profiles over gene regions. More plotting with clustered heatmaps. Provides correlation analysis between different experiments. Motif analysis done with SeqPos. Gene expression analysis with Bioconductor. Sweet. Definitely check it out, tons of really great tools that are ready to use. This is how we should all be working and sharing our code and data.

Juan Perin — Galaxy at Bioinformatics Core at the Children’s Hospital of Philadelphia Research Institute

Juan describes his experience deploying Galaxy in the CHOP Bioinformatics core, running on a 12 node cluster. Specialize in copy number variation (CNV) analysis. Open source Java toolkit CNV Workshop is available that exports data directly into Galaxy. Downstream analysis is done in CNV including plotting and genetic feature overlaps.

Next gen sequencing analysis supports Illumina, Solid and 454 machines. 400 samples analyzed through Galaxy including targeted re-sequencing, RNA-seq, ChIP-seq. Provide workflows specific to different experiment types with lots of publicly available tools: BWA, bowtie, samtools, Mosaik, MACS, Tophat/Cufflink. Galaxy very useful for tertiary analysis: format conversions, visualization. Keeping Galaxy running in a cluster environment essentially full time job currently with keeping up to date, managing software, and tuning the cluster.

Gunnar Ratsch — Transcriptome analsis with Galaxy

Gunnar is from the Friedrich Miescher Laboratory of the Max Planck Society. His group focuses on method development for sequence analysis, genome annotation and transcriptome reconstruction. Apply machine learning applications to computational biology problems. Using Galaxy to make these tools available to a wider audience. Include a /galaxy directory in open source releases. Check out the public galaxy instance:

  • Machine learning toolbox: EasySVM
  • Predictors for transcription start and splice sites
  • Promoter analysis: Kirmes
  • Gene finding system: mGene, mTIM
  • RNA-seq toolbox
  • PALMapper — Accurate RNA-seq alignment: read mapping with spliced alignments.
  • rQuant — Alternative transcripts

Galaxy is setup using an internal and external server. Major challenge is synchronizing storage between the two instances to share data publicly when published.

Generate extensive workflows during gene finding process. One problem is that large workflows get confusing: would be useful to be able to group workflows and outputs. Other suggestions: tools need varying resources and would be nice to define these; commandline API would be useful.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s