Nate Coraor — Building scalable Galaxy
Target of the talk: make large sites run as well as a small instance. The standard build works fine for base cases, but several things that can be done to tune larger instances. Some general suggestions:
- Give galaxy it’s own user
- Don’t share database with other users
- Use clean python interpreter with virtualenv
- Galaxy can use NFS mounts for clustering
Get started with tuning:
- Disable developer settings (
- Move any from default SQLite which does not scale well. Recommend Postgres.
- Use a web proxy like nginx or apache to serve static content. Allows for caching and compression for downloads. See the wiki pages for Apache proxies and nginx proxies.
- Use a cluster to distribute jobs. This allows you to restart the server without restarting jobs. See the wiki Cluster setup page.
- Galaxy has a multiprocess model to avoid issues with Python’s global interpreter lock. Run multiple Galaxy processes on different ports.
- Setting metadata can be cpu intensive and block jobs. Do:
- Tuning Postgres: let it keep results in memory, allow more database connections, use
threadlocalstrategy for accessing database.
- Downloading data from Galaxy. Use the proxy to handle this: Apache:
- Uploading data to Galaxy. Again allow the proxy to handle it. This only works on nginx, and is a bit more complicated to setup.
All of this configuration is running on the main Galaxy site. Check out the wiki documentation for production servers.
Greg von Kuster — Libraries and Sample Tracking at NGS Facilities
Data libraries are hierarchical containers for datasets. You can upload directly to libraries from local directories on the server machine. This lets administrators manage files outside of Galaxy and still have them work internally. With library management, you can delete and undelete datasets.
Libraries allow users to import data into their history. Datasets are not copied, so large files used by multiple users are represented once on disk. Library security settings allow you to restrict access to certain groups of users. Security policies are flexible at the library, folder and dataset level: include access, add, modify and manage. Security settings are inherited downward but can be overridden at each level.
Data library templates associate information with a library and contents. The wiki description provides some nice examples.
Manage sequencing samples and getting them back into galaxy. Basic process: submit request, move through lab process, and transfer data back through a Galaxy library. This can be an automated or semi-automated process.
Labs can define the process and states of requests through the Admin interface and configure all the details.
Sample tracking is intended to complement an existing LIMS system, not be a LIMS replacement. Galaxy messaging engine has a simple XML API that can communicate with an existing system.
Bonus item: place to upload existing tools that integrate with Galaxy. See the brand new community tool site.
Kanwei Li — Building Custom Browsers with Trackster
Trackster is a track/data viewer integrated into Galaxy. Supports:
- Line Tracks — wiggle and bedgraph formats. Three ways: regular line graph, intensity display and filled line graph.
- Feature Tracks — start end regions in bed format.
- BAM tracks — uses pysam.
Front end uses HTML5 canvas elements and jQuery. Data fetched through AJAX calls to Galaxy. Files are indexed using an
array_tree from bx-python. The
interval_index allows quick querying based on regions. The data provider is defined as
get_data for retrieving regions in JSON format. See the BAM python file definition for the code.
Trackster allows sharing of visualizations that can be embedded in Galaxy pages.
Future developments: GFF file format and more display modes.
To use in galaxy: set
enable_track=true. See the wiki page for more details on setting it up.
Jeremy Goecks — Reproducible and transparent computational science with Galaxy
Tries to take Galaxy analysis to the next step: how do you use an analysis to do good science? Want it to be accessible, reproducible and transparent. Transparency — communicating experimental outputs in a meaningful way which facilities understanding and extending analysis, as well as letting others learn how you do things.
Galaxy handles this by having an open web-based platform that allows others to reuse and see your analysis: workflows, display framework, annotations, and pages:
Workflows — abstract analyses that can be applied to many datasets. These can be built from a history automatically and then edited in the workflow editor. Jeremy bravely shows a live demo of a Cufflinks analysis. Really impressive. Next step is being able to move workflows between Galaxy analyses.
Display framework — can push workflows to a public repository of workflows and share with others. When you publish you can add metadata and tags to the workflow so it is easily found be others. This is a public web page that you can make available to others.
Annotations and tagging — short pieces of text that can be added to files, workflows and pages. These provide context so others can see what you’ve done. You can annotate each step of a workflow so others see your comments and criteria. Similarly, short tags can be applied to analyses to make it each to organize them.
Pages — allow online publishing by creating an interactive reading experience. Good example is the Metagenomics windshield splatter page. This is awesome supplemental information directly linked to the actual analyses that were done. You can directly import the data and get going with it on Galaxy immediately.
This all is a cool way to share and make your analyses readily reusable. The next things the Galaxy folks are working on are long-term archival of data with Dryad, and developing best practices using social computing like comments on analyses.