I’m at the first Galaxy Developer Conference in beautiful Cold Spring Harbor Labs. James Taylor kicked things off with a quick introduction to the goal: bringing together the community involved with Galaxy to discuss issues of interest. As a dispersed group of folks this is a chance to meet, share experiences, and learn from each other.
Enis Afgan — Deploying Galaxy on the Cloud
Galaxy is available and ready to run on Amazon: check out the online instructions.
Cloud computing definition — dynamically sharable shared resources shared over the network. Three layers of cloud computing (‘aas’ -> as a service):
- IaaS — Provide the Infrastructure
- PaaS — Provide a Platform on top of infrastructure
- SaaS — Provide Software that is ready to use
Designed for small labs and individual researchers who would like to be able to scale analyses. The pluses for this: allows you to scale costs relative to usage, outsources management of servers to cloud providers, quicker to deploy, and dynamically scalable. On the negative side: it can be expensive for 24/7 usage, you need to be have parallel algorithms for your jobs, and transferring large datasets to the cloud can be difficult.
What does Galaxy have right now? A deployment of Galaxy on Amazon Web Services (AWS) that has support for dynamic resource scaling. Works using the Amazon console to get it running. Once it spins up, you browse to the IP address of your instance and see an administrator interface to manage your cloud interface. They’ve simplified all the back end details so a web form is used to set up associated storage and works nodes. The web interface allows you to add and remove instances based on your usage. 3 EBS storages are attached: galaxy software, index files and data for tools and your personal galaxy data.
To do list:
- Expand data storage
- Update software without manual effort
- Automatic cluster scaling
- Supporting private and hybrid clouds
- Automatic job scaling and parallelization
Dan Blankenberg — Integrating and scaling analysis tools
Dan provides some blow by blow examples of building XML tool configurations. Existing XML tool files are an excellent way to learn about all the useful bits of Galaxy; you can go a long ways by looking at a tool that does something similar to what you want and then copying the details from the XML.
Kick off with an overview of tools in Galaxy. 3 types:
- regular tools — running a program. Described by XML syntax.
- datasource tools — Pull information from external websites
- data destination tool — Send data out
Metadata describes the content of a dataset. For example, line/sequence counts, column assignments, genome builds, file indexes.
- Parameter validation: require the dbkey to be set
- Works on MAFs in the history or from a cached source
- Linked to a configuration file that specifies available MAF files in the system.
- Commandline line generation: uses the Cheetah template language.
Another nice example XML file: Maf visualization. Some interesting bits:
- Repeat parameter which allows a variable number of a set of parameters.
- Hierarchical list of warnings that can be suppressed by the user
- The Gmaj result is button that launches a Java applet displaying the data.
Two ways to get data into Galaxy:
- Synchronous — ready to retrieve. User is directed to external site and interacts with site until ready to submit back to Galaxy. Example is UCSC Datasource Tool.
- Asynchronous — external data needs to crunch data first.
Sending data out of Galaxy:
- Data destination tools — Standard tools. Example is: epigraph configuration file.
- External display applications — links associated with datatypes. Check out the wiki tutorial.