Science Node article "Multiplying Science" features the late James Taylor, SGCI Steering Committee member and Gateways 2019 keynote
- Published on Sunday, 05 April 2020 20:00
Tool-sharing platform for genomics analysis becomes so much more
James Taylor, professor of biology and computer science at Johns Hopkins University, died on April 2, 2020. Taylor was a trailblazer in computational biology and genomics research and one of the original developers of the Galaxy platform for data analysis.
Today, we're sharing our article about Taylor's work on Galaxy, and a portion of our interview with him last autumn at the Gateways 2019 conference in San Diego, CA.
The Galaxy Project helps researchers handle large amounts of DNA sequence data by connecting them with analytics tools.
The instructions for every single cell in our bodies are all contained in our DNA. If we want to understand what goes into the making of each one of us—and what diseases we may be predisposed to—we need to unravel the DNA and read those instructions.
But undone DNA makes for large amounts of sequence data that must be analyzed. To even begin to work with such unwieldy material, scientists often must become inventors, building the software and tools that meet the specific needs of their research.
With a few tweaks, though, those custom tools could be applied to someone else’s research. Trouble is, few people beyond the scientist who needed it and the person who developed it know it exists. When a similar need arises in another scientist’s lab, the process of invention begins all over again.
That's where the Galaxy Project comes in. This science gateway invites people who have created tools and software to share them, and then places them in a user-friendly, web-based platform. Researchers can even combine tools to create their own workflows for custom analyses.
“On one side, you have researchers. They can bring their data into Galaxy and use its analysis tools. On the other side, we have tool developers. They’re taking the tools they've built to analyze data, putting them in Galaxy, and getting their tools out to a wider audience. This way, a larger community of researchers is able to take advantage of their tools.”
The different data that researchers work with may have different use policies. Some genomics researchers model organisms such as mice, fruit flies, and nematodes to better understand human biology and health. But others might include human genomic data from specific individuals.
All of these types of data come with different levels of protection and use restrictions. It can become very complicated very quickly when researchers use data from several different cohorts in their research. Collaboration with the Anvil Project helps keep protected data secure.
Supported by the US National Human Genome Research Institute, Anvil provides data access, data management, and security enforcement, as well as the ability to run large-scale computations. Anvil can also run other environments within it, so Galaxy will be hosted as an analysis environment within Anvil.
“To make it easy for researchers to work with protected data, we need to provide an environment that makes it easy to access and analyze the data, all while respecting and enforcing the policies meant to protect the privacy of the individuals who are enrolled in these studies,” Taylor says. “That's where the Anvil Project comes in. It ensures that people using the platform can only access the data they're authorized to access.”
From its humble beginnings in a server room at Penn State, created by a small group of scrappy graduate students and researchers who saw a need and an opportunity, Galaxy has grown a lot over the last decade.
The main operations moved to the Texas Advanced Computing Center (TACC) eight years ago, but a lot of its computations actually take place on resources allocated through XSEDE (Extreme Science and Engineering Discovery Environment), a National Science Foundation-funded virtual organization that coordinates the sharing of advanced digital resources.
XSEDE provides access to heavy-hitting hardware, like supercomputers, as well as powerful cloud resources, like Jetstream.
“The great thing about Jetstream is because it's a cloud resource, we can scale up and down in a very dynamic way,” Taylor says. “As demand goes up from Galaxy users, we can add more virtual machines and run more jobs on Jetstream. We can also size those virtual machines differently depending on the work that we need to do.”
By building Galaxy, we have opened up the ability to do genome biology to a much, much larger community ~ James Taylor
And if demand is any indication, Galaxy is going to need those scalable resources. As an affiliate of the Science Gateways Community Institute (SGCI), Taylor and the Galaxy team share both knowledge and software with other developers of science gateways.
“By building Galaxy, we have opened up the ability to do genome biology to a much, much larger community,” says Taylor. “And the really crazy thing has been that Galaxy's been picked up outside of genomics.”
“We have people using Galaxy in areas of science I never dreamed of. We have people using Galaxy for climate science. For proteomics. For natural language processing. It's just amazing how much impact we've been able to have. Way more than I ever imagined.”
A new open-source set of tools on Galaxy will help scientists studying the COVID-19 virus obtain transparent and reproducible results. These tools feature analyses in genomics, evolution, and cheminformatics.
Making these tools open-source helps to create a set of standards for handling and calculating COVID-19 data. These standards ensure that any computation will turn out the same way no matter how many times it is run or which lab or computer system carries it out.
The workflows are available to use now on any of the publicly-distributed Galaxy instances.