Webinar: Reproducible big data science—A case study in continuous FAIRness using Globus and Globus Genomics
October 9, 2019
Reproducible big data science: A case study in continuous FAIRness using Globus and Globus Genomics
Presented by Ravi Madduri, Scientist, Data and Learning, Argonne National Laboratory, and Senior Scientist, University of Chicago Consortium for Advanced Science and Engineering
Big biomedical data create exciting opportunities for discovery but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data.
In this talk, we will describe the enhancements made to the Globus Genomics to support working with datasets referred to by minids, analyzing BagIt-based research objects called BDBags, and execution using software encapsulated using docker containers with unique identifiers. We will describe the tools and services developed to create end-to-end reproducible analysis pipelines while adhering to FAIR principles.
- Reproducible big data science: A case study in continuous FAIRness
- Globus Genomics has more general applicability to gateways serving other domains and types of big (or not-so-big) data in the service of reproducibility. Hear Ravi Madduri's answer to this question at 37:56 into the video: Can a science gateway that offers computational resources but not long-term storage of users' data use/add your framework to offer some of the needed information to the user for reproducibility? Do we need to add the code to our gateway or can we call your REST API?