Gateways 2018 Keynote Mercé Crosas featured in Science Node Article "A transparent Dataverse"

Details: Published on Tuesday, 11 December 2018 19:00

Mercè Crosas of the Dataverse Project, which is an open-source repository from Harvard's Institute for Quantitative Social Sciences, has been featured in an article by Science Node. The interview with Crosas featured in the article (copied below) took place at the Gateways 2018 conference in Austin, TX., where she gave one of the keynotes, "Addressing the next challenges in data sharing with Dataverse."

A transparent Dataverse

How eliminating false leads helps science advance faster.

Speed read

Lack of reproducibility hinders scientific discovery
Dataverse repository helps researchers share, analyze, and cite data across disciplines
Greater transparency supports researchers working together to produce more reliable science

Reproducibility is an essential component of reliable science. But according to a study conducted by Nature, more than 70 percent of researchers surveyed failed to reproduce another scientists’ work.

If researchers can’t replicate previous results, they may waste time following false leads or publish inaccurate or incomplete information.

Making science more reliable. Harvard's Dataverse project is an open-source repository that provides a customizable space for researchers to share, analyze, preserve, cite, and explore data from a variety of fields so that it can be replicated more easily.

____________________________________________________________________________________________________________________________

One solution to this problem is the Dataverse Project, an open-source repository from Harvard's Institute for Quantitative Social Sciences. The project provides a space for researchers to share, analyze, preserve, cite, and explore data from a variety of fields so that it can be replicated more easily.

Researchers can upload their own data and also search data from other users. The software allows organizations to set up their own data repositories—or dataverses—which they can customize for their needs. It also provides a space for a growing community of developers and users to promote data sharing and data access.

It begins with necessity

Not just for physical science. One dataverse repository houses election data from all 50 states in one common format, allowing it to be more easily analyzed, cross-compared and applied to statistical models by political scientists and policy makers. The Dataverse Project began in 2006 when researchers at the Harvard-MIT Data Center were growing tired of how difficult it was to share their datasets.

“Back then they had to make a CD just to be able to take the data away with them and bring it back,” says Mercé Crosas, Dataverse’s co-principal investigator. “At some point, this turned into ‘let’s build a web application to do this.’ Then that grew into ‘let’s build incentives so that the entire research community can share data sets with us.’”

Now, 35 different organizations around the world have a Dataverse installation which can host multiple repositories. To date, more than fifty thousand datasets have been downloaded more than 4 million times. Dataverse’s open-source code allows researchers, journals, and institutions to easily access data from a variety of fields, from biomedical research to astronomy.

Making data visible

Crosas concedes that reproducibility may not always be one hundred percent possible, but Dataverse can at least provide an additional layer of transparency.

“There are definitely cases where it would be very difficult to reproduce the entire process,” Crosas says. “On the other hand, that should not be an excuse for those cases.”

Data challenges. Harvard University’s Mercè Crosas presented the keynote talk at the Science Gateways Community Institute's Gateways 2018 conference in Austin, TX. Afterwards, she joined Science Node to expand on addressing the next challenges in data sharing. Courtesy Alicia Hosey. Taking a big step towards transparency, Dataverse has integrated Code Ocean, a cloud-based computational reproducibility platform. This means that instead of setting up a separate environment to run the code, researchers can upload their data and their scientific code directly to their dataverse and run it right there using Code Ocean.

Other scientists can also run the code inside the dataverse, without installing anything on their personal computer. This makes it easier for outside researchers and publications to verify the reproducibility of the research.

And now many agencies that fund research and journals that publish results have started to require that research data be made publicly available.

“All of this drives home the importance of making data accessible, which has made Dataverse more widely known and more frequently used,” Crosas says.

Protecting sensitive data

Going forward, Crosas says Dataverse is growing more concerned about the security of sensitive data. Not only in protecting it from hacks, but also in educating users about how certain kinds of information should be handled.

Color sensitive. Dataverse had developed six color-coded levels of sensitivity to help researchers working across disciplinary boundaries to better understand what data is private and what is not. Courtesy Dataverse. “We need to make sure that we not only provide all of the security environment requirements needed for the data, but also that the data depositor—the researcher that has the data—and the user who wants to use that data understand what it means to share datasets that contain sensitive information,” Crosas says.

To that end, Dataverse has developed a color-coded chart that categorizes data into six levels of sensitivity, ranging from completely open to very sensitive.

Examples of highly secure data would be health data protected by HIPAA (Health Insurance Portability and Accountability Act), student data protected by FERPA (Family Educational Rights and Privacy Act), social media data, or data protected by GDPR (General Data Protection Regulation). However, Crosas says these distinctions are growing increasingly blurred.

“As we work with more complex data, not only the disciplines within the medical community and social sciences, but also research that is interdisciplinary with combined datasets from different resources, it becomes more difficult to understand what is private and what is not.”

Although there is still much to do, Crosas is proud of Dataverse’s evolution.

“The growth of the community has converted the project from something that I had to make work day by day to something that, now, has become a whole group contributing to make it work,” says Crosas.

The Dataverse projects facilitate in-person interactions between researchers through frequent regional meetings and an annual Global Dataverse Community Consortium, where users from around the world come together and collaborate.

As researchers continue to rely on each other more and more for reliable data across disciplines, teamwork and collaboration are the way of the future for reproducibility and accurate science.

Read more:

This article was originally published on ScienceNode.org. Read the original article.