Gateways 2022 Abstracts
- Published on Wednesday, 31 August 2022 20:00
Gateways 2022 – Abstracts
The Pulsar Science Collaboratory (PSC) is an out-of-school-time research project engaging high-school students, teachers, and undergraduates in real-world research by searching for pulsars in the data collected with the 100-m Green Bank Telescope (GBT). The primary goals are to stimulate student interest in Science-Technology-Engineering-Math (STEM) careers, to prepare teachers in implementing authentic research with students by training them within a professional scientific community, and to promote the student use of information technologies through online activities and workshops. After training, PSC students and teachers gain access to radio astronomy data collected by the GBT to search for new pulsars and participate in authentic pulsar research with the 20-m telescope at the Green Bank Observatory (GBO).
This case study highlights the initial work and challenges faced when converting the organically grown gateway underlying the Midwest Regional Climate Center to a modern, extensible framework and composable platform. Issues confronted range from untangling undocumented scripts, multiple databases & user accounts, hundreds of scheduled processes, and decades of accumulated data dispersed across multiple machines.
The National Cancer Institute (NCI) Community Hub is a science gateway for cancer-related projects from around the world to share knowledge. Community members join the science gateway to publish resources, including datasets and computational tools with digital object identifiers (DOIs), teaching materials, workshop materials, and much more. Additionally, community members join the NCI Hub to collaborate via online community spaces. Funding for the NCI Hub is provided by the National Cancer Institute’s Center for Biomedical Informatics and Information Technology, and National Institutes of Health.
One of the major challenges with creating open science data web applications is how to sustain and maintain the site after the initial funding runs out. During the Open Life Science mentorship program, I collaborated with iNaturalist project organizers to create a web application that allows users to explore iNaturalist data and environmental data (DataExplorers.info). Since my collaborators had zero budget for a web development, I had to deeply rethink the way I build web applications. In the end, the limitation of zero budget turned out to be an asset to the project. By blending software ideas from the tech world and the research world, I was able to create an interactive website using Jupyter notebooks and static site generator that runs on CSVs and can be hosted on Github pages for free. By replacing database-based sites with static pages, this greatly reduces the long term maintenance and cost of the site. I hope this project can show other small open data projects what is possible with zero budget.
Rapid advances in technology over the past decade have enabled collection of large amounts of data, in particular, through data streams from sensors and devices. Effective utilization of such data has been hampered by the lack of ready-to-use resources for data providers to manage the data and for data consumers to access the data through facile APIs. In this paper, we present StreamCI, a scalable cloud-based sensor data collection and analysis system that enables researchers to easily collect, process, store, and access large volumes of heterogeneous sensor data. StreamCI provides a web portal for users and administrators to easily register new data sources and monitor the status of data ingestion pipelines. The back end of StreamCI provides real time data ingestion/query APIs, data access control, and data processing pipelines using an open source software stack including RabbitMQ, Node.js, MongoDB, Certbot, Grafana, and HUBzero. Containerization and orchestration of services using Kubernetes improves scalability, as demonstrated by our experimental results. The StreamCI system has been used in multiple research and education domains including the
collection and processing of plant health sensor data by plant phenotype researchers, collection of real-world air quality sensor data and its use in data analysis coursework for ecology students, and the collection and analysis of advanced manufacturing data for cybersecurity research.
Currently, there is increasingly demand for resources and tools made available through new and advanced computing technologies, such as networks, large data analysis, and sophisticated simulation tools that assist in the understanding of natural phenomena. High Performance Computing (HPC) plays an important part in this area. However, undergraduate students lack experience in how HPC functions because most curricula do not adequately cover HPC. To assist in solving this problem, Science Gateways Community Institute obtained external funding to improve undergraduate computing education through enhanced courses. The goal is to incorporate HPC concepts and training across the computing curricula in multiple disciplines to motivate students’ interests in computing and increase their critical thinking skills, which will diversify and strengthen the future workforce in the United States. The panel on HPC Curriculum Enhancements is made up of Minority Serving Institutions (MSI) faculty who have trained in HPC and Gateway technology and are now working to integrate HPC into their assigned computer science classes. The panel will describe the efforts of the NSF Science Gateways Community Institute (SGCI) to support these faculty. The panel will begin with an overview 0f SGCI Faculty HPC/Gateways Program. The faculty panel will also present their lessons learned and next steps in this endeavor. MSI represented on the panel are Winston Salem State University, Elizabeth City State University, Norfolk State University, Mississippi Valley, State University, and Jackson State University.
CHEESEHub is a web-accessible, public science gateway that hosts containerized, hands-on demonstrations of cybersecurity concepts. There are now a plethora of services and tools designed to simplify modern gateway deployment and configuration such as commercial and academic composable cloud, the Terraform infrastructure as service tool, Kubernetes and Helm for container orchestration, as well as CILogon for simplified user authentication. Despite leveraging these tools, our day to day experience with deploying, upgrading, scaling, and extending CHEESEHub has not been entirely straightforward. We describe here some of the major challenges we have encountered in managing CHEESEHub and developing web-accessible demonstrations for the last five years. We hope this will help both new and seasoned gateway developers to effectively leverage these modern tools while avoiding these same pitfalls, while also providing starting points for discussions around gateway development and deployment best-practices.
Gateways provide access to computational codes as a service to their users. One popular design pattern is to use a single user account on an HPC system to run all workloads for the gateway's user community. This introduces several security concerns for the owner of the user account, the machine, and the integrity of the workloads run by the gateway. In this paper we present preliminary work on pieshell, a limited, secure Linux shell that runs in user space. We begin by discussing the use case pieshell addresses. Next we discuss how it fills a gap in the existing secure shell landscape. We then describe pieshell's design and usage within a gateway context before concluding with areas of future work.
Researchers and educators in humanities such as computational linguists, digital humanists, and those doing historical reconstructions are increasingly heavy users of computational and/or data resources. Many know about activities, working groups, and initiatives around the FAIR (Findable, Accessible, Interoperable, Reusable) principles and are a driving force for improving the situation for sharing data and software. However, it seems humanities researchers are less aware of the science gateways community and the end-to-end solutions that science gateways could provide, therefore lacking a driving force for adoption of this technology. Some may be creating their own gateways outside the community; others may wish to use computational and data infrastructures but may perceive a lack of support or opportunities. Hypotheses about the reasons that humanities are not well represented as gateways builders and users include lack of funding and support by computer centers. This study will clarify some of the challenges and needs faced by computational researchers in the humanities that may explain their relatively low participation in the science gateways community. For this paper, we present the results of two interviews as proof of concept for the study. We plan to follow with 12-15 additional interviews for the larger study.
Science gateways adoption and diffusion can be increased and accelerated through influence and outreach by using opinion leaders or gateways influencers. In this paper, we describe how influencers can help accelerate the spread of technology and how they can be utilized in a gateways context. Specifically, we identified how current gateway staff can be trained to identify and recruit influencers and how to systematically prepare an ‘influencer recruitment plan’; we explained how influencers might differ across domains (e.g., science vs. humanities); and we offer suggestions on how influencers can be integrated into the gateways workforce. Our framework for identifying influencers could aid in ensuring the continued growth and sustainability of the gateways community in the long term.
The science gateway nanoHUB has operated and supported a growing user base for over 20 years. To move toward future sustainability, nanoHUB must retain existing users, cultivate new users, and foster community involvement. In order to support retention and growth of the community, the nanoHUB team has leveraged their existing user information and analytics along with the commercial customer relationship management (CRM) tool Salesforce to better understand and communicate with current, former, and potential users. This paper details our efforts in that direction.
Academic groups increasingly utilize interactive computing methods for various tasks related to their research. Jupyter notebooks, in particular, enable users to create literate computing documents that mix code, outputs, and text, suitable for dissemination of results, as well as teaching in the classroom. These capabilities complement the batch computing services provided by HPC centers exposed through traditional science gateways. However, integrating Jupyter into science gateways and other advanced computing ecosystems introduces new challenges related to scalability, collaboration and reproducibility. In this paper, we present Project Scinco (Scalable Interactive Computing), an open platform for scalable, reproducible, interactive computing, designed to be run at academic computing centers and incorporated into science gateways. We describe the primary features and architecture of Scinco and highlight some of the projects using it in different fields from astronomy to machine learning and civil engineering.
Scientific productivity is a long-time issue across all the cyberinfrastructure-backed research areas. It is a known issue that many research groups have low productivity due to the lack of FAIR principles in practice for their workflows and data products. Non-sharable and non-reusable code and datasets drive students and researchers to repeat the same data retrieval and cleaning procedure repeatedly, wasting numerous hours on the same steps by new onboarded members or even the original contributor researcher who forgot and cannot reproduce their results after a while. Geoweaver, a GUI-based scientific workflow management system, is developed to address this low-FAIRness-caused productivity issue while reducing the learning costs for researchers with a less technical background. This paper will use a machine learning workflow in Geoweaver as a demonstration example. The example use case utilizes PyTorch to create and deploy a crop classification operational model to classify cropland in Kenya during the growing season (i.e., partially-observed time series) and produces a 10-meter resolution probability cropland map for the 2021-2022 growing season.
Science gateways can add value for researchers by empowering them to organize their data in a way that tells a compelling science story. To this end, the Web and Mobile Applications team at TACC has pursued data curation as a first-class feature in our gateways. We have previously reported on curation features in the DesignSafe cyberinfrastructure, which has supported the publication of over 500 annotated data sets since its launch in 2016. In DesignSafe, users associate their data with metadata entities that represent different aspects of an experiment or simulation, then assemble those entities into a hierarchical tree which represents the overall research project. We present a novel design for a data curation architecture which provides feature parity with the DesignSafe model, but which can be provided "out of the box" in containerized science gateway solutions. This provides a path forward for advanced data curation features in gateways supporting any conceivable scientific domain. Our proposed architecture represents the data curation hierarchy as a directed tree which can be manipulated using standard graph analysis libraries. We believe that this design is broadly applicable because it provides a strong separation of concerns between the construction of the entity tree and the management of its constituent metadata records.
The goal of a robust cyberinfrastructure (CI) ecosystem is to catalyse discovery and innovation. Tapis does this through offering a sustainable production-quality set of API services to support modern science and engineering research, which increasingly span geographically distributed data centers, instruments, experimental facilities, and a network of national and regional CI. Leveraging frameworks, such as Tapis, enables researchers to accomplish computational and data-intensive research in a secure, scalable, and reproducible way and allows them to focus on their research instead of the technology needed to accomplish it. This project aims to enable the integration of the Google Cloud Platform (GCP) and CloudyCluster resources into Tapis- supported science gateways to provide on-demand scaling needed by computational workflows. The new functionality uses Tapis event-driven Abaco Actors and CloudyCluster to create an elastic distributed cloud computing system on demand. This integration allows researchers and science gateways to augment cloud resources on top of existing local and national computing resources.
The Tapis Streams API is a production grade quality service that provides REST APIs for storing, processing and analyzing real-time streaming data. This paper focuses on improvements made to Tapis 1.0 Streams API for making it up-to-date and easily accessible. The newer version, Tapis 1.2 Streams API adopts the latest version of InfluxDB, InfluxDB 2.X, which has built-in security features and supports next generation data analytics and processing with a data processing language Flux. This paper also discusses the measures implemented in the Tapis 1.2 Streams API to mitigate potential security risks involved in unauthorized data stream access by users who do not own it. Additionally, new data Channel Actions supporting 3rd Party notification and web-hooks has been released. Lastly a tool, Tapis UI, which is a self contained server less application to access Tapis Services via rest calls is discussed in the paper. Tapis UI is a lightweight browser only client application which allows interactive access to Streams resources and real-time streaming data.
Effective communication is crucial for the success of academic projects, especially within multidisciplinary teams where researchers come from different backgrounds not only on personal and/or cultural level but also from different disciplines. This can lead to misunderstandings which might not be even obvious in meetings and project plans if the same terms might be used for different concepts. Team members implicitly assume that all parties work with the same definition of terms. The project VisDict addresses the communication between workflow providers and domain researchers via the creation of a visual dictionary in a science gateway so that differences in the perception of terms are easily recognized and can be timely resolved. Dictionaries are used as translation tools between natural languages - the approach for translating from computational science to research domains such as physics and biology is novel. In this paper, we go into detail for our approach to build the dictionary in a science gateway and the lessons learned from carefully curating the first entries and plans for automating its extension to a large set of relevant terms including their illustrations.
The Atom portal, udel.edu/atom, provides the scientific community with easily accessible high-quality atomic data, including energies, transition matrix elements, transition rates, radiative lifetimes, branching ratios, polarizabilities, and hyperfine constants for atoms and ions. The data are calculated using a high-precision state-of-the-art linearized coupled-cluster method. All values include estimated uncertainties. Where available, experimental results are provided with references.
This paper describes some of the software engineering approaches applied in the development of the portal.
What could a research team accomplish if given access to the latest GPU hardware? A molecular dynamics research team could process protein-ligand tests using 100 GPUs on Delta in about 3.6 days, a task that would take their local lab a full year using their own GPU hardware. A team exploring the wonders of cosmic rays is now part of the multi-messengerastrophysics revolution; they can crunch ever more data with the additional resources. With the newly built Delta system, the research community will utilize the latest NVIDIA and AMD GPUs to empower their scientific exploration. In parallel with the Delta system launch, the Delta team conducted 44 interviews with research code leads and science gateway community leads. The observations and requirements discovered through these conversations, such as optimizing Delta to support AI/ML applications and scheduling, guide the system’s configuration and the design of the Delta science gateway.
Developing a workforce with applied skills in the high-performance computing field has highlighted a gap in personnel experiences needed for capabilities towards scientific purposes. The exposure provided in current academic programs, especially on an entry-level was deemed insufficient. To increase student participation HackHPC, a collaboration between the Science Gateways Community Institute, Omnibond Systems, and the Texas Advanced Computing Center was formed to address that gap through the application of methods used in events known as hackthons. Hackathons are time-bounded events during which participants form teams to work on a project that is of interest to they the participants. Hosting a hackathon that has desired long-term outcomes involves a number of crucial decisions related to preparing, running, and following up on an event. The methodologies, participants, procedures, and refined implementation of practices used to plan and host hackathon events for this purpose has been termed the "HackHPC Model".
For reliable machine learning and statistical inference with large/big data, curing incomplete data is critical. Fractional hot deck imputation (FHDI) cures multivariate missing data by filling each missing unit with multiple observed values (thus, hot-deck) without resorting to distributional assumptions. By inheriting the power of FHDI and parallel computing, we developed ultra data-oriented parallel FHDI (named as UP-FHDI) to cure ultra incomplete data with tremendous instances (big-n) and high dimensionality (big-p). We enabled scalable ultra incomplete data curing and also devised variance estimation via a parallel Jackknife method as well as efficient ultra data-oriented parallel linearization techniques. Results confirm that UP-FHDI can cure ultra datasets, up to millions instances and 10,000 variables. We validate the accuracy and scalability of UP-FHDI, paving a new pathway for reliable machine learning and statistical inference with ``cured'' big data.
Christina Koch and Robert Quick
The Open Science Pool (OSPool) is a virtual cluster operated by the OSG, with shared computing and data resources via distributed high-throughput computing (dHTC) technologies. The pool aggregates mostly opportunistic (“backfill”) computing resources from contributing clusters at campuses and other organizations, making them available to the US-based open science community. The OSPool and related services is an ideal backend for computational gateways that serve communities with dHTC-friendly workloads. In this tutorial, we will identify communities who might benefit from such a gateway, explore OSPool and HTCondor features that can be used to support gateways, and use a gateway that leverages the OSPool.
The Gateway framework that will be used to demonstrate OSPool resource gateway integration will be Apache Airavata. Airavata is a software suite that composes, manages, executes, and monitors large scale applications and workflows on computational resources ranging from local clusters to national grids through web interfaces. Airavata consists of four components: (1) a workflow suite, enabling a user to compose and monitor workflows, (2) an application wrapper service to convert command line programs into services that can be used reliably on a network, (3) a registry service that records how workflows and wrapped programs have been deployed, and (4) a message broking service to enable communication over possibly unreliable networks to clients behind organizations' firewalls. This tutorial will provide an overview of these components and how they were used in the demonstration gateway.
After completing this tutorial participants will have a basic understanding of distributed High Throughput Computing, available OSPool capacity and tooling, and how they can use an Apache Airavata web environment to leverage HTC capacity in the form of the OSPool.
Rajesh Kalyanam, Erik Gough, Brian Werts, Samuel Weekly and Eric Adams
Anvil is a capacity, high-performance computing (HPC) system that is currently allocated to researchers as part of the XSEDE national computing infrastructure program. In addition to the traditional HPC computing cluster, Anvil also includes a composable subsystem that is designed for non-traditional gateway-style workloads and deployments. The composable subsystem greatly simplifies the scalable, extensible, and portable deployment of containerized applications through Kubernetes via a user-friendly Rancher web interface. In this introductory tutorial we will provide attendees with an overview of the Anvil cluster, Kubernetes concepts, and hands-on experience in using Rancher to deploy gateway components such as databases, web servers, and interactive computing environments (e.g., Jupyter) all from their web browser. We will also demonstrate how automated scaling can be configured in Kubernetes to allow gateways to scale up or down in response to workloads.
Lee Liming and Steve Turoscy
Science gateways hide complexity behind deceptively simple interfaces. These accessible, easy-to-use web interfaces enable a broad research audience to use sophisticated computing and data capabilities on highly specialized systems. Modern research instruments and facilities, such as sequencing cores, satellite-based systems, advanced light sources, and Cryo-EM generate datasets at the TB+ scale. Behind the scenes, science gateways must stage data from instruments, submit compute jobs to analyze data (using shared or cloud-hosted computers), move results to more persistent storage, describe data products, and provide a means for collaborators to search, discover, reuse, and augment these data products. Myriad tools are available to enable all these tasks but integrating them in a way that hides the complexity from users is a challenge. This scenario-driven, 180-minute tutorial with optional hands-on experiences introduces researchers and science gateway developers with intermediate experience to an approach that bootstraps science gateway development based on the Modern Research Data Portal design pattern. The solution uses a set of open source tools that build on the established Django web framework and Globus platform services. The Globus platform provides federated logins via InCommon and CILogon, a rich groups API for access management, data upload and download at scale, and a simple but powerful search API for indexing and describing datasets. The Django Globus Portal integrates these features in a single Django project that can be deployed in a few minutes and then customized to add features for a specific science gateway. Attendees will leave the tutorial with a working data portal and accompanying documentation and references for re-hosting the portal on their own systems and customizing it to their own needs.
Sean Cleveland, Anagha Jamthe, Steve Black, Joe Stubbs, Joon Chuah and Michael Packard
This tutorial will focus on providing attendees exposure to cutting-edge technologies for building reproducible, portable and scalable scientific computing workloads, which can be easily run across Cloud and HPC machines. This tutorial will explain how to effectively leverage the NSF-funded Tapis v3 platform, an Application Program Interface (API) for distributed computation. We will include several hands-on exercises, which will enable the attendees to build a complete scientific workflow that can be seamlessly moved to different execution environments, including a small virtual machine and a national-scale supercomputer. Using techniques covered in the tutorial, attendees will be able to easily share their results and analyses with one or more additional users. This tutorial will make use of a specific machine learning image classifier analysis to illustrate the concepts, but the techniques introduced can be applied to a broad class of analyses in virtually any domain of science or engineering. This tutorial will also introduce the Tapis UI project that can act as a base level easy to host science gateway.
Gerald Byrket, Jeff Ohrstrom, Travis Ravert, Alan Chalker, David Hudak and Robert Settlage
Open OnDemand (openondemand.org) is an NSF-funded open-source HPC platform currently in use at over 200 HPC centers around the world.(1-7) It is an intuitive, innovative, and interactive interface to remote computing resources that has recently been adopted as the portal of choice for the ACCESS program. For the user, Open OnDemand (OOD), offers an interface that is both intuitive and easy to use. For the center support staff, the intuitive and easy to use interface translates into a reduced help ticket load.
The Open OnDemand interface is browser based, runs as user, and is extensible via customizable apps. These apps are effectively webforms that accept user input, reformat the input into a scheduler request, which is then transformed into link presented to the user for the software application of interest. The official OOD github repo currently has links to software that appeals to a wide range of scientific disciplines, such as Jupyter, Abaqus, ANSYS, COMSOL, MATLAB, RStudio, Tensorboard, QGIS, VMD, RELION, STATA and Visual Studio.
This tutorial is for those who would like to use Open OnDemand as a platform for Gateway development. Using Open OnDemand as a platform for Gateway development that utilizes the same compute resource(s) OOD already surfaces has several advantages including use of the current installed and configured OOD instance (hardware and software) reducing the overhead required to support science Gateways. The tutorial will cover basic OOD dashboard customization and then work through creation of a dashboard passenger app (Gateway). The tutorial will utilize a containerized high-performance computing cluster (2-node) that is available via GitHub and will run locally on the participants laptops.