Guest Blog: The Distant Reader—A Jetstream-Powered Assist for an Age-Old Process
By Harmony Jankowski, Science Highlights Team, UITS Research Technologies, Indiana University
Scholars, students, and researchers in all fields have one activity in common: reading. With the increased prevalence of online publishing through formal and informal channels, the number of books and articles written about any and all topics has grown exponentially. What is a conscientious researcher to do when one just cannot possibly read everything?
Enter the Distant Reader gateway. Developed in 2019 at Notre Dame University by Eric Lease Morgan, the Distant Reader supplements the traditional reading process by allowing the user to make sense of a large body of text when there’s just too much to process in one mind. The tool works with a “corpus,” supplied by the researcher, which can be a web-based text, a file, or a set of files, and “reads” them, harvesting and caching the content. The gateway then uses natural language processing (NLP) and text mining, transforms the corpus into machine-readable formats and a relational database, summarizes what it learned, and then creates a “study carrel” to which the researcher receives a link. The researcher can download the study carrel and begin investigating some of the big-picture findings around the text included in the corpus.
The Distant Reader excels at certain kinds of reading activities, like identifying frequently-appearing and statistically significant words or phrases; illuminating themes; categorizing and summarizing parts of speech, revealing what’s described and how; and people and places mentioned. Far from an end-point for researchers, the Distant Reader opens up a text or group of texts but is meant to be paired with other tools for visualization, topic modeling, and interpretation of results.
To make all of this possible, the Distant Reader uses a dynamic virtual cluster deployed on the Jetstream cloud, the National Science Foundation’s (NSF) first production research and education cloud. Led by the Indiana University Pervasive Technology Institute and an SGCI Partner, Jetstream provides self-service cloud resources for researchers and educators who need a different set of capabilities from the traditional high performance computing environments typically used in research. It is designed to provide programmable cyberinfrastructure that gives researchers and educators access to interactive computing and data analysis resources on demand. Jetstream is a resource provided for the education and research communities and is allocated via the NSF’s eXtreme Science and Engineering Discovery Environment (XSEDE) (for more information, contact email@example.com).
Through Jetstream, users access the Distant Reader gateway through a web interface, which uses Apache Airavata middleware to move between the web interface and the virtual clusters. The Distant Reader leverages SciGaP as service infrastructure providing user authentication, authorization, and identity management, in addition to access to the Distant Reader tool. The execution environment is made up of a single small head node and a dozen on-demand compute nodes. The resource provides a virtual cluster on Jetstream and uses SLURM to schedule and manage resources. The head node receives a request from the scheduler and uses OpenStack to provision one or more compute nodes for executing the request. Each compute node comprises 10 cores, 30 GB RAM, 60 GB local disk, and a shared file system that all cluster nodes can access.
The Distant Reader can be helpful to researchers and students in various scholarly situations, from scientists doing a literature review to undergraduates who want a “big picture” understanding of an entire semester’s readings. The gateway is free for anyone to use; for more information, check it out at https://distantreader.org.