If you have questions that are not answered in this guide, reach out to the Research Services Team Leader, Chris Guder.
You can also contact the librarian or archivist associated with your discipline area.
For direct support from HTRC, email htrc-help@hathitrust.org. Sign up for HTRC email announcements including details about monthly office hours here.
HTRC Data Capsules are Linux virtual machines that allow researchers to run their own code against the full text of any volume in HathiTrust using the HTRC Data API, but restrict what can be exported to avoid violating copyright law. Data capsules provide the flexibility to do research on "medium-sized" full text transcript corpuses in their original order, in contrast to the Extracted Features Dataset which reduces each work to a bag-of-words.
As of April 2022, HTRC Data Capsules are Linux virtual machines running Ubuntu 16 with the Anaconda 3 (4.2.0) version of Python preinstalled along with several other powerful research tools like R, Mallet, Ant, Voyant Tools, Spark, Hadoop, and others. HTRC also provides a command line client for its API (the HTRC Workset Toolkit).
To prevent leaking copyrighted work while allowing computational access to the full texts of copyrighted works, data capsules have two modes: maintenance and secure. While in maintenance mode the capsule has access to the internet but not the HTRC Data API and can install code and tools to be used in secure mode. When switched to secure mode the capsule has access to the HTRC Data API but not the internet so that researchers can run their code against HathiTrust works. The results can then be added to a mounted export drive for HTRC staff review before being exported.
Allowable data capsule exports include derived facts about works such as:
Prohibited exports are any that would allow the reconstruction of copyrighted full text works.
Data capsules have two interfaces: a VNC remote desktop view that lets you interact live with the virtual machine's desktop in your browser, and a terminal or SSH mode that can only be accessed while the capsule is in maintenance mode.
A common practical HTRC data capsule workflow is to: