Skip to Main Content

HathiTrust Research Center

Data Capsules

HTRC Data Capsules are Linux virtual machines that allow researchers to run their own code against the full text of any volume in HathiTrust using the HTRC Data API, but restrict what can be exported to avoid violating copyright law. Data capsules provide the flexibility to do research on "medium-sized" full text transcript corpuses in their original order, in contrast to the Extracted Features Dataset which reduces each work to a bag-of-words.

  • Demo data capsules are available to anyone with a HTRC account to practice using data capsules. Demo capsules can only access a small sample of public domain works and no results can be exported.
  • Data capsules with access to all of HathiTrust's public domain works are also available to anyone with a HTRC account, and results can be exported.
  • Data capsules with access to the full text of copyrighted works are available only to researchers affiliated with HathiTrust member institutions, and results are manually checked by HTRC staff before they can be exported. Ohio University is a HathiTrust member institution, so Ohio University affiliated researchers have computational non-consumptive access to the full text of all HathiTrust works.

As of April 2022, HTRC Data Capsules are Linux virtual machines running Ubuntu 16 with the Anaconda 3 (4.2.0) version of Python preinstalled along with several other powerful research tools like R, Mallet, Ant, Voyant Tools, Spark, Hadoop, and others. HTRC also provides a command line client for its API (the HTRC Workset Toolkit).

To prevent leaking copyrighted work while allowing computational access to the full texts of copyrighted works, data capsules have two modes: maintenance and secure. While in maintenance mode the capsule has access to the internet but not the HTRC Data API and can install code and tools to be used in secure mode. When switched to secure mode the capsule has access to the HTRC Data API but not the internet so that researchers can run their code against HathiTrust works. The results can then be added to a mounted export drive for HTRC staff review before being exported.

Allowable data capsule exports include derived facts about works such as:

  • Topic models
  • Statistical summaries
  • Visualizations
  • Named Entity Recognition entities
  • N-grams

Prohibited exports are any that would allow the reconstruction of copyrighted full text works.

Data capsules have two interfaces: a VNC remote desktop view that lets you interact live with the virtual machine's desktop in your browser, and a terminal or SSH mode that can only be accessed while the capsule is in maintenance mode.

A common practical HTRC data capsule workflow is to:

  1. Create a data capsule
  2. In maintenance mode install any tools or outside data in the capsule and debug them on the maintenance mode test data set.
  3. Switch to secure mode, run the analysis and save results to the mounted secure volume.
  4. Submit the results for review and export.