If you have questions that are not answered in this guide, reach out to the Research Services Team Leader, Chris Guder.
You can also contact the librarian or archivist associated with your discipline area.
For direct support from HTRC, email htrc-help@hathitrust.org. Sign up for HTRC email announcements including details about monthly office hours here.
HTRC Algorithms are pre-configured versions of standard text-mining algorithms that HTRC runs on its own servers then provides you with the results. All you need to run an Algorithm is an HTRC Workset of up to about 3,000 volumes. They are the easiest easy way to get started analyzing HathiTrust books.
In addition to the basic algorithms described on this page, HathiTrust+Bookworm is a minimal effort, click-and-run visualization tool that graphs word trends over time, providing instant results from HathiTrust's millions of volumes.
Algorithm | Type of analysis | Description |
---|---|---|
Token Count and Tag Cloud Creator | Word frequencies | Analyzes word usage in a workset |
InPhO Topic Model Explorer | Themes | Extracts groups of words that represent the workset’s primary topics |
Named Entity Recognizer | Proper nouns | Identifies people, places, organizations, dates, and monetary values that appear throughout the workset |
The browser-based algorithms described on this page can be applied to worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB. Single-volume worksets or those with fewer volumes will result in more manageable datasets.
"Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred." - HTRC
Once you have logged in to HTRC, select the "Algorithms" tab to execute the Token Count and Tag Cloud Creator.
Provide a name for the job, select the workset to be analyzed, and identify the predominant language.
Define the parameters for stop words and replacement rules. Stop words will be filtered out of results, while replacement rules will substitute words with inconsistent spellings and non-standard characters. Click the "Use default" button to use the HTRC-provided parameters.
Choose whether words are converted to lowercase before counting. If set to "True," capitalized instances of the same word will not be counted separately.
The last two parameters apply strictly to the tag cloud visualization. The default regular expression value will only display words containing letters or those that are hyphenated. The maximum number of tokens controls how many words appear in the visualization.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.
Tag cloud visualization: Displays words that occur most frequently throughout the workset. The larger the word, the more occurrences were detected by the algorithm. The color and configuration of words changes slightly each time the job page is refreshed. To save the visualization, click the button to open in a new tab and capture a screenshot.
Tag cloud generated from Jesse Stuart author workset. Click image to view in new tab.
Token counts: A downloadable CSV file that provides the exact number of occurrences of every word in the workset. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file. In addition to commonly used words, researchers can gain insight about those used least frequently.
Token |
Count |
---|---|
rain-kiss | 1 |
yonside | 1 |
frog-moaned | 1 |
harmonicas | 1 |
slushes | 1 |
once-green | 1 |
wind-sliced | 1 |
innerested | 1 |
click-clacking | 1 |
love-vined | 1 |
reforested | 1 |
windgust | 1 |
cawcawed | 1 |
trilling | 1 |
The table above shows several lines of token counts from the Jesse Stuart workset. Each word only occurs once, but their rarity speaks to the author's style and use of language.
Read more about the Jesse Stuart author collection housed in OHIO Libraries Archives & Special Collections.
A topic model employs machine learning to aid the discovery of abstract topics or themes present in a large or unstructured dataset.
For this kind of analysis, the text is chunked into "documents", and stop words (frequently used words such as "the", "and", "if") are removed since they reveal little about the substance of a text. The computer treats the documents as bags of words, and guesses which words make up a "topic" based on their proximity to one another in the documents, with the idea the words that frequently co-occur are likely about the same thing. The results are groupings of words that the computer has statistically analyzed and determined are likely related to each other...
Once you have logged in to HTRC, select the "Algorithms" tab to execute the InPho Topic Model Explorer.
Provide a name for the job and select the workset to be analyzed.
Define the parameters for the number of iterations. This determines the number of samples that the topic model will use to conduct its analysis. A lower number of iterations (i.e. 200 iterations) will process faster and is good for experimentation. A higher number will give higher quality results (i.e. 1000 iterations).
Indicate the number of topics you would like the algorithm to generate from your dataset. Multiple values, such as the default "20 40 60 80," are accepted. With this configuration, the model will run multiple times creating a list of 20, 40, 60, and 80 topics for comparison. An appropriate number will depend on your research inquiry and the size of the workset.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.
Bubble visualization: Displays each topic as a node or bubble. When hovering over a bubble, you can see the top words that were grouped into the topic. The colors of the bubbles are a loose representation of topics with similar themes. Click the "collision detection" box to minimize overlap between bubbles and improve readability. The numbers on the left side relate to the number of topics generated, as do the size of the bubbles. You can toggle the display of the topic clusters by clicking on the numbers.
Topic model visualization of the Cairns Collection of American Women Writers. Click image to open interactive visualization.
Topics json file: In addition to the bubble visualization, the algorithm generates several files including one called topics.json. This file allows researchers a detailed view of the word groupings that constitute a topic. Each word has a decimal number that represents the probability of its appearance in the topic. Topics.json can be viewed directly in the browser or pasted into a conversion tool such as json2table.com for easier viewing.
Topics.json text showing grouped words and probabilities
Topic models require interpretation. They provide lists of words that may be related given the frequency with which they appear together throughout the corpus. It's up to the researcher to derive meaning from the word groupings and interpret the topic. The topic model generated for the Cairns Collection of American Women Writers reveals multiple word groupings that suggest overarching themes such as religion, romance, children's literature, travel, American history, and the role of women in society and the home. The dataset also contains some very specific word groupings that are less broadly represented, including those related to slavery, science, women's organizations, and book publishing.
The HTRC topic model explorer has limited parameters that often yield imprecise results. However it may prove useful for identifying themes and outliers within a large, unfamiliar text corpus.
"Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. " - HTRC
Named Entity Recognition (NER) is a process by which parts of an unstructured text corpus are labeled and extracted as names. It relies on machine learning to structure components of the text and classify them using statistical models of word use. NER is a branch of Natural Language Processing (NLP) which utilizes computer systems to parse human language and extract data.
Once you have logged in to HTRC, select the "Algorithms" tab to execute the Named Entity Recognizer.
Provide a name for the job, select the workset to be analyzed, and identify the predominant language.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.
This video walks through the process of applying the NER algorithm to a workset.
The resulting CSV file can be used to create a map visualization (see instructions & video below.)
Table of named entities: A downloadable CSV file that displays the entity name and type, as well as the volume ID and page sequence. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file.
Entity types include: date, location, miscellaneous proper nouns, money, organization, percent, person, and time. The corresponding volume ID and page sequence can be used to identify the source of a given entity by plugging the vol_id and page_seq into the following URL template:
In this example the volume id is "hvd.32044021584222" and the page sequence is "33".
The link points to the HathiTrust Digital Library volume and page on which the entity appears.
vol_id |
page_seq |
entity |
type |
---|---|---|---|
mdp.39015063160900 |
2 |
1979 |
DATE |
mdp.39015063160900 |
2 |
1978 |
DATE |
mdp.39015063160900 |
2 |
1963 |
DATE |
mdp.39015061208131 |
19 |
New Ireland |
LOCATION |
mdp.39015061208131 |
19 |
Papua New Guinea |
LOCATION |
mdp.39015061208131 |
19 |
South Africa |
LOCATION |
mdp.39015061208131 |
19 |
Zwelihle Township |
LOCATION |
mdp.39015055802758 |
20 |
Gxarha River |
LOCATION |
mdp.39015061208131 |
21 |
St Helena Bay |
LOCATION |
uc1.32106017886364 |
21 |
Lesotho |
LOCATION |
uc1.32106017886364 |
21 |
South Africa |
LOCATION |
uc1.32106017886364 |
21 |
Lesotho |
LOCATION |
uc1.32106017886364 |
10 |
African National Congress's Youth League |
ORGANIZATION |
uc1.32106017886364 |
10 |
National Party |
ORGANIZATION |
uc1.32106017886364 |
10 |
African National Congress |
ORGANIZATION |
uc1.32106017886364 |
10 |
Pan Africanist Congress |
ORGANIZATION |
uc1.b3668076 |
10 |
Peka High School |
ORGANIZATION |
uc1.b3668076 |
10 |
Federated Union of Black Arts |
ORGANIZATION |
inu.30000055316172 |
3 |
Zakes Mda |
PERSON |
inu.30000055316172 |
4 |
Lea Glen |
PERSON |
inu.30000055316172 |
4 |
Nangomso Jol |
PERSON |
mdp.39015063160900 |
4 |
Dorothy Wheeler |
PERSON |
mdp.39015063160900 |
4 |
Garth Erasmus |
PERSON |
uc1.b3668076 |
4 |
Eddie Nhlapo |
PERSON |
uc1.b3668076 |
4 |
James Mthoba |
PERSON |
The table above shows a selection of named entities from the Zakes Mda author workset.
One of the challenges associated with NER is the ambiguity of language. For example, the proper noun "Lincoln" could refer to a person, place, or car manufacturer. Locations are particularly problematic since the same place names can occur in many different geographic locations. An NER dataset will inevitably require some scrutiny and data cleaning to identify and fix inaccurate labels.
The map below displays places mentioned in the Zakes Mda author workset. Due to copyright restrictions, the workset's six volumes are not available to view in the HathiTrust Digital Library, however they can be analyzed with HTRC tools such as the Named Entity Recognizer.