Library Guides: HathiTrust Research Center: Applying Algorithms

HTRC Algorithms

HTRC Algorithms are pre-configured versions of standard text-mining algorithms that HTRC runs on its own servers then provides you with the results. All you need to run an Algorithm is an HTRC Workset of up to about 3,000 volumes. They are the easiest easy way to get started analyzing HathiTrust books.

In addition to the basic algorithms described on this page, HathiTrust+Bookworm is a minimal effort, click-and-run visualization tool that graphs word trends over time, providing instant results from HathiTrust's millions of volumes.

Basic HTRC algorithms (see Advanced Features for additional tools)
Algorithm	Type of analysis	Description
Token Count and Tag Cloud Creator	Word frequencies	Analyzes word usage in a workset
InPhO Topic Model Explorer	Themes	Extracts groups of words that represent the workset’s primary topics
Named Entity Recognizer	Proper nouns	Identifies people, places, organizations, dates, and monetary values that appear throughout the workset

The browser-based algorithms described on this page can be applied to worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB. Single-volume worksets or those with fewer volumes will result in more manageable datasets.

How to apply basic algorithms

"Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred." - HTRC

Once you have logged in to HTRC, select the "Algorithms" tab to execute the Token Count and Tag Cloud Creator.
Provide a name for the job, select the workset to be analyzed, and identify the predominant language.
Define the parameters for stop words and replacement rules. Stop words will be filtered out of results, while replacement rules will substitute words with inconsistent spellings and non-standard characters. Click the "Use default" button to use the HTRC-provided parameters.
Choose whether words are converted to lowercase before counting. If set to "True," capitalized instances of the same word will not be counted separately.
The last two parameters apply strictly to the tag cloud visualization. The default regular expression value will only display words containing letters or those that are hyphenated. The maximum number of tokens controls how many words appear in the visualization.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.

Output & Interpretation

Tag cloud visualization: Displays words that occur most frequently throughout the workset. The larger the word, the more occurrences were detected by the algorithm. The color and configuration of words changes slightly each time the job page is refreshed. To save the visualization, click the button to open in a new tab and capture a screenshot.

Tag cloud generated from Jesse Stuart author workset. Click image to view in new tab.

Token counts: A downloadable CSV file that provides the exact number of occurrences of every word in the workset. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file. In addition to commonly used words, researchers can gain insight about those used least frequently.

Token	Count
rain-kiss	1
yonside	1
frog-moaned	1
harmonicas	1
slushes	1
once-green	1
wind-sliced	1
innerested	1
click-clacking	1
love-vined	1
reforested	1
windgust	1
cawcawed	1
trilling	1

The table above shows several lines of token counts from the Jesse Stuart workset. Each word only occurs once, but their rarity speaks to the author's style and use of language.

Read more about the Jesse Stuart author collection housed in OHIO Libraries Archives & Special Collections.

A topic model employs machine learning to aid the discovery of abstract topics or themes present in a large or unstructured dataset.

For this kind of analysis, the text is chunked into "documents", and stop words (frequently used words such as "the", "and", "if") are removed since they reveal little about the substance of a text. The computer treats the documents as bags of words, and guesses which words make up a "topic" based on their proximity to one another in the documents, with the idea the words that frequently co-occur are likely about the same thing. The results are groupings of words that the computer has statistically analyzed and determined are likely related to each other...

- HTRC, Digging Deeper, Reaching Further

Once you have logged in to HTRC, select the "Algorithms" tab to execute the InPho Topic Model Explorer.
Provide a name for the job and select the workset to be analyzed.
Define the parameters for the number of iterations. This determines the number of samples that the topic model will use to conduct its analysis. A lower number of iterations (i.e. 200 iterations) will process faster and is good for experimentation. A higher number will give higher quality results (i.e. 1000 iterations).
Indicate the number of topics you would like the algorithm to generate from your dataset. Multiple values, such as the default "20 40 60 80," are accepted. With this configuration, the model will run multiple times creating a list of 20, 40, 60, and 80 topics for comparison. An appropriate number will depend on your research inquiry and the size of the workset.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.

Output & Interpretation

Bubble visualization: Displays each topic as a node or bubble. When hovering over a bubble, you can see the top words that were grouped into the topic. The colors of the bubbles are a loose representation of topics with similar themes. Click the "collision detection" box to minimize overlap between bubbles and improve readability. The numbers on the left side relate to the number of topics generated, as do the size of the bubbles. You can toggle the display of the topic clusters by clicking on the numbers.

Vizualization of topic model consisting of different colored and sized bubbles representing topics

Topic model visualization of the Cairns Collection of American Women Writers. Click image to open interactive visualization.

Topics json file: In addition to the bubble visualization, the algorithm generates several files including one called topics.json. This file allows researchers a detailed view of the word groupings that constitute a topic. Each word has a decimal number that represents the probability of its appearance in the topic. Topics.json can be viewed directly in the browser or pasted into a conversion tool such as json2table.com for easier viewing.

Text from a json file displaying groups of words followed by decimal numbers

Topics.json text showing grouped words and probabilities

Topic models require interpretation. They provide lists of words that may be related given the frequency with which they appear together throughout the corpus. It's up to the researcher to derive meaning from the word groupings and interpret the topic. The topic model generated for the Cairns Collection of American Women Writers reveals multiple word groupings that suggest overarching themes such as religion, romance, children's literature, travel, American history, and the role of women in society and the home. The dataset also contains some very specific word groupings that are less broadly represented, including those related to slavery, science, women's organizations, and book publishing.

The HTRC topic model explorer has limited parameters that often yield imprecise results. However it may prove useful for identifying themes and outliers within a large, unfamiliar text corpus.

"Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. " - HTRC

Named Entity Recognition (NER) is a process by which parts of an unstructured text corpus are labeled and extracted as names. It relies on machine learning to structure components of the text and classify them using statistical models of word use. NER is a branch of Natural Language Processing (NLP) which utilizes computer systems to parse human language and extract data.

Once you have logged in to HTRC, select the "Algorithms" tab to execute the Named Entity Recognizer.
Provide a name for the job, select the workset to be analyzed, and identify the predominant language.
Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.
Clicking the name of a completed job displays the results and data available for download.

This video walks through the process of applying the NER algorithm to a workset.

The resulting CSV file can be used to create a map visualization (see instructions & video below.)

Output & Interpretation

Table of named entities: A downloadable CSV file that displays the entity name and type, as well as the volume ID and page sequence. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file.

Entity types include: date, location, miscellaneous proper nouns, money, organization, percent, person, and time. The corresponding volume ID and page sequence can be used to identify the source of a given entity by plugging the vol_id and page_seq into the following URL template:

https://babel.hathitrust.org/cgi/pt?id=hvd.32044021584222&seq=33

In this example the volume id is "hvd.32044021584222" and the page sequence is "33".

The link points to the HathiTrust Digital Library volume and page on which the entity appears.

vol_id	page_seq	entity	type
mdp.39015063160900	2	1979	DATE
mdp.39015063160900	2	1978	DATE
mdp.39015063160900	2	1963	DATE
mdp.39015061208131	19	New Ireland	LOCATION
mdp.39015061208131	19	Papua New Guinea	LOCATION
mdp.39015061208131	19	South Africa	LOCATION
mdp.39015061208131	19	Zwelihle Township	LOCATION
mdp.39015055802758	20	Gxarha River	LOCATION
mdp.39015061208131	21	St Helena Bay	LOCATION
uc1.32106017886364	21	Lesotho	LOCATION
uc1.32106017886364	21	South Africa	LOCATION
uc1.32106017886364	21	Lesotho	LOCATION
uc1.32106017886364	10	African National Congress's Youth League	ORGANIZATION
uc1.32106017886364	10	National Party	ORGANIZATION
uc1.32106017886364	10	African National Congress	ORGANIZATION
uc1.32106017886364	10	Pan Africanist Congress	ORGANIZATION
uc1.b3668076	10	Peka High School	ORGANIZATION
uc1.b3668076	10	Federated Union of Black Arts	ORGANIZATION
inu.30000055316172	3	Zakes Mda	PERSON
inu.30000055316172	4	Lea Glen	PERSON
inu.30000055316172	4	Nangomso Jol	PERSON
mdp.39015063160900	4	Dorothy Wheeler	PERSON
mdp.39015063160900	4	Garth Erasmus	PERSON
uc1.b3668076	4	Eddie Nhlapo	PERSON
uc1.b3668076	4	James Mthoba	PERSON

The table above shows a selection of named entities from the Zakes Mda author workset.

One of the challenges associated with NER is the ambiguity of language. For example, the proper noun "Lincoln" could refer to a person, place, or car manufacturer. Locations are particularly problematic since the same place names can occur in many different geographic locations. An NER dataset will inevitably require some scrutiny and data cleaning to identify and fix inaccurate labels.

Google Maps Visualization

Start a new blank spreadsheet in Google Sheets.
Import the the CSV file of named entities.
In the Data menu, select Create a Filter.
Click the triangular icon in the "type" cell (row 1, column D).
Select all the values listed except "LOCATION".
Click the row number below the header row. Hold shift and scroll to the last row in the sheet to select all but row 1.
Right-click to "Delete selected rows". (Large datasets will require longer wait times to complete.)
In the Data menu, select Remove filter. You should now only see location entities.
Click the upper left corner of the sheet to select all cells. In the Data menu, go to Data Cleanup, then Remove Duplicates.
Check the header row indicator and select "Column A - vol_id" and "Column C -entity." This will remove all duplicate place names that are from the same volume.

Screenshot of Google Sheets remove duplicates window with header row, column A, and column C boxes checked

If you haven't already, give the Untitled spreadsheet a unique name.
Go to Google My Maps and click the button to Create a New Map.
In the maps layer menu, click Import. Search to find the newly created spreadsheet in your Google Drive.
Select "entity" as the column to both position and title the placemarks.
You can customize the style and appearance of the map before sharing it.

The map below displays places mentioned in the Zakes Mda author workset. Due to copyright restrictions, the workset's six volumes are not available to view in the HathiTrust Digital Library, however they can be analyzed with HTRC tools such as the Named Entity Recognizer.

University Libraries

HathiTrust Research Center

Get Help

HTRC Algorithms

How to apply basic algorithms

Output & Interpretation

Output & Interpretation

Output & Interpretation

https://babel.hathitrust.org/cgi/pt?id=hvd.32044021584222&seq=33

Google Maps Visualization

OHIO UNIVERSITY LIBRARIES