Skip to Main Content

HathiTrust Research Center

HTRC Algorithms

HTRC Algorithms are pre-configured versions of standard text-mining algorithms that HTRC runs on its own servers then provides you with the results. All you need to run an Algorithm is an HTRC Workset of up to about 3,000 volumes. They are the easiest easy way to get started analyzing HathiTrust books.

In addition to the basic algorithms described on this page, HathiTrust+Bookworm is a minimal effort, click-and-run visualization tool that graphs word trends over time, providing instant results from HathiTrust's millions of volumes.

Basic HTRC algorithms (see Advanced Features for additional tools)

Algorithm Type of analysis Description
Token Count and Tag Cloud Creator Word frequencies Analyzes word usage in a workset
InPhO Topic Model Explorer Themes Extracts groups of words that represent the workset’s primary topics
Named Entity Recognizer Proper nouns Identifies people, places, organizations, dates, and monetary values that appear throughout the workset

The browser-based algorithms described on this page can be applied to worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB. Single-volume worksets or those with fewer volumes will result in more manageable datasets.

How to apply basic algorithms

 

"Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred." - HTRC


  1. Once you have logged in to HTRC, select the "Algorithms" tab to execute the Token Count and Tag Cloud Creator.

  2. Provide a name for the job, select the workset to be analyzed, and identify the predominant language.

  3. Define the parameters for stop words and replacement rules. Stop words will be filtered out of results, while replacement rules will substitute words with inconsistent spellings and non-standard characters. Click the "Use default" button to use the HTRC-provided parameters.

  4. Choose whether words are converted to lowercase before counting. If set to "True," capitalized instances of the same word will not be counted separately.

  5. The last two parameters apply strictly to the tag cloud visualization. The default regular expression value will only display words containing letters or those that are hyphenated. The maximum number of tokens controls how many words appear in the visualization.

  6. Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.

  7. Clicking the name of a completed job displays the results and data available for download.


Output & Interpretation

 

Tag cloud visualization: Displays words that occur most frequently throughout the workset. The larger the word, the more occurrences were detected by the algorithm. The color and configuration of words changes slightly each time the job page is refreshed. To save the visualization, click the button to open in a new tab and capture a screenshot.

Tag cloud visualization generated by HTRC showing 200 words displayed in different colors and sizes

 

Tag cloud generated from Jesse Stuart author workset. Click image to view in new tab.

Token counts: A downloadable CSV file that provides the exact number of occurrences of every word in the workset. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file. In addition to commonly used words, researchers can gain insight about those used least frequently.

 

Token

Count

rain-kiss 1
yonside 1
frog-moaned 1
harmonicas 1
slushes 1
once-green 1
wind-sliced 1
innerested 1
click-clacking 1
love-vined 1
reforested 1
windgust 1
cawcawed 1
trilling 1

The table above shows several lines of token counts from the Jesse Stuart workset. Each word only occurs once, but their rarity speaks to the author's style and use of language.

Read more about the Jesse Stuart author collection housed in OHIO Libraries Archives & Special Collections.

A topic model employs machine learning to aid the discovery of abstract topics or themes present in a large or unstructured dataset.

For this kind of analysis, the text is chunked into "documents", and stop words (frequently used words such as "the", "and", "if") are removed since they reveal little about the substance of a text. The computer treats the documents as bags of words, and guesses which words make up a "topic" based on their proximity to one another in the documents, with the idea the words that frequently co-occur are likely about the same thing. The results are groupings of words that the computer has statistically analyzed and determined are likely related to each other...

- HTRC, Digging Deeper, Reaching Further


  1. Once you have logged in to HTRC, select the "Algorithms" tab to execute the InPho Topic Model Explorer.

  2. Provide a name for the job and select the workset to be analyzed.

  3. Define the parameters for the number of iterations. This determines the number of samples that the topic model will use to conduct its analysis. A lower number of iterations (i.e. 200 iterations) will process faster and is good for experimentation. A higher number will give higher quality results (i.e. 1000 iterations).

  4. Indicate the number of topics you would like the algorithm to generate from your dataset. Multiple values, such as the default "20 40 60 80," are accepted. With this configuration, the model will run multiple times creating a list of 20, 40, 60, and 80 topics for comparison. An appropriate number will depend on your research inquiry and the size of the workset.

  5. Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.

  6. Clicking the name of a completed job displays the results and data available for download.


Output & Interpretation

Bubble visualization: Displays each topic as a node or bubble. When hovering over a bubble, you can see the top words that were grouped into the topic. The colors of the bubbles are a loose representation of topics with similar themes. Click the "collision detection" box to minimize overlap between bubbles and improve readability. The numbers on the left side relate to the number of topics generated, as do the size of the bubbles. You can toggle the display of the topic clusters by clicking on the numbers.

 

Vizualization of topic model consisting of different colored and sized bubbles representing topics

Topic model visualization of the Cairns Collection of American Women Writers. Click image to open interactive visualization.

 

Topics json file: In addition to the bubble visualization, the algorithm generates several files including one called topics.json. This file allows researchers a detailed view of the word groupings that constitute a topic. Each word has a decimal number that represents the probability of its appearance in the topic. Topics.json can be viewed directly in the browser or pasted into a conversion tool such as json2table.com for easier viewing.

 

Text from a json file displaying groups of words followed by decimal numbers

 Topics.json text showing grouped words and probabilities

 

Topic models require interpretation. They provide lists of words that may be related given the frequency with which they appear together throughout the corpus. It's up to the researcher to derive meaning from the word groupings and interpret the topic. The topic model generated for the Cairns Collection of American Women Writers reveals multiple word groupings that suggest overarching themes such as religion, romance, children's literature, travel, American history, and the role of women in society and the home. The dataset also contains some very specific word groupings that are less broadly represented, including those related to slavery, science, women's organizations, and book publishing.

The HTRC topic model explorer has limited parameters that often yield imprecise results. However it may prove useful for identifying themes and outliers within a large, unfamiliar text corpus.

"Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. " - HTRC

Named Entity Recognition (NER) is a process by which parts of an unstructured text corpus are labeled and extracted as names. It relies on machine learning to structure components of the text and classify them using statistical models of word use. NER is a branch of Natural Language Processing (NLP) which utilizes computer systems to parse human language and extract data. 


  1. Once you have logged in to HTRC, select the "Algorithms" tab to execute the Named Entity Recognizer.

  2. Provide a name for the job, select the workset to be analyzed, and identify the predominant language.

  3. Click "Submit" and you'll be redirected to the algorithm jobs subpage. The status bar for active jobs indicates when they are first staging, then running. Processing times will vary depending on the size of the workset and amount of text.

  4. Clicking the name of a completed job displays the results and data available for download.

This video walks through the process of applying the NER algorithm to a workset.

The resulting CSV file can be used to create a map visualization (see instructions & video below.)


Output & Interpretation

 

Table of named entities: A downloadable CSV file that displays the entity name and type, as well as the volume ID and page sequence. The file can be opened using a spreadsheet program such as Microsoft Excel or Google Sheets. However, the size of the dataset could prohibit some software from opening the file. 

Entity types include: date, location, miscellaneous proper nouns, money, organization, percent, person, and time. The corresponding volume ID and page sequence can be used to identify the source of a given entity by plugging the vol_id and page_seq into the following URL template:

https://babel.hathitrust.org/cgi/pt?id=hvd.32044021584222&seq=33 

In this example the volume id is "hvd.32044021584222" and the page sequence is "33". 

The link points to the HathiTrust Digital Library volume and page on which the entity appears.

 

vol_id

page_seq

entity

type

mdp.39015063160900

2

1979

DATE

mdp.39015063160900

2

1978

DATE

mdp.39015063160900

2

1963

DATE

mdp.39015061208131

19

New Ireland

LOCATION

mdp.39015061208131

19

Papua New Guinea

LOCATION

mdp.39015061208131

19

South Africa

LOCATION

mdp.39015061208131

19

Zwelihle Township

LOCATION

mdp.39015055802758

20

Gxarha River

LOCATION

mdp.39015061208131

21

St Helena Bay

LOCATION

uc1.32106017886364

21

Lesotho

LOCATION

uc1.32106017886364

21

South Africa

LOCATION

uc1.32106017886364

21

Lesotho

LOCATION

uc1.32106017886364

10

African National Congress's Youth League

ORGANIZATION

uc1.32106017886364

10

National Party

ORGANIZATION

uc1.32106017886364

10

African National Congress

ORGANIZATION

uc1.32106017886364

10

Pan Africanist Congress

ORGANIZATION

uc1.b3668076

10

Peka High School

ORGANIZATION

uc1.b3668076

10

Federated Union of Black Arts

ORGANIZATION

inu.30000055316172

3

Zakes Mda

PERSON

inu.30000055316172

4

Lea Glen

PERSON

inu.30000055316172

4

Nangomso Jol

PERSON

mdp.39015063160900

4

Dorothy Wheeler

PERSON

mdp.39015063160900

4

Garth Erasmus

PERSON

uc1.b3668076

4

Eddie Nhlapo

PERSON

uc1.b3668076

4

James Mthoba

PERSON

The table above shows a selection of named entities from the Zakes Mda author workset.

 

One of the challenges associated with NER is the ambiguity of language. For example, the proper noun "Lincoln" could refer to a person, place, or car manufacturer. Locations are particularly problematic since the same place names can occur in many different geographic locations. An NER dataset will inevitably require some scrutiny and data cleaning to identify and fix inaccurate labels.


Google Maps Visualization

  1. Start a new blank spreadsheet in Google Sheets
  2. Import the the CSV file of named entities.
  3. In the Data menu, select Create a Filter.
  4. Click the triangular icon in the "type" cell (row 1, column D).
  5. Select all the values listed except "LOCATION".
  6. Click the row number below the header row. Hold shift and scroll to the last row in the sheet to select all but row 1.
  7. Right-click to "Delete selected rows". (Large datasets will require longer wait times to complete.)
  8. In the Data menu, select Remove filter. You should now only see location entities.
  9. Click the upper left corner of the sheet to select all cells. In the Data menu, go to Data Cleanup, then Remove Duplicates.
  10. Check the header row indicator and select "Column A - vol_id" and "Column C -entity." This will remove all duplicate place names that are from the same volume.

Screenshot of Google Sheets remove duplicates window with header row, column A, and column C boxes checked

  1. If you haven't already, give the Untitled spreadsheet a unique name.
  2. Go to Google My Maps and click the button to Create a New Map.
  3. In the maps layer menu, click Import. Search to find the newly created spreadsheet in your Google Drive.
  4. Select "entity" as the column to both position and title the placemarks.
  5. You can customize the style and appearance of the map before sharing it.

 

The map below displays places mentioned in the Zakes Mda author workset. Due to copyright restrictions, the workset's six volumes are not available to view in the HathiTrust Digital Library, however they can be analyzed with HTRC tools such as the Named Entity Recognizer.