NLP - Tufts Intro DH

Natural Language Processing¶

NLP is a large field — we’ll only be able to scratch the surface today. We’ve already started working with elements of NLP, however: tokenization is a common first step in NLP, and it is also an NLP problem in itself.

As we’ve discussed, tokenization is not as simple as breaking on whitespace, nor even as simple as breaking on whitespace and punctuation. How should we handle hyphenated words, for example? What about “U.K.” or “U.S.”?

Named Entity Recognition¶

Another subtask of NLP is Named Entity Recognition (NER). NER itself is composed of other sub-problems like named entity classification (“What kind of entity is this?”) and named entity linking (“To what specific entity does this refer?”)

Today, we’ll be focusing on named entity classification of Book 1 of Pausanias’ Periegesis. We’ll be able to look up these entities in a data dump from the Pleiades project and feed them back into ArcGIS along with their coordinates and other relevant information.

Loading the data¶

First, let’s read in the transcription of Book 1 that we’ll be using.

from pathlib import Path

paus_filename = Path("./txt/tlg0525.tlg001.theoi-eng.txt")

with open(paus_filename) as f:
    book_1 = f.read()

Simple enough. Let’s just peek at the data to make sure it looks sane.

book_1[100:200]

Seems pretty reasonable to me!

Installing spaCy¶

spaCy is a Python library for NLP. Unlike the NLTK, which prioritizes teaching and research, spaCy generally provides one way of performing a given task. For our purposes, spaCy’s guided approach will be more than sufficient.

To get started, install spaCy like any other Python library:

%pip install spacy

But for spaCy to perform anything of use, we also need to download a pretrained model. Models are essentially large mappings of tokens (or subtokens) to long matrixes (lists of lists) of numbers. The larger the model, the more accurately it can represent a text in numerical terms — but also the more expensive it is to run.

We’ll use the medium model today, as it hits the sweet spot for accuracy and usability.

%run -m spacy download en_core_web_md

With the model downloaded, we can now run the text of book_1 through spaCy’s NER pipeline.

import spacy

# Load the model that we downloaded.
# If this line fails, make sure that
# you have downloaded the model that's
# referenced here.
nlp = spacy.load("en_core_web_md")

# Analyze `book_1` — this might take a bit.
doc = nlp(book_1)

# We're running these lines in a separate cell so that we don't
# need to run the full analysis each time we inspect the results.

ents = [(e.text, e.label_) 
        for e in doc.ents 
        if e.label_ not in ("CARDINAL", "ORDINAL")]

for ent in ents:
    print(ent)

Taking a random sample¶

Python has a built-in random library for generating random numbers — or taking random samples of a list. If you wanted to get a random sample of 20 entities, you could run the following:

import random

my_ents = random.sample(ents, 20)

my_ents

But it might be better to look thematically/sectionally for entities that appear to be important in a given passage.

Looking up coordinates¶

While these results are far from perfect — “Hyllus,” at least in my practice runs, was classified as a “PRODUCT” rather than a “PERSON” — they’re fairly useful in broad strokes for our purposes.

But we still need to add coordinates, and we have over 4000 entities to link. How can we go about doing this scalably?

Build a search tool¶

All of the data we need is available through Pleiades and ToposText, but the strings that are labeled by our NER model might not match the titles of places available from these sources. We could build a search index that lets us match titles mor flexibly, but that is beyond the scope of our work for today.

Annotate by hand¶

Instead, working in groups, choose about 20 places from the NER list that you would like to map. You could even pull them out randomly, if you’d like.

Then, using Pleiades’s own search tool, find the coordinates for each location. Store this data, along with any contextual information or descriptions that you deem relevant, in a CSV or spreadsheet that you can upload to ArcGIS.

If you find that your group is working particularly quickly, grab another 10 placenames, or experiment with mapping specific sections of Pausanias’ text.

Readings¶

Romanello & Najem-Meyer (2024)
Elliott et al. (2025)
Kirsch, Adam. “Technology Is Taking Over English Departments.” The New Republic, May 2, 2014
Blei (2012)
Brett (2012)
Mimno (2012)
Wellmon (2015)

Homework¶

Finish annotating placenames and uploading the results to an ArcGIS map
Share a link to the map on Canvas

References¶

Romanello, M., & Najem-Meyer, S. (2024). A Named Entity-Annotated Corpus of 19th Century Classical Commentaries. Journal of Open Humanities Data, 10. 10.5334/johd.150
Elliott, T., Talbert, R., Bagnall, R., Becker, J., Bond, S., Gillies, S., Holman, L., Horne, R., Moss, G., Rabinowitz, A., Robinson, E., & Turner, B. (2025). isawnyu/pleiades.datasets: Pleiades Datasets 4.0.1. New York University; University of North Carolina at Chapel Hill. 10.5281/ZENODO.1193921
Blei, D. M. (2012). Topic Modeling and Digital Humanities. Journal of Digital Humanities, 2(1). 10.1145/2133806.2133826
Brett, M. R. (2012). Topic Modeling: A Basic Introduction. Journal of Digital Humanities, 2(1). 10.1145/2133806.2133826
Mimno, D. (2012). Computational Historiography: Data Mining in a Century of Classics Journals. Journal on Computing and Cultural Heritage, 5(1), 1–19. 10.1145/2160165.2160168
Wellmon, C. (2015). Sacred Reading: From Augustine to the Digital Humanists [Magazine]. The Hedgehog Review, Fall 2015. https://hedgehogreview.com/issues/re-enchantment/articles/sacred-reading-from-augustine-to-the-digital-humanists