Skip to article frontmatterSkip to article content

NLP

Natural Language Processing and Named Entity Recognition

Tufts University

Natural Language Processing

NLP is a large field — we’ll only be able to scratch the surface today. We’ve already started working with elements of NLP, however: tokenization is a common first step in NLP, and it is also an NLP problem in itself.

As we’ve discussed, tokenization is not as simple as breaking on whitespace, nor even as simple as breaking on whitespace and punctuation. How should we handle hyphenated words, for example? What about “U.K.” or “U.S.”?

Named Entity Recognition

Another subtask of NLP is Named Entity Recognition (NER). NER itself is composed of other sub-problems like named entity classification (“What kind of entity is this?”) and named entity linking (“To what specific entity does this refer?”)

Today, we’ll be focusing on named entity classification of Book 1 of Pausanias’ Periegesis. We’ll be able to look up these entities in a data dump from the Pleiades project and feed them back into ArcGIS along with their coordinates and other relevant information.

Loading the data

First, let’s read in the transcription of Book 1 that we’ll be using.

from pathlib import Path

paus_filename = Path("./txt/tlg0525.tlg001.theoi-eng.txt")

with open(paus_filename) as f:
    book_1 = f.read()

Simple enough. Let’s just peek at the data to make sure it looks sane.

book_1[100:200]

Seems pretty reasonable to me!

Installing spaCy

spaCy is a Python library for NLP. Unlike the NLTK, which prioritizes teaching and research, spaCy generally provides one way of performing a given task. For our purposes, spaCy’s guided approach will be more than sufficient.

To get started, install spaCy like any other Python library:

%pip install spacy

But for spaCy to perform anything of use, we also need to download a pretrained model. Models are essentially large mappings of tokens (or subtokens) to long matrixes (lists of lists) of numbers. The larger the model, the more accurately it can represent a text in numerical terms — but also the more expensive it is to run.

We’ll use the medium model today, as it hits the sweet spot for accuracy and usability.

%run -m spacy download en_core_web_md

With the model downloaded, we can now run the text of book_1 through spaCy’s NER pipeline.

import spacy

# Load the model that we downloaded.
# If this line fails, make sure that
# you have downloaded the model that's
# referenced here.
nlp = spacy.load("en_core_web_md")

# Analyze `book_1` — this might take a bit.
doc = nlp(book_1)
# We're running these lines in a separate cell so that we don't
# need to run the full analysis each time we inspect the results.

ents = [(e.text, e.label_) 
        for e in doc.ents 
        if e.label_ not in ("CARDINAL", "ORDINAL")]

for ent in ents:
    print(ent)

Taking a random sample

Python has a built-in random library for generating random numbers — or taking random samples of a list. If you wanted to get a random sample of 20 entities, you could run the following:

import random

my_ents = random.sample(ents, 20)

my_ents

But it might be better to look thematically/sectionally for entities that appear to be important in a given passage.

Looking up coordinates

While these results are far from perfect — “Hyllus,” at least in my practice runs, was classified as a “PRODUCT” rather than a “PERSON” — they’re fairly useful in broad strokes for our purposes.

But we still need to add coordinates, and we have over 4000 entities to link. How can we go about doing this scalably?

Build a search tool

All of the data we need is available through Pleiades and ToposText, but the strings that are labeled by our NER model might not match the titles of places available from these sources. We could build a search index that lets us match titles mor flexibly, but that is beyond the scope of our work for today.

Annotate by hand

Instead, working in groups, choose about 20 places from the NER list that you would like to map. You could even pull them out randomly, if you’d like.

Then, using Pleiades’s own search tool, find the coordinates for each location. Store this data, along with any contextual information or descriptions that you deem relevant, in a CSV or spreadsheet that you can upload to ArcGIS.

If you find that your group is working particularly quickly, grab another 10 placenames, or experiment with mapping specific sections of Pausanias’ text.

Readings

Homework

  • Finish annotating placenames and uploading the results to an ArcGIS map
  • Share a link to the map on Canvas
References
  1. Romanello, M., & Najem-Meyer, S. (2024). A Named Entity-Annotated Corpus of 19th Century Classical Commentaries. Journal of Open Humanities Data, 10. 10.5334/johd.150
  2. Elliott, T., Talbert, R., Bagnall, R., Becker, J., Bond, S., Gillies, S., Holman, L., Horne, R., Moss, G., Rabinowitz, A., Robinson, E., & Turner, B. (2025). isawnyu/pleiades.datasets: Pleiades Datasets 4.0.1. New York University; University of North Carolina at Chapel Hill. 10.5281/ZENODO.1193921
  3. Blei, D. M. (2012). Topic Modeling and Digital Humanities. Journal of Digital Humanities, 2(1). 10.1145/2133806.2133826
  4. Brett, M. R. (2012). Topic Modeling: A Basic Introduction. Journal of Digital Humanities, 2(1). 10.1145/2133806.2133826
  5. Mimno, D. (2012). Computational Historiography: Data Mining in a Century of Classics Journals. Journal on Computing and Cultural Heritage, 5(1), 1–19. 10.1145/2160165.2160168
  6. Wellmon, C. (2015). Sacred Reading: From Augustine to the Digital Humanists [Magazine]. The Hedgehog Review, Fall 2015. https://hedgehogreview.com/issues/re-enchantment/articles/sacred-reading-from-augustine-to-the-digital-humanists