TF-IDF

TF-IDF - An Introduction¶

Term Frequency — Inverse Document Frequency, or TF-IDF, is a powerful, if deceptively simple, way of determining keywords in a text. As its name suggests, it is comprised of two parts:

Term Frequency (TF)¶

The number of times that a term — typically, a token or a phrase — appears in a document. We can normalize this using various techniques. Generally speaking, we want to use the relative frequency, or the raw frequency divided by the number of terms in the document.

Document Frequency (DF)¶

The number of documents in which a term appears. Again, there are common normalizations that we will typically perform, such as adding a small ε value to avoid dividing by zero.

Inverse document frequency, or IDF, is just the multiplicative inverse, or reciprocal, of DF.

In practice, however, IDF is usually logarithmically scaled, i.e., $IDF=\log(n/DF)$ where $n$ is the number of documents in the corpus.

Prefatory matters¶

In order to get a sense of a basic implementation of TF-IDF, we first need some text to work with. To get some text, we’ll need to review a few things about Python, and we’ll need to learn how to use the lxml library to get text out of TEI XML files — the kinds of files that store much of the textual data and metadata that we use in digital humanities.

Install `lxml`¶

%pip (note the %) is a special command built into Jupyter notebooks that lets us use the Python package manager pip.

To install a package, we just enter %pip install and then the package name in a code cell, like so:

%pip install lxml

Import libraries¶

With lxml installed, we now need to import the etree library, along with the Path constructor from the built-in pathlib.

from lxml import etree
from pathlib import Path

Initialize our path¶

We’ll use the Path constructor to grab the XML files that we need to parse:

files = Path("./xml/tlg0012").glob("**/*perseus-eng*.xml")

Set up namespaces for TEI parsing¶

We also need to define a few namespaces to make it easier to find what we need in the XML files.

TEI_NS = "http://www.tei-c.org/ns/1.0"
XML_NS = "http://www.w3.org/XML/1998/namespace"

NAMESPACES = {
    "tei": TEI_NS,
    "xml": XML_NS,
}

Process the files¶

Now we’re ready to iterate through the files and extract the text.

for file in files:
    # print the name of the file as a sanity check
    print(file)
    
    # etree.parse() reads the file and turns the raw XML into an object that we can use in Python
    tree = etree.parse(file)

    # xpath() is a method that applies **xpath expressions** to search through the XML.
    # This xpath expression says, "Find any `tei:div` element with a `subtype` of `'card'`.
    # Under that element, get any text." The second argument, `namespaces=`, tells the
    # method to use the supplied namespaces as shortcuts, so we don't have to type out
    # "http://www.tei-c.org/ns/1.0" every time we want an element in the TEI namespace.

    text = tree.xpath(f"//tei:div[@subtype='card']//text()", namespaces=NAMESPACES)

    # xpath() returns an array of matches, so we initialize an empty array to store the
    # results. We could use a list comprehension, but for now rewriting these
    # lines as a list comprehension is left as an exercise for the reader.
    cleaned_text = []

    # Now we iterate through each string returned by `xpath()`
    for t in text:
        # `strip()` removes leading and trailing whitespace; if all that's left is an empty
        # string, we don't care about it.
        if t.strip() != "":
            cleaned_text.append(t.strip())

    # We make sure that we actually *have* text before writing just the text, without
    # TEI elements, to a separate file. No need to write an empty file, right?
    if len(cleaned_text) > 0:
        # A lot is happening here:
        #
        # 1. `str(file)` turns the `Path` object into a `str`
        # 2. `split("/")` splits the resulting string at every "/"
        # 3. `[-1]` takes the last element of the list returned by `split("/")`
        # 4. `replace(".xml", ".txt")` changes the extension of the file
        # 
        # So something like "xml/tlg0012/tlg001/tlg0012.tlg001.perseus-eng3.xml"
        # is transformed into "tlg0012.tlg001.perseus-eng3.txt".
        with open(str(file).split("/")[-1].replace(".xml", ".txt"), "w+") as f:
            # We write the text to the file, `join`-ing each element in
            # `cleaned_text` with a newline ("\n")
            f.write("\n".join(cleaned_text))

TF-IDF in Action¶

Now that we have our text files, we can start implementing some basic TF-IDF analysis.

First, let’s tokenize our text. In the past, we have used split() and regular expressions to handle tokenization, but we need to start getting serious about handling punctuation. Let’s use the Natural Language Toolkit, or NLTK, instead.

Install the NLTK¶

%pip install nltk

Download support files¶

With the NLTK installed, we can now download the support files that it needs for tokenization.

import nltk

# download the files needed for tokenization
# the punkt tokenizer should be installed already,
# but let's download it just in case
nltk.download("punkt")
nltk.download("punkt_tab")

Tokenize texts¶

With the tokenizer downloaded, we can now read in and tokenize each text.

# Initialize the tokenizer
from nltk.tokenize import word_tokenize

# Initialize an empty dictionary to store the tokenized texts
tokenized_texts = {}

# Get a Path.glob() iterator for the .txt files that you've created in this directory.
# Can you figure out what the new `[1-4]` segment is doing?
text_files = Path(".").glob("tlg0012.tlg00*.perseus-eng[1-4].txt")

# Iterate through the text files, reading and tokenizing them one by one,
# then storing the list of tokens in our `tokenized_texts` dictionary —
# so we'll be getting a dictionary of lists.
for file in text_files:
    name = str(file)

    with open(file) as f:
        # Notice we're lowercasing the text. You don't *have*
        # to do this, but it helps eliminate some noise for
        # our purposes.
        text = f.read().lower()
        tokens = word_tokenize(text)

        # Let's just print the length of the tokens list to make
        # sure we're getting sane results. We'll use string interpolation
        # to identify which text we're working with.
        print(f"There are {len(tokens)} tokens in {name}.")

        # Store each file's `tokens` list in the `tokenized_texts`
        # dictionary, using the filename as the key.
        tokenized_texts[name] = tokens

Count the tokens¶

Now, we could count these tokens by hand, but why do that when Python gives us the Counter object?

from collections import Counter

# Using our `tokenized_texts` dictionary, we'll iterate
# through each key-value pair — remember, the keys are
# filenames and the values are lists of tokens.
# We'll get a count of the tokens by passing the list to
# `Counter`, then we'll change the value for that key to
# a dictionary with its own keys, `tokens` and `counts`.

for filename, tokens in tokenized_texts.items():
    counts = Counter(tokens)

    tokenized_texts[filename] = {"tokens": tokens, "counts": counts}

Now we can check to see what our frequencies look like.

tokenized_texts["tlg0012.tlg001.perseus-eng3.txt"]["counts"]["odysseus"]

Calculate the document frequency for a given term¶

Let’s compare occurrences of the strings "odysseus" and "achilles" — we probably expect the former to “matter” more for the Odysseus, and the latter for the Iliad. Let’s see if that’s the case.

df_achilles = 0
df_odysseus = 0

# Calculate the DF for "odysseus" and "achilles".
# We iterate through the dictionary, and then simply
# count the number of files in which we find each term.
# For these two terms, we should probably expect DFs of 4.
for filename, values in tokenized_texts.items():
    if "odysseus" in values['counts']:
        df_odysseus += 1
    
    if "achilles" in values["counts"]:
        df_achilles += 1

# Now we'll import the log function to calculate the IDF for each term.
from math import log10

n_docs = len(tokenized_texts.keys())

idf_achilles = log10(n_docs / df_achilles)
idf_odysseus = log10(n_docs / df_odysseus)

print(idf_achilles)

# Now let's calculate the TF-IDF "score" for each term in each document.

# Once again, iterate through the dictionary.
for filename, values in tokenized_texts.items():
    # Get the total number of terms in each file — we'll
    # use this to calculate the relative frequency as our
    # TF.
    total_terms = len(values['tokens'])

    # Get the TF for each term in this file.
    tf_achilles = values['counts']['achilles'] / total_terms
    tf_odysseus = values['counts']['odysseus'] / total_terms

    # Remember, the simplest version of TF-IDF is just
    # TF * 1/DF
    tf_idf_achilles = tf_achilles * idf_achilles
    tf_idf_odysseus = tf_odysseus * idf_odysseus

    # Now we can report on the statistics for this file
    print(f"""In {filename}:
TF of achilles: {tf_achilles}
TF of odysseus: {tf_odysseus}
TF-IDF of achilles: {tf_idf_achilles}
TF-IDF of odysseus: {tf_idf_odysseus}
""")

Well, the TF-IDF isn’t super interesting, but at least we have the TF to fall back on.

Overcoming frequent words¶

In such a small corpus, it can be difficult to guess words that don’t appear in all documents — certainly names of heroes like Achilles and Odysseus will appear in all of the documents, meaning $\log(n/DF)$ will always be 0.

But we can use the set() constructor to find words that do not appear in all of the documents, and calculate TF-IDF on them.

In programming, a set is very similar to a list, except it guarantees that it only contains at most 1 of every element.

Let’s start small:

my_list = [1, 1, 2, 3, 3]

set(my_list)

Notice how calling set() on my_list gets rid of the duplicate 1s and 3s. We can do likewise with strings.

non_universal_terms = {}

for filename, values in tokenized_texts.items():
    my_set = set(values['counts'].keys())

    for other_file, other_values in tokenized_texts.items():
        # make sure we don't compare the file
        # to itself, otherwise the difference
        # will be the empty set
        if other_file != filename:
            my_set -= set(other_values['counts'].keys())
    
    # now push the remaining set of terms to the dictionary
    non_universal_terms[filename] = my_set

# log `non_universal_terms` as a sanity check
non_universal_terms

Your turn¶

Now that you’ve seen the basics of TF-IDF, use the sets that we’ve built to explore the values for several terms in this corpus.

As you explore, make note of any findings that seem odd — are they errors in how we have run the analysis, or is the text just weird?

Consider, too, how you might improve the analysis in the future.

Then in your own words, describe what TF-IDF is really telling us about a given term in each text and within the corpus as a whole.

Finally, think about other objects of study for which TF-IDF might be useful.

Intro DH

GIS