Exploring Literature with the Stanza NLP Package

Using Natural Language Processing to Analyze Text

The Stanford NLP Group has long been an active player in natural language processing, particularly through their well-known CoreNLP Java toolkit. Until recently though, Stanford NLP has been a less well-known player in the Python community, which is a shame since many NLP practitioners work primarily in Python. But there’s good news! Stanford NLP’s Stanza Python library is coming into its own with the recent release of version 1.1.1!

The new Stanza version supports 66 different human languages (which is a big step forward, since NLP has long been very English-centric) and can carry out core NLP tasks like lemmatization and named entity recognition. Stanza is also customizable, which means that users can build their own pipelines and train their own models.

So, for all you Pythonistas out there, let’s take a look at Stanza and what it can do. We’ll start with a brief overview of core Stanza functionality and then we’ll use it to explore the characters in the classic novel, Moby Dick.

Pipeline

The Stanza Pipeline can be configured with a variety of options to select the language model, processors, etc. The language model must be downloaded before it can be used in a pipeline.

For our exploratory project, let’s use the first paragraph from Moby Dick.

Data Objects

Documents

Stanza Document objects include text, tokens, words, dependencies and entities attributes. The tokens, words, dependencies and entities attributes are lists, and individual items can be accessed by index.

Sentences

Each Sentence object contains doc, text, dependencies, tokens, words, and entities attributes. Individual items can be accessed by indexing the appropriate list.

Tokens

Each Token object includes text, words, start_char, and end_char attributes, among others. In cases where a token is a multi-word token, the words attribute will contain each of the underlying words.

Words

Each Word object includes various word-level annotations, as defined by the various processors (parts-of-speech, lemmatization, etc.), including text, lemma, upos, xpos, feats, and others.

Entities

Stanza includes a built-in named entity recognition (NER) module, with options for extension and customization. The default pipeline includes the built-in NERProcessor, which recognizes named entities for all token spans. Each Entity object includes attributes for text, tokens, type, start_char, end_char, and others.

Sentiment Analysis

Stanza includes a built-in sentiment analysis processor, which can be customized as needed. Each Sentence object in a Document includes a sentiment score, where 0 represents negative, 1 represents neutral, and 2 represents positive. To make this a bit more human-readable, we’ll covert the scores to a string descriptor.

Character Analysis

Now that we know a little bit about how to use Stanza, let’s use it to see if we can learn anything about the characters in Moby Dick. First, we’ll have to load up the full text. As many of you will remember, Moby Dick is a long novel, so putting it through the Stanza pipeline can take a while. If you happen to have access to GPUs though, Stanza is GPU-aware and the process will go much faster.

Moby Dick Characters

Lets use Stanza’s entity recognition function to identify all the characters in Moby Dick. We’ll do this by selecting only those entities that have the type PERSON. Since each entity points back to its containing sentence, we’ll go ahead and save the sentiment of that sentence for future use.

Now that we have all of the characters from Moby Dick, we can start to analyze the data to see what we can learn about them. First, how many characters are there?

Wow! 699 characters (or at least unique PERSON entities) is a lot. Most of those are mentioned just a single time, so perhaps we should take a look at just the major characters.

Character Counts

With our character dataframe in hand, we can now check which characters appear in the text most often. This will give us some idea about which characters are the most important.

Unsurprisingly for anyone who has read Moby Dick, Captain Ahab is the most-mentioned character in the book. Other members of his crew like Stubb, Queequeg, and Starbuck make appearances in the most-frequent list as well. And of course, Moby Dick himself is in the top 10.

Character Sentiment

Since each Entity also includes a pointer to its parent sentence, we can now use the sentence sentiment rating that we saved earlier to make a judgement about the overall character sentiment. We’ll do this by converting our sentiment descriptors to a value of -1 for “negative”, 0 for “neutral”, and 1 for “positive”. After that, we can group the various appearances of each character and sum the sentiment value for each sentence the character appears in. A negative sum indicates a negative overall character sentiment, and a positive sum the opposite. And the farther from 0 the sum is, the stronger the sentiment.

Phew. It would seem that Moby Dick is pretty grim! Almost no characters appear in majority positive sentences — and for those who do, the positivity is quite weak. As for Captain Ahab, his overall sentence sentiment sum is -42! Of course, we haven’t checked to see whether the sentiment is about Ahab, but merely the sentiment of sentences in which Ahab appears. Perhaps this is an indicator that Ahab lives a tortured and unhappy life — it would seem that he isn’t in a lot of happy sentences.

Next Steps

And that’s it for our quick look at Stanza! If you think Stanza could be a good fit for your needs, I highly encourage you to check it out — the documentation is excellent and has a good overview on usage. Perhaps you too can use it to explore your favorite novel. (And if you do, be sure to let us know the results!)

Writer | Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store