The Stanford NLP Group has long been an active player in natural language processing, particularly through their well-known CoreNLP Java toolkit. Until recently though, Stanford NLP has been a less well-known player in the Python community, which is a shame since many NLP practitioners work primarily in Python. But there’s good news! Stanford NLP’s Stanza Python library is coming into its own with the recent release of version 1.1.1!
The new Stanza version supports 66 different human languages (which is a big step forward, since NLP has long been very English-centric) and can carry out core NLP tasks like lemmatization and named entity recognition. Stanza is also customizable, which means that users can build their own pipelines and train their own models.
So, for all you Pythonistas out there, let’s take a look at Stanza and what it can do. We’ll start with a brief overview of core Stanza functionality and then we’ll use it to explore the characters in the classic novel, Moby Dick.
Pipeline can be configured with a variety of options to select the language model, processors, etc. The language model must be downloaded before it can be used in a pipeline.
For our exploratory project, let’s use the first paragraph from Moby Dick.
Document objects include
entities attributes. The
entities attributes are lists, and individual items can be accessed by index.
Sentence object contains
entities attributes. Individual items can be accessed by indexing the appropriate list.
Token object includes
end_char attributes, among others. In cases where a token is a multi-word token, the
words attribute will contain each of the underlying words.
Word object includes various word-level annotations, as defined by the various processors (parts-of-speech, lemmatization, etc.), including
feats, and others.
Stanza includes a built-in named entity recognition (NER) module, with options for extension and customization. The default pipeline includes the built-in
NERProcessor, which recognizes named entities for all token spans. Each
Entity object includes attributes for
end_char, and others.
Stanza includes a built-in sentiment analysis processor, which can be customized as needed. Each
Sentence object in a
Document includes a
sentiment score, where
0 represents negative,
1 represents neutral, and
2 represents positive. To make this a bit more human-readable, we’ll covert the scores to a string descriptor.
Now that we know a little bit about how to use Stanza, let’s use it to see if we can learn anything about the characters in Moby Dick. First, we’ll have to load up the full text. As many of you will remember, Moby Dick is a long novel, so putting it through the Stanza pipeline can take a while. If you happen to have access to GPUs though, Stanza is GPU-aware and the process will go much faster.
Moby Dick Characters
Lets use Stanza’s entity recognition function to identify all the characters in Moby Dick. We’ll do this by selecting only those entities that have the type
PERSON. Since each entity points back to its containing sentence, we’ll go ahead and save the sentiment of that sentence for future use.
Now that we have all of the characters from Moby Dick, we can start to analyze the data to see what we can learn about them. First, how many characters are there?
Wow! 699 characters (or at least unique
PERSON entities) is a lot. Most of those are mentioned just a single time, so perhaps we should take a look at just the major characters.
With our character dataframe in hand, we can now check which characters appear in the text most often. This will give us some idea about which characters are the most important.
Unsurprisingly for anyone who has read Moby Dick, Captain Ahab is the most-mentioned character in the book. Other members of his crew like Stubb, Queequeg, and Starbuck make appearances in the most-frequent list as well. And of course, Moby Dick himself is in the top 10.
Entity also includes a pointer to its parent sentence, we can now use the sentence sentiment rating that we saved earlier to make a judgement about the overall character sentiment. We’ll do this by converting our sentiment descriptors to a value of
-1 for “negative”,
0 for “neutral”, and
1 for “positive”. After that, we can group the various appearances of each character and sum the sentiment value for each sentence the character appears in. A negative sum indicates a negative overall character sentiment, and a positive sum the opposite. And the farther from 0 the sum is, the stronger the sentiment.
Phew. It would seem that Moby Dick is pretty grim! Almost no characters appear in majority positive sentences — and for those who do, the positivity is quite weak. As for Captain Ahab, his overall sentence sentiment sum is -42! Of course, we haven’t checked to see whether the sentiment is about Ahab, but merely the sentiment of sentences in which Ahab appears. Perhaps this is an indicator that Ahab lives a tortured and unhappy life — it would seem that he isn’t in a lot of happy sentences.
And that’s it for our quick look at Stanza! If you think Stanza could be a good fit for your needs, I highly encourage you to check it out — the documentation is excellent and has a good overview on usage. Perhaps you too can use it to explore your favorite novel. (And if you do, be sure to let us know the results!)