Member-only story

Are You the Keymaster?

Severin Perez
31 min readFeb 15, 2025

--

OK, you may not be Vince Clortho, Keymaster of Gozer, but you can still master the art of extracting keywords from texts! (Not nearly as exciting, I know, but still useful!) In this article we’ll talk about statistical, graph-based, algorithmic, and machine learning techniques to identify keywords and keyphrases in texts.

A keyword is a word that represents the most important ideas in a text. Keywords are useful for information retrieval systems, extractive summarization, classification tasks, and more. Keywords also provide a useful means of quickly comparing texts, with the idea that texts with very similar keywords are likely to be similar in their totality. Identifying these words is by no means easy though, and different techniques can produce very different results. Further, what makes a “good keyword” is a subjective question, and the type of text you’re working on makes a difference. For that reason, it’s useful to understand the various techniques, and experiment with which ones are best suited to your needs.

In this article we’ll use three categories of texts: news articles, scientific articles, and books. As you’ll see, the types of keywords we extract from each will vary in usefulness, particularly as texts get longer. For the sake of reproducibility, we’ll use datasets from Huggingface for the news and scientific articles, and books from Project Gutenberg. To keep…

--

--

Responses (1)