Using Gensim for Topic Models in Social Science Research
Gensim is a fantastic Python module capable of handling large corpora of text data easier and faster than most the existing social science toolkit. In particular, Gensim is capable of parallelizing model fitting, while R packages cannot. This leads to R breaking when confronted with even trivially large amounts of text data. This is how I came across Gensim personally - a corpus with ~ 5 million documents was simply impossible to fit a topic model to using R’s
topicmodel package. Gensim is perfectly capable of handling corpora this large, and even doing so in constant memory!
gensim seems to have been primarily written for enterprise use, and topic modeling is only one of an impressive array of different NLP models in the package. As such, I found it ocassionally difficult to extract the quantities of interest I needed for political science research.
The purpose of this tutorial is to show other social scientists how to set-up gensim, run a parallelized LDA model, and extract some common quantities of interest from the models. I don’t claim that the solutions I’ve found are necessarily the most optimal, and would love to hear from you if you improve on my code.
Setting up your environment
Gensim runs best in a virtual environment. In particular, I ran into issues with parallelization when not isolated to a Conda environment; it seems to conflict with something in scikit-learn. Regardless, I confine most my Python projects to Conda environments - the little bit of setup is worth not being caught in dependency hell. If you aren’t already using Conda environments with Python, I strongly reccommend you start.
We’re going to start with the imports necessary. There’s nothing complicated her, but for more involved workflows you may need to import additional pre-processors from
nltk, stemmers for different workflows, etc. We’re going to use the spaCy model here because it’s awesome, but since I mostly work with Russian data I never get a chance to use it.
import re #regular expression import numpy as np import pandas as pd from pprint import pprint #pretty printing # spacy for lemmatization import spacy #gensim imports import gensim #whole module import gensim.corpora as corpora #convenience rename from gensim.utils import simple_preprocess #import preprocessor from gensim.models import CoherenceModel #model for coherence #enable logging for gensim import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR) logging.root.level = logging.INFO #ipython sometimes messes up the logging setup import warnings warnings.filterwarnings("ignore",category=DeprecationWarning)
Now let’s download the toy dataset for this tutorial. Following this tutorial, which I learned the Gensim basics from, we’re going to be using the 20-Newsgroup dataset.
#read json file in df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json') #print names of newsgroups print(df.target_names.unique()) #inspect df.head()
['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space' 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware' 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc' 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x' 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian' 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']
|0||From: email@example.com (where's my thing)\nS...||7||rec.autos|
|1||From: firstname.lastname@example.org (Guy Kuo)...||4||comp.sys.mac.hardware|
|2||From: email@example.com (Thomas E Will...||4||comp.sys.mac.hardware|
|3||From: jgreen@amber (Joe Green)\nSubject: Re: W...||1||comp.graphics|
|4||From: firstname.lastname@example.org (Jonathan McDow...||14||sci.space|
While we’re at it, we need to download and load the basic English stop words from the
nltk package. You’ll also need to download the spaCy model if you don’t already have it. Uncomment the code below and use it if you need to.
#download nltk #nltk.download('stopwords') #download spacy - run in terminal #python3 -m spacy download en
We’ll now need to clean the data, using regular expressions. I borrow the code from the tutorial for this purpose. These regexs will remove the emails, the new line characters, and quote signs. We’ll then use Gensim’s
simple_preprocess() function to get rid of punctuation and tokenize the text.
#convert data to list - required to run regexes on it using list comprehensions data = df.content.values.tolist() #remove emails data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] #remove new line characters data = [re.sub('\s+', ' ', sent) for sent in data] #get rid of single quotes data = [re.sub("\'", "", sent) for sent in data] #simple preprocess def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations data_tok = list(sent_to_words(data)) pprint(data_tok[:1])
['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: ' 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: ' '15 I was wondering if anyone out there could enlighten me on this car I saw ' 'the other day. It was a 2-door sports car, looked to be from the late 60s/ ' 'early 70s. It was called a Bricklin. The doors were really small. In ' 'addition, the front bumper was separate from the rest of the body. This is ' 'all I know. If anyone can tellme a model name, engine specs, years of ' 'production, where this car is made, history, or whatever info you have on ' 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by ' 'your neighborhood Lerxst ---- '] [['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]
Now we’re going to remove stopwords, and then lemmatize/stem using spaCy and restrict the data to only nouns, adjectives, verbs, and adverbs.
#define functions def remove_stopwords(texts): '''Loops over texts, preprocess, and then compares each word to stopwords using a list comprehension''' for text in texts: return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def stemmer(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): '''Uses spaCy EN model to stem words, remove certain parts of speech''' texts_out =  for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_out #load in stop words from nltk.corpus import stopwords stop_words = stopwords.words('english') #instatiate spacy model, keeping only tagger nlp = spacy.load('en', disable=['parser', 'ner']) data_tok = remove_stopwords(data_tok) data_stem = stemmer(data_tok) pprint(data_stem[:1])
[['where', 's', 'thing', 'subject', 'car', 'nntp', 'post', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'maryland', 'college', 'park', 'line', 'wonder', 'anyone', 'could', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'whatev', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]
We’re almost ready to run the model. At this point, we just need the dictionary and corpus, which will be fed to the LDA model. We use functions within Gensim to do this.
corpora.Dictionary stores each unique word and assigns it a numeric ID. The
.doc2bow method converts the corpus to its bag-of-words representation (also known as the term document frequency matrix, although Gensim doesn’t store this as a matrix, but rather a list of tuples).
#create dictionary dictionary = corpora.Dictionary(data_stem) #full texts - for later reference texts = data_stem #bag of words corpus = [dictionary.doc2bow(text) for text in texts]
Now let’s move on to modeling. This corpus isn’t actually big enough for parallelization to be worthwhile, but we’ll do it for sake of demonstration.
I won’t get into too much detail on what the arguments to multicore mean - most of them are self-explanatory. Important for parallelization and working with big corpora, however, are
chunksize: Controls how large training “chunks” of the data are. Part of the reason Gensim works so well with large corpuses is that it uses online variational inference, allowing the model to run in constant memory. The larger the chunks, the quicker the model runs, but the more memory needed (in order to hold the chunk in memory).
workers: # of cores you have available to train the model. More cores, faster run.
passes: Since we’re using variational inference, we need a stopping point for convergence. Gensim has a tolerance already set, but
passescontrols how many times the algorithm will attempt to reach convergence. If you’re not reaching convergence on all/nearly all documents, you’ll want to increase that number. See the gensim docs for more detail on this. Naturally, more passes increases the run time.
We’ll set up and run the model. If you want, you can also try the generic single core LDA model. It’ll be faster in this case, since the overhead of setting up parallel processing isn’t worth it on such a small corpus.
Make sure to set the random_state for replicability and set minimum_probability to 0 to prevent topics with low responbility from being filtered out of the results
lda_model = gensim.models.LdaMulticore(corpus = corpus, id2word = dictionary, num_topics = 20, random_state = 1017, chunksize = 100, passes = 10, per_word_topics = True, workers = 2, minimum_probability = 0)
Let’s take a look at the topics.The
print_topics() method gives us the top tokens for each topic, plus its associated weight.
If you were tuning $k$ for a topic model, you’ll need some measure to help you select the best $k$. Gensim has a built in coherence model for exactly that purpose, which allows you to choose between several different measures of coherence. This can also be parallelized using the
processes argument, though I omit it here.
#get c_v coherence score #instatiate model cm = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v') #processes = 2 would use 2 cpus #run model co = cm.get_coherence() print(co)
And there you have the basics of topic modeling in Gensim.
Harder Quantaties of Interest
Everything up until this point has been remarkably easy. Now we’re going to retrieve two quantaties of interest for which there is not a simple, pre-existing method: the most representative documents for each topic and topic assignments.
Here, I’m defining the most representative documents as the $n$ documents with the highest cluster responsbility for a given topic (the cluster responsibilities are often known as Phi in the LDA literature). Topic assignments are a bit trickier to define, since LDA is a mixed-membership model - documents can belong to more than one topic. Nevertheless, it’s often helpful to have clear-cut topic assignments in order to summarize your corpus or perform other secondary analyses. I assign topics here by simpling choosing the topic that best explains the document (highest responsibility), but there are other ways this could be done. Another method might be to set some baseline responsibility, say $1/k$, that the responsibility must be above in order to be assigned to a topic. This prevents assigning topics to documents the model was “unsure” about ($1/k$ being the probability of getting it right by random guessing).
Most Representative Documents
We’ll start with retrieving the 25 most representative documents for each topic. I’ll show two methods here: one for the small model/if you have a lot of memory and another where you dump the responbilities to .csv to save memory. By dumping the responbilities to CSV, I was able to iterate over many models’ outputs in a bash script, making it more memory efficient. Both methods work even if you forgot to specify the minimum probability argument (via the
I do this using two functions.
lda_to_df extracts cluster responsibilities to a pandas dataframe for ease of use - this probably isn’t the most efficient way to do this, but it is simple.
get_best_docs returns the $n$ most representative docs for each topic, given the output of
lda_to_df as input.
lda_to_df takes the corpus as input, whereas
get_best_docs takes the full (unstemmed/uncleaned) texts as input. This could be easily combined into one function, but when working with large corpora or many models, it’s convenient to save the intermediate output to file (as I do in the second example).
#no csv method #define functions def lda_to_df(model,corpus): '''This function takes a gensim lda model as input, and outputs a df with topics probs by document''' topic_probs = model.get_document_topics(corpus) #get the list of topic probabilities by doc topic_dict = [dict(x) for x in topic_probs] #convert to dictionary to convert to data frame df = pd.DataFrame(topic_dict).fillna(0) #convert to data frame df['docs'] = df.index.values #create column with document indices (correspond to indices of dataframe) df.columns = df.columns.astype(str) #convert to string to make indexing easier return df def get_best_docs(df, n, k, texts): '''Return the index of the n most representative documents from a list of topic responsibilities for each topic''' '''n is the number of douments you want, k is the number of topics in the model, the texts are the FULL texts used to fit the model''' #create column list to iterate over k_cols = range(0, k) #intialize empty list to hold results n_rep_docs =  #loop to extract documents for each topic for i in k_cols: inds = df.nlargest(n = n, columns = str(i))['docs'].astype(int).tolist() #use list comprehension to extract documents n_rep_docs.append([texts[ind] for ind in inds]) return n_rep_docs
#run functions resp = lda_to_df(lda_model, corpus) best_docs = get_best_docs(resp, 25, 20, data)
Let’s take a look at the output.
Here’s the function that outputs the responsibilities to a CSV file.
def lda_to_csv(model, outfile, corpus): '''This function takes a gensim lda model as input, and outputs a csv with topics probs by document''' topic_probs = model.get_document_topics(corpus) #get the list of topic probabilities by doc topic_dict = [dict(x) for x in topic_probs] #convert to dictionary to convert to data frame df = pd.DataFrame(topic_dict).fillna(0) #convert to data frame, fill topics < 0.01 as 0 df.to_csv(outfile)
This is a basic function that assigns topics based on the maximum cluster responsibility. Again, I’d like to emphasize that LDA models are mixed membership, and this isn’t the only way to cut assignments of topics.
This function operates diretly on the output of the
get_document_topics() method, so it’s more efficient than the function to retrieve the best documents.
def assign_topic(model, corpus): doc_topics = model.get_document_topics(corpus) #get by-topic probability for each document topic_assignments =  #initialize empty list for assignments for i in range(len(doc_topics)): #loop over every document doc = doc_topics[i] #extract relevant document list_length = len(doc) probs =  for r in range(list_length): probs.append(doc[r]) #get topic probs max_val = max(probs) #get max value max_ind = probs.index(max_val) #retrieve index topic = doc[max_ind] #retrieve topic number topic_assignments.append(topic) #append only topic number (not also responsibility) to results return topic_assignments
assignments = assign_topic(lda_model, corpus)
Let’s take a peek at the output.
[0, 19, 11, 0, 0, 15, 0, 19, 0, 19]
The topic assignments are a simple list that you are free to use as you choose. If you also wanted to have the responsibilities, you could simply remove the slice from the
topic_assignments.append() line in the function.
There you have it - I hope that these functions are useful for you! Please get in touch if you have any questions or improve on my code!