This repository tries to help analysing the results of a mallet topic modelling. Therefore, it takes certain mallet-export files like the diagnostics-file, the output-doc-topics file or others.
The script topic_cloud.py
uses the python-libraries pandas
, matplotlib
, BeautifulSoup
and
wordlcoud
to create black and white word-clouds for the different topics of the modell.
You might need to install the missing packages before running the script with:
python3 -m pip install --upgrade pip
python3 -m pip install pandas matplotlib wordcloud bs4
Then navigate with the shell or terminal into the directory where the topic_cloud.py
script is and run python3 topic_cloud.py
. After this just follow the given instructions.
It is important to now, that this file generates the word-clouds based on the diagnostics file (mallet-parameter:
diagnostics-file
) of the mallet
topic-modelling, so you'll need enter the full path to this file (so it'll be easier, if it's in the same directory.)
If you want to change variables like the maximum number of words displayed or the size of the plot, you can do this in line 77 and following.
The scripttopic_over_time.py
can be used to visualize one or more topics over time. Therefore,
the document names must include some kind of time-information. this can either be a two-digit volume-number or a
four-digit year number.
This script uses the libraries seaborn
, matpoltlib
, pandas
and re
. While re
is normally preinstalled, the
other libraries, might still have to be installed.
python3 -m pip install --upgrade pip
python3 -m pip install pandas matplotlib seaborn
You can run the script the same way as topic_cloud.py, by typing python3 topic_cloud.py
(you must be in the same
directory with your shell or terminal).
The script topic_over_time.py
uses the doc-topic-distribution file of mallet, which can be exported by including the
parameter output-doc-topics
followed by a file-name into the mallet call. You'll need enter the full path to this file
(so it'll be easier, if it's in the same directory.)
Sometimes it can be usefully to determine, in which way certain topics correlate in their importance over time or even
just to now, if they are some topics which correlate with each-other. To determine this, the script
topic_correlation.py
plots all topic-correlations into a heatmap. After that it offers the
possibility to visualize the correlation between pairs of topic, the user might find interesting.
As the script above (topic_over_time.py) it uses the doc-topic-distribution file from mallet, which should therefore be
created during the topic-modelling.
Furthermore, it uses the python libraries numpy
, pandas
, matplotlib
, seaborn
and re
. The can install them by:
python3 -m pip install --upgrade pip
python3 -m pip install numpy pandas matplotlib seaborn
You can run the script by typing python3 topic_correlation.py
. Concerning the time-information please check the
note to the topic_over_time script.
This script is still under development, but can already be used. Only the plot between pairs of topics might sometimes still look a bit odd.
I wrote these script for our own topic-modelling project and thought the might be usefull to others. But it can be possible, that they'd need further customization. If you find errors, or you have ideas for further generalization, don't hesitate to contact me or to create an Issue.
The scripts using the doc-topic-distribution file, can be further customized by changing the RegEx pattern to determine
the volumen number of the file name. You can adjust the therefore used variable vol_pattern
:
##############################################
vol_pattern = r'[0-9][0-9]'
"""
The pattern to determine the volume of a document.
"""