Skip to content

This repository contains scripts for analysing the results of a mallet topic-modelling

License

Notifications You must be signed in to change notification settings

Leano1998/Analysing_Mallet_Results

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysing the results of a mallet topic-modelling

This repository tries to help analysing the results of a mallet topic modelling. Therefore, it takes certain mallet-export files like the diagnostics-file, the output-doc-topics file or others.

Word-Clouds

The script topic_cloud.py uses the python-libraries pandas, matplotlib, BeautifulSoup and wordlcoud to create black and white word-clouds for the different topics of the modell. You might need to install the missing packages before running the script with:

python3 -m pip install --upgrade pip
python3 -m pip install pandas matplotlib wordcloud bs4

Then navigate with the shell or terminal into the directory where the topic_cloud.py script is and run python3 topic_cloud.py. After this just follow the given instructions.

It is important to now, that this file generates the word-clouds based on the diagnostics file (mallet-parameter: diagnostics-file) of the mallet topic-modelling, so you'll need enter the full path to this file (so it'll be easier, if it's in the same directory.)

If you want to change variables like the maximum number of words displayed or the size of the plot, you can do this in line 77 and following.

Topics over time

The scripttopic_over_time.py can be used to visualize one or more topics over time. Therefore, the document names must include some kind of time-information. this can either be a two-digit volume-number or a four-digit year number.

This script uses the libraries seaborn, matpoltlib, pandas and re. While re is normally preinstalled, the other libraries, might still have to be installed.

python3 -m pip install --upgrade pip
python3 -m pip install pandas matplotlib seaborn

You can run the script the same way as topic_cloud.py, by typing python3 topic_cloud.py (you must be in the same directory with your shell or terminal).

The script topic_over_time.py uses the doc-topic-distribution file of mallet, which can be exported by including the parameter output-doc-topics followed by a file-name into the mallet call. You'll need enter the full path to this file (so it'll be easier, if it's in the same directory.)

Topic correlation over time

Sometimes it can be usefully to determine, in which way certain topics correlate in their importance over time or even just to now, if they are some topics which correlate with each-other. To determine this, the script topic_correlation.py plots all topic-correlations into a heatmap. After that it offers the possibility to visualize the correlation between pairs of topic, the user might find interesting. As the script above (topic_over_time.py) it uses the doc-topic-distribution file from mallet, which should therefore be created during the topic-modelling.

Furthermore, it uses the python libraries numpy, pandas, matplotlib, seaborn and re. The can install them by:

python3 -m pip install --upgrade pip
python3 -m pip install numpy pandas matplotlib seaborn

You can run the script by typing python3 topic_correlation.py. Concerning the time-information please check the note to the topic_over_time script.

This script is still under development, but can already be used. Only the plot between pairs of topics might sometimes still look a bit odd.

Further Information

I wrote these script for our own topic-modelling project and thought the might be usefull to others. But it can be possible, that they'd need further customization. If you find errors, or you have ideas for further generalization, don't hesitate to contact me or to create an Issue.

The scripts using the doc-topic-distribution file, can be further customized by changing the RegEx pattern to determine the volumen number of the file name. You can adjust the therefore used variable vol_pattern:

##############################################

vol_pattern = r'[0-9][0-9]'
"""
    The pattern to determine the volume of a document.
"""

About

This repository contains scripts for analysing the results of a mallet topic-modelling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages