Skip to content

ScienceNOW: Topic Modelling with Tweets, Arxiv, Reddit & Mendeley

Notifications You must be signed in to change notification settings

benearnthof/ScienceNOW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScienceNOW

ScienceNOW: Topic Modelling with Arxiv e-prints for Trend detection.

Installation: The easiest way to get started with ScienceNOW is installing the package locally. Set up a virtual environment in which you want to install the package and all dependencies, python 3.10 is recommended.

git clone https://github.com/benearnthof/ScienceNOW.git
cd ScienceNOW
pip install -e .

To install everything as an editable package.

To train and evaluate topic models, and extract trends from Arxiv Preprints you will need to download the data from the Open Archives Initiative (OAI) first. This can be done with
https://github.com/mattbierbaum/arxiv-public-datasets/tree/master
But is very slow as the OAI limits the amount of metadata one can download to 1000 entries every 10 seconds. Downloading the entire Arxiv Preprint metadata from scratch takes over 10 hours so the data needed to reproduce our experiments are provided via Google Drive:
https://drive.google.com/drive/folders/1xhLDDFwJauVH5ijRY94xjVRaChc5g5EO?usp=drive_link

There you will find most of the files needed to begin topic modelling:

  • ARXIV_SNAPSHOT.json A file containing the metadata of all Arxiv preprints up including the 13th of November 2023.
  • arxiv-metadata-oai-2023-11-13.json.gz The OAI snapshot generated by downloading the Arxiv Preprint metadata. (Only needed if you wish to update your data at a later point in time.)
  • A .txt file containint the resumptionToken necessary to do so
  • embeddings.npy A numpy array obtained by encoding the preprocessed preprint abstracts with a sentence transformer.
  • reduced_embeddings.npy A numpy array obtained by using cuML.manifold.umap to compress the sentence transformer embeddings down to 5 dimensional vectors.
  • arxiv_df.feather A pandas DataFrame which contains the data used for topic modelling after all preprocessing steps like filtering and dimensionality reduction are done, stored as .feather to drastically speed up loading of 2.3 million preprints to memory
  • taxonomy.txt A text file containing the mapping from the Arxiv labeling taxonomy to plaintext labels. (cs.CL: Computation and Language, etc.)

Download the files and adjust the respective paths in secrets.yaml. It is recommended to leave all these files in a single folder since they will only be written to if the dataset is updated with sciencenow.data.update_arxiv_snapshot.py.

Updating the OAI Snapshot can be done by running:

python sciencenow/data/update_arxiv_snapshot.py

After you've added the necessary paths to secrets.yaml so the update script will find the pre-downloaded snapshot and will be able to continue requesting papers added to the Arxiv after the 13th of November 2023, instead of building a new snapshot from scratch.

Setting up the project config:
As mentioned above, to update the OAI snapshot and train models with this package you will need to adjust some parameters in sciencenow/config/secrets.yaml. Navigate from the root of the project directory to the corresponding folder and adjust the following lines in the file:

ROOT: "absolute path to the Project Root e.g.: ./ScienceNOW/"
ARXIV_SNAPSHOT: "absolute path to the metadata snapshot e.g.: /arxiv-public-datasets/arxiv-data/arxiv-metadata-oai-2023-11-13.json"
EMBEDDINGS: "absolute path to the precomputed abstract sentence embeddings e.g.: /arxiv-public-datasets/arxiv-data/embeddings.npy"
REDUCED_EMBEDDINGS: absolute path to the precomputed reduced embeddings e.g.: /arxiv-public-datasets/arxiv-data/reduced_embeddings.npy"
FEATHER_PATH: "absolute path to the preprocessed metadata in .feather format e.g.:/arxiv-public-datasets/arxiv-data/arxiv_df.feather"
TAXONOMY_PATH: "absolute path to the label taxonomy for semisupervised models e.g.: /taxonomy.txt"
EVAL_ROOT: "absolute path to the directory in which evaluation results should be stored e.g.: /tm_evaluation/"
VOCAB_PATH: "absolute path to the file where the vocabulary for evaluation will be stored e.g.: /tm_evaluation/vocab.txt"
CORPUS_PATH: "absolute path to the file where the corpus used for evaluation will be stored e.g.: /tm_evaluation/corpus.tsv"
SENTENCE_MODEL: "sentence transformer used to generate EMBEDDINGS e.g.: all-MiniLM-L6-v2" 
UMAP_NEIGHBORS: 15 umap parameters
UMAP_COMPONENTS: 5
UMAP_METRIC: "cosine"
VOCAB_THRESHOLD: 15 # Threshold used to avoid out of memory errors during evaluation
TM_VOCAB_PATH: "absolute path to the topic model vocab (distinct from evaluation vocab) e.g.: /tm_evaluation/tm_vocab.txt"
TM_TARGET_ROOT: "absolute path to directory where trained topic model should be written to disk e.g.: /tm_evaluation/"

We will go over each of these parameters (and other training hyperparameters) in detail, for now let's set up the data we wish to analyze.

Setting up the Arxiv Snapshot:

After downloading the Arxiv Snapshot from the OAI or from the provided Google Drive, make sure to set the respective paths in secrets.yaml to the correct locations.
If you wish to perform topic modelling on large subsets of the Arxiv metadata it is recommended to install (https://github.com/rapidsai/cuml)[https://github.com/rapidsai/cuml] as the parallel implementation of UMAP runs orders of magnitude faster than the default CPU implementation available in the umap python library.

After updating the OAI snapshot, you need to rerun the preprocessing steps to convert the data into a dataframe and save it as a .feather file for faster loading.
An example experiment where synthetic trends are added to a target dataset is given in demo.py.

To set the preprocesing pipeline up and load data from disk we define the Pipeline we wish to run and then execute it like so:

from pathlib import Path
from tempfile import TemporaryFile

from sciencenow.core.pipelines import (
    ArxivPipeline,
)

from sciencenow.core.steps import (
    ArxivDateTimeFilterStep,
    ArxivTaxonomyFilterStep,
    ArxivPlaintextLabelStep,
    ArxivReduceSubsetStep,
    ArxivGetNumericLabelsStep,
    ArxivLoadFeatherStep,
)

from sciencenow.core.dataset import (
    ArxivDataset,
)

from sciencenow.config import (
    setup_params,
)

setup_params["target"] = "cs.LG"
setup_params["cluster_size"] = 6
setup_params["secondary_target"] = "q-bio"
setup_params["secondary_proportion"] = 0.2
setup_params["trend_deviation"] = 1.67
setup_params["recompute"] = True

SETUP_PARAMS = setup_params

path = "C:\\Users\\Bene\\Desktop\\testfolder\\Experiments\\all-distilroberta-v1\\taxonomy.txt"
ds = ArxivDataset(path=path, pipeline=None)
ds.load_taxonomy(path=path)

target_pipe = ArxivPipeline(
    steps=[
        ArxivLoadFeatherStep(),
        ArxivDateTimeFilterStep(
            interval={
                "startdate": "01 01 2020",
                "enddate": "31 01 2020"}),
        ArxivTaxonomyFilterStep(target="cs"),
        ArxivPlaintextLabelStep(taxonomy=ds.taxonomy, threshold=25, target="cs"),
        ArxivReduceSubsetStep(limit=400),
        ArxivGetNumericLabelsStep(mask_probability=0),
    ]
)

targetds = ArxivDataset(path=TemporaryFile().file.name, pipeline=target_pipe)
targetds.execute_pipeline(input=FEATHER_PATH])

The ArxivLoadFeatherStep method will load the preprocessed data frame present at the FEATHER_PATH location specified in secrets.yaml.

This will cut down preprocessing times for 2.3 million abstracts to a couple of seconds of loading the data frame from disk.

Training and Evaluating Topic Models:

The main goal of this project is the detection of trends or emerging topics in Arxiv preprints. For this purpose we utilize the time stamps present in the metadata of each preprint. While every preprint comes equipped with potentially multiple different time stamps, the v1_datetime corresponds to the first time the preprint was added to the Arxiv, which, like we will discuss below, allows us to order all preprints by their distinct timestamps and bin them by day, week, month, or year. The v1_datetime does not match the actual publication date of the respective articles but this does not impact our topic models much since on one hand, most preprints do remain preprints forever, and on the other, the few preprints that do end up getting accepted at journals may get published multiple times in multiple journals which may further muddy the waters in respect to picking the "correct" publication date. Despite these slight discrepancies in the way articles are published and updated on the Arxiv, we are still able to find trends and patterns in the way preprints of certain domains are added to the Arxiv, corresponding to publication deadlines of various journals. Some examples are discussed later on.

For the purpose of modeling trends that may evolve over time or yet emerge in a later batch of data many different approaches, such as (ANTM)[https://arxiv.org/abs/2302.01501] exist, but (BERTopic)[https://maartengr.github.io/BERTopic/] is, as of today, both the most mature software package available for topic modeling and best suited for large datasets featuring millions of documents.
BERTopic offers three different ways of modeling time dependent documents:

  • Fitting multiple topic models on separate batches of data and then merging both models to detect topics that emerged only in the most recent batch.
  • Fitting a Dynamic Topic Model (DTM) that calculates both a global and a local representation for each topic respectively, making it possible to track topics over time.
  • Fitting an Online Topic Model (OTM) that makes use of online clustering algorithms to extract microclusters from each batch of data or assign documents to microclusters already extracted in a previous batch.

All of these approaches have different advantages and drawbacks, mainly computational cost when the performance of them needs to be evaluated, but merging multiple models comes with the additional disadvantage that, since the technique was added to BERTopic to allow federated learning to a certain degree, the c-TF-IDF representations of each model are not merged since the tokenizers of both models will have been obtained from different corpora. This renders interpreting the results of merged models rather difficult, since obtaining local topic representations for smaller time bins is then no longer possible. Topics may be similar enough to be merged depending on the hyperparameters chosen, but this gets messy rather quickly when, for example, a fine grained analysis of a full years worth of data should be performed over 52 different time stamps. Both other methods do not suffer from this problem, they are only limited by computational cost, which can be alleviated by doing the heavy lifting up front and caching intermediary results to disk.

Doing the heavy lifting

As outlined during the setup steps above, a CUDA capable GPU is recommended for some of the preprocessing steps to use BERTopic.

First of all, the abstracts of each preprint need to be converted into a format that is digestible by computers, for this they can be embedded to numeric vectors with (Sentence Transformers)[https://www.sbert.net/docs/pretrained_models.html]. The embeddings provided at the link above have been obtained with the all-MiniLM-L6-v2 since this model offers robust quality while still processing all 2.7 million abstracts in about 41 minutes on a 40GB A100. Should you wish to reencode the preprint abstracts with another model, swap the respective line in secrets.yaml to one of the other pretrained models available on (sbert.net)[sbert.net] and make sure to swap the path for the precalculated embeddings in secrets.yaml to an empty directory so the ArxivPreprocessor will actually calculate a set of new embeddings.

After embedding each document will be represented by a vector with more than 700 entries. In order to avoid the curse of dimensionality while clustering BERTopic employs (UMAP)[https://umap-learn.readthedocs.io/en/latest/], a robust dimensionality reduction algorithm that works well for a broad spectrum of data modalities, is non-linear, preserving local strucutre better than linear counterparts, and does not suffer from the same time complexity issues like t-SNE. Running vanilla UMAP on a CPU with all 2.7 million documents at once will fail however, since UMAP is very memory demanding. cuML comes equipped with a CUDA implementation of UMAP, that drastically cuts down computation time and allows users to fit a Topic Model on thousands of documents in seconds. The precalculated reduced embeddings provided for download above have been reduced to 5 dimensions, which took about 1 minute on a 40GB A100 card. For specific use cases, such as extracting trends and topic representations that may change over time, it is recommended to recalculate the reduced embeddings for every document every time a new model is fit, since the reduced representations for each document do depend on every other document in its neighborhood. This would in turn allow downstream tasks to "peek into the future" (or past), potentially warping the representations of some documents more than others and leading to biased results.

After clustering BERTopic will then fit a topic model to the data, according to the hyperparameters specified. To obtain actual topic clusters from the individual document representations of each abstract HDBSCAN is used as a clustering step. The GPU implementation of HDBSCAN is faster than its CPU counterpart, but not crucial. But since it is already part of the cuML package necessary for UMAP we do recommend using it.

Fitting Topic Models

To fit Topic Models to the Arxiv preprints the ModelWrapper class is used. This is the main work horse to both train and evaluate BERTopic according to the hyperparameters specified. All hyperparameters are passed to ModelWrapper as dictionaries, a quick summary of each parameter is given below:

Setup Parameters

The Setup Parameters dictionary, necessary to train a basic Topic Model, as well as Dynamic Topic Models.

  • samples: Number of HDBSCAN samples, does only have a marginal impact on model performance, 1 by default.
  • cluster_size: The minimum size of clusters generated by HDBSCAN. Does impact the model performance to a certain degree, should be set by performing evaluation for different values and examining the topic representations generated after fitting models with these values. No default since user should pick a suitable value for their use case.
  • startdate: Start date of the time span of interest. Format: "DD MM YYYY" e.g.: "01 01 2021". Defaults to None.
  • enddate: End date of the time span of interest. Format "DD MM YYYY" e.g. "31 01 2021". Defaults to None
  • target: A label from the arXiv Category Taxonomy (Link)[https://arxiv.org/category_taxonomy] Allows the user to filter documents to a subsest of interest. Default "cs" for Computer Science adjacent articles.
  • secondary_target: A label from the arXiv Category Taxonomy for a different document subset used to add synthetic data to examine the ability of DTMs and OTMs to detect the influx of new trends in the data. Defaults to "q-bio" for Quantitative Biology.
  • secondary_startdate: Start date of the time span from which "synthetic" data should be extracted. It is recommended to select a time frame before the start date selected for the primary data set to avoid the possibility of secondary papers being related to primary papers.
  • secondary_enddate: End date of the time span from which "synthetic" data should be extracted.
  • secondary_proportion: Proportion of "synthetic" data that will be added to the primary data set. Defaults to 0.1
  • trend_deviation: Value between 1 and 2 that determines how many more papers will be in "trending" time bins when compared to non-trending time bins. Defaults to 1.5
  • n_trends: Integer value >= 1 and <= nr_bins that specifies how many bins should exhibit an influx of papers from the secondary set. Defaults to 1.
  • threshold: Threshold value relevant for semi-supervised model. If threshold is > 0, only papers that fall into a label class with at least threshold papers in it will be kept for downstream modeling. Defaults to 0.
  • labelmatch_subset: (Deprecated) Subset of data the model shall be compared to. May potentially be labelled with a different set of labels.
  • mask_probability: For semi-supervised model. Value between 0 and 1 that specifies how many preprint labels should be masked as 0. Defaults to 0, the fully semi-supervised case. 0.1 will result in 10% of papers being unlabeled, a mask_probability of 1 corresponds to a fully unsupervised model.
  • recompute: Bool that specifies if the precomputed UMAP reduced embeddings should be used for the subset of data of interest or if UMAP should be rerun to recompute the dimensionality reduced document embeddings. It is recommended to recompute the UMAP step based on the chosen subset since running UMAP based on all 2.7 million preprints prior to fitting the topic model may lead to different results.
  • nr_topics: Parameter that specifies if the amount of topics found by BERTopic should be reduced in case more topics are found than the user desires. Defaults to None. If nr_topics is an integer BERTopic will attempt to bundle Topic Clusters using a hierarchical approach. Usually adjusting cluster_size should be sufficient to cut down on the amount of clusters generated in the first place.
  • nr_bins: Number of bins for DTM. The time span specified with start date and end date will be split into nr_bins bins of equal size. Can be used to split a full year of data into 52 bins corresponding to one week each. Usually works out of the box but for leap years a bit of care needs to be employed if the beginnings and ends of each bin should fall onto the same day of every week.
  • nr_chunks: Number of chunks the documents should be split up into for online learning. Not equivalent to binning since the amount of documents in each chunk will be kept equal. If time binning should be performed instead for an online model, set nr_chunks to None and adjust nr_bins accordingly.
  • evolution_tuning: Hyperparameter that specifies, if evolution tuning should be performed to adjust the representation of every topic over time. Relevant only for dynamic model, defaults to False.
  • global_tuning: Hyperparameter that specifies if glboal tuning should be performed to adjust the global topic representation of DMT over time. Relevant only for dynamic model, defaults to False.
  • limit: Only relevant for evaluation. Performing evaluation may run into memory problems when calculating topic coherence and diversity measures for large corpora (more than 10000 documents) it is recommended to limit the number of documents considered in evaluation runs to less than 10000. A value of 7500 has proven stable. Defaults to None.
  • subset_cache: A string that specifies a location where subsets of filtered data should be cached in. This speeds up evaluation by a lot since all filtering and preprocessing steps need only be run once. Defaults to None.
Online Parameters

Parameters needed to fit Online Models.

  • clustering_threshold: Radius around cluster center that represents a cluster. Adjusting this parameter has a similar effect on the number of generated clusters like the cluster_size parameter in DTMs. It is a bit less intuitive to tune since the impact of the clustering threshold directly depends on the intervals the UMAP vectors fall into. In most cases values between 1.0 and 1.5 seem to work reasonably well, with smaller values resulting in a larger micro cluster count. Should be tuned by evaluating a range of models on a suitable subset of data first.
  • fading_factor: Parameter > 0 that controls importance of historical data to micro clusters in current batch of data. Defaults to 0.01.
  • cleanup_interval: Time interval between two consecutive time periods when the cleanup process is conducted
  • intersection_factor: Area of the overlap of micro clusters relative to the area covered by micro clusters. Defaults to 0.3.
  • Minimum weight: Minimum weight for a cluster to be considered not "noisy".

For more information about each of these OTM hyperparameters please refer to the DBSTREAM manual here: (https://riverml.xyz/latest/api/cluster/DBSTREAM/)[https://riverml.xyz/latest/api/cluster/DBSTREAM/] In practice only the clustering threshold has a strong impact on model performance, it is recommended to perform evaluation runs on a subset of data to arrive at a reasonable set of hyperparameters for the task at hand.

Topic Model Evaluation

Assessing the performance of Topic Models is not trivial, as usually collections of relevant documents will be unlabeled or, like it is the case with Arxiv preprints, only fall into broad categories that may overlap or not align reflect the contents of papers perfectly. This problem is doubly relevant for the approach presented here, as the topic clusters obtained by dynamic or online topic modeling will grow and change over the time period of interest. This Leads us to use Normalized Pointwise Mutual Information (NPMI/Coherence), and Topic Diversity, to assess the alignment of obtained clusters with human judgment like presented in the original BERTopic paper, and also use Cluster Purity and "Acuity" as proxy measures of how many of the documents in each cluster fall into the same prior "soft label" from the arxiv taxonomy, and how well the model picks up artificially introduced clusters of "fake" data given the specified hyperparameters.

As full example of this is presented in demo.py but the gist is setting up a Model and then executing a postprocessing pipeline along the lines of:

from sciencenow.core.model import BERTopicDynamic

# Need to 
model = BERTopicDynamic(
    data = merger.data,
    embeddings = reducer.reduced_embeddings,
    cluster_model=cluster_model,
    ctfidf_model=ctfidf_model,
)

# train model

model.train(setup_params=SETUP_PARAMS)

eval_pipe = ArxivPipeline(
    steps = [
        GetOctisDatasetStep(root = str(Path(TemporaryFile().file.name).parent)),
        GetDynamicTopicsStep(),
        GetMetricsStep(),
        CalculateDynamicDiversityStep(),
        CalculateCoherenceStep(),
        ExtractEvaluationResultsStep(
            id=merger.get_id(SETUP_PARAMS, primary=False),
            setup_params=SETUP_PARAMS
        ),
    ]
)

eval_pipe.execute(input=model)

After the fitting has been completed, the evaluation pipeline will then compute all evaluation metrics and add them as a dictionary to the model object, coupled with the hyperparameters specified for the particular model and the topic labels for each document that was present in the training data.
For large corpora it is recommended to limit the dataset to <10000 documents during evaluation, since calculating the NPMI is extremely memory hungry and may fail for very large amounts of documents. We tried mitigating this issue by using https://radimrehurek.com/gensim/corpora/mmcorpus.html matrix market format corpora for evaluation, but found this solution not sufficient for corpora beyond 20000 documents. All other metrics are not dependent on corpus size.

Impact of Hyperparameters on Training

Perhaps the most intuitive Hyperparameter to showcase is the HDBSCAN minimum cluster size, specified as one of the setup parameters and roughly analogous in impact on model performance to the clustering threshold, its online clustering counterpart. Keeping all other hyperparameters constant and naively training a model with minimum cluster size 1 will result in HDBSCAN generating numerous very small but separate clusters, which will, at least for relatively homogenous datasets, result in clusters of documents being separated from one another despite having very similar contents, perhaps even falling into the same prior taxonomy label class. This is reflected in lower NPMI and diversity scores, as well as a very high cluster purity. This makes intuitive sense, given that the more different clusters are obtained the more similar some clusters will be to other clusters that a human expert may label as falling into a larger umbrella category that may be less specific, yet better reflect the way humans may group topics.
Increasing minimum cluster size one can observe that coherence and diversity scores increase up to a threshold and then begin to decrease again as the clusters obtained by the model become more and more general, and less nuanced. For any given month in the years 2020-2023 the models seem to perform best with cluster sizes between 20 and 30.
A similar dip can be observed in cluster purity, with very small minimum cluster sizes yielding very pure clusters and very large minimum cluster sizes resulting in very general "impure" clusters. For every day use of this package the user should always inspect the topic labels obtained by the model, as textual descriptions allow users to gauge the impact of any chosen hyperparameter beyond just obtuse performance measures.

Trend Extraction

To extract Trending papers from a trained model object we use the TrendPostprocessor class like so:

trend_postprocessor = TrendPostprocessor(
    model=model,
    merger=merger,
    reducer=reducer,
    setup_params=SETUP_PARAMS,
    embedder=embedder,
)
# Performance calculations can be done since we added a synthetic subset
trend_postprocessor.calculate_performance(threshold=1.5)

trend_df, trend_embeddings = trend_postprocessor.get_trend_info(threshold=1.5)

# Inspect trend_df to see which papers were identified as trending
# We can now visualize the results
fig = plot_trending_papers(trend_df, trend_embeddings)
fig.write_html("C:\\Users\\Bene\\Desktop\\trends2.html")

Again, a full example of this is presented in demo.py.

About

ScienceNOW: Topic Modelling with Tweets, Arxiv, Reddit & Mendeley

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published