This is a Natural Language Processing (NLP) study of values expressed by academic papers¹ published over a couple of decades. By values we mean terms like 'objectivity', 'accuracy', 'truth', but also 'fairness', 'equality', 'freedom', 'utility' etc. We are interested in how the values expressed have changed over time.

Note: this is still in exploratory phase; no tangible results are in as yet.

Project Plan

Phase 1: Data Preparation and Preprocessing

Data Collection Verification: Validate dataset completeness and relevance.
Text Cleaning: Utilize stemming, lemmatization, and stopword removal.
Data Splitting: Partition dataset into training, validation, and test sets.

Phase 2: Exploratory Data Analysis (EDA)

Initial Insights: Analyze basic statistics and metrics.
Value-Term Frequency: Investigate term frequencies for key value-terms.
Temporal Trends: Study term frequency over time.

Phase 3: Feature Engineering

Bag of Words (BoW): Implement as a baseline.
TF-IDF: Use for term importance. Word Embeddings: Capture semantic meanings. 3.1 GloVe Vectors: Utilize pretrained GloVe vectors for semantic similarity. 3.2 Word2Vec: Use pretrained word2vec vectors. Semantic Similarity: Integrate GloVe vectors or other embeddings into similarity computations. 4.1 Cosine Similarity with Embeddings: Compute cosine similarity scores using word embeddings. 4.2 Document Embedding: Create document-level embeddings by averaging or weighting word embeddings. 4.3 Topic Modeling: Identify key topics.

Phase 4: Model Building

Rule-Based Methods: Implement sentiment capture for value-terms.
Machine Learning Models: Use Random Forests, XGBoost, Naive Bayes, or SVMs.
Deep Learning Models: Explore RNNs or Transformers.

Phase 5: Evaluation Metrics

Standard Metrics: Evaluate via accuracy, precision, recall, and F1-score.
Inter-annotator Agreement: Manually annotate and compare.

Phase 6: Deployment and Monitoring

API Creation: Use Django for deployment.
Monitoring: Implement logging and performance tracking.
Continuous Updating: Update model with new academic papers.

Optional:

Advanced NLP: Explore sentiment models or transfer learning.
Explainability: Use techniques like LIME or SHAP.
Testing and Logging: Ensure codebase stability and scalability.

Tech Stack

Data Preparation: Pandas, SpaCy
EDA: Matplotlib, Seaborn
Feature Engineering: Scikit-learn, Gensim, SpaCy
Model Building: Scikit-learn, PyTorch
Deployment: Django

Activate virtual environment

conda activate nlpenv

You can then install dependencies from a env.yml:

conda create -f env.yml

While environment is activate, you can install a package:

conda install <package-namee> -c conda-forge

Footnotes

We are primarily interested in more formal/mathematical topics, and start our study by looking at Psychometrics articles. The research can then be extended to other disciplines and texts. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Project Plan

Phase 1: Data Preparation and Preprocessing

Phase 2: Exploratory Data Analysis (EDA)

Phase 3: Feature Engineering

Phase 4: Model Building

Phase 5: Evaluation Metrics

Phase 6: Deployment and Monitoring

Optional:

Tech Stack

Activate virtual environment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project Plan

Phase 1: Data Preparation and Preprocessing

Phase 2: Exploratory Data Analysis (EDA)

Phase 3: Feature Engineering

Phase 4: Model Building

Phase 5: Evaluation Metrics

Phase 6: Deployment and Monitoring

Optional:

Tech Stack

Activate virtual environment

Footnotes