This is a Natural Language Processing (NLP) study of values expressed by academic papers1 published over a couple of decades. By values we mean terms like 'objectivity', 'accuracy', 'truth', but also 'fairness', 'equality', 'freedom', 'utility' etc. We are interested in how the values expressed have changed over time.
Note: this is still in exploratory phase; no tangible results are in as yet.
- Data Collection Verification: Validate dataset completeness and relevance.
- Text Cleaning: Utilize stemming, lemmatization, and stopword removal.
- Data Splitting: Partition dataset into training, validation, and test sets.
- Initial Insights: Analyze basic statistics and metrics.
- Value-Term Frequency: Investigate term frequencies for key value-terms.
- Temporal Trends: Study term frequency over time.
- Bag of Words (BoW): Implement as a baseline.
- TF-IDF: Use for term importance. Word Embeddings: Capture semantic meanings. 3.1 GloVe Vectors: Utilize pretrained GloVe vectors for semantic similarity. 3.2 Word2Vec: Use pretrained word2vec vectors. Semantic Similarity: Integrate GloVe vectors or other embeddings into similarity computations. 4.1 Cosine Similarity with Embeddings: Compute cosine similarity scores using word embeddings. 4.2 Document Embedding: Create document-level embeddings by averaging or weighting word embeddings. 4.3 Topic Modeling: Identify key topics.
- Rule-Based Methods: Implement sentiment capture for value-terms.
- Machine Learning Models: Use Random Forests, XGBoost, Naive Bayes, or SVMs.
- Deep Learning Models: Explore RNNs or Transformers.
- Standard Metrics: Evaluate via accuracy, precision, recall, and F1-score.
- Inter-annotator Agreement: Manually annotate and compare.
- API Creation: Use Django for deployment.
- Monitoring: Implement logging and performance tracking.
- Continuous Updating: Update model with new academic papers.
- Advanced NLP: Explore sentiment models or transfer learning.
- Explainability: Use techniques like LIME or SHAP.
- Testing and Logging: Ensure codebase stability and scalability.
- Data Preparation: Pandas, SpaCy
- EDA: Matplotlib, Seaborn
- Feature Engineering: Scikit-learn, Gensim, SpaCy
- Model Building: Scikit-learn, PyTorch
- Deployment: Django
conda activate nlpenv
You can then install dependencies from a env.yml
:
conda create -f env.yml
While environment is activate, you can install a package:
conda install <package-namee> -c conda-forge
Footnotes
-
We are primarily interested in more formal/mathematical topics, and start our study by looking at Psychometrics articles. The research can then be extended to other disciplines and texts. ↩