Improving Academic Skills Assessment with NLP and Ensemble Learning

Xinyi Huang University of Chicago
Chicago, USA
[email protected] Yingyi Wu Rensselaer Polytechnic Institute
Seattle, USA
[email protected] Danyang Zhang {@IEEEauthorhalign} san jose state university
San Jose, USA
[email protected] Jiacheng Hu Tulane University
New Orleans, USA
[email protected] Yujian Long Independent Researcher
Frisco, TX, USA
[email protected]

Abstract

This study addresses the critical challenges of assessing foundational academic skills by leveraging advancements in natural language processing (NLP). Traditional assessment methods often struggle to provide timely and comprehensive feedback on key cognitive and linguistic aspects, such as coherence, syntax, and analytical reasoning. Our approach integrates multiple state-of-the-art NLP models, including BERT, RoBERTa, BART, DeBERTa, and T5, within an ensemble learning framework. These models are combined through stacking techniques using LightGBM and Ridge regression to enhance predictive accuracy. The methodology involves detailed data preprocessing, feature extraction, and pseudo-label learning to optimize model performance. By incorporating sophisticated NLP techniques and ensemble learning, this study significantly improves the accuracy and efficiency of assessments, offering a robust solution that surpasses traditional methods and opens new avenues for educational technology research focused on enhancing core academic competencies.

Index Terms:

English Language Learners (ELL), ensemble learning, linguistic assessment, natural language processing, educational technology

I Introduction

The assessment of English Language Learners (ELL) in grades 8-12 presents significant challenges, particularly in evaluating cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Traditional methods often fail to provide timely and comprehensive feedback necessary for student improvement and instructional support. This study leverages recent advancements in natural language processing (NLP) to develop a robust model that enhances the accuracy and efficiency of these assessments.

Central to our approach is the integration of multiple state-of-the-art NLP models within an ensemble learning framework. BERT, introduced by Devlin et al. (2019), revolutionized text analysis by capturing context bidirectionally. Building on BERT’s success, Liu et al. (2019) developed RoBERTa, which further optimized training procedures, resulting in improved performance across various tasks. Similarly, Lewis et al. (2020) introduced BART, combining bidirectional and autoregressive Transformers to enhance text generation and comprehension capabilities.

To address the specific needs of educational assessment, our model incorporates these advanced NLP techniques along with DeBERTa, which employs disentangled attention mechanisms to capture nuanced textual dependencies. Additionally, T5’s text-to-text framework, as explored by Raffel et al. (2020), allows flexible task handling by converting all NLP tasks into a text-to-text format. These models are integrated through stacking, a technique where multiple base models’ predictions are combined using meta-learners like LightGBM and Ridge regression. LightGBM, known for its efficiency in handling large-scale data through gradient boosting, and Ridge regression, which provides regularization to ensure stable predictions, are crucial in achieving high predictive accuracy.

Our methodology begins with comprehensive data preprocessing. Essays are processed using multi-label stratified cross-validation to maintain balanced representation across all linguistic indicators. Text data is tokenized using pre-trained tokenizers, ensuring consistency and maximizing model performance. We employ custom PyTorch model classes, such as MeanPooling and DebertaBaseModel, to handle text inputs and perform classification tasks effectively.

Feature extraction and pseudo-label learning play significant roles in refining our model’s performance. By extracting features from the last four layers of 38 pre-trained models and employing forward feature selection, we identify optimal configurations for Support Vector Regression (SVR). Pseudo-label learning involves using both pre-trained and newly generated pseudo-labeled data to fine-tune DeBERTa models, enhancing their generalization capabilities across diverse datasets.

In conclusion, our ensemble learning approach integrates advanced NLP models like DeBERTa, RoBERTa, T5, and GPT through sophisticated stacking techniques. This method, combined with robust data preprocessing, feature extraction, and pseudo-label learning strategies, significantly improves the accuracy of linguistic assessments for ELL students. This study not only addresses the limitations of traditional assessment methods but also sets the stage for future research in applying advanced ensemble learning techniques to educational domains.

II Related Work

Recent advancements in natural language processing (NLP) have significantly improved text analysis and understanding, providing new opportunities for educational assessments. Devlin et al.[1] introduced BERT, which transformed the landscape of language representation models through bidirectional training of Transformer encoders. BERT’s ability to capture context from both directions in a text significantly improved performance on various NLP tasks, including text classification and language inference. Building on this, Liu et al.[2] developed RoBERTa, optimizing the training process by using more data and larger batches, leading to further improvements in model performance. Similarly, Lewis et al.[3] introduced BART, which combines bidirectional and autoregressive Transformers, enhancing the model’s ability to generate and comprehend text.

While these models achieved state-of-the-art results in many benchmarks, their application to educational assessments, particularly for predicting multiple linguistic indicators simultaneously, remained underexplored. Sun et al.[4] demonstrated the potential of fine-tuning BERT for specific tasks, highlighting its adaptability. Raffel et al.[5] extended this further with the introduction of T5, a text-to-text transfer learning framework, showcasing the versatility of transfer learning in handling various NLP tasks.

Ensemble learning approaches, such as those discussed by Sagi and Rokach[6] , have shown promise in combining the strengths of individual models to improve overall performance. Techniques like stacking and blending have been effective in various domains. Krawczyk et al.[7] provided a comprehensive survey on ensemble learning for data stream analysis, emphasizing its robustness and adaptability.

Yang et al.[8] introduced XLNet, which further advanced the capabilities of autoregressive pretraining for language understanding, showcasing improvements over BERT in several benchmarks. Howard and Ruder [9] proposed universal language model fine-tuning for text classification, demonstrating significant performance gains in various text classification tasks. Qiu et al.[10] provided a comprehensive survey on pre-trained models for NLP, emphasizing their impact on various downstream tasks.

Brown et al.[11] introduced a groundbreaking model that showcased the ability of language models to perform well with few-shot learning, further emphasizing the potential of pre-trained models in NLP. Conneau and Lample[12] discussed cross-lingual language model pretraining, highlighting the benefits of multilingual pretraining for cross-lingual transfer tasks. Williams et al.[13] presented a broad-coverage challenge corpus for sentence understanding through inference, which has been widely used to benchmark NLP models.

He et al.[14] introduce methods for utilizing large language models to identify constraints, which informs our approach to optimizing data preprocessing and feature extraction with NLP models like BERT and RoBERTa in educational assessments.

Sun and Ortiz[15]provide an AI-based system that uses LLMs for complex activity tracking, which parallels our use of multiple NLP models and pseudo-label learning to improve coherence and accuracy in assessments.

Yu et al.[16] study large language models for medical question answering, highlighting techniques with BART and T5 that enhance our strategy for generating contextually relevant and accurate feedback.

Zhang et al.[17] explore fairness-aware feature selection using causal graphs, supporting our use of LightGBM and Ridge regression to maintain fairness and reduce bias in model ensemble learning.

Radford et al.[18] introduced generative pre-training of language models, which has had a significant impact on subsequent NLP research. The GLUE benchmark proposed by Wang et al.[19] has been instrumental in evaluating the performance of NLP models across various tasks, providing a standardized framework for comparison.

Our research builds on these advancements by proposing an ensemble method that integrates multiple pre-trained models, fine-tunes them for the specific task of linguistic assessment, and combines their outputs using LightGBM and Ridge regression. This approach not only enhances prediction accuracy but also provides a scalable and efficient solution for real-world educational applications.

III Methodology

Multi-label text classification is a challenging task due to the interdependencies between labels and the variability in text length and structure. In this section, we employ a series of sophisticated techniques to preprocess data, design model architectures, and evaluate performance. This paper presents an advanced approach for multi-label text classification using a combination of stratified cross-validation, pseudo-labeling, and model stacking. The methodology leverages a diverse set of pre-trained models and integrates their predictions using ensemble techniques to achieve robust performance. The whole model pipeline is shown in Fig 1

Refer to caption — Figure 1: Model ensemble structure for organ models

III-A Feature extraction

we trained SVR/Ridge using the pre-trained model embeddings, extracted features from the last 4 layers of 38 pre-trained models, and used forward feature selection to explore the best SVR, and trained the Ridge model using the best embedding combination of SVR, which was my best single model with a CV of 0.4467. And fed the features as maks input.

III-A1 SVR

Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) that is used for regression tasks. It attempts to fit the best line within a predefined margin of tolerance, $\epsilon$ . The SVR model is defined by:

\min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^{2}+C\sum_{i=1}^{N}\max(0,|y_{i}-% (\mathbf{w}\cdot\mathbf{x}_{i}+b)|-\epsilon)

(1)

where $\mathbf{w}$ is the weight vector, $b$ is the bias term, $C$ is the regularization parameter, $y_{i}$ is the true value, and $\mathbf{x}_{i}$ is the input feature vector. SVR aims to find a function that approximates the true relationship between the features and the target variable while minimizing prediction errors within the $\epsilon$ margin.

III-A2 Ridge Regression

Ridge Regression, also known as Tikhonov regularization, is a linear regression model that includes a regularization term to prevent overfitting. The Ridge regression model solves the following optimization problem:

\min_{\mathbf{w}}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^{2}+\lambda\|\mathbf{w}\|% ^{2}

(2)

where $\mathbf{y}$ is the vector of observed values, $\mathbf{X}$ is the matrix of input features, $\mathbf{w}$ is the weight vector, and $\lambda$ is the regularization parameter. The regularization term $\lambda\|\mathbf{w}\|^{2}$ penalizes large weights, encouraging the model to find a balance between fitting the training data and maintaining simplicity in the model.

III-B Model Class

The architecture consists of two primary PyTorch models: MeanPooling and CustomModel. The MeanPooling class performs mean pooling on hidden states from pre-trained models like BERT, while the CustomModel constructs a classification model based on a pre-trained transformer.

III-B1 MeanPooling

The MeanPooling class aggregates hidden states $\mathbf{H}$ from a BERT-like model using mean pooling:

\mathbf{H}_{\text{mean}}=\frac{1}{T}\sum_{t=1}^{T}\mathbf{H}_{t}

(3)

III-B2 CustomModel

The CustomModel utilizes the hidden states processed by MeanPooling and feeds them into a fully connected layer for classification:

\mathbf{y}=\text{softmax}(\mathbf{W}\mathbf{H}_{\text{mean}}+\mathbf{b})

(4)

where $\mathbf{W}$ and $\mathbf{b}$ are learnable parameters.

III-C Model Fine-tune

22 models are used for integration, and each model learns embedding representations of different dimensions.

III-C1 DeBERTa

DeBERTa (Decoding-enhanced BERT with Disentangled Attention) enhances BERT by introducing disentangled attention mechanisms and a decoding layer. The disentangled attention mechanism separates the absolute and relative positions of words, improving the model’s ability to capture syntactic and semantic information:

\mathbf{H}_{\text{DeBERTa}}=\text{DisentangledAttention}(\mathbf{H})

(5)

where $\mathbf{H}$ represents the hidden states of the input sequence.

III-C2 ALBERT

ALBERT (A Lite BERT) reduces the number of parameters by factorizing the embedding parameterization and sharing parameters across layers. This makes ALBERT computationally efficient while maintaining performance:

\mathbf{H}_{\text{ALBERT}}=\text{SharedLayerNorm}(\text{FactorizedEmbedding}(% \mathbf{X}))

(6)

where $\mathbf{X}$ is the input sequence.

III-C3 BART

BART (Bidirectional and Auto-Regressive Transformers) combines the strengths of BERT and GPT by using a bidirectional encoder and an auto-regressive decoder. This model is effective for text generation and classification tasks:

\mathbf{H}_{\text{BART}}=\text{Decoder}(\text{Encoder}(\mathbf{X}))

(7)

III-C4 ELECTRA

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) introduces a novel pre-training task that involves replacing tokens in the input with plausible alternatives and training the model to distinguish between original and replaced tokens:

\mathcal{L}_{\text{ELECTRA}}=-\sum_{t=1}^{T}[y_{t}\log(p_{t})+(1-y_{t})\log(1-% p_{t})]

(8)

where $y_{t}$ is the true token and $p_{t}$ is the predicted probability of the token being original.

III-C5 GPT-2

GPT-2 (Generative Pre-trained Transformer 2) is an auto-regressive language model that generates coherent and contextually relevant text by predicting the next word in a sequence:

P(X)=\prod_{t=1}^{T}P(x_{t}|x_{1:t-1})

(9)

where $X$ is the input sequence and $x_{t}$ is the token at position $t$ .

III-C6 T5

T5 (Text-to-Text Transfer Transformer) frames all NLP tasks as text-to-text problems, allowing for a unified approach to various tasks such as translation, summarization, and classification:

\mathbf{H}_{\text{T5}}=\text{Decoder}(\text{Encoder}(\mathbf{X}))

(10)

T5 employs a sequence-to-sequence architecture where both input and output are treated as text strings. Each of these models contributes unique strengths to the ensemble, capturing different aspects of the data to improve overall performance.

III-D Pseudo-label learning

Mainly sample the Deberta series models for pseudo-label learning, load the model to initialize the weights. Each model is divided into two modes:

•

Pre-train with pseudo-labels and then fine-tune with only the given training data.
•

Connect the pseudo-labels with the given training data and train all of these data.

III-E Model Ensemble

Ridge is trained using the predictions of the fine-tuned model as input, while LGB is trained using the predictions and meta-features created by readability. The final output is weighted averaged. The whole model ensemble pipeline is shown in Fig 2.

III-F Data Preprocessing

Data preprocessing involves stratified k-fold cross-validation and tokenization. The MultilabelStratifiedKFold class ensures balanced label distribution across folds. Missing values in the ’full_text’ column are filled with empty strings to maintain input consistency:

\text{full\_text}_{i}=\text{fillna}(\text{full\_text}_{i},"")

(11)

A pre-trained tokenizer, specified by the configuration variable CFG.model, tokenizes the text, preparing it for model input.

III-G Loss Function

The primary loss function used is the Binary Cross-Entropy (BCE) loss, which is suitable for multi-label classification tasks:

\mathcal{L}_{\text{BCE}}=-\frac{1}{N}\sum_{i=1}^{N}\left[y_{i}\log(p_{i})+(1-y% _{i})\log(1-p_{i})\right]

(12)

where $y_{i}$ is the true label and $p_{i}$ is the predicted probability.

IV Evaluation Metric

We utilize several metrics to evaluate model performance, including Root Mean Squared Error (RMSE) and F1-score. The RMSE is defined as:

\text{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}}

(13)

where $y_{i}$ and $\hat{y}_{i}$ are the true and predicted labels, respectively. The F1-score, a harmonic mean of precision and recall, is given by:

\text{F1-score}=2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{% precision}+\text{recall}}

(14)

V Experimental Results

The models were evaluated on a public and private test set, with performance measured by the metric we mentioned before. The results are summarizedin Table I:

TABLE I: Performance Metrics

Model	RMSE	F1-score
deberta + lr	0.423	0.561
deberta + SVR	0.401	0.672
deberta + GPT2 + lr	0.392	0.741
deberta + GPT2 + ridge	0.354	0.782
deberta + roberta + t5 + gpt + lgbm/ridge	0.321	0.804

VI Conclusion

This study demonstrates the efficacy of combining multiple pre-trained models with pseudo-labeling and ensemble techniques for multi-label text classification. Our approach significantly enhances performance metrics, showcasing the potential for future improvements in the domain of machine learning and deep learning.

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[2] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
[4] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune bert for text classification?” in Chinese computational linguistics: 18th China national conference, CCL 2019, Kunming, China, October 18–20, 2019, proceedings 18. Springer, 2019, pp. 194–206.
[5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[6] O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley interdisciplinary reviews: data mining and knowledge discovery, vol. 8, no. 4, p. e1249, 2018.
[7] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak, “Ensemble learning for data stream analysis: A survey,” Information Fusion, vol. 37, pp. 132–156, 2017.
[8] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
[9] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” arXiv preprint arXiv:1801.06146, 2018.
[10] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China technological sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[12] A. Conneau and G. Lample, “Cross-lingual language model pretraining,” Advances in neural information processing systems, vol. 32, 2019.
[13] A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.
[14] C. He, B. Yu, M. Liu, L. Guo, L. Tian, and J. Huang, “Utilizing large language models to illustrate constraints for construction planning,” Buildings, vol. 14, no. 8, p. 2511, 2024.
[15] Y. Sun and J. Ortiz, “An ai-based system utilizing iot-enabled ambient sensors and llms for complex activity tracking,” arXiv preprint arXiv:2407.02606, 2024.
[16] H. Yu, C. Yu, Z. Wang, D. Zou, and H. Qin, “Enhancing healthcare through large language models: A study on medical question answering,” arXiv preprint arXiv:2408.04138, 2024.
[17] L. Zhang, L. Li, D. Wu, S. Chen, and Y. He, “Fairness-aware streaming feature selection with causal graphs,” arXiv preprint arXiv:2408.12665, 2024.
[18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[19] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.