"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

\fnmMichael \surHardy \affilStanford University \email[email protected]    [Uncaptioned image]  Michael  Hardy
Stanford University University
[email protected]
Please see 8 for additional information about the author
Abstract

"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families–encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of human ratings of classroom instruction.

Keywords NLP  \cdot LLM  \cdot evaluation  \cdot bias  \cdot education  \cdot teacher development  \cdot Generalizability Theory  \cdot IRT  \cdot hierarchical rater models  \cdot reliability  \cdot classroom observation  \cdot classroom instruction  \cdot AI  \cdot fairness  \cdot racial bias  \cdot equity  \cdot annotations

1 Introduction

Human mediated labels always have an unknown amount of error. In machine learning practice, this error is often quantified using inter-rater reliability metrics and correlations. However, this annotation uncertainty is often ignored during standard supervised learning and model evaluation, leading to poorer models Belz et al. (2023). Thus, imperfect labels are treated as "gold" or "ground truth" (Belz et al., 2020; Hosking et al., 2024). This may be due in part to measures of accuracy being the most preferred methods of assessing and benchmarking model performance Birhane et al. (2022); Ribeiro et al. (2020); Kiela et al. (2021), but common practice might also arise from not using tools expressive enough to interpret labels in low reliability. To that end, this work demonstrates methods for working with low/unknown reliability annotations, often found in tasks requiring complex expert judgment.

The field of education has many complex tasks that often yield low reliabilities in labels (Jurenka et al., 2024; Kane and Staiger, 2012) which make edtech NLP models and research particularly vulnerable to the effects of inexpert annotations Belz et al. (2020); van der Lee et al. (2019); Zhou et al. (2023). The case study used to illustrate more expressive methods for working with unreliable labels will be from K12 education. Specifically, this study examines a use case where expert annotations are highly unreliable and yet used in high-stakes decisions: automated rating of the quality of classroom teaching. Methods used in this paper answer the call from others to evaluate the psychometric properties of models that perform this task (Casabianca et al., 2013; Liu and Cohen, 2021), and do so by comparing metrics across six dimensions of interest: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness (full results across these metrics against human baselines are in Table 4). Novel contributions of this work to NLP include:

  1. 1.

    measurements of the generalizability and dependability of labels used with NLP tasks (Section 5.2),

  2. 2.

    methods for detection of spurious correlations in model outputs via disattenuating low human-model correlations (Section 5.3),

  3. 3.

    methods for measuring model biases by disentangling human rater-specific contributions to unknown bias for unknown data sets (Section 5.4),

  4. 4.

    measurement of model fairness and racial bias in the presence of low label reliabilities (Section 5.5), and

  5. 5.

    application of Design Studies (d-studies) from Generalizability Theory (g-theory) for estimating impacts of human-in-the-loop (HIL) model use on human label quality (Section 5.6).

This work strengthens the argument that only using simple inter-rater reliability metrics to understand the quality of labels may be masking the limitations of the labeling criteria (Hill et al., 2012b; Hosking et al., 2024; Belz et al., 2020). It also illustrates how more robust evaluation techniques can yield information in the presence of noisy labels and seemingly inconclusive results. The analyses presented in this study are motivated by issues of model interpretability, fairness, and usefulness. Brief introductions to various techniques will be provided and illustrated via the study task, with interpretation of limitations and recommendations for future research.

Refer to caption
Figure 1: Data Processes and Sources for Studying Teaching and Annotation Quality

1.1 Study Task: Annotating Teaching Quality

The classification task of rating teaching may seem deceptively simple: using a rubric, provide a rating for the quality of instruction of an elementary school math classroom. Such ratings are given to all US K12 public education teachers for both formative educator development feedback and as high-stakes teacher evaluations. Despite their ubiquity, these ratings, even when conducted by experts, are unreliable (Ho and Kane, 2013; Kane et al., 2015; Kane and Staiger, 2012; Glaese et al., 2022; Whitehill and LoCasale-Crouch, 2024), similar to the poor reliability of other K12 education labels (Jurenka et al., 2024; Tack et al., 2023) that have limited the rigor of education research (Slavin, 2002; Klahr, 2013; Jurenka et al., 2024). Studies about ratings of instruction are also extremely expensive to conduct relative to other annotation tasks (Grissom et al., 2013; Liu and Cohen, 2021; Jurenka et al., 2024), with only two major studies across hundreds of public school teachers that use authentic instructional metrics to support development: the MET study (Kane et al., 2013; Kane and Staiger, 2012) and the NCTE Main Study (Kane et al., 2015), the latter of which is the source of data for this study.

From the first study, Ho and Kane estimated that increasing the number of human classroom observers can improve the reliability of ratings assigned. In their major work on the topic, they use methods similar to those in this paper to measure conditions under which the use of additional human raters can increase the reliability of this resource- and time-intensive task (Kane and Staiger, 2012; Whitehurst et al., 2014). Considering the expense, importance, complexity, and lack of reliability in ratings of classroom teaching and also the advances in natural language processing, automated ratings based on classroom discourse offer one potential solution.

Study Research Question:

How can we know when the behaviors of models are good enough to be used lieu of humans as estimated by Ho and Kane?

Answering whether automated ratings can similarly improve human annotations is understanding the extent to which models’ added contributions would result in similar benefits as expected from humans. Thus, this study illustrates methods for working with unreliable labels in NLP tasks by investigating and disentangling the variation found in human and model raters from the variation found within the observations and the instrument used for the annotation task. The model raters are comprised of two families: the "GPT" family of autoregressive in-context learners from Wang and Demszky (2023) (using ChatGPT) with three models whose siblings differ by prompt engineering strategies and an "Encoder" family built for this study whose five siblings differ in embeddings and a few adjustments to training hyperparameters. Quality of ratings will be examined between and within families and individual raters.

2 Related Work

2.1 Annotation Quality and Bias

Better understanding human label behaviors is key to training and evaluating models (Webson et al., 2023; Webson and Pavlick, 2022; Gordon et al., 2022). Accuracy, based on "gold" or "ground truth" labels, is the primary and most valued performance metric by which LLMs are evaluated Birhane et al. (2022); Ribeiro et al. (2020); Kiela et al. (2021). For expediency of development, data scientists often choose to assume data labels are reliable, accurate, and end-task aligned for intended real-world use cases, Hosking et al. (2024); Bejar et al. (2006); Messick (1998), even in scenarios where these assumptions could be detrimental (e.g., performing complex high-stakes tasks, reducing discriminatory biases found in data (Field et al., 2021) that are immutably historical by definition of their creation, etc.), which is especially true of autoregressive models, whose labels are Internet text and which contain harmful biases (Hofmann et al., 2024a, b). Assessing the accuracy and reliability of idiosyncratically human annotated "ground truth" can be difficult Eckes and Jin ; Wind and Guo (2019); Wind (2019); Abercrombie et al. (2023); Baan et al. (2024, 2022); Waseem (2016); Kazai et al. (2013); Hosseiny Marani et al. (2022); Tack et al. (2023); Hosking et al. (2024), a challenge that is exacerbated when label uncertainty is underexamined or underreported. Limited transparency around label quality makes it more challenging to measure biases, interpret model findings, assess individual fairness, and establish real-world validity (Hill et al., 2012b; Jurenka et al., 2024).

Powerful and provocative research has begun to address the limitations of accuracy-only evaluations and propose more fair and responsible solutions under assumptions of uncertainty (Hardt et al., 2016; Dwork et al., 2012; Kasy and Abebe, 2021; Song et al., 2020; Zhao and Ermon, 2021; Corbett-Davies et al., 2023; Pleiss et al., 2017; Zemel et al., 2013), including techniques for addressing when labels lead to undesirable model behaviors Ding et al. (2022); Hebert-Johnson et al. (2018); Qi et al. (2023). This paper offers several ways to quantify these issues and improve interpretability and explainability Adebayo et al. (2020); Lundberg and Lee (2017); Rudin (2019); Kim et al. (2018).

2.2 Teacher Development and Evaluation

School leaders working with teachers to improve the quality of instruction typically evaluate the teacher’s proficiency in a range of competencies (typically measured during in-class observation and evaluation on a teaching rubric; Aguilar (2013); Bambrick-Santoyo (2016, 2018)), then determine which competencies are most important to improve first (i.e., which change will have the biggest impact on student learning), and then provide supportive feedback and coaching. This paper focuses on the first step of evaluating teacher proficiency, which is often time-consuming and produces ratings (labels) that are unreliable Kane and Staiger (2012); Blazar (2018); Kane et al. (2013); Casabianca et al. (2013). Without accurate classifications, it is challenging for practitioners to prioritize instructional needs and aligned practices from among the many elements of good teaching (Saphier et al., 2008; Darling-Hammond, 2014; Hammond, 2015; Lemov and Atkins, 2015; Lemov, 2021; Liljedahl et al., 2021; Darling-Hammond et al., 2020; Schwartz et al., 2016) and for researchers to empirically quantify the impact of good teaching practices Pianta and Hamre (2009); Charalambous and Delaney (2019); Blazar and Pollard (2022); Jurenka et al. (2024).

Thus, this work provides a bridge to research seeking to improve teaching quality by providing feedback to teachers on various instructional techniques (Samei et al., 2014; Donnelly et al., 2017; Kelly et al., 2018; Demszky et al., 2021; Suresh et al., 2022; Jacobs et al., 2022; Alic et al., 2022; Demszky and Liu, 2023; Demszky et al., 2024, 2023). These feedback studies identify linguistic features correlated with an aspect of good teaching, but may optimistically overgeneralize the usefulness, efficacy, and universality of identifiable features, providing specific prescriptions without diagnosis. Matching these models with the specific needs of teachers will help provide a more individualized approach to teacher development, one based on understanding instructional needs and then providing corresponding supports.

Only three recent studies have sought to use LLMs to provide ratings of classroom instruction (via classroom transcripts) using authentic rating rubrics. Whitehill and LoCasale-Crouch (2024) use a mix of zero-shot and bag-of-words model configurations to provide scores to instructional domains for Pre-Kindergarden classrooms using a private dataset, commenting on their highest Pearson r𝑟ritalic_r correlation statistic of 72 experiments (r=0.48𝑟0.48r=0.48italic_r = 0.48) that it "approaches human inter-rater reliability". Wang and Demszky (2023) and Xu et al. (2024) both use the same publicly available datasets as the present study, and the approach of the former will be discussed further. Xu et al. use a by-item "best of" modeling approach which included experiments with BERT (Devlin et al., 2019), DistilBERT (Sanh et al., 2020), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2020), Llama 2 (Touvron et al., 2023), and ChatGPT, using models in two-stages where the first stage LLM provides the best text to the second stage which generates the rating. Unfortunately, the LLM-facilitated preprocessing of text and the by-item model training and selection processes limit the generalizability and transferability of their methods. While Xu et al. did not publicly release model ratings or the combinations of ensembles used, they did report Spearman correlation values for each of the best of several item-specific model constructions. In Figure 2, the results from their reported held-out test set are displayed alongside those from the present study for a comprehensive comparison across all studies reporting performance of automated ratings which use the MQI rubric or which use publicly accessible data.

3 Data

The data used in this study and in Wang and Demszky (2023) are from the National Center for Teacher Effectiveness (NCTE) Main Study (Kane et al., 2015), which contains three years of data collection and observations of math instruction in approximately fifty schools and three-hundred (4th and 5th grade) mathematics classrooms across four school districts in the United States, including expert human ratings of individual video-captured classroom lessons across two observation instruments Bacher-Hicks et al. (2017, 2019): the CLASS framework (12 items) (Pianta et al., 2008) for general instructional practice and the content-specific Mathematical Quality of Instruction (MQI; 13 items) (Hill et al., 2008), together yielding over 400,000 distinct human rating labels assigned, the distributions of which are in Figure 6. Each instrument item is intended to measure a different aspect of teaching quality.

Like all human mediated labels,111Label(er), rate(r), annotat(ion/or), and score(r) will be used interchangeably for these classification tasks, as terminology varies multidisciplinarily. an individual classroom observation rating requires at a minimum three facets: (1) a task with rating criteria (Section 3.1), (2) raters/labelers (Section 3.2), and (3) observations to be classified (sections of transcripts of classroom discourse, Section 3.3). As tasks increase in complexity, three facets contribute more error to estimates. This dataset has the additional real-world challenges of having very long and noisy transcripts and having large imbalances (Figure 4 panel (a), Figure 6) in human labels that have hindered previous research (Xu et al., 2024; Wang and Demszky, 2023), but which provide extra opportunity to demonstrate the importance of robust methods of evaluation.

3.1 Rating Criteria: MQI Rubric

Just as all raters contribute uncertainty to a system, so too do the measurement instruments. Ambiguity uncertainty is introduced when an instrument, instruction, or criteria for a task has language that could lead to two equally-expert raters to different results, ceteris paribus. The 13 MQI items within the dataset have at least two raters per classroom observation. While both humans and Encoders evaluated all items, the this paper will focus on the 4 of the 13 MQI items evaluated in Wang and Demszky (2023) to support comparability across humans and models.222Xu et al. provided results for 11 of the 13 MQI items. No explanation is provided for the exclusion of MGEN and USEPROD. These four ternary items are teacher explanations ( EXPL), remediation of student errors (REMED), student questioning and reasoning (SMQR), and imprecision in mathematical language (LANGIMP).333LANGIMP is reverse-coded so higher scores are better and has interesting self-referentiality vis-à-vis instrument uncertainty that is worth noting, but out of scope for the current study. See Appendix C.2 for more on this and other negatively worded items. Analyses for all other items are in the appendices. Prior studies have explored the reliability of MQI instrument ratings generally Kane and Staiger (2012); Mantzicopoulos et al. (2018); Hill et al. (2012b); Kane et al. (2015); Ji (2023); this study confirms previous findings via reproduced reliability metrics in Section 3.2, which correspond to the NCTE Study, Appendix Section 2).

3.2 Human Expert Raters

Human rater information for both the MQI and CLASS instruments can be found in the Appendix of the DS0 Study-Level Files from the NCTE Main study. MQI raters in particular were recruited from a separate pool of applicants based on their background in mathematics and through contacting colleagues in mathematics departments (Hill et al., 2012a; Blazar et al., 2017) and then passed certification exams to score the MQI, and attended biweekly calibration meetings to ensure standardization of scoring procedures.

3.3 Classroom Observations

63 human raters watched videos and provided ratings at regular intervals across all items in the MQI. Transcripts of these same videos (Demszky and Hill, 2022) are used by LLMs for the same task, where the class discourse is equipartitioned across utterances (GPT family models) or words (Encoder family models) by the total number of classroom segments to align the text to the human labels in the absence of timestamps. Data from the NCTE Main study (Kane et al., 2015) 444https://www.icpsr.umich.edu/web/ICPSR/studies/36095/datadocumentation and for the associated transcripts (Demszky and Hill, 2023)555https://github.com/ddemszky/classroom-transcript-analysis are available online.

4 Model Families and Model Rater Data

GPT Models

The GPT model family from Wang and Demszky (2023)666https://github.com/rosewang2008/zero-shot-teacher-feedback/ have 7,660 ratings for 223 different teachers. The family consists of three models differing in prompt engineering methods (herein called N, NR, and ND), and brief summary of those differences is in Table 8. GPT models were evaluated on curated selections of classroom text with the least transcriptorial noise (i.e., minimizing instances of [inaudible]), and were edited to indicate whether the speakers were teachers or students.

Encoder Models

Encoder family models are custom transformer encoders trained on the NCTE classroom transcripts. The five models (un1, un2, un3, gte, and e5) use fixed-parameter pretrained sentence embeddings, differing in these and in training hyperparamters, thereby exploiting LLM sensitivites to pretraining regimes (D’Amour et al., 2020; McCoy et al., 2023). A summary of differences is in Table 7 and more training details can be found in Appendix D. In contrast to the model experiments of Xu et al. who used different combinations of models by item, each encoder model produces labels for all 13 MQI (and 12 CLASS) items. In contrast to the GPT models, the only text preprocessing used with the Encoders simply replaced all transcription notes with [inaudible] to mimic the uncertainty in live audio transcription, and no edits to indicate speakership were included. For the Encoder models, all model outputs777https://github.com/hardy-education/LLM-Psychometrics in this study were conducted with a lesson-level-stratified held-out test set (see Figure 8) that was not used during model development. Encoder models were trained a single GPU in Google Colab with training detailed in Appendix D.3.

5 Evaluation Methods

Refer to caption
Raters ETCA EXPL LANGIMP LCP LINK MAJERR MGEN MLANG MMETH REMED SMQR STEXPL USEPROD
Humans 0.3 0.27 0.28 0.21 0.41 0.28 0.19 0.32 0.47 0.32 0.29 0.39 0.31
Encoders 0.51 0.46 0.41 0.39 0.57 0.35 0.33 0.52 0.52 0.46 0.39 0.47 0.46
GPTs 0.04 0.04 0.04 0.12
Xu et al. 0.3 0.31 0.19 0.13 0.41 0.13 0.4 0.36 0.27 0.26 0.37
Figure 2: Spearman correlation coefficients and confidence intervals by MQI Item for all rater families and studies. Human (Kane et al., 2015), Encoder (current study, Section 4), and GPT (Wang and Demszky, 2023) family correlations are between each rater and one randomly sampled human rater for each observation, following the processes used in the original human study, repeated 1,000 times for bootstrapped confidence intervals. Xu et al. coefficients are reported from Tables 5 and 9 of that paper, where each number represents the best of several ensemble models fit for each individual item. Bold in the table indicates highest performing label family.

Typical reliability metrics (see Section 5.1) provide a backdrop of descriptives that can flag issues of low quality labels. Measures of statistical dependability can be used for generalizing label conclusions and identifying spurious correlations (see Section 5.3), a part of improving accuracy. Methods for disentangling human and model label biases (see Section 5.4) are first demonstrated and then extended to estimate fairness across racial lines in Section 5.5. Usefulness, as measured by the amount of rating reliability improvement a model can provide to a human rater in human-in-the-loop contexts, including associated cost savings in human time (for encoder models) are in Section 5.6.

5.1 Concordance: Agreement and Reliability Metrics

RQ 1:

How do automated models perform relative to humans in the presence of low label reliability? RQ 1: Case Study Reframing: How well do automated models perform relative to humans when evaluating instruction?

5.1.1 Baseline Human Metrics

RQ1: Concordance Metrics Correlation: r𝑟ritalic_r, ρ𝜌\rhoitalic_ρ, τ𝜏\tauitalic_τ Inter-rater Agreement: % Agree, % Agree ± 1, Cohen’s κ𝜅\kappaitalic_κ, QWK Intuition: QWK QWK is the extent to raters agree on ratings, not by chance. Bigger differences in ratings show less agreement, scaled quadratically.

Full reproductions888Small differences in the reported values here compared to the original study arise from random human rater selection required in the procedure, which were done at the segment level. All families and model evaluations used the same random sample of human raters for comparison. of all reliability metrics and calculation processes exactly as described in the NCTE Main Study Appendix Section 2 were conducted. (Kane et al., 2015). Following their same procedures, replicated calculations were extended to the model families, replacing a human rater score with a specified or random model for evaluations of individual models and model families, respectively. Intra-class correlations (ICCs) are with the calculation methods in Appendix F. Reproduced human results and model results, including additional metrics in this section, are fully reported in Appendix F.1 and all item results can be found in the online supplement.

5.1.2 Commonly Used Metrics

The results also include three additional correlation and reliability metrics: Quadratic Weighted Kappa (QWK) typically used in ordinal classification tasks to penalize distance quadratically (squared error) while accounting for categorical agreement by chance (e.g., Shermis (2014); Hardy (2021); Wang and Demszky (2023)), Pearson correlation r𝑟ritalic_r, (e.g., Whitehill and LoCasale-Crouch (2024)) Spearman correlation ρ𝜌\rhoitalic_ρ (e.g., Wang and Demszky (2023); Xu et al. (2024)), and Kendall correlation τ𝜏\tauitalic_τ (e.g., Liu et al. (2023b)). Figure 2 shows Spearman correlations (ρ𝜌\rhoitalic_ρ) and confidence intervals for all model families and for models from Xu et al. (2024). The table in Figure 2 contains the ρ𝜌\rhoitalic_ρ estimates.

5.1.3 Results

Using nearly any standardized combination of metrics across all items from Section 5.1, Encoder models perform better than the single highest performing expert human rater. The human labels assigned for the four focus MQI have very low reliabilities, despite the significant training and calibration for human raters described in 3.2. Overall, the human labels are highly unreliable, but if a researcher were trying to compare the model to human performance, they could be displayed as they are in Table 1. For metrics of agreement and reliability, each encoder model outperformed humans on average, whilst each GPT model underperformed humans on every metric and every item. Table 1 has a summary of the full panel of lesson segment-level inter-rater reliability metrics for each MQI item. Specific metrics for the four focus MQI items in this study are in Panel (b) in Figure 4, and the full individual model-item comparisons for all MQI items and metrics in this section are in Table LABEL:tab:tab:full. Additionally, the detailed full results for all models and metrics, MQI, and CLASS rubrics can be found in the supplementary materials online.

Using only these metrics and without further testing, one might assume that the encoder models are therefore ready to help with the task of automated annotations of teaching quality or that GPT models show improvement to ICC measures and could be helpful. Implications: Basic statistics in the presence of unreliable labels can mislead interpretations of model performance. Researchers should be wary of studies reporting few metrics in the presence of low reliabilities.

Metric Encoders un1 un2 un3 gte e5 GPTs N NR ND
%Agr 0.54 0.69 0.77 0.69 0.39 0.39 0.00 0.00 0.00 0.00
C’s κ𝜅\kappaitalic_κ 0.69 0.85 0.77 0.62 0.62 0.62 0.00 0.00 0.00 0.00
QWK 1.00 1.00 1.00 1.00 0.92 0.92 0.00 0.00 0.00 0.00
r𝑟ritalic_r 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00
ρ𝜌\rhoitalic_ρ 1.00 1.00 1.00 1.00 0.77 0.77 0.00 0.00 0.00 0.00
τ𝜏\tauitalic_τ 1.00 1.00 1.00 1.00 0.77 0.77 0.00 0.00 0.00 0.00
Table 1: Concordance: Performance above Human Reliability and Agreement Metrics. Proportion of MQI items where the model or model family listed had better results than human baselines. Bold indicates where performance was better on more than half of items rated. Inter-rater reliability metrics introduced in Section 5.1. C’s κ𝜅\kappaitalic_κ: Cohen’s κ𝜅\kappaitalic_κ; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement; r𝑟{r}italic_r: Pearson’s correlation; ρ𝜌\mathbf{\rho}italic_ρ: Spearman’s rank correlation; τ𝜏\mathbf{\tau}italic_τ: Kendall’s concordance correlation;. Full data can be found in the supplementary material online.

5.2 Confidence: Generalizable Reliability

RQ2: Confidence Metrics Generalizability: 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Dependability: Φ𝛷\mathit{\Phi}italic_Φ Intuition: 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Φ𝛷\mathit{\Phi}italic_Φ By accounting for the different facets of variation, we can estimate how much of the relative (𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) and absolute (Φ𝛷\mathit{\Phi}italic_Φ) label variation associated with the teacher is attributable to the teacher only.
RQ 2:

How generalizable are findings from unreliable labels? RQ 2 Case Study Reframing: To what extent would the ratings of a teacher’s instructional quality persist across lessons or contexts?

5.2.1 Generalizability and Dependability

Generalizability Study (g-study) (Brennan, 2001a, 2013, b; Hill et al., 2012b) designs utilize random effect estimates across possible configurations of different sources of variance to quantify how generalizable labels. This is done by estimating the extent to which given labels would persist if sources of variation changed (e.g., same teacher, different day; same lesson, different rater; human rater vs model rater; etc.). 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a measure of the relative generalizability of a rating (i.e., is rating order preserved), and Φ𝛷\mathit{\Phi}italic_Φ, accounting for absolute error, is a measure of label dependability: how likely specific ratings would be numerically the same with different sources of variation. These two reliability-like estimates can help quantify how "golden" labels are.

The multifaceted g-study design used to estimate the how much variation (ν𝜈\nuitalic_ν) in individual teachers’ instructional quality, i𝑖iitalic_i, contributed to a rating label, X𝑋Xitalic_X, annotated for a section of a lesson, s𝑠sitalic_s, during an observation, o𝑜oitalic_o, on rubric item j𝑗jitalic_j by rater r𝑟ritalic_r is known as a Item-by-Rater-by-Segment-within-Observation-within-Individual Teacher design: J×R×(S:O:I)J\times R\times(S:O:I)italic_J × italic_R × ( italic_S : italic_O : italic_I ). Overall estimates across all MQI items for a given rater family, 𝔽𝔽\mathbb{F}blackboard_F, are in Table 2. For item-level reliabilities, we simplify the expression by holding the item fixed, resulting in a R×(S:O:I)R\times(S:O:I)italic_R × ( italic_S : italic_O : italic_I ) design. Using nested random effects notation, the estimation model is:

Xs:o:ir(j)=μ+νi+νo:i+νs:o:i+νir+νr+νs:o:ir,jJformulae-sequencesuperscriptsubscript𝑋:𝑠𝑜:𝑖𝑟𝑗𝜇subscript𝜈𝑖subscript𝜈:𝑜𝑖subscript𝜈:𝑠𝑜:𝑖subscript𝜈𝑖𝑟subscript𝜈𝑟subscript𝜈:𝑠𝑜:𝑖𝑟for-all𝑗J\displaystyle X_{s:o:ir}^{(j)}=\mu+\nu_{i}+\nu_{o:i}+\nu_{s:o:i}+\nu_{ir}+\nu_% {r}+\nu_{s:o:ir},\forall j\in\textbf{J}italic_X start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_μ + italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_o : italic_i end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r end_POSTSUBSCRIPT , ∀ italic_j ∈ J (1)

where j𝑗jitalic_j indicates the item index.999For the estimates in Fig. 4 (c), for dependability metrics of Section 5.3, and for comparability with human baselines(Hill et al., 2012b; Kane et al., 2015; Ho and Kane, 2013; Kane and Staiger, 2012), a simplified model, an by-item R×(O:I)R\times(O:I)italic_R × ( italic_O : italic_I ) design, was conducted for the human expert rater family with results in Appendix H.1. The simplified model is Xo:ir(j)=μ+νi+νo:i+νir+νr+νo:irsuperscriptsubscript𝑋:𝑜𝑖𝑟𝑗𝜇subscript𝜈𝑖subscript𝜈:𝑜𝑖subscript𝜈𝑖𝑟subscript𝜈𝑟subscript𝜈:𝑜𝑖𝑟X_{o:ir}^{(j)}=\mu+\nu_{i}+\nu_{o:i}+\nu_{ir}+\nu_{r}+\nu_{o:ir}italic_X start_POSTSUBSCRIPT italic_o : italic_i italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_μ + italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_o : italic_i end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_o : italic_i italic_r end_POSTSUBSCRIPT The full model structures of Eq. 1, 2 and 3 are used for Section 5.6. Code for the model specification is in Appendix H.3. Then, 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (Equation 2) and Φ𝛷\mathit{\Phi}italic_Φ (Equation 3) are easily estimated from the random effects for raters in rater family 𝔽𝔽\mathbb{F}blackboard_F:

𝐄ρ𝔽2(j)=νijνij+νo:ij+νs:o:ij+νirj+νs:o:irj,𝐄superscriptsubscriptsuperscript𝜌2𝔽𝑗subscript𝜈𝑖𝑗subscript𝜈𝑖𝑗subscript𝜈:𝑜𝑖𝑗subscript𝜈:𝑠𝑜:𝑖𝑗subscript𝜈𝑖𝑟𝑗subscript𝜈:𝑠𝑜:𝑖𝑟𝑗,\displaystyle{\mathbf{E}\mathit{\rho}^{2}_{\mathbb{F}}}^{(j)}=\frac{\nu_{ij}}{% \nu_{ij}+\nu_{o:ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{s:o:irj}}\text{, }bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = divide start_ARG italic_ν start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i italic_r italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r italic_j end_POSTSUBSCRIPT end_ARG , (2)
Φ𝔽(j)=νijνij+νo:ij+νs:o:ij+νirj+νrj+νs:o:irj,superscriptsubscript𝛷𝔽𝑗subscript𝜈𝑖𝑗subscript𝜈𝑖𝑗subscript𝜈:𝑜𝑖𝑗subscript𝜈:𝑠𝑜:𝑖𝑗subscript𝜈𝑖𝑟𝑗subscript𝜈𝑟𝑗subscript𝜈:𝑠𝑜:𝑖𝑟𝑗,\displaystyle\mathit{\Phi}_{\mathbb{F}}^{(j)}=\frac{\nu_{ij}}{\nu_{ij}+\nu_{o:% ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{rj}+\nu_{s:o:irj}}\text{, }italic_Φ start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = divide start_ARG italic_ν start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i italic_r italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_r italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r italic_j end_POSTSUBSCRIPT end_ARG , (3)

r𝔽for-all𝑟𝔽\forall r\in\mathbb{F}∀ italic_r ∈ blackboard_F, where the individual item-rating-segment variation, νs:o:irjsubscript𝜈:𝑠𝑜:𝑖𝑟𝑗\nu_{s:o:irj}italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r italic_j end_POSTSUBSCRIPT, is confounded with error variation. These results are found in Table 2. A figure comparing the 𝐄ρ^j2𝐄subscriptsuperscript^𝜌2𝑗\mathbf{E}\hat{\rho}^{2}_{j}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT item values to item-level reliability estimates related to Guttman’s λ6subscript𝜆6\lambda_{6}italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , ρjjλ6\rho^{\lambda_{6}}_{jj\prime}italic_ρ start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j ′ end_POSTSUBSCRIPT, from Classical Test Theory (Zijlmans et al., 2018a, b), can be found in Appendix H.2. Additionally an illustration of sources of variance including descriptions can be found in Appendix H, color-coded to support interpretation of sources of variance with the table of results.

5.2.2 Results

Humans, on average, produce labels that are both more reliable and generalizable. The full results for human rater labels, decomposed into variance components, can be found in H.3101010Appendix 2.c of Kane et al. (2015) provided a g-study, but, surprisingly, not using the data from the study. and estimates for 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Φ𝛷\mathit{\Phi}italic_Φ can also be found in panel (c) of Figure 4. The Encoder models outperform humans on nearly every item in terms of inter-rater reliability metrics (Table 1) , but not in generalizable reliability metrics as seen in panel (c) tables of Figure 4. Importantly, the large difference between 𝐄ρ^2𝐄superscript^𝜌2\mathbf{E}\hat{\rho}^{2}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Φ^^𝛷\mathit{\hat{\Phi}}over^ start_ARG italic_Φ end_ARG for Humans and Encoders is due to properties of individual items, which accounted for over 75% of the variation in those families. GPT models, on the other hand, did not change ratings very much on different items, consistent with literature on these models not understanding such prompts Liu et al. (2023a); Webson and Pavlick (2022); Heo et al. (2024).

Table 2 shows that Encoder model still performs better than humans on the majority of items, but it is no longer as clear. Interestingly, as mentioned in Section 4, the encoder models did not receive any annotations outside of the transcript, including speaker. This means that the model would struggle to identify teacher explanations (EXPL) from student explanations (STEXPL). This shift in interpreting encoder family performance from superhuman to zero reliability adds validity to the argument that these metrics provide valuable insight, showing that the relationships found in some of the variables could be explained by variance unrelated to the label construct. Implications: Measures of generalizability and dependability derived from structured variance decomposition can meaningfully quantify label quality.

𝐄ρ^2𝐄superscript^𝜌2\mathbf{E}\hat{\rho}^{2}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Φ^^𝛷\mathit{\hat{\Phi}}over^ start_ARG italic_Φ end_ARG
ITEM Human Encoders GPTs Human Encoders GPTs
ETCA 0.17 0.20 0.15 0.19
EXPL 0.15 0.00 0.00 0.12 0.00 0.00
LANGIMP 0.09 0.15 0.08 0.08 0.14 0.08
LCP 0.11 0.27 0.09 0.26
LINK 0.13 0.19 0.12 0.19
MAJERR 0.08 0.00 0.07 0.00
MGEN 0.03 0.08 0.02 0.08
MLANG 0.07 0.18 0.06 0.17
MMETH 0.13 0.37 0.13 0.36
REMED 0.13 0.10 0.05 0.11 0.09 0.04
SMQR 0.14 0.09 0.00 0.13 0.09 0.00
STEXPL 0.25 0.00 0.23 0.00
USEPROD 0.19 0.25 0.17 0.25
All Items 0.114 0.106 0.007 0.010 0.014 0.004
Table 2: Generalizability and Dependability metrics by model families for each MQI Item. Bold represents the best rater family for each of 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Φ𝛷\mathit{\Phi}italic_Φ, respectively. Underlined items are focus MQI items, because they were evaluated by Wang and Demszky (2023). For the overall "All Items" calculation, a J×R×(O:I)J\times R\times(O:I)italic_J × italic_R × ( italic_O : italic_I ) model was used for comparability with other similar research.

5.3 Validity: Convergent and Spurious Correlations

RQ 3:

To what extent can accuracy and validity be estimated with unreliable labels? RQ 3Case Study Reframing: To what extent do models and humans rate the same underlying construct similarly?

5.3.1 Disattenuating High Noise Correlations

Dependability and generalizability do not guarantee accuracy, but even at these very low levels, they can be used in indirect tests of convergent validity to see whether correlations between humans and models are low because of measurement error, such as poor rubric item construction, or because the two sets are really uncorrelated. If an individual teacher’s latent instructional ability θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is about the same from lesson to lesson with the same students, we can correlate θ^isubscript^𝜃𝑖\hat{\theta}_{i}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for human (𝕙𝕙\mathbb{h}blackboard_h) and model (𝕞𝕞\mathbb{m}blackboard_m) family ratings for different lessons coming from the same teacher and correct for measurement error by disattenuating the correlations by each rater family’s 𝔽𝔽\mathbb{F}blackboard_F label generalizability, 𝐄ρ^𝔽(j)𝐄superscriptsubscript^𝜌𝔽𝑗\mathbf{E}\hat{\mathit{\rho}}_{\mathbb{F}}^{(j)}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, for a given item j𝑗jitalic_j. The disattenuated correlation,ϱ𝕙𝕞(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗\mathbf{\varrho}_{\mathbb{hm}}^{(j)}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, between humans and a family of models for item, j𝑗jitalic_j, can be estimated:

ϱ𝕙𝕞(j)=Corr[𝒳~𝕙(i,𝔏,j,r𝕙),𝒳~𝕞(i,¬𝔏,j,r𝕞)]𝐄ρ^𝕙2(j)𝐄ρ^𝕞2(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗Corrsubscript~𝒳𝕙𝑖𝔏𝑗subscript𝑟𝕙subscript~𝒳𝕞𝑖𝔏𝑗subscript𝑟𝕞𝐄superscriptsubscriptsuperscript^𝜌2𝕙𝑗𝐄superscriptsubscriptsuperscript^𝜌2𝕞𝑗\displaystyle\mathbf{\varrho}_{\mathbb{hm}}^{(j)}=\frac{\operatorname{Corr}[% \operatorname{\tilde{\mathcal{X}}_{\mathbb{h}}}(i,\mathfrak{L},j,r_{\mathbb{h}% }),\operatorname{\tilde{\mathcal{X}}_{\mathbb{m}}}(i,\neg\mathfrak{L},j,r_{% \mathbb{m}})]}{\sqrt{{\mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{h}}}^{(j)}{% \mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{m}}}^{(j)}}}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = divide start_ARG roman_Corr [ start_OPFUNCTION over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT end_OPFUNCTION ( italic_i , fraktur_L , italic_j , italic_r start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT ) , start_OPFUNCTION over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT end_OPFUNCTION ( italic_i , ¬ fraktur_L , italic_j , italic_r start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT ) ] end_ARG start_ARG square-root start_ARG bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_ARG end_ARG (4)
RQ3: Validity Metric Disattenuated Convergent Correlation: ϱ𝕙𝕞(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗\mathbf{\varrho}_{\mathbb{hm}}^{(j)}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT Intuition: ϱ𝕙𝕞(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗\mathbf{\varrho}_{\mathbb{hm}}^{(j)}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT Teaching abilities on item j𝑗jitalic_j do not change dramatically each lesson, so if human r𝕙subscript𝑟𝕙r_{\mathbb{h}}italic_r start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT and model r𝕞subscript𝑟𝕞r_{\mathbb{m}}italic_r start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT observers rate teacher i𝑖iitalic_i similarly on different lessons (𝔏𝔏\mathfrak{L}fraktur_L and ¬𝔏𝔏\neg\mathfrak{L}¬ fraktur_L), they are responding to similar observable indicators of the teacher.

where 𝒳~𝔽subscript~𝒳𝔽\tilde{\mathcal{X}}_{\mathbb{F}}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT is score retrieval function for individual teacher i𝑖iitalic_i on item j𝑗jitalic_j by a random member r𝑟ritalic_r of rater family 𝔽𝔽\mathbb{F}blackboard_F in relation to some observed lesson 𝔏𝔏\mathfrak{L}fraktur_L with family label generalizability, 𝐄ρ^𝔽2(j)𝐄superscriptsubscriptsuperscript^𝜌2𝔽𝑗{\mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{F}}}^{(j)}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT defined in Equation 2. In other words, the numerator (represented in red in Figure 3) is the correlation in scores whenever two different lessons from the same teacher were scored by raters from different families (human and model). The denominator then adjusts for based on the reliabilities of raters from each family to account for the known tendency of low reliability to diminish observed correlations.

Figure 4 panel (b) has the disattenuated correlations and their respective 95% confidence intervals, calculated at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 using empirical confidence scaling methods defined by Charles (2005), which produces more conservative confidence intervals on this data than traditional Fisher normalization (Kromrey et al., 2008), which is preferable given the low levels of reliability in Section 5.2 which can lead to overcorrection. Reported disattenuated correlations of 1.0 do not mean perfect correlation: it generally means that measurement error is not randomly distributed.

Refer to caption
Figure 3: Correlations (fainter color hues, numerator of Eq. 4), disattenuated correlations (darker color hues, Eq. 4), and their respective 95% confidence intervals between human raters and model raters by MQI item. Item-level rater-label generalizability for both human and model raters, 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The attenuated and disattenuated correlations between humans and models ϱhmsubscriptitalic-ϱ𝑚\varrho_{hm}italic_ϱ start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT are shown. The attenuated correlation confidence intervals were calculated with the standard Fisher Transformation and α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Disattenuated correlation confidence intervals used the empirical method recommended in Charles (2005).

Disattenuated correlations are not directly comparable111111For example, reported disattenuated correlations of 1.0 do not mean perfect correlation: it generally means that measurement error is not randomly distributed. to the measures of correlation in Section 5.1 (Muchinsky, 1996). However, failure of disattenuation to identify viable human-model correlations for items that previously such showed correlated relationships in Section 5.1 suggests the prior correlations may be spurious. Disattenuation does not change the low reliability across items nor the quality of the measurement, but it offers indirect evidence for discerning model predictive validity by quantifying the how changes in the underlying construct result in changes in the same direction for both human and model.

Results for disattenuated correlations described in Section 5.3 and their confidence intervals are in Figure 3. Most items show correlated relationships after disattenuation, and most with confidence intervals above 0.5, suggesting that the encoder models and the humans are likely identifying similar sources of underlying teacher variation for those items.

5.3.2 Results

Disattenuation analyses and Section 5.2 suggest that the Encoder model family’s SOTA-level correlations on the EXPL and STEXPL item may have been spurious (likely identifying speech patterns associated with higher teacher performance, and not necessarily specific to explanations), a direct result of low generalizabilities found in Section 5.2. Additionally, we see see very large confidence intervals for the encoders for items where item score distributions are most imbalanced (MGEN, MAJERR), suggesting that correlations found are not justified in the presence of low reliabilities. Items where the disattenuated correlations are lower (e.g., LCP, MMETH) suggests that models and humans interpreted observational features differently. Implications: when measurement error is high, disattenuating model and human correlations can help identify whether items with high or similar correlations have spuriousness or are responding to similar features.

This method only minimally provides evidence for investigating accuracy and validity, but, for the Encoder models, evidence can be built upon by comparing how the more continuous ratings of the models and humans change and correlate over the course of a given observation. While not explicitly part of this study, an example of how Encoders’ and humans’ ratings change from the start to the end of a class for a randomly chosen lesson observation is illustrated in Figure 15. Investigating the validity of a construct would require more robust qualitative review of the content.

Refer to caption

Figure 4: Section 5 Study Method Results for four focus MQI Items across Human (Kane et al., 2015), Encoder (this study), and GPT (Wang and Demszky, 2023) rater families. (a) Distributions. Score distributions by rater type. (b) Reliabilities. Inter-rater reliability metrics introduced in Section 5.1. C’s κ𝜅\kappaitalic_κ: Cohen’s κ𝜅\kappaitalic_κ; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement; %Agr±1: percent agreement within 1 category; ICC: intraclass correlation; AICC: adjusted intraclass correlation; r𝑟{r}italic_r: Pearson’s correlation; ρ𝜌\mathbf{\rho}italic_ρ: Spearman’s rank correlation; Bold format is highest value for a given metric. (c) Generalizability Measures and Spurious Correlation Detection. Section 5.2: generalizability coefficient 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and dependability measure Φ𝛷\mathit{\Phi}italic_Φ. Section 5.3:ϱ𝕙𝕞:absentsubscriptitalic-ϱ𝕙𝕞:\varrho_{\mathbb{hm}}: italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT is the disattenuated correlation. Red font indicates correlation was spurious or incalculable due to low reliabilities. (d) Disentangled Rater Bias. Section 5.4: standardized rater bias ϕjrsubscriptitalic-ϕ𝑗𝑟\phi_{jr}italic_ϕ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT (x axis) and rater variability/consistency, ψjrsubscript𝜓𝑗𝑟\psi_{jr}italic_ψ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT (y axis) from Equation 7, ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-centered. Each point represents an individual human or model rater. More severe raters are left, more lenient right. (e) Fairness across Racial Lines. Section 5.5: Standardized difference in rater bias ϕrsubscriptitalic-ϕ𝑟\phi_{r}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (x axis) and rater combined variability/consistency, ψrsubscript𝜓𝑟\psi_{r}italic_ψ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, (y axis) across Black teachers and White teachers. Leftward values are more severe towards Black teachers, rightward are more lenient. Any horizontal bar present with a marker represents 95% CI for bias. (f) Estimated Improvements to Reliability. Section 5.6: Expected changes to rating reliability are estimated improvements to quality (via reliability) of classroom ratings for various contexts. The single individual human baseline (black) estimates reliability improvements by visiting the same class the x axis represents the number of different 15 min. classroom observations of the same teacher. The red line is estimate of having a different human observer conduct observations as described. By contrast, for the model raters–single Encoder (green), Encoder ensemble (average of 3 encoders) (Red), and GPT ensemble (average of 3 GPT prompt engineered models)–the x-axis for models is the number of full classroom observations conducted where the human (black) observes at least 15 minutes (in-the-loop) of the same classroom (models observe the entire class period). A summary of these results can be found in Table 3.

5.4 Bias: Disentangling Individual Rater Behaviors

RQ 4:

Can bias contributed by individual rater behaviors be identified and disentangled from labels? RQ 4: Case Study Reframe: How do individual rater effects contribute to ratings bias?

5.4.1 Hierarchical Rater Models

RQ4: Annotation Bias Method Hierarchical Rater Model: Three layers of estimation, parameters solved simultaneously (MCMC). Top Stage Intuition The latent teacher abilities 𝜽𝜽\boldsymbol{\theta}bold_italic_θ are assumed to be normally distributed. IRT Stage Intuition Eq. 6 estimates the probability of a teacher i𝑖iitalic_i receiving an ideal rating ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the teacher’s ability 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and item characteristics (𝜶𝜶\boldsymbol{\alpha}bold_italic_α, 𝜸𝜸\boldsymbol{\gamma}bold_italic_γ). SDT Stage Intuition Eq. 7 estimates the probability that a rater gave teacher i𝑖iitalic_i a rating Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the ideal rating ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rater tendencies (bias ϕitalic-ϕ\phiitalic_ϕ and variability ψ2superscript𝜓2\psi^{2}italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT).

Rater biases in complex tasks are usually not directly measurable, but we can estimate latent constructs that quantify the effects of individual raters’ behaviors using methods commonly used to estimate latent attributes of rubric items (e.g., item difficulty) and latent attributes individuals (e.g., ability) throughout Item Response Theory (IRT). If the data had no variation due to raters, various polytomous IRT methods could help estimate "true scores"/"gold" labels (ξijsubscript𝜉𝑖𝑗\xi_{ij}italic_ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT) during classroom observations, teacher instructional abilities (θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), and the various individual item effects. For tasks with human-mediated labels, human raters introduce additional sources of measurement error for each classification and the data may include multiple measures from multiple raters for a single observation (leading to an accumulation of information at overlap observation points). To address this, hierarchical rater modeling (HRM) (Patz et al., 2002; Decarlo, 2003; DeCarlo et al., 2011) combines an IRT model with a first stage estimation defined by a signal detection theory (SDT) relationship. The latter asks the question, "given the presence of the ’true’ score, can a rater detect it?" as the former asks, "given the inputs, can we estimate the ’true’ score accounting for differences in the tasks used to measure it?". The hierarchical structure addresses the problem of accumulation of information in the estimates. HRMs consist of three components:

HRM{𝜽iMVN(0M×1,IM×M),ξoijIRT model: Equation 6XsoijrSDT model: Equation 7HRMcasessimilar-tosubscript𝜽𝑖MVNsubscript0𝑀1subscriptI𝑀𝑀,otherwisesimilar-tosubscript𝜉𝑜𝑖𝑗IRT model: Equation 6otherwisesimilar-tosubscript𝑋𝑠𝑜𝑖𝑗𝑟SDT model: Equation 7otherwise\displaystyle\text{HRM}\begin{cases}\boldsymbol{\theta}_{i}\sim\text{MVN}(% \textbf{0}_{M\times 1},\textbf{I}_{M\times M})\text{,}\\ \xi_{oij}\sim\text{{IRT model}: Equation \ref{eq:MHRM_IRT}}\\ X_{soijr}\sim\text{{SDT model}: Equation \ref{eq:MHRM_SDM}}\end{cases}HRM { start_ROW start_CELL bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ MVN ( 0 start_POSTSUBSCRIPT italic_M × 1 end_POSTSUBSCRIPT , I start_POSTSUBSCRIPT italic_M × italic_M end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT ∼ bold_IRT bold_model : Equation end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT ∼ bold_SDT bold_model : Equation end_CELL start_CELL end_CELL end_ROW (5)

where an IRT model estimates the "gold" label score ξsoijsubscript𝜉𝑠𝑜𝑖𝑗\xi_{soij}italic_ξ start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j end_POSTSUBSCRIPT for a given item for some time segment s𝑠sitalic_s in teacher i𝑖iitalic_i’s o𝑜oitalic_o-th observed lesson for item j𝑗jitalic_j, which arises from i𝑖iitalic_i’s M𝑀Mitalic_M-dimensionally distributed latent instructional ability/needs (𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), and a Signal Detection Theory (SDT) model component disentangles individual rater biases from each recorded score, Xsoijrsubscript𝑋𝑠𝑜𝑖𝑗𝑟X_{soijr}italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT, by quantifying the latent attributes that mediate whether rater r𝑟ritalic_r correctly detects the true score, i.e., pξkr=P[Xsoijr=k|ξoij=ξ]subscript𝑝𝜉𝑘𝑟𝑃delimited-[]subscript𝑋𝑠𝑜𝑖𝑗𝑟conditional𝑘subscript𝜉𝑜𝑖𝑗𝜉p_{\xi kr}=\ P\left[X_{soijr}=k\ |\xi_{oij}=\xi\ \right]italic_p start_POSTSUBSCRIPT italic_ξ italic_k italic_r end_POSTSUBSCRIPT = italic_P [ italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT = italic_k | italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT = italic_ξ ].

The IRT component of Equation 5 estimating the the true scores based on rubric item- and teacher-specific parameters is a Kjsubscript𝐾𝑗K_{j}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-category multidimensional generalized partial credit model (MGPCM) (Muraki, 1992; Adams et al., 1997; Cui et al., 2024; Casabianca, 2021). Distributional challenges of negatively worded items can be addressed through a multidimensional parameterization of the underlying latent teacher instructional abilities, with between-item dimensionality confirmatorily defined by the factors in Blazar et al. (2017). The MGPCM item discrimination parameters, 𝜶j=αjmsubscript𝜶𝑗subscript𝛼𝑗𝑚\boldsymbol{\alpha}_{j}=\alpha_{jm}bold_italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_j italic_m end_POSTSUBSCRIPT, a vector of dimension-specific traits 𝜽i=θimsubscript𝜽𝑖subscript𝜃𝑖𝑚\boldsymbol{\theta}_{i}=\theta_{im}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT are separated for mM𝑚𝑀m\in Mitalic_m ∈ italic_M latent dimensions, and parameters for item difficulties γjksubscript𝛾𝑗𝑘\gamma_{jk}italic_γ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT exist for each possible score category k𝑘kitalic_k in item j𝑗jitalic_j:

P[ξoij=ξ|𝜽i,𝜶j,γjξ,o]=exp{(k1)𝜶j𝜽ik=1kγjk}h=1Kjexp{(k1)𝜶j𝜽ik=1hγjk},soformulae-sequence𝑃delimited-[]subscript𝜉𝑜𝑖𝑗conditional𝜉subscriptsuperscript𝜽bold-′𝑖subscript𝜶𝑗subscript𝛾𝑗𝜉𝑜𝑘1subscript𝜶𝑗subscriptsuperscript𝜽bold-′𝑖superscriptsubscript𝑘1𝑘subscript𝛾𝑗𝑘superscriptsubscript1subscript𝐾𝑗𝑘1subscript𝜶𝑗subscriptsuperscript𝜽bold-′𝑖superscriptsubscript𝑘1subscript𝛾𝑗𝑘for-all𝑠𝑜\displaystyle P\left[\xi_{oij}=\xi\ |\boldsymbol{\theta^{\prime}}_{i},\ % \boldsymbol{\alpha}_{j\ },\ \gamma_{j\xi},o\right]=\frac{\exp\left\{(k-1)% \boldsymbol{\alpha}_{j}\boldsymbol{\theta^{\prime}}_{i}-\sum_{k=1}^{k}\gamma_{% jk}\right\}}{\sum_{h=1}^{K_{j}}\exp\left\{(k-1)\boldsymbol{\alpha}_{j}% \boldsymbol{\theta^{\prime}}_{i}-\sum_{k=1}^{h}\gamma_{jk}\right\}},\forall s\in oitalic_P [ italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT = italic_ξ | bold_italic_θ start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_j italic_ξ end_POSTSUBSCRIPT , italic_o ] = divide start_ARG roman_exp { ( italic_k - 1 ) bold_italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp { ( italic_k - 1 ) bold_italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT } end_ARG , ∀ italic_s ∈ italic_o (6)

where oi=1,,N𝑜𝑖1𝑁oi=1,...,Nitalic_o italic_i = 1 , … , italic_N lessons observed for teacher i𝑖iitalic_i, j=1,,J𝑗1𝐽j=1,...,Jitalic_j = 1 , … , italic_J items, r=1,,R𝑟1𝑅r=1,...,Ritalic_r = 1 , … , italic_R raters, and k=1,,K𝑘1𝐾k=1,...,Kitalic_k = 1 , … , italic_K possible scores.

As parameterized by Patz et al. (2002), the base-level SDT model of the HRM represents the measurement error induced by rater r𝑟ritalic_r whose ability to "detect" the true score changes according to an individual rater’s item-specific biases, ϕjrsubscriptitalic-ϕ𝑗𝑟\phi_{jr}italic_ϕ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT and variabilities, ψjrsubscript𝜓𝑗𝑟\psi_{jr}italic_ψ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT , on the x and y axes of Figure 4:

pξkrexp{12ψjr2[k(ξ+ϕjr)]2}proportional-tosubscript𝑝𝜉𝑘𝑟12superscriptsubscript𝜓𝑗𝑟2superscriptdelimited-[]𝑘𝜉subscriptitalic-ϕ𝑗𝑟2\displaystyle p_{\xi kr}\propto\exp\left\{-\ \frac{1}{2\psi_{jr}^{2\ }}\left[k% -\left(\xi\ +\ \phi_{jr}\right)\right]^{2}\right\}\ italic_p start_POSTSUBSCRIPT italic_ξ italic_k italic_r end_POSTSUBSCRIPT ∝ roman_exp { - divide start_ARG 1 end_ARG start_ARG 2 italic_ψ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_k - ( italic_ξ + italic_ϕ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } (7)

where ϕjr=Yjrηsubscriptbold-italic-ϕ𝑗𝑟subscriptY𝑗𝑟𝜂\boldsymbol{\phi}_{jr}=\textbf{Y}_{jr}\etabold_italic_ϕ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT = Y start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT italic_η is a linear model for rating bias for items and with design matrix YjrsubscriptY𝑗𝑟\textbf{Y}_{jr}Y start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT of dimensions (RJ)×(R+J)𝑅𝐽𝑅𝐽(RJ)\times(R+J)( italic_R italic_J ) × ( italic_R + italic_J ) and η=(ϕ1,,ϕR,η1,ηJ)T𝜂superscriptsubscriptitalic-ϕ1subscriptitalic-ϕ𝑅subscript𝜂1subscript𝜂𝐽𝑇\eta=(\phi_{1},...,\phi_{R},\eta_{1},...\eta_{J})^{T}italic_η = ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_η start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for R𝑅Ritalic_R raters and J𝐽Jitalic_J items, as parameterized in Mariano and Junker (2007). Correspondingly, we update lnψjr2=Yjr(lnτ2)superscriptsubscript𝜓𝑗𝑟2subscriptY𝑗𝑟superscript𝜏2\ln{\psi_{jr}^{2}}=\textbf{Y}_{jr}(\ln{\tau^{2}})roman_ln italic_ψ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = Y start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT ( roman_ln italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) where lnτ2=(lnψ12,,lnψR2,lnτ12,,lnτJ2)Tsuperscript𝜏2superscriptsuperscriptsubscript𝜓12superscriptsubscript𝜓𝑅2superscriptsubscript𝜏12superscriptsubscript𝜏𝐽2𝑇\ln{\mathbf{\tau}^{2}}=(\ln{\psi_{1}^{2}},...,\ln{\psi_{R}^{2}},\ln{\tau_{1}^{% 2}},...,\ln{\tau_{J}^{2}})^{T}roman_ln italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( roman_ln italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , roman_ln italic_ψ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_ln italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , roman_ln italic_τ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . The complete rater estimates from these models are displayed in Figure 10. The Bayesian estimates were calculated via Markov-chain Monte Carlo (MCMC) simulation using Gibbs sampling across four chains using JAGS (Plummer, 2003) in R using very weakly-informative priors, converging with R^<1.1^𝑅1.1\hat{R}<1.1over^ start_ARG italic_R end_ARG < 1.1 for each parameter. A structural plate diagram and JAGS code for the full extended model can be found in Appendix G.

5.4.2 Results

Individual annotator tendencies and behaviors can be measured and indiciate significant differences. The vertical dashed lines on the graphs in panels (d) and (e) in Figure 4 represent 0.5 standard deviations of difference for individual raters from the mean. GPT models show significantly different rater behavior. Implications: even tasks where there is minimal overlap of observations to individual raters, behaviors can still be modeled and removed. This allows for improved curation of datasets and model selection.

5.5 Fairness: Estimation of Ratings Racial Lines

RQ 5:

With unreliable labels and complex tasks, can rater contributions to biased labeling across groups be estimated? RQ 5 Case Study Reframe: Can issues of racial fairness in ratings be disentangled from individual rater behaviors?

5.5.1 Measuring Racial Discrimination as Rater Covariates

Disentangling individual rater biases further, across sensitive attributes, can provide a measure of fairness for labels and identify raters (human or model) that display discriminatory biases. Variables representing a sensitive attribute, ς𝜍\varsigmaitalic_ς (e.g., race/ethnicity, gender, age, etc.) should be independent of observed score Xsoijrsubscript𝑋𝑠𝑜𝑖𝑗𝑟X_{soijr}italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT given the true score ξsoijsubscript𝜉𝑠𝑜𝑖𝑗\xi_{soij}italic_ξ start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j end_POSTSUBSCRIPT if ratings are fair: XςPς=a(Xjr|ξj)=Pς=b(Xjr|ξj),a,bformulae-sequenceperpendicular-to𝑋𝜍subscript𝑃𝜍𝑎conditionalsubscript𝑋𝑗𝑟subscript𝜉𝑗subscript𝑃𝜍𝑏conditionalsubscript𝑋𝑗𝑟subscript𝜉𝑗for-all𝑎𝑏X\perp\varsigma\Rightarrow P_{\varsigma=a}(X_{jr}|\xi_{j})=P_{\varsigma=b}(X_{% jr}|\xi_{j}),\forall a,bitalic_X ⟂ italic_ς ⇒ italic_P start_POSTSUBSCRIPT italic_ς = italic_a end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_ς = italic_b end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ∀ italic_a , italic_b . In the notation used for disentangling rater effects, there should be no difference in variation in scoring from rater r𝑟ritalic_r on item j𝑗jitalic_j is fair with respect to attribute ς𝜍\varsigmaitalic_ς given ςξperpendicular-to𝜍𝜉\varsigma\perp\xiitalic_ς ⟂ italic_ξ:

P[Xsoijr|ξsoij,r,j,ςi]=P[Xsoijr|ξsoij,r,j]𝑃delimited-[]conditionalsubscript𝑋𝑠𝑜𝑖𝑗𝑟subscript𝜉𝑠𝑜𝑖𝑗𝑟𝑗subscript𝜍𝑖𝑃delimited-[]conditionalsubscript𝑋𝑠𝑜𝑖𝑗𝑟subscript𝜉𝑠𝑜𝑖𝑗𝑟𝑗\displaystyle P[X_{soijr}|\xi_{soij},r,j,\varsigma_{i}]=P[X_{soijr}|\xi_{soij}% ,r,j]italic_P [ italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j end_POSTSUBSCRIPT , italic_r , italic_j , italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = italic_P [ italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j end_POSTSUBSCRIPT , italic_r , italic_j ] (8)

To measure a rater’s item-level fairness with respect to some sensitive teacher attribute, ς𝜍\varsigmaitalic_ς, the rater parameter vectors are easily updated where ϕjrς=Yjrςηsubscriptitalic-ϕ𝑗𝑟𝜍subscriptY𝑗𝑟𝜍𝜂\phi_{jr\varsigma}=\textbf{Y}_{jr\varsigma}\etaitalic_ϕ start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT = Y start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT italic_η is now a linear model for rating bias for items and with YjrςsubscriptY𝑗𝑟𝜍\textbf{Y}_{jr\varsigma}Y start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT is a design matrix of dimensions (RJΣ)×(R+J+Σ)𝑅𝐽Σ𝑅𝐽Σ(RJ\Sigma)\times(R+J+\Sigma)( italic_R italic_J roman_Σ ) × ( italic_R + italic_J + roman_Σ ) and Σ={B,W}Σ𝐵𝑊\Sigma=\{B,W\}roman_Σ = { italic_B , italic_W } for Black and White self-identified teachers respectively. In this case, where ςi{B,W}subscript𝜍𝑖𝐵𝑊\varsigma_{i}\in\{B,W\}italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_B , italic_W }, we can update the vector explicitly to illustrate those values η=(ϕ1B,,ϕRB,ϕ1W,,ϕRW,η1B,ηJB,η1W,ηJW)T𝜂superscriptsubscriptitalic-ϕsubscript1𝐵subscriptitalic-ϕsubscript𝑅𝐵subscriptitalic-ϕsubscript1𝑊subscriptitalic-ϕsubscript𝑅𝑊subscript𝜂subscript1𝐵subscript𝜂subscript𝐽𝐵subscript𝜂subscript1𝑊subscript𝜂subscript𝐽𝑊𝑇\eta=(\phi_{1_{B}},\dots,\phi_{R_{B}},\phi_{1_{W}},\dots,\phi_{R_{W}},\eta_{1_% {B}},...\eta_{J_{B}},\eta_{1_{W}},...\eta_{J_{W}})^{T}italic_η = ( italic_ϕ start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … italic_η start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … italic_η start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for R𝑅Ritalic_R raters, J𝐽Jitalic_J items, , and , lnψjrς2=Yjrς(lnτ2)superscriptsubscript𝜓𝑗𝑟𝜍2subscriptY𝑗𝑟𝜍superscript𝜏2\ln{\psi_{jr\varsigma}^{2}}=\textbf{Y}_{jr\varsigma}(\ln{\tau^{2}})roman_ln italic_ψ start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = Y start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT ( roman_ln italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is similarly updated such that lnτ2=(lnψ12,,lnψR2,lnτ12,,lnτJ2,τB2,τW2)Tsuperscript𝜏2superscriptsuperscriptsubscript𝜓12superscriptsubscript𝜓𝑅2superscriptsubscript𝜏12superscriptsubscript𝜏𝐽2superscriptsubscript𝜏𝐵2superscriptsubscript𝜏𝑊2𝑇\ln{\mathbf{\tau}^{2}}=(\ln{\psi_{1}^{2}},\dots,\ln{\psi_{R}^{2}},\ln{\tau_{1}% ^{2}},...,\ln{\tau_{J}^{2},\tau_{B}^{2},\tau_{W}^{2}})^{T}roman_ln italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( roman_ln italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , roman_ln italic_ψ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_ln italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , roman_ln italic_τ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

RQ5: Fairness Metric Group ς𝜍\varsigmaitalic_ς Independence: Xς\measeqϕBϕWperpendicular-to𝑋𝜍\measeqsubscriptitalic-ϕ𝐵subscriptitalic-ϕ𝑊X\perp\varsigma\measeq\phi_{B}-\phi_{W}italic_X ⟂ italic_ς italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT Intuition: Xςperpendicular-to𝑋𝜍X\perp\varsigmaitalic_X ⟂ italic_ς Holding a teacher’s ideal rating ξ𝜉\xiitalic_ξ constant for a given rater r𝑟ritalic_r, a teacher’s race (ςi{B,W}subscript𝜍𝑖𝐵𝑊\varsigma_{i}\in\{B,W\}italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_B , italic_W }) should be independent of the assigned score X𝑋Xitalic_X. Estimating rater biases directly, ϕr,ς=Bϕr,ς=W0approximately-equals-or-equalssubscriptitalic-ϕ𝑟𝜍𝐵subscriptitalic-ϕ𝑟𝜍𝑊0\phi_{r,\varsigma=B}-\phi_{r,\varsigma=W}\approxeq 0italic_ϕ start_POSTSUBSCRIPT italic_r , italic_ς = italic_B end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_r , italic_ς = italic_W end_POSTSUBSCRIPT ≊ 0.

By approaching the estimation this way, where ϕjrςsubscriptitalic-ϕ𝑗𝑟𝜍\phi_{jr\varsigma}italic_ϕ start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT is estimated as a parameter, we disentangle contributions to rater scores based on teacher race. This simplifies the task of evaluating for fairness using the metric of group independence, Xςperpendicular-to𝑋𝜍X\perp\varsigmaitalic_X ⟂ italic_ς, where we can directly calculate P[Xsoijr|ξoij,ϕjrς,ςi]=P[Xsoijr|ξoij,ϕjrς]𝑃delimited-[]conditionalsubscript𝑋𝑠𝑜𝑖𝑗𝑟subscript𝜉𝑜𝑖𝑗subscriptitalic-ϕ𝑗𝑟𝜍subscript𝜍𝑖𝑃delimited-[]conditionalsubscript𝑋𝑠𝑜𝑖𝑗𝑟subscript𝜉𝑜𝑖𝑗subscriptitalic-ϕ𝑗𝑟𝜍P[X_{soijr}|\xi_{oij},\phi_{jr\varsigma},\varsigma_{i}]=P[X_{soijr}|\xi_{oij},% \phi_{jr\varsigma}]italic_P [ italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT , italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = italic_P [ italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT | italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j italic_r italic_ς end_POSTSUBSCRIPT ]. Thus, Xς\measeqϕBϕW0perpendicular-to𝑋𝜍\measeqsubscriptitalic-ϕ𝐵subscriptitalic-ϕ𝑊approximately-equals-or-equals0X\perp\varsigma\measeq\phi_{B}-\phi_{W}\approxeq 0italic_X ⟂ italic_ς italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ≊ 0.

When estimated, less than 1% of parameter estimates had R^1.1^𝑅1.1\hat{R}\geq 1.1over^ start_ARG italic_R end_ARG ≥ 1.1, whose differences in posterior distributions have no material effect on results or discussion; all rater-item-specific 95% credible intervals for biases are represented as horizontal lines in Figure 4, in panel (e). Appendix G has full JAGS code used for the formula specification for all items and dimensions, including initial value parameters. Additionally, a plate diagram for MCMC modeling can be found in Figure 9.

5.5.2 Results

Racial bias at the individual rater level is significiantly measurable. The GPT model families show a negative bias trend against Black teachers relative to White teachers on most items, as seen in the comparison of those models across panels (d) and (e) in Figure 4. Potentially more precisely, GPT models’ rating centrality seemed to diminish when rating Black teachers, especially with the "reasoning" model, adding evidence that these foundation models may be sensitive to linguistic differences found in African-American English (AAE) (Hofmann et al., 2024b; Fleisig et al., 2024), possibly due to historical data or models’ relative unfamiliarity with AAE Rickford and King (2016). These results alone should give pause to edtech developers relying on prompt-engineering of foundation LLMs, as subtleties in biases exist in very complex tasks. Additionally, it is not just GPT models showing biases. For some types of items, such as negatively worded items, individual human rater effects could be detected where abnormal rater biases, either positive or negative, towards teachers with some sensitive attribute.

Overall, encoders displayed much less bias than humans. However, while not as severe as the GPT or human biases, the encoder models did not avoid issues of racial bias. On the worst performing item for both human and encoder models, MGEN, all of the encoder models found spurious relationships in some language feature while overfitting with a negative bias against Black teachers. The reasons are likely to do with label sparcity and underrepresentativeness across label categories: with so few examples of ratings in the higher categories in the training dataset, overfit on a biased sample was not adequately controlled for, showing a microcosm of alignment to poor data that GPT exhibits in macrocosm. Fortunately for the encoders, many earlier data had already suggested that neither the models nor humans (see Appendix F.1 and Hill et al. (2012b)) could sufficiently distinguish between the item’s categories.

Implications: even tasks where there is minimal overlap of observations to individual raters, bias can still be modeled and removed. This allows for improved curation of datasets and model selection. The techniques can be used for evaluation of biases from given populations.

5.6 Helpfulness: Estimating Real-world of Effects

RQ 6:

Can we estimate the effects on rating quality and changes in real-world cost if a model were to be used with a human-in-the-loop? RQ 6 Case Study Reframe: For a teacher, how would automated ratings of instruction affect human rating quality?

RQ6: Helpfulness Metric Human-in-the-Loop Dependability: Φ~j,𝔽HIL𝐊subscript~Φsimilar-to𝑗subscriptsuperscript𝔽HIL𝐊\widetilde{\Phi}_{j,{\mathbb{F^{\prime}_{\text{HIL}}}}\sim\mathbf{K}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_j , blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT ∼ bold_K end_POSTSUBSCRIPT Intuition: Φ~𝔽HIL𝐊subscript~Φsimilar-tosubscriptsuperscript𝔽HIL𝐊\widetilde{\Phi}_{{\mathbb{F^{\prime}_{\text{HIL}}}}\sim\mathbf{K}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT ∼ bold_K end_POSTSUBSCRIPT By controlling variance contributions by source, we can estimate how changes (e.g., observing another lesson, using a different rater) would affect the dependability of a rating given to a teacher.

5.6.1 Mixed Decision Studies

A Decision Study (D-study) estimates how reliabilities of ratings could improve by adjusting measured facets of variation, much like Ho and Kane did to motivate the case study. To estimate the reliability in a human-in-the-loop scenario, multiple g-studies and d-studies would need to be constructed to combine the variance contributions across a set rater families, 𝔽𝔽\mathbb{F}blackboard_F. For this work, only two different types of families are consider in each d-study, and one of them will always be human, as automated rating models, even high-performing Encoders, are not yet ready to produce ratings independent from human confirmation. For a human-in-the-loop decision study, 𝔽𝔽\mathbb{F}blackboard_F would consist of families 𝕗𝕗\mathbb{f}blackboard_f that have humans only and models only, and a combined human-model family. For a (S:O:i)×R(S:O:i)\times R( italic_S : italic_O : italic_i ) × italic_R study estimated dependability of ratings provided to teachers i𝑖iitalic_i on item j𝑗jitalic_j, Φ~jsubscript~Φ𝑗\tilde{\Phi}_{j}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is, in the joined "universe" 𝔽superscript𝔽\mathbb{F}^{\prime}blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where estimations are represented by 𝐊𝐊\mathbf{K}bold_K, the collection of unique parameterizations and estimates, ϰitalic-ϰ\varkappaitalic_ϰ, for the facets of variance in each D-study:

Φ~j,𝔽ϰ𝐊=𝕗𝔽σ2(iϰ)j𝕗𝕗𝔽σ2(iϰ)j𝕗+σ2(Δϰ)j𝕗subscript~Φsimilar-to𝑗subscriptsuperscript𝔽italic-ϰ𝐊superscriptsubscript𝕗𝔽superscript𝜎2subscriptsubscript𝑖italic-ϰ𝑗𝕗superscriptsubscript𝕗𝔽superscript𝜎2subscriptsubscript𝑖italic-ϰ𝑗𝕗superscript𝜎2subscriptsubscriptΔitalic-ϰ𝑗𝕗\displaystyle\widetilde{\Phi}_{j,{\mathbb{F^{\prime}_{\varkappa}}}\sim\mathbf{% K}}=\frac{\sum_{\mathbb{f}}^{\mathbb{F}}{\sigma^{2}(i_{\varkappa})}_{j\mathbb{% f}}}{\sum_{\mathbb{f}}^{\mathbb{F}}{\sigma^{2}(i_{\varkappa})}_{j\mathbb{f}}+{% \sigma^{2}(\Delta_{\varkappa})}_{j\mathbb{f}}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_j , blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ∼ bold_K end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT blackboard_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_F end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j blackboard_f end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT blackboard_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_F end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j blackboard_f end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j blackboard_f end_POSTSUBSCRIPT end_ARG (9)

where the summations in Equation 9 combines the variation across the familial "universes", indexed by ϰitalic-ϰ\varkappaitalic_ϰ, of different rater families in 𝔽𝔽\mathbb{F}blackboard_F and σ2(iϰ)jsuperscript𝜎2subscriptsubscript𝑖italic-ϰ𝑗{\sigma^{2}(i_{\varkappa})}_{j}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and σ2(Δϰ)jsuperscript𝜎2subscriptsubscriptΔitalic-ϰ𝑗{\sigma^{2}(\Delta_{\varkappa})}_{j}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the "universe" variability for teacher i𝑖iitalic_i and the absolute error for dependability, respectively, at the teacher-year-level (i𝑖iitalic_i) across the combined parameterization set 𝐊𝐊\mathbf{K}bold_K. Structurally, Equation 9 shares similarities with the two-stage ICC calculation of Eq. 12. These values are represented in the ratio for calculating dependability, ΦjsubscriptΦ𝑗\Phi_{j}roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, as found in Equation 3 σ2(Δ)jνo:ij+νs:o:ij+νirj+νrj+νs:o:irjsuperscript𝜎2subscriptΔ𝑗subscript𝜈:𝑜𝑖𝑗subscript𝜈:𝑠𝑜:𝑖𝑗subscript𝜈𝑖𝑟𝑗subscript𝜈𝑟𝑗subscript𝜈:𝑠𝑜:𝑖𝑟𝑗{\sigma^{2}(\Delta)}_{j}\equiv\nu_{o:ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{rj}+\nu_{% s:o:irj}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ italic_ν start_POSTSUBSCRIPT italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i italic_r italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_r italic_j end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_s : italic_o : italic_i italic_r italic_j end_POSTSUBSCRIPT. The absolute error for a rater family (𝕗𝕗\mathbb{f}blackboard_f) indexed by ϰitalic-ϰ\varkappaitalic_ϰ across any permutation of decision values in this study:

σ2(Δϰ)superscript𝜎2subscriptΔitalic-ϰ\displaystyle\sigma^{2}(\Delta_{\varkappa})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) =σ2(rϰ)nrϰ+σ2(oϰ:i)noϰ+σ2(rϰi)nrϰ+σ2(sϰ:oϰ:i)nsϰnoϰ+σ2(sϰ:oϰ:irϰ)nsϰnoϰnrϰ\displaystyle=\frac{\sigma^{2}(r_{\varkappa})}{n_{r_{\varkappa}}^{\prime}}+% \frac{\sigma^{2}(o_{\varkappa}:i)}{n_{o_{\varkappa}}^{\prime}}+\frac{\sigma^{2% }(r_{\varkappa}i)}{n_{r_{\varkappa}}^{\prime}}+\frac{\sigma^{2}(s_{\varkappa}:% o_{\varkappa}:i)}{n_{s_{\varkappa}}^{\prime}n_{o_{\varkappa}}^{\prime}}+\frac{% \sigma^{2}(s_{\varkappa}:o_{\varkappa}:ir_{\varkappa})}{n_{s_{\varkappa}}^{% \prime}n_{o_{\varkappa}}^{\prime}n_{r_{\varkappa}}^{\prime}}= divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT : italic_i ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT italic_i ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT : italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT : italic_i ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT : italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT : italic_i italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG (10)

where the decision values vary across design facets and whose contribution is weighted by the combined count nksuperscriptsubscript𝑛𝑘n_{k}^{\prime}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of a given facet k𝑘kitalic_k for ratings generated only by the family indexed by ϰitalic-ϰ\varkappaitalic_ϰ, nkϰsubscript𝑛subscript𝑘italic-ϰn_{k_{\varkappa}}italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT and those facets, if any, shared between families, nk𝔽subscript𝑛subscript𝑘superscript𝔽n_{k_{\mathbb{F}^{\prime}}}italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT: nkϰ=nkϰ+nk𝔽k{s,o,r},nr𝔽=0formulae-sequencesuperscriptsubscript𝑛subscript𝑘italic-ϰsubscript𝑛subscript𝑘italic-ϰsubscript𝑛subscript𝑘superscript𝔽for-all𝑘𝑠𝑜𝑟subscript𝑛subscript𝑟superscript𝔽0n_{k_{\varkappa}}^{\prime}=n_{k_{\varkappa}}+n_{k_{\mathbb{F}^{\prime}}}% \forall k\in\{s,o,r\},n_{r_{\mathbb{F}^{\prime}}}=0italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_ϰ end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∀ italic_k ∈ { italic_s , italic_o , italic_r } , italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0. These distinct sets of parameter values for each design study are represented in Equation 9. For human-in-the-loop only use cases, ϰHILsubscriptitalic-ϰHIL\varkappa_{\text{HIL}}italic_ϰ start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT, the value nk𝔽subscript𝑛subscript𝑘superscript𝔽n_{k_{\mathbb{F}^{\prime}}}italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents those sources of variation that are shared between rater families, and for a model family 𝕗=𝕞𝕗𝕞\mathbb{f}=\mathbb{m}blackboard_f = blackboard_m, where there would be no observations made by a model without a human, the model would not have any independent observations no𝕞=0subscript𝑛subscript𝑜𝕞0n_{o_{\mathbb{m}}}=0italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0. To represent these n𝑛nitalic_n values where a human 𝕙𝕙\mathbb{h}blackboard_h observes a classroom for 15 minutes121212For the MQI instrument, observation segments are 7.5 minutes long. with a model and where a single model 𝕞𝕞\mathbb{m}blackboard_m continues to observe for the remainder of the class (an additional 45 minutes), 𝐊nϰHIL={no𝕞=0,no𝕙=0,no𝔽=1,ns𝕞=6,ns𝕙=0,ns𝔽=2,nr𝕞=1,nr𝕙=1,no𝔽=0}subscript𝐊𝑛subscriptitalic-ϰHILformulae-sequencesubscript𝑛subscript𝑜𝕞0formulae-sequencesubscript𝑛subscript𝑜𝕙0formulae-sequencesubscript𝑛subscript𝑜superscript𝔽1formulae-sequencesubscript𝑛subscript𝑠𝕞6formulae-sequencesubscript𝑛subscript𝑠𝕙0formulae-sequencesubscript𝑛subscript𝑠superscript𝔽2formulae-sequencesubscript𝑛subscript𝑟𝕞1formulae-sequencesubscript𝑛subscript𝑟𝕙1subscript𝑛subscript𝑜superscript𝔽0\mathbf{K}_{n\in\varkappa_{\text{HIL}}}=\{n_{o_{\mathbb{m}}}=0,n_{o_{\mathbb{h% }}}=0,n_{o_{\mathbb{F}^{\prime}}}=1,n_{s_{\mathbb{m}}}=6,n_{s_{\mathbb{h}}}=0,% n_{s_{\mathbb{F}^{\prime}}}=2,n_{r_{\mathbb{m}}}=1,n_{r_{\mathbb{h}}}=1,n_{o_{% \mathbb{F}^{\prime}}}=0\}bold_K start_POSTSUBSCRIPT italic_n ∈ italic_ϰ start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 , italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 6 , italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 2 , italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT blackboard_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 , italic_n start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT blackboard_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 , italic_n start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 } and where the variance components are solved similarly to the coefficients of Eq. 1.

5.6.2 Results

Estimates of impacts of model use can be reconstructed from measurable variances. The estimates for Φ~j,𝔽subscript~Φ𝑗superscript𝔽\widetilde{\Phi}_{j,{\mathbb{F^{\prime}}}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_j , blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are in Figure 4 panel (f) with complete results for all items in Figure 14. As conducting actual human annotated classroom observation ratings is immensely expensive, the decision study analyses of Section 5.6 offer methods for estimating the improvement gained by using a model or model family. Parameterizing the decision conditions to reflect "human-in-the-loop" scenarios can even offer insight into whether the variation offered from automated ratings adds or detracts from human rating quality, offering a means of estimating research questions before more expensive trials.

Constructs that are relatively infrequent, such as LANGIMP, could greatly benefit automated ratings, since sufficient human observations for identifying that construct would be expensive. Having encoder models listen in for three entire classes yields reliabilities for that construct that are twice that of the combined efforts of multiple human raters stopping by a teacher’s classroom 10 times, fifteen minutes each time—a net savings of two hours for the principal and a potential savings of over 10 hours if such a level of reliability were desire and were these trends to continue. Implications: Not all variance contributes equally, and its careful deconstruction and reconstruction can anticipate future effects before setting up more expensive studies.

Category Metric GPTs Encoders
EXPL LANGIMP REMED SMQR EXPL LANGIMP REMED SMQR
RQ1 Concordance IRRs \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle
r,ρ,τ𝑟𝜌𝜏r,\rho,\tauitalic_r , italic_ρ , italic_τ \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle
RQ2 Confidence 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faTimesCircle[regular] \faTimesCircle[regular]
ΦΦ\Phiroman_Φ \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faTimesCircle[regular] \faTimesCircle[regular]
RQ3 Validity ϱ𝕙𝕞(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗\varrho_{\mathbb{hm}}^{(j)}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faQuestionCircle[regular]
RQ4 Bias ϕrsubscriptitalic-ϕ𝑟\phi_{r}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faQuestionCircle[regular]
RQ5 Fairness Xςperpendicular-to𝑋𝜍X\perp\varsigmaitalic_X ⟂ italic_ς \faQuestionCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle
RQ6 Helpfulness Φ~𝔽HIL𝐊subscript~Φsimilar-tosubscriptsuperscript𝔽HIL𝐊\widetilde{\Phi}_{{\mathbb{F^{\prime}_{\text{HIL}}}}\sim\mathbf{K}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT ∼ bold_K end_POSTSUBSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faQuestionCircle[regular]
Table 3: Summary Table for Item-level Metrics and Relative Performance for Model Families on four focus items. GPTs are from Wang and Demszky and Encoders are from the present study. For each metric, symbols represent whether the model family generally performs as good as or better than humans \faCheckCircle, worse than humans \faTimesCircle[regular], or if performance relative to humans is unclear \faQuestionCircle[regular]. The results for all MQI items can be found in Table 4. IRRs refers to the Inter-rater Agreement metrics from Section 5.1.

6 Overall Results and Discussion

At the outset we asked How can we know when the behaviors of models are good enough to be used lieu of the humans estimated by Ho and Kane? This question, which is a question of validity, is unanswerable by purely empirical means. While reliability (and accuracy) are measurable, validity is a case made from argument. Thus, the answer to that question is not a binary, but one of quality; it is about knowing when the behaviors of models are "good enough" on some item on some instrument for some population of classrooms against some standard of performance. Even though the Encoder family in this study outperform humans, we need to be wary of the validity of the construct being measured, as humans have exhibited the tendency to collaborate poorly with LLM/AI models in their current state Vaccaro et al. (2024); Agarwal et al. (2023); Zhou et al. (2024); Azaria et al. (2024); UpLevel (2024). The constraints of human uses demand arguments to validity that are beyond the scope of this work, despite the intentional wording of the primary research question.

The overall results relative to human performance corresponding to each of the research questions and their respective metrics for the four focus MQI items can be found in Table 3 and Table 4 has all MQI items.

For the four focus MQI items, contrasting panel (b) with panel (c) in Figure 4 reveals commonly used evaluation metrics can obscure important aspects of model performance. However, as demonstrated in panels (c)-(f), there are methods that can be used to improve evaluation under label uncertainty. Many of these methods could be applied to annotated data prior to model training to improve data quality and support training (Gordon et al., 2022).

Encoder models, on most items and in general, outperformed human raters in terms of reduced biases, improved performance metrics, and anticipated cost savings. They represent the best performing models for automated rating of classroom instruction using an authentic measurement instrument of which we are aware at the time of writing, showing large gains over human performance and even larger compared to other models, across metrics discussed herein. While not the focus of this study, the best reported single metric by Whitehill and LoCasale-Crouch on the CLASS rubric across all items and models, R=0.48𝑅0.48R=0.48italic_R = 0.48, is contrasted with the average CLASS item performance of the encoders, R¯=0.60¯𝑅0.60\bar{R}=0.60over¯ start_ARG italic_R end_ARG = 0.60, and the single worst item for any Encoder model minR=0.50𝑅0.50\min R=0.50roman_min italic_R = 0.50, as reported in the online materials. Thus, the Encoder family models offer a pathway forward for supporting the expensive research task of instructional annotation, regardless of whether they are ready for actual deployment teachers.

This is in stark contrast to the GPT models, which perform much worse than human raters. GPT models likely performed poorly in part due to the prompt length (Liu et al., 2023a), the out-of-distribution inputs of elementary school classroom discourse and task of instructional assessment (McCoy et al., 2023): hypotheses which could be investigated with future research. As GPT-style models increase in popularity, in use, and in sophistication, these methods can help identify sophistry and speciousness in third-party models even in the presence of low reliability. Like humans, models tended to choose a preferred rating value, and their deviations, conditionally informed by billions of fixed parameters at inference, are non-random.131313Variables like ‘temperature‘ can increase stochasticity of model outputs.

Being able to identify biases in cases of unreliable annotations is important, and researchers should resist the urge to withhold evaluable results from foundation models even if the data fail to reject a null hypothesis. By performing more rigorous evaluations, researchers could crowdsource measuring model biases and behavior tendencies to help all users be more discerning of speciousness, especially as these models’ poor behaviors get harder to detect (Azaria et al., 2024; Hosking et al., 2024; Zhou et al., 2024) and as researchers make bolder claims about their abilities (see Binz et al. 2024, inter alia).

The Encoder models’ designs, by contrast, were constructed to allow for multiple methods of interpretability and use by evaluating continuous windows of classroom discourse. This could be used for real-time diagnosis, interpretation, and supporting common understanding between teacher and coach. An example of such use can be found in Figure 15, where the continuous predictions for all encoder models are displayed next to average human rating scores. Improvements to this process, combined with successful feature attribution, could boost validity and trust in model use for these high-stake scenarios. If various performance measures continue to display performance Feature attribution (see Appendix I.1) could then be used in the future for augmenting transcripts of classroom instruction to support model training and inference.

Automated encoder LLMs could reduce the high costs of improving classroom observers’ annotations and serve as a stepping stone to quality teacher development.141414Code for statistical models is available in the appendix and free for use. Education technologists and EdTech enthusiasts should be wary of foundation models’ abilities to do out-of-distribution tasks. These "stochastic parrots" (Bender et al., 2021) might start fires with their "embers of autoregression" (McCoy et al., 2023) when trying to perform tasks for data so far from their training distribution, which is certainly the case with authentic fourth and fifth grade mathematics classroom discourse.

7 Limitations

The methods serve as a proof of concept for enhancing reliability in widespread and costly classroom evaluation tasks. Even though these models can perform better than a human given many accepted metrics, much more analysis and technological development is needed. Despite being best in class, these models should not be used in production in their current state. Even with a human in the loop, much more work must be done to ensure their readiness for possible assumed capabilities by end users. Far more important is that GPT style models are not used similarly, and this paper does not endorse their use for this or similar tasks.

Demonstrating multiple methods in a paper with suggestion towards their flexibility evokes the Garden of Forking Paths Problem. This study chose to follow the same parameterizations in Section 5.1 and data aggregations as the original study (Kane et al., 2015) in order to preserve comparability with the original data and human raters by using more familiar methods for the context. However, this parameterization has its limitations. An example of where aggregating and calculating reliabilities at the segment level (as was demonstrated in Section 5.6) would be to look at reliability and validity issues at the utterance level—something uniquely available to the Encoder model family herein that is not available to other raters or models. Figure 15 illustrates this capability, underexplored in this paper. Such analyses could be bolstered further by authentic feature attribution for improving interpretability. (See Appendix I.1 for more on directions for future work implied here.)

While they do demonstrate the claims, the methods of this paper might not be the best implementation of available methods. Rather it is intended to illustrate the potential for better quantifying behaviors in both labelers and models when we have uncertainty in labels. For example, if more understanding of rater perceptions and behaviors of labeling tasks is needed, using a more expressive substitution of Equation 7 (DeCarlo et al., 2011; DeCarlo, 2023, 2008) could give greater insight, especially in the case where models may perceive label category thresholds differently.

Psychometric models generally assume that the underlying latent variables are distributed normally across a population, which is usually a reasonable assumption with humans. But this assumption need not be true for models nor for all tasks. In this study, few models were estimated alongside humans to demonstrate how differently they behave under this assumption, but this paper provides no evidence that model abilities would be normally distributed for LLMs (e.g., latent constructs could follow multimodal distributions, depending on a family and pretraining, or follow a Normal-exponential-gamma distribution for shifts in metric-specific emergent behaviors). Were researchers interested in modeling learning in a larger population of models, other methods, such as, unipolar IRT models (Huang and Bolt, 2023), could potentially help for understanding between-model behaviors for the case where the rating instrument is purely an issue of detection and then magnitude. The usefulness of basic psychometric models presented is based on usefulness of the anthropomorphic distributional comparisons we can reasonably make in the presence of uncertain labels.

The parameters and variables selected for reporting decision study results presented do not represent all use cases and algorithms. While the assumption that models like GPT would have their labels treated as if they were human is a reasonable assumption , it is still an assumption. For example, the decision study of Section 5.6 does not have a within-observation-longitudinal parameterization and thus assumes that humans observing multiple segments of a class period do not necessarily need to observe the segments consecutively. While the MQI rubric is worded so as to be robust to within-lesson autocorrelation, actual lessons are obviously autocorrelated. Longitudinality could likewise support more accurate versions of Equation 6.

While many studies cited herein seek to generalize similar research across all classrooms, we acknowledge that this cannot be done with the transcript data we use for this presented work, as it only consists of fourth and fifth-grade mathematics classrooms from the United States. While the methods potentially possess broad applicability across all grades and subject areas, the current models lack generalizability beyond elementary mathematics classrooms in U.S. public schools, highlighting the need for more publicly available data in this area. Furthermore, the associated ratings and reliability metrics pertain solely to a subset of rating items on the MQI rubric151515The full set of items from MQI and CLASS rubrics are available in Appendices and in the online materials., which may introduce limitations when addressing the more universal task of automated instruction ratings. This is associated with the limitations of the instruments themselves, as imperfect tools for even calibrated and trained raters.

Similarly, as the focus of this paper is to demonstrate evaluation techniques in the presence of unreliable labels, the generalizability of models is low. Encoder models, while each is powerful and individually able to produce automated scores for 25 different authentic measures of classroom instruction (in contrast to the models of Xu et al., which used 11 separate fine-tuned models for the MQI items evaluated), were built specifically for this task and would not generalize further without data or architecture changes. GPT models represent available autoregressive decoder in-context learning via prompt engineering in 2023. Models have scaled and improved since then and it is possible that performance would improve, but issues of underlying racial biases (Section 5.5) continue to exist, even with more current models (Hofmann et al., 2024b, a; Warr et al., 2024; Shieh et al., 2024; Nghiem et al., 2024; Henderson et al., 2024).

The Encoder models were trained under the assumptions that the actual expert human ratings are not very reliable, that the alignment of the coordination of timing across rubrics and across transcripts is imperfect, that the discourse transcripts are imperfect, and that information is lost by keeping fixed sentence-level embeddings. While the methods outlined worked to extract a meaningful signal despite these challenges, it should be noted that the signal is still trained on noisy human ratings. If, on average, the raters had a particular bias, the model would carry that bias. For example, this is particularly true with the CLASS item ratings, as there were only 19 different raters used, compared to the 63 used for the MQI rubric items, and only had one rater per classroom observation. Results are included for comparability and generalizability, but they likely carry more human raters’ idiosyncrasies.

The encoder models removed transcription notes and intentionally did not use transcription information (such as identification of speaker) to best emulate what the functionality would be in a audio-input-only setup. While this is an authentic interpretation of the task, the transcription process was still done with humans. While direct input from audio would capture even more information (such as tone or long breaks in speaking for independent work), these models have not been trained to work with automated transcription.

The encoder models could be improved through metalearning training, so they could be more adaptive to new instructional rubrics and classrooms. Without metalearning across tasks, transferability is limited by the training regime and architecture as well as the data. Future work will include metalearning, allowing the model to take advantage of 72% more observations.

Finally, while the paper reported on "GPT" family performance, it only used the performance corresponding to a since study, which used only prompt engineering and which used ChatGPT 3.5. Perhaps with fine-tuning, multi-agent prompting, and other enhanced uses of such models, performance might improve. However, it is not clear that, even as models continue to improve on general use tasks, that they will improve on their ability to understand and respond to text that is outside of their training distribution (i.e., classroom discourse). Even if the text were within the training distribution, this study has demonstrated that evaluation of such text is non-trivial and, thus, the task would still be more challenging for such models (McCoy et al., 2023).

8 Authorship and Positionality Statement

Michael Hardy is the sole author of this work. Prior to his research work, he worked in public education as a teacher, principal, superintendent, and a state chief, where he evaluated and improved instructional materials and practices across many contexts. With more than decade of successful coaching instruction and as a former Educator of the Year for Texas, he is compelled by his passion and expertise to improve and support classroom teachers so that all students can have access to an excellent education. Third-party generative language models, such as ChatGPT, were not used for any aspect of the study, except where explicitly stated.

Category Metric GPTs Encoders
EXPL LANGIMP REMED SMQR EXPL LANGIMP REMED SMQR ETCA LCP LINK MAJERR MGEN MLANG MMETH STEXPL USEPROD
RQ1 Concordance IRRs \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faQuestionCircle[regular] \faCheckCircle
r,ρ,τ𝑟𝜌𝜏r,\rho,\tauitalic_r , italic_ρ , italic_τ \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle
RQ2 Confidence 𝐄ρ2𝐄superscript𝜌2\mathbf{E}\rho^{2}bold_E italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle
ΦΦ\Phiroman_Φ \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle
RQ3 Validity ϱ𝕙𝕞(j)superscriptsubscriptitalic-ϱ𝕙𝕞𝑗\varrho_{\mathbb{hm}}^{(j)}italic_ϱ start_POSTSUBSCRIPT blackboard_h blackboard_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle
RQ4 Bias ϕrsubscriptitalic-ϕ𝑟\phi_{r}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faQuestionCircle[regular]
RQ5 Fairness Xςperpendicular-to𝑋𝜍X\perp\varsigmaitalic_X ⟂ italic_ς \faQuestionCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle
RQ6 Helpfulness Φ~𝔽HIL𝐊subscript~Φsimilar-tosubscriptsuperscript𝔽HIL𝐊\widetilde{\Phi}_{{\mathbb{F^{\prime}_{\text{HIL}}}}\sim\mathbf{K}}over~ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT blackboard_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT HIL end_POSTSUBSCRIPT ∼ bold_K end_POSTSUBSCRIPT \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faCheckCircle \faCheckCircle \faCheckCircle \faQuestionCircle[regular] \faTimesCircle[regular] \faCheckCircle \faCheckCircle \faTimesCircle[regular] \faCheckCircle
Table 4: Summary Table for All MQI Item-level Metrics and Relative Performance for Model Families. GPTs are from Wang and Demszky and Encoders are from the present study. For each metric, symbols represent whether the model family generally performs as good as or better than humans \faCheckCircle, worse than humans \faTimesCircle[regular], or if performance relative to humans is unclear \faQuestionCircle[regular]. Bold Italicized Items represent the four MQI items tested by Wang and Demszky.

References

Appendix A NCTE Population Descriptive Statistics

NCTE sample means
Female 0.85
African-American 0.22
Asian 0.03
Hispanic 0.03
White 0.65
Teaching Experience (Years) 10.59
Teachers N=309
Female 0.50
African-American 0.41
Asian 0.08
Hispanic 0.24
White 0.24
Free or Reduced Price Lunch 0.65
Special Education 0.11
English Language Learners 0.21
Prior Year State Math Test (Standardized) 0.08
Prior Year State ELA Test (Standardized) 0.07
Students N=9,141
Table 5: Teacher and student descriptive statistics.

Appendix B Observation Instrument Item Descriptions and Distributions

For each of the observation instruments, the abbreviation codes used in this study are listed with the expanded names in Table 6. The distributions of scores across all items for all rater families are in Figure 6. The CLASS rubric has 12 items on a scale from 1 to 7, rated at 15 minute intervals. The MQI rubric has 13 items on a scale from 1 to 3, rated at 7.5 minute intervals.

Refer to caption
Figure 5: Overview of technical details the two instructional frameworks used for evaluating instruction.
Abbreviation Item Item Description
MQI Instrument
ETCA Enacted Task Cognitive Activation Task cognitive demand, such as drawing connections among different representations, concepts, or solution methods; identifying and explaining patterns.
EXPL Teacher Explanations Teacher explanations that give meaning to ideas, procedures, steps, or solution methods.
LANGIMP Imprecision in Language or Notation Imprecision in language or notation, with regard to mathematical symbols and technical or general mathematical language.
LCP† Lack of Clarity in Presentation of Mathematical Content Lack of clarity in teachers’ launching of tasks or presentation of the content.
LINK Linking and Connections Linking and connections of mathematical representations, ideas, and procedures.
MAJERR† Major Mathematical Errors Major mathematical errors, such as solving problems incorrectly, defining terms incorrectly, forgetting a key condition in a definition, equating two non-identical mathematical terms.
MGEN Developing Mathematical Generalizations Developing generalizations based on multiple examples.
MLANG Mathematical Language Mathematical language is dense and precise and is used fluently and consistently.
MMETH Multiple Procedures or Solution Methods Multiple procedures or solution methods for a single problem.
REMED Remediation of Student Errors and Difficulties Remediation of student errors and difficulties addressed in a substantive manner.
SMQR Student Mathematical Questioning and Reasoning Student mathematical questioning and reasoning, such as posing mathematically motivated questions, offering mathematical claims or counterclaims.
STEXPL Students Provide Explanations Student explanations that give meaning to ideas, procedures, steps, or solution methods.
USEPROD Responding to Student Mathematical Productions Responding to student mathematical productions in instruction, such as appropriately identifying mathematical insight in specific student questions, comments, or work; building instruction on student ideas or methods.
CLASS Instrument
CLPC Classroom Positive Climate
CLNC† Classroom Negative Climate
CLTS Teacher Sensitivity
CLRSP Regard for Student Perspective
CLBM Behavior Management
CLPRDT Productivity
CLILF Instructional Learning Formats
CLCU Content Understanding
CLAPS Applied Problem Solving
CLQF Quality of Feedback
CLINSTD Instructional Dialogue
CLSTENG Student Engagement
Table 6: CLASS and MQI item descriptions and corresponding abbreviations. †denotes items that are reverse coded due to being negatively worded with respect to the construct of teacher ability. Bolded items are those evaluated by the GPT family of raters and reported by Wang and Demszky. Each member of the Human and Encoder families of raters evaluated all 25 items.
Refer to caption
Figure 6: Distribution of rater scores for each of the 25 instrument items for all rater families.

Appendix C MQI Instrument

C.1 MQI Instrument Properties

For our purposes, the MQI instrument has a few unique properties that warrant further analysis, as the instrument may have some qualitative attributes that may influence human raters.

The MQI ratings are written to identify the presence of a behavior and then, if present, report the magnitude or quality of its presence, doing so repeatedly at regular intervals throughout the lesson (in this case, 7.5 minutes). This shortened window with simpler targets provides an opportunity for training a model for real-time use (rather than an arbitrary interval) to find different features across a single lesson, as shown in Figure 15.

The version of the MQI for which data is available in the NCTE dataset is ternary, in contrast to the current MQI version, which is quaternary. The lowest rating on the ternary MQI scale is a combination of the lowest two ratings on the quaternary, meaning the present data cannot distinguish between whether the attribute described in each item is “Not present” or “Low”. 161616There is one exception, which the original authors of the Appendix adjusted for: the USEPROD item is replaced by the MATCON item, with the correction of combining the lowest two categories. This ternary classification scheme creates non-normal distributions as seen in Figure 6, which will need to inform models and methods during quantitative analysis.

This is unfortunate because the difference between these two categories are “None.” And “Brief content error, instance of imprecision, lack of clarity. Does not obscure the mathematics of the segment,” respectively (for the Errors and Imprecision domain in Hill et al. and second MQI-only factor in Blazar et al.: MAJERR, LANGIMP, LCP).

C.2 Possible Effects of Negative-worded Items

The MQI is unique in having a separate domain of items that try to capture aspects of poor mathematical instruction. Unlike most items in observation rubrics, the MQI has three items that are worded in the negative direction, specifically, higher scores on the MAJERR, LANGIMP, and LCP items indicate worse performance.171717In the analyses of this paper, these will be reverse coded, as will the one negative CLASS item CLNC It is possible that looking for negative attributes may make these items more susceptible to different rater biases. A partial description of the potential impact of this rubric attribute for the LCP item found in Appendix C.2 with further details.

Of note, the LCP item is particularly subjective. In the documentation and training provided for the MQI, You have to ask: “What, mathematically, was the teacher trying to say?” This is already problematic, as it is asking for observers to use their judgment to determine what the teacher was “trying to say.” The subjectivity increases further for observers who may not be as familiar with African-American Vernacular English (AAVE). The subjectivity is further mixes lack of content clarity (lack of clarity explaining math) with lack of directional clarity (unclear instructions for an activity, which is typically associated with items addressing classroom management), as stated in the MQI rubric:

Teacher’s launch of a task/activity lacks clarity (the “launch” is the teacher’s effort to get the mathematical tasks/activities into play). If the launch is problematic, score for the launch plus amount of time students are confused/off-task/engaging in non-productive explorations…[Example:] Garbling a task launch, e.g., by asking initially “How much TV is watched in the US?” when students really must draw a graph to show “How many TVs in US vs. Europe vs. rest of the world?

Instructing observers to score based on the “amount of time students are confused/off-task/engaging in non-productive explorations”, is more likely to capture problems with classroom management and directional lack of clarity, not mathematical lack of clarity, compounded by the request for raters to guess what the teachers were trying to say and training instructions that let raters "code Lack of Clarity even with correction". This mix of observational cues and overlapping constructs makes this item particularly susceptible to individual rater biases.181818As a note, the skill of providing clear directions, foundational to establishing a well-managed classroom, is also not included the CLASS instrument’s ”Behavior Management” item, suggesting that neither of these instruments is perfectly designed to address root causes of instructional shortcomings and thus may be inadequate as tools for coaching and developing skills in teachers.

Indeed, while not reported in this paper explicitly, we identified that one rater in particular rated Black teachers much more harshly on these, especially on LCP, providing some evidence that some items can be more prone to rater biases, even with research-quality observers and calibration.

C.3 Prior work on Rater Fairness with MQI

Recent work has begun to look at rater biases, including racial bias, in these data and with the MQI instrument. Ji (2023) uses cross-classified mixed effects models for analysis and evaluation, which seeks to answer similar questions through combining G-theory and IRT estimations Briggs and Wilson (2007). However, the helpfulness of this study is limited by data selection decisions: it eliminates 23% of MQI items (all of the second MQI factor in Blazar et al. (2017)) without explanation; it only uses 21% of available classroom observations (from a single year) and by so doing also eliminates 43% of the study’s raters; it then truncates the class lengths to 45 minutes thus removing another 20% of the remaining data observations, and when evaluating for differences in teacher race, combines all non-white races/ethnicities into a single category, removing meaningful inference from the contrast. These decisions to use only 13% of available data would lead to a model with better fit, as all of those removals simplify trends in the data, indirectly suggesting that the mixed effects model constructions used are not robust to the complete set of observations Murphy and Beretvas (2015) and are therefore inadequate for our purposes here.

Refer to caption
Figure 7: Model Pipeline: General sentence-encoder model architecture.

Appendix D Encoder Family Construction

Pretraining and training/fine-tuning regimes can have significant effects on model performance D’Amour et al. (2020), so our family of models sought to exploit this by using three different pretrainings for sentence-level embeddings and including variations on training regimes (e.g., different checkpoints), the summary of these variations can be found in Table 7. Thus, the encoder family of models designed for this study share the same architecture,191919One model, ”un2”, has a slightly different architecture, differing in the number of attention heads. training and held-out test sets, differing only as outlined in Table 7.

[Another forthcoming paper to be under review] explores this protocol in greater depth, showing that the extreme training and treatment of data noise can achieve SOTA and "super-human" results on a variety of sentence embedding pretrainings, with a more complete set of training

Model Pretrained Embedding Layer Attn. Heads Train Epochs Dropout
un1 Unsupervised SimCSE (Gao, 2022) 32 3 75
un2 Unsupervised SimCSE (Gao, 2022) 16 4 75
un3 Unsupervised SimCSE (Gao, 2022) 32 8 75
e5 E5 (Wang et al., 2022) 32 2 15
gte GTE (Li et al., 2023) 32 4 65
Table 7: Encoder Within-family differences: Summary of basic differences within the Encoder family of models. Detailed information about training and architecture can be found in Appendix D.3.

All results were run on a completely held out test set (Figure 8) of entire classroom transcripts. No analyses were conducted using the held-out test set until after all models in the model family were trained, thus preserving the integrity of the study.

GPT Model Name Prompt Info Output
N Numeric Item Overview Single Number
ND Numeric w/ Description Rubric Descriptions of Score Categories Single Number
NR Numeric after Reasoning Item Overview and CoT instructions Reasoning and Number
Table 8: GPT Within-family model differences: Details for the GPT/Decoder models can be found in the original paper (Wang and Demszky, 2023).

D.0.1 Encoder Model Preprocessing

As mentioned in Section 4, preprocessing of the transcript data was intentionally minimal, replacing bracketed transcription notes (e.g. [cross-talk]) with [inaudible]. For this study, the transcript was not annotated denote whether a teacher or a student is speaking to reflect the broadest future use case of general classroom microphones. In other words, this family of models does not know who is speaking, and the results of this decision are evident in the models’ relative underperformance in two MQI items that distinguish between teacher explanations (EXPL) and student explanations (STEXPL), a trend that might be evident in the validity demonstration in Figure 15, where models may be responding nearly identically to/failing to distinguish between these two items.

To align transcripted class segments to human observation ratings, transcripts were equipartitioned at the word-level across the maximum number of lesson segments for which there were human annotations available, and estimated timestamps were made across sentences by linear interpolation weighted by word count.

D.1 Sentence-level Embeddings

One key difference to other studies using these same transcripts is the choice to parse the utterances at the sentence level. Sentences, rather than individual words or long, uninterrupted utterances, are the key unit of meaning for interpretability of models for classroom discourse. The downstream tasks are a key decision for this choice. Sentence level parsing anticipates meaningful feature attribution studies (Sundararajan et al., 2017) to further investigate construct validity.

Parsing at the sentence level both augments the total number of unique observations in the data and, by creating more standardization in sequence lengths prior to sentence-embedding, the variation in the density of semantic information is reduced.

The model takes as input an approximate 12 min rolling window of class text (stepping at each sentence), and simultaneously predicts ratings for each of the 12 CLASS dimensions, 13 of the MQI dimensions for rounded-rolling average scores for that time window. Each model is multi-task predicting all 25 scores simultaneously for each of the MQI and CLASS items. This multi-task training takes advantage of the interrelated skills of teaching that may be implicit in human ratings. Over one million unique observations from fewer than 1,600 unique classroom transcripts were generated, with rolling windows representing each observation. Training-val-test splits of this data were 75/15/10, stratified at the classroom level.

Classroom transcripts are extremely long, with thousands of sentences, and with classes having tokens in the hundreds of thousands. Sentence-level inputs could capture the relationship between something a teacher says and something a student says five minutes later without incurring large costs associated with sequence length. These long-range dependencies are needed to identify some of the instructional constructs being measured.

Raw class transcripts also have a lot of noise: content that is unrelated to any of the tasks, including fillers, self-corrections, interruptions and self-interruptions, sentences that are partially repeated or emphasized, text that requires being able to refer to a visual cue in the classroom, etc. While sentence level embeddings lose information relative to subword tokenizations, this loss of information may mitigate disproportionate effects of idiosyncratic speaking styles.

D.1.1 Embedding Model Selection

To save on compute, static embeddings were pre-computed. To represent the very noisy transcript data, we have to be careful in using sentence-embeddings, as they decrease the completeness of the information captured. We tested sentence-level embeddings using across different pretrained embedding models accessed through Huggingface on a subset of the training data for a small random selection of target measures:

  • unsup-simcse-roberta-large: from princeton-nlp (Gao, 2022), was pretrained using unsupervised contrastive sentence representations. simCSE

  • sup-simcse-roberta-large: from princeton-nlp (Gao, 2022), was pretrained using supervised training. At the writing of this paper, we did not yet have a converged model with reportable results. simCSE

  • e5-large-v2: from intfloat Wang et al. (2022), pretrained using weakly supervised contrastive sentence representations with sentence pair training. e5-large-v2

  • gte-large: from thenlper Li et al. (2023), pretrained using multistage contrastive sentence representations. gte-large

The first three models had significantly reduced performance, compared to our sentence embedding model of choice, SimCSE (Gao, 2022), which uses unsupervised self-contrasting learning to improve sentence-level representations of words.

D.2 Model Architecture

D.3 Encoder Model Training and Description

Models were built and trained in pytorch,202020https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html largely based on the Encoder modules available. Each model was trained on a single L4 GPU in Google Colab. Each epoch took about 4.25 hours:

  • 8 transformer encoder layers

  • 25 total classifier heads (with a single dense layer each) for each task (using double objective functions, results 50 total loss calculations backpropagated.)

  • All encoder layer parameters are shared by objectives, but the trainable parameters of the single dense layer classification heads are specific to each item.

  • Attention heads: 32. Since a lot of semantic information were needed to be extracted from within each embedding and its neighbors, supporting an increase in multi-head self-attention mechanisms.

  • Hidden dimension: 2048

D.3.1 Preventing Overfit within the Model

An abnormally high 0.75 Dropout rate was the primary regularization technique to avoid overfit in a noisy, repetitively augmented dataset with non-gold labels.

  • Optimizer: Adamax: defined in the original paper by Kingma and Ba (2017), this is a variant of Adam that replaces the L2 norm of the gradients with the L-infinity norm which provides stability in sparse gradients resulting from the dropout. Additionally, its initial momentum and second derivative momentum are limited slightly to 0.78 and 0.9, respectively, to prevent overfitting, but increasing training time, and increased the weight decay to 0.0003 similarly.

  • Learning Rate: initial learning rate was set to 2.5e-5, within the learning rate schedule seen below.

  • Gradient clipping: set to 4 (instead of the typical 1), since we did not want an unusual batch to explode, but recognizing the need to capture as much info as we can from our optimizer given dropout was a primary regularization to account for high level of repetition in the augmented transcript windows.

  • Learning rate schedule: Using chaining, began linear from zero with warmup, a 1,000 step linear ramp, followed by exponential decay with gamma = 0.9995) multiplied with CosineAnnealingWarmRestarts from pytorch212121https://pytorch.org/, scheduling with annealing cycles cutting frequency by a third each time. We have initial data to suggest that using a cyclic learning rate improves model performance, but did not sufficiently ablate this additional level of complexity sufficiently to claim whether, without it, the models would still learn effectively.

  • Loss functions In addition to cross-entropy loss, we use a custom loss function implementing Quadratic weighted kappa loss with fuzzy labels/label smoothing set at 0.2, to increase noise around the unreliable human ratings.

D.4 Encoder Model Test Set

The distributions for the held out test set for Encoder model can be found in Figure 8 compared to the training/development data.

Refer to caption
Figure 8: Test set label distributions compared to training and development sets, based on all human rater labels.

Appendix E GPT Model Family

E.1 Model construction

Detailed descriptions of the three models and data generated by them can be found in the original paper and accompanying websites Wang and Demszky222222The automated rating data was retrieved from https://github.com/rosewang2008/zero-shot-teacher-feedback/tree/main which examples for how the three models differ. A brief summary of those differences can be found in Table 8.

E.1.1 GPT Model Preprocessing

In contrast to the Encoder model preprocessing, a preliminary analysis was conducted by Wang and Demszky to identify the highest quality 7.5-minute segments available in the dataset, as defined by fewest transcriber notes. The models are provided the discrourse from these selections and also information about the subset of items they provide ratings for, including four items from the MQI (EXPL, LANGIMP, REMED, SMQR).

Appendix F Reliability Metrics

ICC calculations were reproduced using the following multilevel model where lesson l𝑙litalic_l scores for each rubric item are nested within teachers k𝑘kitalic_k:

ITEMlk=β0+μk+εlk,𝐼𝑇𝐸subscript𝑀𝑙𝑘subscript𝛽0subscript𝜇𝑘subscript𝜀𝑙𝑘,\displaystyle{ITEM}_{lk}=\beta_{0}+\mu_{k}+\varepsilon_{lk}\text{,}italic_I italic_T italic_E italic_M start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT , (11)

and then calculate the ICC and Adjusted ICC

ICC=var(μk)var(μk)+var(εlk)nl,𝐼𝐶𝐶varsubscript𝜇𝑘varsubscript𝜇𝑘varsubscript𝜀𝑙𝑘subscript𝑛𝑙\displaystyle ICC=\frac{\operatorname{var}\left(\mu_{k}\right)}{\operatorname{% var}\left(\mu_{k}\right)+\frac{\operatorname{var}\left(\varepsilon_{lk}\right)% }{n_{l}}},italic_I italic_C italic_C = divide start_ARG roman_var ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_var ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG roman_var ( italic_ε start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG end_ARG , (12)

where nl=1subscript𝑛𝑙1n_{l}=1italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 for ICC and where nl=6subscript𝑛𝑙6n_{l}=6italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 6 for Adjusted ICC following the original study. Full results of human baselines and comparisons against the various models can be found in Appendix F.1.

F.1 Full Results

Table LABEL:tab:tab:full contains the full results calculations referenced in Section 5.1. The metric symbols found in the table are as follows: C’s κ𝜅\kappaitalic_κ: Cohen’s κ𝜅\kappaitalic_κ; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement; Agr±1: percent agreement within 1 category; ICC and AdjICC: intraclass correlation and adjusted intraclass correlation (with nested staging in Eq. 12; r𝑟ritalic_r: r𝑟ritalic_r, Pearson’s correlation; ρ𝜌\mathbf{\rho}italic_ρ: ρ𝜌\mathbf{\rho}italic_ρ, Spearman’s rank correlation, τ𝜏\mathbf{\tau}italic_τ: τ𝜏\mathbf{\tau}italic_τ, Kendall’s rank correlation. *.low and *.hi are low and high 95% confidence intervals at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, respectively. These results and full results for CLASS items can be found online.232323https://github.com/hardy-education/LLM-Psychometrics

Table 9: Full Agreement Metrics
Instrument Item Metric Human Encoders un1 un2 un3 gte e5 GPTs N ND NR
MQI LINK C’s κ𝜅\kappaitalic_κ 0.31 0.39 0.41 0.33 0.44 0.39 0.39
MQI LINK QWK 0.41 0.58 0.6 0.55 0.62 0.56 0.56
MQI LINK %Agr 0.7 0.73 0.74 0.71 0.75 0.71 0.71
MQI LINK Agr±1 0.97 0.98 0.97 0.98 0.98 0.98 0.98
MQI LINK r𝑟ritalic_r 0.41 0.58 0.61 0.56 0.63 0.56 0.56
MQI LINK r𝑟ritalic_r.low 0.39 0.57 0.57 0.51 0.59 0.52 0.52
MQI LINK r𝑟ritalic_r.hi 0.42 0.6 0.64 0.6 0.66 0.6 0.6
MQI LINK ρ𝜌\rhoitalic_ρ 0.41 0.57 0.6 0.53 0.61 0.54 0.54
MQI LINK ρ𝜌\rhoitalic_ρ.low 0.4 0.55 0.56 0.48 0.58 0.5 0.5
MQI LINK ρ𝜌\rhoitalic_ρ.hi 0.43 0.58 0.64 0.58 0.65 0.58 0.58
MQI LINK τ𝜏\tauitalic_τ 0.4 0.54 0.57 0.51 0.59 0.51 0.51
MQI LINK τ𝜏\tauitalic_τ.low 0.38 0.52 0.53 0.46 0.55 0.47 0.47
MQI LINK τ𝜏\tauitalic_τ.hi 0.41 0.56 0.61 0.56 0.62 0.56 0.56
MQI LINK ICC 0.15 0.14 0.14 0.14 0.14 0.14 0.14
MQI LINK AdjICC 0.51 0.5 0.5 0.5 0.5 0.5 0.5
MQI EXPL C’s κ𝜅\kappaitalic_κ 0.23 0.25 0.25 0.28 0.23 0.24 0.24 0.03 0.01 0.07 0.01
MQI EXPL QWK 0.28 0.43 0.46 0.42 0.44 0.4 0.4 0.01 0.01 0.06 -0.01
MQI EXPL %Agr 0.7 0.72 0.72 0.69 0.72 0.72 0.72 0.31 0.31 0.42 0.15
MQI EXPL Agr±1 0.98 0.97 0.97 0.97 0.96 0.97 0.97 0.86 0.95 0.9 0.67
MQI EXPL r𝑟ritalic_r 0.28 0.44 0.48 0.43 0.47 0.41 0.41 0.03 0.03 0.09 -0.03
MQI EXPL r𝑟ritalic_r.low 0.26 0.42 0.44 0.37 0.42 0.36 0.36 -0.03 -0.07 -0.01 -0.14
MQI EXPL r𝑟ritalic_r.hi 0.29 0.46 0.52 0.48 0.51 0.46 0.46 0.08 0.13 0.19 0.09
MQI EXPL ρ𝜌\rhoitalic_ρ 0.27 0.42 0.46 0.41 0.46 0.39 0.39 0.03 0.03 0.1 -0.03
MQI EXPL ρ𝜌\rhoitalic_ρ.low 0.25 0.4 0.42 0.35 0.41 0.34 0.34 -0.03 -0.07 0 -0.14
MQI EXPL ρ𝜌\rhoitalic_ρ.hi 0.29 0.44 0.51 0.46 0.5 0.43 0.43 0.09 0.13 0.19 0.08
MQI EXPL τ𝜏\tauitalic_τ 0.26 0.41 0.45 0.39 0.44 0.38 0.38 0.03 0.03 0.09 -0.03
MQI EXPL τ𝜏\tauitalic_τ.low 0.25 0.39 0.4 0.33 0.4 0.32 0.32 -0.03 -0.07 -0.01 -0.14
MQI EXPL τ𝜏\tauitalic_τ.hi 0.28 0.43 0.49 0.45 0.49 0.42 0.42 0.08 0.12 0.19 0.08
MQI EXPL ICC 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
MQI EXPL AdjICC 0.55 0.56 0.56 0.56 0.56 0.56 0.56 0.55 0.55 0.55 0.55
MQI MMETH C’s κ𝜅\kappaitalic_κ 0.42 0.33 0.46 0.39 0.33 0.27 0.27
MQI MMETH QWK 0.47 0.49 0.48 0.53 0.54 0.46 0.46
MQI MMETH %Agr 0.85 0.82 0.88 0.86 0.84 0.78 0.78
MQI MMETH Agr±1 0.99 0.98 0.99 0.98 0.98 0.97 0.97
MQI MMETH r𝑟ritalic_r 0.47 0.52 0.51 0.58 0.57 0.51 0.51
MQI MMETH r𝑟ritalic_r.low 0.46 0.5 0.46 0.53 0.53 0.47 0.47
MQI MMETH r𝑟ritalic_r.hi 0.49 0.54 0.55 0.62 0.61 0.56 0.56
MQI MMETH ρ𝜌\rhoitalic_ρ 0.47 0.5 0.51 0.57 0.57 0.48 0.48
MQI MMETH ρ𝜌\rhoitalic_ρ.low 0.45 0.49 0.47 0.52 0.53 0.43 0.43
MQI MMETH ρ𝜌\rhoitalic_ρ.hi 0.48 0.52 0.55 0.61 0.61 0.52 0.52
MQI MMETH τ𝜏\tauitalic_τ 0.46 0.49 0.51 0.56 0.56 0.46 0.46
MQI MMETH τ𝜏\tauitalic_τ.low 0.45 0.47 0.46 0.52 0.52 0.42 0.42
MQI MMETH τ𝜏\tauitalic_τ.hi 0.48 0.51 0.55 0.61 0.6 0.51 0.51
MQI MMETH ICC 0.15 0.18 0.18 0.18 0.18 0.18 0.18
MQI MMETH AdjICC 0.52 0.57 0.57 0.57 0.57 0.57 0.57
MQI MGEN C’s κ𝜅\kappaitalic_κ 0.15 0.26 0.27 0.32 0.27 0.24 0.24
MQI MGEN QWK 0.19 0.34 0.34 0.48 0.34 0.29 0.29
MQI MGEN %Agr 0.95 0.95 0.96 0.96 0.96 0.94 0.94
MQI MGEN Agr±1 0.99 1 1 1 1 0.99 0.99
MQI MGEN r𝑟ritalic_r 0.19 0.34 0.37 0.48 0.37 0.29 0.29
MQI MGEN r𝑟ritalic_r.low 0.18 0.32 0.32 0.43 0.32 0.24 0.24
MQI MGEN r𝑟ritalic_r.hi 0.21 0.37 0.42 0.53 0.42 0.34 0.34
MQI MGEN ρ𝜌\rhoitalic_ρ 0.19 0.32 0.34 0.42 0.34 0.28 0.28
MQI MGEN ρ𝜌\rhoitalic_ρ.low 0.17 0.29 0.29 0.37 0.29 0.22 0.22
MQI MGEN ρ𝜌\rhoitalic_ρ.hi 0.2 0.34 0.39 0.48 0.39 0.33 0.33
MQI MGEN τ𝜏\tauitalic_τ 0.18 0.32 0.34 0.42 0.34 0.27 0.27
MQI MGEN τ𝜏\tauitalic_τ.low 0.17 0.29 0.29 0.37 0.29 0.22 0.22
MQI MGEN τ𝜏\tauitalic_τ.hi 0.2 0.34 0.39 0.48 0.39 0.33 0.33
MQI MGEN ICC 0.04 0.03 0.03 0.03 0.03 0.03 0.03
MQI MGEN AdjICC 0.19 0.16 0.16 0.16 0.16 0.16 0.16
MQI MLANG C’s κ𝜅\kappaitalic_κ 0.23 0.37 0.4 0.44 0.43 0.31 0.31
MQI MLANG QWK 0.33 0.48 0.49 0.55 0.51 0.44 0.44
MQI MLANG %Agr 0.59 0.65 0.68 0.69 0.7 0.6 0.6
MQI MLANG Agr±1 0.98 0.98 0.99 0.99 0.99 0.98 0.98
MQI MLANG r𝑟ritalic_r 0.33 0.48 0.5 0.55 0.52 0.46 0.46
MQI MLANG r𝑟ritalic_r.low 0.32 0.46 0.45 0.5 0.47 0.41 0.41
MQI MLANG r𝑟ritalic_r.hi 0.35 0.5 0.54 0.59 0.56 0.5 0.5
MQI MLANG ρ𝜌\rhoitalic_ρ 0.32 0.47 0.48 0.54 0.5 0.43 0.43
MQI MLANG ρ𝜌\rhoitalic_ρ.low 0.31 0.45 0.43 0.49 0.46 0.39 0.39
MQI MLANG ρ𝜌\rhoitalic_ρ.hi 0.34 0.49 0.52 0.59 0.55 0.48 0.48
MQI MLANG τ𝜏\tauitalic_τ 0.31 0.45 0.46 0.52 0.49 0.41 0.41
MQI MLANG τ𝜏\tauitalic_τ.low 0.29 0.43 0.42 0.47 0.45 0.36 0.36
MQI MLANG τ𝜏\tauitalic_τ.hi 0.33 0.47 0.51 0.57 0.53 0.46 0.46
MQI MLANG ICC 0.08 0.09 0.09 0.09 0.09 0.09 0.09
MQI MLANG AdjICC 0.34 0.36 0.36 0.36 0.36 0.36 0.36
MQI REMED C’s κ𝜅\kappaitalic_κ 0.27 0.3 0.27 0.34 0.35 0.27 0.27 -0.01 -0.01 0 0
MQI REMED QWK 0.32 0.44 0.44 0.52 0.42 0.42 0.42 0.02 0 0.06 0.02
MQI REMED %Agr 0.66 0.69 0.68 0.68 0.74 0.67 0.67 0.16 0.1 0.27 0.08
MQI REMED Agr±1 0.96 0.96 0.94 0.96 0.99 0.96 0.96 0.62 0.54 0.81 0.48
MQI REMED r𝑟ritalic_r 0.32 0.44 0.46 0.52 0.45 0.42 0.42 0.05 -0.01 0.11 0.11
MQI REMED r𝑟ritalic_r.low 0.31 0.42 0.41 0.47 0.4 0.37 0.37 0 -0.11 0.01 -0.01
MQI REMED r𝑟ritalic_r.hi 0.34 0.47 0.5 0.57 0.49 0.47 0.47 0.11 0.09 0.21 0.22
MQI REMED ρ𝜌\rhoitalic_ρ 0.32 0.42 0.44 0.49 0.44 0.38 0.38 0.06 0 0.12 0.09
MQI REMED ρ𝜌\rhoitalic_ρ.low 0.31 0.4 0.39 0.43 0.4 0.33 0.33 0 -0.1 0.02 -0.02
MQI REMED ρ𝜌\rhoitalic_ρ.hi 0.34 0.44 0.48 0.54 0.49 0.43 0.43 0.12 0.1 0.22 0.2
MQI REMED τ𝜏\tauitalic_τ 0.31 0.4 0.42 0.46 0.43 0.37 0.37 0.06 0 0.11 0.09
MQI REMED τ𝜏\tauitalic_τ.low 0.3 0.38 0.37 0.41 0.39 0.32 0.32 0 -0.1 0.02 -0.02
MQI REMED τ𝜏\tauitalic_τ.hi 0.33 0.42 0.46 0.51 0.48 0.41 0.41 0.11 0.1 0.21 0.2
MQI REMED ICC 0.16 0.17 0.17 0.17 0.17 0.17 0.17 0.14 0.14 0.14 0.14
MQI REMED AdjICC 0.53 0.55 0.55 0.55 0.55 0.55 0.55 0.5 0.5 0.5 0.5
MQI USEPROD C’s κ𝜅\kappaitalic_κ 0.25 0.3 0.28 0.32 0.3 0.31 0.31
MQI USEPROD QWK 0.33 0.46 0.44 0.5 0.46 0.46 0.46
MQI USEPROD %Agr 0.76 0.75 0.74 0.8 0.75 0.74 0.74
MQI USEPROD Agr±1 0.98 0.95 0.93 0.97 0.94 0.95 0.95
MQI USEPROD r𝑟ritalic_r 0.33 0.49 0.48 0.5 0.5 0.49 0.49
MQI USEPROD r𝑟ritalic_r.low 0.32 0.47 0.43 0.45 0.46 0.45 0.45
MQI USEPROD r𝑟ritalic_r.hi 0.35 0.51 0.52 0.55 0.54 0.53 0.53
MQI USEPROD ρ𝜌\rhoitalic_ρ 0.31 0.47 0.47 0.45 0.49 0.46 0.46
MQI USEPROD ρ𝜌\rhoitalic_ρ.low 0.29 0.45 0.42 0.39 0.45 0.42 0.42
MQI USEPROD ρ𝜌\rhoitalic_ρ.hi 0.32 0.49 0.51 0.5 0.53 0.51 0.51
MQI USEPROD τ𝜏\tauitalic_τ 0.3 0.45 0.45 0.44 0.48 0.45 0.45
MQI USEPROD τ𝜏\tauitalic_τ.low 0.29 0.43 0.41 0.38 0.43 0.4 0.4
MQI USEPROD τ𝜏\tauitalic_τ.hi 0.32 0.47 0.5 0.49 0.52 0.49 0.49
MQI USEPROD ICC 0.24 0.23 0.23 0.23 0.23 0.23 0.23
MQI USEPROD AdjICC 0.65 0.64 0.64 0.64 0.64 0.64 0.64
MQI MAJERR C’s κ𝜅\kappaitalic_κ 0.24 0.22 0.27 0.26 0.22 0.19 0.19
MQI MAJERR QWK 0.28 0.35 0.35 0.45 0.43 0.29 0.29
MQI MAJERR %Agr 0.91 0.9 0.92 0.91 0.92 0.87 0.87
MQI MAJERR Agr±1 0.99 0.99 1 0.99 0.99 0.98 0.98
MQI MAJERR r𝑟ritalic_r 0.28 0.36 0.38 0.45 0.44 0.31 0.31
MQI MAJERR r𝑟ritalic_r.low 0.26 0.34 0.33 0.4 0.39 0.26 0.26
MQI MAJERR r𝑟ritalic_r.hi 0.29 0.38 0.43 0.5 0.49 0.36 0.36
MQI MAJERR ρ𝜌\rhoitalic_ρ 0.28 0.31 0.34 0.43 0.38 0.27 0.27
MQI MAJERR ρ𝜌\rhoitalic_ρ.low 0.26 0.29 0.28 0.37 0.33 0.21 0.21
MQI MAJERR ρ𝜌\rhoitalic_ρ.hi 0.29 0.33 0.39 0.48 0.43 0.32 0.32
MQI MAJERR τ𝜏\tauitalic_τ 0.28 0.31 0.33 0.42 0.37 0.26 0.26
MQI MAJERR τ𝜏\tauitalic_τ.low 0.26 0.28 0.28 0.36 0.32 0.21 0.21
MQI MAJERR τ𝜏\tauitalic_τ.hi 0.29 0.33 0.38 0.47 0.42 0.32 0.32
MQI MAJERR ICC 0.1 0.06 0.06 0.06 0.06 0.06 0.06
MQI MAJERR AdjICC 0.39 0.29 0.29 0.29 0.29 0.29 0.29
MQI LANGIMP C’s κ𝜅\kappaitalic_κ 0.25 0.2 0.32 0.21 0.21 0.15 0.15 0 0 -0.03 0.03
MQI LANGIMP QWK 0.29 0.34 0.36 0.43 0.39 0.29 0.29 -0.01 -0.01 -0.05 0.03
MQI LANGIMP %Agr 0.8 0.8 0.86 0.81 0.83 0.75 0.75 0.32 0.25 0.38 0.33
MQI LANGIMP Agr±1 0.99 0.98 1 0.99 0.98 0.97 0.97 0.98 0.97 0.98 0.99
MQI LANGIMP r𝑟ritalic_r 0.29 0.35 0.4 0.44 0.4 0.31 0.31 -0.02 -0.02 -0.08 0.06
MQI LANGIMP r𝑟ritalic_r.low 0.27 0.33 0.35 0.38 0.35 0.26 0.26 -0.08 -0.12 -0.17 -0.05
MQI LANGIMP r𝑟ritalic_r.hi 0.3 0.37 0.45 0.49 0.45 0.36 0.36 0.04 0.07 0.02 0.17
MQI LANGIMP ρ𝜌\rhoitalic_ρ 0.28 0.31 0.38 0.4 0.37 0.26 0.26 -0.02 -0.03 -0.08 0.05
MQI LANGIMP ρ𝜌\rhoitalic_ρ.low 0.26 0.29 0.33 0.34 0.32 0.21 0.21 -0.08 -0.13 -0.17 -0.06
MQI LANGIMP ρ𝜌\rhoitalic_ρ.hi 0.29 0.34 0.43 0.45 0.42 0.32 0.32 0.03 0.07 0.02 0.17
MQI LANGIMP τ𝜏\tauitalic_τ 0.28 0.31 0.38 0.39 0.37 0.26 0.26 -0.02 -0.03 -0.07 0.05
MQI LANGIMP τ𝜏\tauitalic_τ.low 0.26 0.28 0.33 0.33 0.31 0.2 0.2 -0.08 -0.13 -0.17 -0.06
MQI LANGIMP τ𝜏\tauitalic_τ.hi 0.29 0.33 0.43 0.45 0.41 0.31 0.31 0.03 0.07 0.02 0.16
MQI LANGIMP ICC 0.12 0.13 0.13 0.13 0.13 0.13 0.13 0.12 0.12 0.12 0.12
MQI LANGIMP AdjICC 0.44 0.47 0.47 0.47 0.47 0.47 0.47 0.45 0.45 0.45 0.45
MQI LCP C’s κ𝜅\kappaitalic_κ 0.18 0.2 0.26 0.25 0.18 0.17 0.17
MQI LCP QWK 0.23 0.32 0.32 0.44 0.36 0.25 0.25
MQI LCP %Agr 0.86 0.86 0.89 0.87 0.89 0.83 0.83
MQI LCP Agr±1 0.99 0.98 0.99 0.98 0.98 0.98 0.98
MQI LCP r𝑟ritalic_r 0.23 0.32 0.36 0.45 0.37 0.25 0.25
MQI LCP r𝑟ritalic_r.low 0.22 0.3 0.31 0.39 0.32 0.2 0.2
MQI LCP r𝑟ritalic_r.hi 0.25 0.34 0.41 0.5 0.42 0.31 0.31
MQI LCP ρ𝜌\rhoitalic_ρ 0.22 0.28 0.33 0.41 0.34 0.21 0.21
MQI LCP ρ𝜌\rhoitalic_ρ.low 0.2 0.25 0.28 0.35 0.29 0.15 0.15
MQI LCP ρ𝜌\rhoitalic_ρ.hi 0.23 0.3 0.38 0.46 0.39 0.26 0.26
MQI LCP τ𝜏\tauitalic_τ 0.21 0.27 0.33 0.41 0.34 0.21 0.21
MQI LCP τ𝜏\tauitalic_τ.low 0.2 0.25 0.27 0.35 0.29 0.15 0.15
MQI LCP τ𝜏\tauitalic_τ.hi 0.23 0.3 0.38 0.46 0.39 0.26 0.26
MQI LCP ICC 0.14 0.15 0.15 0.15 0.15 0.15 0.15
MQI LCP AdjICC 0.5 0.51 0.51 0.51 0.51 0.51 0.51
MQI STEXPL C’s κ𝜅\kappaitalic_κ 0.36 0.29 0.26 0.3 0.26 0.31 0.31
MQI STEXPL QWK 0.4 0.45 0.45 0.45 0.48 0.45 0.45
MQI STEXPL %Agr 0.8 0.77 0.76 0.79 0.77 0.77 0.77
MQI STEXPL Agr±1 0.99 0.97 0.97 0.97 0.98 0.97 0.97
MQI STEXPL r𝑟ritalic_r 0.4 0.48 0.48 0.46 0.51 0.48 0.48
MQI STEXPL r𝑟ritalic_r.low 0.38 0.46 0.44 0.4 0.47 0.43 0.43
MQI STEXPL r𝑟ritalic_r.hi 0.41 0.5 0.53 0.51 0.56 0.52 0.52
MQI STEXPL ρ𝜌\rhoitalic_ρ 0.39 0.47 0.47 0.46 0.5 0.47 0.47
MQI STEXPL ρ𝜌\rhoitalic_ρ.low 0.38 0.45 0.42 0.4 0.46 0.43 0.43
MQI STEXPL ρ𝜌\rhoitalic_ρ.hi 0.41 0.49 0.51 0.51 0.54 0.52 0.52
MQI STEXPL τ𝜏\tauitalic_τ 0.39 0.46 0.46 0.45 0.49 0.46 0.46
MQI STEXPL τ𝜏\tauitalic_τ.low 0.37 0.44 0.41 0.39 0.45 0.41 0.41
MQI STEXPL τ𝜏\tauitalic_τ.hi 0.4 0.48 0.5 0.5 0.53 0.51 0.51
MQI STEXPL ICC 0.3 0.27 0.27 0.27 0.27 0.27 0.27
MQI STEXPL AdjICC 0.72 0.69 0.69 0.69 0.69 0.69 0.69
MQI SMQR C’s κ𝜅\kappaitalic_κ 0.25 0.3 0.23 0.29 0.35 0.32 0.32 0.07 0.1 0.09 0
MQI SMQR QWK 0.3 0.41 0.45 0.41 0.41 0.37 0.37 0.08 0.09 0.07 0.06
MQI SMQR %Agr 0.76 0.76 0.75 0.77 0.78 0.75 0.75 0.4 0.42 0.48 0.25
MQI SMQR Agr±1 0.98 0.99 0.97 0.97 0.99 0.99 0.99 0.9 0.91 0.88 0.93
MQI SMQR r𝑟ritalic_r 0.3 0.41 0.46 0.41 0.43 0.38 0.38 0.13 0.16 0.11 0.13
MQI SMQR r𝑟ritalic_r.low 0.29 0.39 0.42 0.36 0.38 0.33 0.33 0.07 0.06 0.01 0.02
MQI SMQR r𝑟ritalic_r.hi 0.32 0.43 0.51 0.47 0.47 0.43 0.43 0.19 0.25 0.2 0.24
MQI SMQR ρ𝜌\rhoitalic_ρ 0.29 0.39 0.41 0.4 0.42 0.37 0.37 0.12 0.16 0.11 0.12
MQI SMQR ρ𝜌\rhoitalic_ρ.low 0.28 0.37 0.36 0.34 0.37 0.32 0.32 0.06 0.06 0.01 0.01
MQI SMQR ρ𝜌\rhoitalic_ρ.hi 0.31 0.42 0.46 0.46 0.46 0.42 0.42 0.18 0.25 0.2 0.23
MQI SMQR τ𝜏\tauitalic_τ 0.29 0.38 0.4 0.39 0.41 0.37 0.37 0.12 0.15 0.1 0.11
MQI SMQR τ𝜏\tauitalic_τ.low 0.27 0.36 0.35 0.33 0.36 0.32 0.32 0.06 0.05 0 0
MQI SMQR τ𝜏\tauitalic_τ.hi 0.3 0.41 0.45 0.45 0.46 0.42 0.42 0.17 0.24 0.2 0.23
MQI SMQR ICC 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19
MQI SMQR AdjICC 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
MQI ETCA C’s κ𝜅\kappaitalic_κ 0.24 0.3 0.28 0.39 0.27 0.31 0.31
MQI ETCA QWK 0.32 0.5 0.5 0.55 0.51 0.48 0.48
MQI ETCA %Agr 0.67 0.68 0.66 0.74 0.65 0.69 0.69
MQI ETCA Agr±1 0.98 0.97 0.96 0.98 0.96 0.98 0.98
MQI ETCA r𝑟ritalic_r 0.32 0.52 0.52 0.56 0.55 0.48 0.48
MQI ETCA r𝑟ritalic_r.low 0.3 0.5 0.48 0.51 0.51 0.44 0.44
MQI ETCA r𝑟ritalic_r.hi 0.33 0.54 0.56 0.6 0.59 0.53 0.53
MQI ETCA ρ𝜌\rhoitalic_ρ 0.3 0.5 0.51 0.54 0.55 0.46 0.46
MQI ETCA ρ𝜌\rhoitalic_ρ.low 0.28 0.48 0.47 0.49 0.51 0.42 0.42
MQI ETCA ρ𝜌\rhoitalic_ρ.hi 0.31 0.52 0.55 0.59 0.59 0.51 0.51
MQI ETCA τ𝜏\tauitalic_τ 0.29 0.48 0.49 0.52 0.53 0.44 0.44
MQI ETCA τ𝜏\tauitalic_τ.low 0.27 0.46 0.45 0.47 0.48 0.4 0.4
MQI ETCA τ𝜏\tauitalic_τ.hi 0.31 0.5 0.54 0.57 0.57 0.49 0.49
MQI ETCA ICC 0.21 0.22 0.22 0.22 0.22 0.22 0.22
MQI ETCA AdjICC 0.61 0.63 0.63 0.63 0.63 0.63 0.63
Table 9: Full Agreement Metrics (continued)

Appendix G Disentangling Bias and Measuring Fairness

Conducting a full fairness analysis across both CLASS and MQI items and raters is considerably more complicated when accounting for all four construct dimensions in Blazar et al. (2017). If only MQI items are modeled, as was the case in the plots of Figure 4, the model can be simplified two dimensions. Full item-level MQI results for those models for disentangling biases from Section 5.4 are in Figure 10. The item-level results for corresponding racial bias difference models from Section 5.5 are in Figure 11. JAGS code for MCMC in R is available online.242424https://github.com/hardy-education/LLM-Psychometrics A structural plate diagram for the model in Section 5.5 is in Figure 9.

JAGS code of a full model representing Section 5.5, including code for the additional estimation of CLASS items and simultaneous estimation of human and model parameters, as seen in Figure 9. To reduce the total length of code, Code Listing G encapsulates all code for the various MCMC estimations used in this paper. For the creation of Panels (d) and (e) of Figure 4 and Figures 10 and 11, model parameters were estimated after human raters and teacher parameters were estimated and only using MQI items (i.e., xi[i,j] is held as fixed when estimating parameters for Encoders and GPTs). It also includes an additional hierarchical structure in latent abilities to allow for estimation of ideal scores at the lesson observation-level ξoijsubscript𝜉𝑜𝑖𝑗\xi_{oij}italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT so teacher latent abilities, θoisubscript𝜃𝑜𝑖\theta_{oi}italic_θ start_POSTSUBSCRIPT italic_o italic_i end_POSTSUBSCRIPT, can vary across lessons during the year and jointly be informed by the teacher’s true year-level latent abilities ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This would update the top latent ability estimation Equation 5 to the following.

HRM{𝜽oiMVN(𝚯M×1,IM×M)Θim𝒩(0,1),ξoijIRT modelXsoijrSDT modelHRMcasessimilar-tosubscript𝜽𝑜𝑖MVNsubscript𝚯𝑀1subscriptI𝑀𝑀subscriptΘ𝑖𝑚similar-to𝒩01,otherwisesimilar-tosubscript𝜉𝑜𝑖𝑗IRT modelotherwisesimilar-tosubscript𝑋𝑠𝑜𝑖𝑗𝑟SDT modelotherwise\displaystyle\text{HRM}\begin{cases}\boldsymbol{\theta}_{oi}\sim\text{MVN}(% \boldsymbol{\Theta}_{M\times 1},\textbf{I}_{M\times M})\text{; }\Theta_{im}% \sim\mathcal{N}(0,1)\text{,}\\ \xi_{oij}\sim\text{{IRT model}}\\ X_{soijr}\sim\text{{SDT model}}\end{cases}HRM { start_ROW start_CELL bold_italic_θ start_POSTSUBSCRIPT italic_o italic_i end_POSTSUBSCRIPT ∼ MVN ( bold_Θ start_POSTSUBSCRIPT italic_M × 1 end_POSTSUBSCRIPT , I start_POSTSUBSCRIPT italic_M × italic_M end_POSTSUBSCRIPT ) ; roman_Θ start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_o italic_i italic_j end_POSTSUBSCRIPT ∼ IRT model end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_s italic_o italic_i italic_j italic_r end_POSTSUBSCRIPT ∼ SDT model end_CELL start_CELL end_CELL end_ROW (13)
{minted}

R model ## Signal detection theory model with rater covariates for (i in 1:NN) x[i]   dcat(prob.sdt[i, ]) for (k in 1:K) d[i, k] <- k - xi[subject[i], item[i]] - rhocov.r[rater[i], item[i], race[i]] z[i, k] <- exp(-d[i, k] * d[i, k]/2 * exp(zeta.r[rater[i], item[i], race[i]])) prob.sdt[i, k] <- ifelse((K - maxscore.by.item[item[i]]), ifelse(k < (maxscore.by.item[item[i]] + 1), z[i, k]/sum(z[i, ]), 0.00000E+00), z[i, k]/sum(z[i, ]))

## Multidimensional Generalized Partial Credit Model for (i in 1:N) for (j in 1:J) xi[i, j]   dcat(prob.irt[i, j, ]) for (m in 1:M) kern[i, j, m] <- alpha[j, m] * (theta[i, m]) for (k in 1:K) dotprod[i, j, k] <- (k - 1) * sum(kern[i, j, ]) eta[i, j, k] <- dotprod[i, j, k] - sum(gamma[j, 1:k]) exp.eta[i, j, k] <- exp(eta[i, j, k]) prob.irt[i, j, k] <- ifelse(K - maxscore.by.item[j], ifelse(k <= (maxscore.by.item[j]), exp.eta[i, j, k]/sum(exp.eta[i, j, 1:maxscore.by.item[j]]), 0), exp.eta[i, j, k]/sum(exp.eta[i, j, 1:maxscore.by.item[j]])) ## Rater Parameters for (nu in r1.raters) for (s in r.1.in) for (ra in 1:RA) rhocov.r[nu, s, ra]   dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra]   dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.1.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1 for (nu in r2.raters) for (s in r.2.in) for (ra in 1:RA) rhocov.r[nu, s, ra]   dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra]   dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.2.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1 for (nu in r3.raters) for (s in r.3.in) for (ra in 1:RA) rhocov.r[nu, s, ra]   dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra]   dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (nu in r4.raters) for (s in r.4.in) for (ra in 1:RA) rhocov.r[nu, s, ra]   dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra]   dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.4.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1

## Multidimension parameters for (m in 1:M) pi.rt[m] <- 0 delta.rt[m] <- 0 sigma.rt[m] <- 1

## Item Parameters for (s in 1:S) for (ra in 1:RA) eta.rt[s, ra]   dnorm(pi.rt[factors.by.item[s]], prec.eta) kappa.rt[s, ra]   dnorm(delta.rt[factors.by.item[s]], prec.kappa) tau.rt[s, ra] <- sqrt(1/exp(kappa.rt[s, ra]))

## Initializations for rater and item parameters prec.pi   dgamma(a.precpi, b.precpi) prec.delta   dgamma(a.precdelta, b.precdelta) prec.eta   dgamma(a.preceta, b.preceta) prec.kappa   dgamma(a.preckappa, b.preckappa) prec.rhocov   dgamma(a.precrhocov, b.precrhocov) prec.zeta   dgamma(a.preczeta, b.preczeta) sd.rhocov <- sqrt(1/prec.rhocov) sd.zeta <- sqrt(1/prec.zeta) sd.pi <- sqrt(1/prec.pi) sd.delta <- sqrt(1/prec.delta) sd.eta <- sqrt(1/prec.eta) sd.kappa <- sqrt(1/prec.kappa) for (m in 1:M) alpha[d2[1], m] <- ifelse(m == 2, 1, 0) alpha[d1[1], m] <- ifelse(m == 1, 1, 0) alpha[d3[1], m] <- ifelse(m == 3, 1, 0) alpha[d4[1], m] <- ifelse(m == 4, 1, 0) for (j in d1[2:D1]) alpha[j, 1]   dlnorm(0, prec.alpha) alpha[j, 2] <- 0 alpha[j, 3] <- 0 alpha[j, 4] <- 0 for (j in d2[2:D2]) alpha[j, 2]   dlnorm(0, prec.alpha) alpha[j, 1] <- 0 alpha[j, 3] <- 0 alpha[j, 4] <- 0 for (j in d3[2:D3]) alpha[j, 3]   dlnorm(0, prec.alpha) alpha[j, 2] <- 0 alpha[j, 1] <- 0 alpha[j, 4] <- 0 for (j in d4[2:D4]) alpha[j, 4]   dlnorm(0, prec.alpha) alpha[j, 1] <- 0 alpha[j, 2] <- 0 alpha[j, 3] <- 0 for (j in 1:J) gamma[j, 1] <- 0 for (k in 2:maxscore.by.item[j]) gamma[j, k]   dnorm(0, prec.gamma) for (k in (maxscore.by.item[j] + 1):(K + 1)) gamma[j, k] <- 0 ## Theta estimations for (i in 1:TY) for (m in 1:M) ty[i, m]   dnorm(0, prec.ty) for (i in 1:N) theta[i, 1:M]   dmnorm(ty[tyr.by.obs[i], ], Tau[, ]) Tau[1:M, 1:M]   dwish(W[, ], DF) Sigma <- inverse(Tau[, ]) sd.th1 <- sqrt(Sigma[1, 1]) sd.th2 <- sqrt(Sigma[2, 2]) rho12 <- Sigma[1, 2]/sqrt(Sigma[1, 1] * Sigma[2, 2]) prec.ty   dgamma(a.precty, b.precty) sd.ty <- 1/sqrt(prec.ty) prec.b <- pow(var.b, -1) prec.g <- pow(var.g, -1) prec.alpha <- pow(var.alpha, -1) prec.gamma <- pow(var.gamma, -1) prec.phi <- pow(var.phi, -1) ## initial values inits <- function() list( alpha = item.dims * runif(J*M,0.1,1.5), gamma = item.cats.by.score * rnorm(J*(K+1),0,0.5), # ty= matrix(rep(rnorm(TY, 0, 1),M),nrow=TY,ncol=M), theta = matrix(rnorm(N*M,0,1),ncol=M), phi = rnorm(R, 0, 1), tau = runif(R, 0.1, 8), rhocov = array(rnorm(R*S*RA),dim = c(R,S,RA)) * rnorm(R*S*RA,0,.5), zeta = array(rnorm(R*S*RA),dim = c(R,S,RA)), pi = rnorm(R,0,.5), delta = rnorm(R,0,.5), kappa = rnorm(R,0,.5), theta.prec = rgamma(1,100,100)) {listing} JAGS code of a full model representing Section 5.5, including code for the additional estimation of CLASS items and simultaneous estimation of human and model parameters, as seen in Figure 9. For brevity, this includes all code which can be reduced for the various methods herein. For the creation of Panels (d) and (e) of Figure 4 and Figures 10 and 11, model parameters were estimated after human raters and teacher parameters were estimated and only using MQI items (i.e., xi[i,j] is held as fixed when estimating parameters for Encoders and GPTs).

Refer to caption
Figure 9: Structural plate diagram for model described in Section 5.5.
Refer to caption
Figure 10: Rater biases, ρjrsubscript𝜌𝑗𝑟\rho_{jr}italic_ρ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT , for each item jMQI𝑗MQIj\in\text{MQI}italic_j ∈ MQI centered at an item-level detection effectηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and variabilities, ω2superscript𝜔2\omega^{2}italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, by MQI Item and visually grouped by dimension, m𝑚mitalic_m and marked by severity/leniency. Each point is an individual rater: a “+” marker is a single human rater; “\bullet” and “\bigtriangledown” are specific encoder and GPT models, respectively. X-axis is rater bias. Right is more lenient, left more severe. Color (via x-axis) are bias categories. Y-axis is rater variability (lower is more consistent. Horizontal lines 95% CI for bias via MCMC Bayes Estimation
Refer to caption
Figure 11: Fairness across Racial Lines. Section 5.5: Standardized difference in rater bias ϕrsubscriptitalic-ϕ𝑟\phi_{r}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (x axis) and rater combined variability/consistency, ψrsubscript𝜓𝑟\psi_{r}italic_ψ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, (y axis) across Black teachers and White teachers. Leftward values are more severe towards Black teachers, rightward are more lenient. Any horizontal bar present with a marker represents 95% CI for bias. Differences in rater biases, ΔρjrΔsubscript𝜌𝑗𝑟\Delta\rho_{jr}roman_Δ italic_ρ start_POSTSUBSCRIPT italic_j italic_r end_POSTSUBSCRIPT , for each item jMQI𝑗MQIj\in\text{MQI}italic_j ∈ MQI centered at an item-level detection effectηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and variabilities, ω2superscript𝜔2\omega^{2}italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Each point is an individual rater: a “+” marker is a single human rater; “\bullet” and “\bigtriangledown” are specific encoder and GPT models, respectively. X-axis is rater bias. Right is more lenient, left more severe. Color categories along x-axis are bias categories. Y-axis is rater variability (lower is more consistent. Horizontal lines 95% CI for bias via MCMC Bayes Estimation

Appendix H Generalizability and Decision Studies

H.1 Generalizability Study Human Results (for NCTE Main Study)

The results of the item-level G-study for human expert ratings, consisting of only the estimates for individual items using the NCTE Main Study data Kane et al. (2015) to replicate Section 2.d from the Appendix. All calculations and representations are according to the design details listed therein.

In the Appendix of the NCTE study, the authors submitted a G-study on the MQI instrument, but not for data of the study: they provide a separate G-study of only eight (8) different middle school teachers teaching three (3) lessons each with only nine (9) raters, instead of the corresponding 317 NCTE Study teachers with an average 5.34 lessons each and 63 raters. For completeness, this paper conducts the G-study for the NCTE main study Appendix, Section 3, using the NCTE dataset. The full results of the human label G-study are in Table 11.

Refer to caption
Figure 12: Variance components for Generalizability Calculations
Table 10: By item, the percentage contribution, excluding the residual (which accounts for the remainder of the variance), of each variance component in the given MQI Item’s R x (O:T) Generalizability Study
Table 11: By item, the percentage contribution, excluding the residual (which accounts for the remainder of the variance), of each variance component in the given MQI Item’s R x (O:T) Generalizability Study

H.2 Item Generalizability and Item-score Reliability

As a complement and context stemming from Sections 5.1 and 5.2, 𝐄ρ^j2𝐄subscriptsuperscript^𝜌2𝑗\mathbf{E}\hat{\rho}^{2}_{j}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT item values to item-level reliability estimates related to Guttman’s λ6subscript𝜆6\lambda_{6}italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT (Guttman, 1945), ρ^jjλ6\hat{\rho}^{\lambda_{6}}_{jj\prime}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j ′ end_POSTSUBSCRIPT (Zijlmans et al., 2018a, b). ρ^jjλ6\hat{\rho}^{\lambda_{6}}_{jj\prime}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j ′ end_POSTSUBSCRIPT represents the proportion of an item’s variance shared by the to variance captured by other items. This estimate from Classical Test Theory (naïvely, in this case) assumes that all items measure the same latent construct, i.e., the Mathematical Quality of Instruction (Hill et al., 2008). ρ^jjλ6\hat{\rho}^{\lambda_{6}}_{jj\prime}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j ′ end_POSTSUBSCRIPT removes the variance in the residual error, σεj2subscriptsuperscript𝜎2subscript𝜀𝑗\sigma^{2}_{\varepsilon_{j}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, from a multiple regression of item j𝑗jitalic_j on the scores from the remaining J1𝐽1J-1italic_J - 1 items to estimate the proportion of total item variance σXj2subscriptsuperscript𝜎2subscript𝑋𝑗\sigma^{2}_{X_{j}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT consistent with the unidimensional construct shared with the other items. Figure 13 highlights the large difference in the measurement used in Section 5.2 and item reliabilities from classical test theory. The latter of which describes the item reliability based on all scores, while the former is used in this study because it is more related to the reliability of individual scores for a given item.

Refer to caption
Figure 13: Estimates for Family-wise Item-level Generalizability, 𝐄ρ^j2𝐄subscriptsuperscript^𝜌2𝑗\mathbf{E}\hat{\rho}^{2}_{j}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , and Reliability ρ^jjλ6\hat{\rho}^{\lambda_{6}}_{jj\prime}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j ′ end_POSTSUBSCRIPT

.

H.3 Generalizability Theory Parameters and Code

A helpful heuristic for understanding the mathematics of G-theory might be they are very computationally similar to hierarchical mixed effect models, where estimates of interest are found in variation of the random effects. The two code blocks represent by item (O:I)×R(O:I)\times R( italic_O : italic_I ) × italic_R and (S:O:I)×R(S:O:I)\times R( italic_S : italic_O : italic_I ) × italic_R parameterizations, respectively, using variable names from the original dataset. The former replicates the methods used in Hill et al. (2012b) and the Appendix Section 2.d of Kane et al. (2015) to create Table 11 in Appendix section H.1, and was used in this study to calculate the family generalizability metrics in Section 5.2, including those used in Section 5.3. The latter is used for the decision studies described in Section 5.6. Studies were conducted using lme4 (Bates et al., 2015) in R (Team, )

Full results for item-level d-studies as defined in Section 5.6 are in Figure 14.

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(R_TYPE == rater.type), formula = SCORE   (1—RATERID) + (1—NCTETID/OBSID) + (1—ITEM) + (1—RATERID:NCTETID) + (1—RATERID:OBSID) + (1—ITEM:NCTETID) + (1—ITEM:OBSID) + (1—RATERID:ITEM) + (1—ITEM:RATERID:NCTETID) {listing} lme4 code for Family-wise all item estimations in Table 2

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(ITEM == item)—¿ filter(R_TYPE == rater.type), formula = SCORE   (1—NCTETID/OBSID) + (1—RATERID) + (1—RATERID:NCTETID) {listing} lme4 code for item-level estimations of 𝐄ρ^j2𝐄subscriptsuperscript^𝜌2𝑗\mathbf{E}\hat{\rho}^{2}_{j}bold_E over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in Equation 2

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(ITEM == item)—¿ filter(R_TYPE == rater.type), formula = SCORE   (1—NCTETID_SCHOOLYEAR_SP/OBSID/CHAPNUM) + (1—RATERID) + (1—RATERID:NCTETID_SCHOOLYEAR_SP) {listing} lme4 code for item-level estimations used in Equation 9

Refer to caption
Figure 14: Expected changes to rating reliability are estimated improvements to quality (via reliability) of classroom ratings for various contexts. The single individual human baseline (black) estimates reliability improvements by visiting the same class the x axis represents the number of different 15 min. classroom observations of the same teacher. The red line is estimate of having a different human observer conduct observations as described. By contrast, for the model raters–single Encoder (green), Encoder ensemble (average of 3 encoders) (Red), and GPT ensemble (average of 3 GPT prompt engineered models)–the x-axis for models is the number of full classroom observations conducted where the human (black) observes at least 15 minutes (in-the-loop) of the same classroom (models observe the entire class period).
Refer to caption
Figure 15: Real-time Evaluation: the X axis represents time in class (where 0 minutes is the start of class), each chart is one of the 25 items in the rubrics, the black lines are human evaluations (averaged, if multiple raters). The other lines are continuous model predictions for that item, using Loess smoothing where local fitting uses tricubic weighting of neighborhood points that span α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

Appendix I Interpretability of Encoder Labels

I.1 Feature Attribution Models and Tools

The Explainable Artificial Intelligence (XAI) community has proposed various cutting-edge methodologies to enhance the explainability of deep learning models. A popular strategy is feature attribution, wherein for a given neural network model f, an attribution method E delineates the significance of each input feature of x to the prediction y = f(x). Various strategies to ascertain feature importance have been introduced, encompassing gradient-based methods, surrogate methods, and perturbation-based methods. Our study employs Integrated Gradients, a gradient-based approach developed by Sundararajan et al. (2017), to identify pivotal sentences for classroom quality assessment. Integrated Gradients is engineered to comply with two essential axioms—Sensitivity and Implementation Invariance—that attribution methods ought to adhere to, as defined below:

IntegratedGradsiapprox(x)::=:subscriptsuperscriptIntegratedGrads𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑥assign\displaystyle\text{IntegratedGrads}^{approx}_{i}(x)::=IntegratedGrads start_POSTSUPERSCRIPT italic_a italic_p italic_p italic_r italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) : :=
(xixi)×k=1mF(x+km×(xx))xi×1m.subscript𝑥𝑖subscriptsuperscript𝑥𝑖superscriptsubscript𝑘1𝑚𝐹superscript𝑥𝑘𝑚𝑥superscript𝑥subscript𝑥𝑖1𝑚\displaystyle(x_{i}-x^{\prime}_{i})\times\sum_{k=1}^{m}\frac{\partial F(x^{% \prime}+\frac{k}{m}\times(x-x^{\prime}))}{\partial x_{i}}\times\frac{1}{m}.( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ∂ italic_F ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG italic_k end_ARG start_ARG italic_m end_ARG × ( italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG × divide start_ARG 1 end_ARG start_ARG italic_m end_ARG .

In the above, (xixi)subscript𝑥𝑖subscriptsuperscript𝑥𝑖(x_{i}-x^{\prime}_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the difference between the inputs, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the baseline, and m𝑚mitalic_m is the number of loops used for each step in a Riemann approximation of the exact integral, as presented by Sundararajan et al. (2017). Integrated Gradients compute the average gradient by interpolating between a chosen baseline and the input. The resulting attributions are subsequently obtained as the element-wise product of this path-averaged gradient vector and the difference vector between the input and the baseline.