"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

\fnmMichael \surHardy \affilStanford University \email[email protected]

Michael Hardy
Stanford University University
[email protected] Please see 8 for additional information about the author

Abstract

"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families–encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of human ratings of classroom instruction.

Keywords NLP $\cdot$ LLM $\cdot$ evaluation $\cdot$ bias $\cdot$ education $\cdot$ teacher development $\cdot$ Generalizability Theory $\cdot$ IRT $\cdot$ hierarchical rater models $\cdot$ reliability $\cdot$ classroom observation $\cdot$ classroom instruction $\cdot$ AI $\cdot$ fairness $\cdot$ racial bias $\cdot$ equity $\cdot$ annotations

1 Introduction

Human mediated labels always have an unknown amount of error. In machine learning practice, this error is often quantified using inter-rater reliability metrics and correlations. However, this annotation uncertainty is often ignored during standard supervised learning and model evaluation, leading to poorer models Belz et al. (2023). Thus, imperfect labels are treated as "gold" or "ground truth" (Belz et al., 2020; Hosking et al., 2024). This may be due in part to measures of accuracy being the most preferred methods of assessing and benchmarking model performance Birhane et al. (2022); Ribeiro et al. (2020); Kiela et al. (2021), but common practice might also arise from not using tools expressive enough to interpret labels in low reliability. To that end, this work demonstrates methods for working with low/unknown reliability annotations, often found in tasks requiring complex expert judgment.

The field of education has many complex tasks that often yield low reliabilities in labels (Jurenka et al., 2024; Kane and Staiger, 2012) which make edtech NLP models and research particularly vulnerable to the effects of inexpert annotations Belz et al. (2020); van der Lee et al. (2019); Zhou et al. (2023). The case study used to illustrate more expressive methods for working with unreliable labels will be from K12 education. Specifically, this study examines a use case where expert annotations are highly unreliable and yet used in high-stakes decisions: automated rating of the quality of classroom teaching. Methods used in this paper answer the call from others to evaluate the psychometric properties of models that perform this task (Casabianca et al., 2013; Liu and Cohen, 2021), and do so by comparing metrics across six dimensions of interest: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness (full results across these metrics against human baselines are in Table 4). Novel contributions of this work to NLP include:

1.

measurements of the generalizability and dependability of labels used with NLP tasks (Section 5.2),
2.

methods for detection of spurious correlations in model outputs via disattenuating low human-model correlations (Section 5.3),
3.

methods for measuring model biases by disentangling human rater-specific contributions to unknown bias for unknown data sets (Section 5.4),
4.

measurement of model fairness and racial bias in the presence of low label reliabilities (Section 5.5), and
5.

application of Design Studies (d-studies) from Generalizability Theory (g-theory) for estimating impacts of human-in-the-loop (HIL) model use on human label quality (Section 5.6).

This work strengthens the argument that only using simple inter-rater reliability metrics to understand the quality of labels may be masking the limitations of the labeling criteria (Hill et al., 2012b; Hosking et al., 2024; Belz et al., 2020). It also illustrates how more robust evaluation techniques can yield information in the presence of noisy labels and seemingly inconclusive results. The analyses presented in this study are motivated by issues of model interpretability, fairness, and usefulness. Brief introductions to various techniques will be provided and illustrated via the study task, with interpretation of limitations and recommendations for future research.

Refer to caption — Figure 1: Data Processes and Sources for Studying Teaching and Annotation Quality

1.1 Study Task: Annotating Teaching Quality

The classification task of rating teaching may seem deceptively simple: using a rubric, provide a rating for the quality of instruction of an elementary school math classroom. Such ratings are given to all US K12 public education teachers for both formative educator development feedback and as high-stakes teacher evaluations. Despite their ubiquity, these ratings, even when conducted by experts, are unreliable (Ho and Kane, 2013; Kane et al., 2015; Kane and Staiger, 2012; Glaese et al., 2022; Whitehill and LoCasale-Crouch, 2024), similar to the poor reliability of other K12 education labels (Jurenka et al., 2024; Tack et al., 2023) that have limited the rigor of education research (Slavin, 2002; Klahr, 2013; Jurenka et al., 2024). Studies about ratings of instruction are also extremely expensive to conduct relative to other annotation tasks (Grissom et al., 2013; Liu and Cohen, 2021; Jurenka et al., 2024), with only two major studies across hundreds of public school teachers that use authentic instructional metrics to support development: the MET study (Kane et al., 2013; Kane and Staiger, 2012) and the NCTE Main Study (Kane et al., 2015), the latter of which is the source of data for this study.

From the first study, Ho and Kane estimated that increasing the number of human classroom observers can improve the reliability of ratings assigned. In their major work on the topic, they use methods similar to those in this paper to measure conditions under which the use of additional human raters can increase the reliability of this resource- and time-intensive task (Kane and Staiger, 2012; Whitehurst et al., 2014). Considering the expense, importance, complexity, and lack of reliability in ratings of classroom teaching and also the advances in natural language processing, automated ratings based on classroom discourse offer one potential solution.

Study Research Question:

How can we know when the behaviors of models are good enough to be used lieu of humans as estimated by Ho and Kane?

Answering whether automated ratings can similarly improve human annotations is understanding the extent to which models’ added contributions would result in similar benefits as expected from humans. Thus, this study illustrates methods for working with unreliable labels in NLP tasks by investigating and disentangling the variation found in human and model raters from the variation found within the observations and the instrument used for the annotation task. The model raters are comprised of two families: the "GPT" family of autoregressive in-context learners from Wang and Demszky (2023) (using ChatGPT) with three models whose siblings differ by prompt engineering strategies and an "Encoder" family built for this study whose five siblings differ in embeddings and a few adjustments to training hyperparameters. Quality of ratings will be examined between and within families and individual raters.

2 Related Work

2.1 Annotation Quality and Bias

Better understanding human label behaviors is key to training and evaluating models (Webson et al., 2023; Webson and Pavlick, 2022; Gordon et al., 2022). Accuracy, based on "gold" or "ground truth" labels, is the primary and most valued performance metric by which LLMs are evaluated Birhane et al. (2022); Ribeiro et al. (2020); Kiela et al. (2021). For expediency of development, data scientists often choose to assume data labels are reliable, accurate, and end-task aligned for intended real-world use cases, Hosking et al. (2024); Bejar et al. (2006); Messick (1998), even in scenarios where these assumptions could be detrimental (e.g., performing complex high-stakes tasks, reducing discriminatory biases found in data (Field et al., 2021) that are immutably historical by definition of their creation, etc.), which is especially true of autoregressive models, whose labels are Internet text and which contain harmful biases (Hofmann et al., 2024a, b). Assessing the accuracy and reliability of idiosyncratically human annotated "ground truth" can be difficult Eckes and Jin ; Wind and Guo (2019); Wind (2019); Abercrombie et al. (2023); Baan et al. (2024, 2022); Waseem (2016); Kazai et al. (2013); Hosseiny Marani et al. (2022); Tack et al. (2023); Hosking et al. (2024), a challenge that is exacerbated when label uncertainty is underexamined or underreported. Limited transparency around label quality makes it more challenging to measure biases, interpret model findings, assess individual fairness, and establish real-world validity (Hill et al., 2012b; Jurenka et al., 2024).

Powerful and provocative research has begun to address the limitations of accuracy-only evaluations and propose more fair and responsible solutions under assumptions of uncertainty (Hardt et al., 2016; Dwork et al., 2012; Kasy and Abebe, 2021; Song et al., 2020; Zhao and Ermon, 2021; Corbett-Davies et al., 2023; Pleiss et al., 2017; Zemel et al., 2013), including techniques for addressing when labels lead to undesirable model behaviors Ding et al. (2022); Hebert-Johnson et al. (2018); Qi et al. (2023). This paper offers several ways to quantify these issues and improve interpretability and explainability Adebayo et al. (2020); Lundberg and Lee (2017); Rudin (2019); Kim et al. (2018).

2.2 Teacher Development and Evaluation

School leaders working with teachers to improve the quality of instruction typically evaluate the teacher’s proficiency in a range of competencies (typically measured during in-class observation and evaluation on a teaching rubric; Aguilar (2013); Bambrick-Santoyo (2016, 2018)), then determine which competencies are most important to improve first (i.e., which change will have the biggest impact on student learning), and then provide supportive feedback and coaching. This paper focuses on the first step of evaluating teacher proficiency, which is often time-consuming and produces ratings (labels) that are unreliable Kane and Staiger (2012); Blazar (2018); Kane et al. (2013); Casabianca et al. (2013). Without accurate classifications, it is challenging for practitioners to prioritize instructional needs and aligned practices from among the many elements of good teaching (Saphier et al., 2008; Darling-Hammond, 2014; Hammond, 2015; Lemov and Atkins, 2015; Lemov, 2021; Liljedahl et al., 2021; Darling-Hammond et al., 2020; Schwartz et al., 2016) and for researchers to empirically quantify the impact of good teaching practices Pianta and Hamre (2009); Charalambous and Delaney (2019); Blazar and Pollard (2022); Jurenka et al. (2024).

Thus, this work provides a bridge to research seeking to improve teaching quality by providing feedback to teachers on various instructional techniques (Samei et al., 2014; Donnelly et al., 2017; Kelly et al., 2018; Demszky et al., 2021; Suresh et al., 2022; Jacobs et al., 2022; Alic et al., 2022; Demszky and Liu, 2023; Demszky et al., 2024, 2023). These feedback studies identify linguistic features correlated with an aspect of good teaching, but may optimistically overgeneralize the usefulness, efficacy, and universality of identifiable features, providing specific prescriptions without diagnosis. Matching these models with the specific needs of teachers will help provide a more individualized approach to teacher development, one based on understanding instructional needs and then providing corresponding supports.

Only three recent studies have sought to use LLMs to provide ratings of classroom instruction (via classroom transcripts) using authentic rating rubrics. Whitehill and LoCasale-Crouch (2024) use a mix of zero-shot and bag-of-words model configurations to provide scores to instructional domains for Pre-Kindergarden classrooms using a private dataset, commenting on their highest Pearson $r$ correlation statistic of 72 experiments ( $r=0.48$ ) that it "approaches human inter-rater reliability". Wang and Demszky (2023) and Xu et al. (2024) both use the same publicly available datasets as the present study, and the approach of the former will be discussed further. Xu et al. use a by-item "best of" modeling approach which included experiments with BERT (Devlin et al., 2019), DistilBERT (Sanh et al., 2020), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2020), Llama 2 (Touvron et al., 2023), and ChatGPT, using models in two-stages where the first stage LLM provides the best text to the second stage which generates the rating. Unfortunately, the LLM-facilitated preprocessing of text and the by-item model training and selection processes limit the generalizability and transferability of their methods. While Xu et al. did not publicly release model ratings or the combinations of ensembles used, they did report Spearman correlation values for each of the best of several item-specific model constructions. In Figure 2, the results from their reported held-out test set are displayed alongside those from the present study for a comprehensive comparison across all studies reporting performance of automated ratings which use the MQI rubric or which use publicly accessible data.

3 Data

The data used in this study and in Wang and Demszky (2023) are from the National Center for Teacher Effectiveness (NCTE) Main Study (Kane et al., 2015), which contains three years of data collection and observations of math instruction in approximately fifty schools and three-hundred (4th and 5th grade) mathematics classrooms across four school districts in the United States, including expert human ratings of individual video-captured classroom lessons across two observation instruments Bacher-Hicks et al. (2017, 2019): the CLASS framework (12 items) (Pianta et al., 2008) for general instructional practice and the content-specific Mathematical Quality of Instruction (MQI; 13 items) (Hill et al., 2008), together yielding over 400,000 distinct human rating labels assigned, the distributions of which are in Figure 6. Each instrument item is intended to measure a different aspect of teaching quality.

Like all human mediated labels,¹¹1Label(er), rate(r), annotat(ion/or), and score(r) will be used interchangeably for these classification tasks, as terminology varies multidisciplinarily. an individual classroom observation rating requires at a minimum three facets: (1) a task with rating criteria (Section 3.1), (2) raters/labelers (Section 3.2), and (3) observations to be classified (sections of transcripts of classroom discourse, Section 3.3). As tasks increase in complexity, three facets contribute more error to estimates. This dataset has the additional real-world challenges of having very long and noisy transcripts and having large imbalances (Figure 4 panel (a), Figure 6) in human labels that have hindered previous research (Xu et al., 2024; Wang and Demszky, 2023), but which provide extra opportunity to demonstrate the importance of robust methods of evaluation.

3.1 Rating Criteria: MQI Rubric

Just as all raters contribute uncertainty to a system, so too do the measurement instruments. Ambiguity uncertainty is introduced when an instrument, instruction, or criteria for a task has language that could lead to two equally-expert raters to different results, ceteris paribus. The 13 MQI items within the dataset have at least two raters per classroom observation. While both humans and Encoders evaluated all items, the this paper will focus on the 4 of the 13 MQI items evaluated in Wang and Demszky (2023) to support comparability across humans and models.²²2Xu et al. provided results for 11 of the 13 MQI items. No explanation is provided for the exclusion of MGEN and USEPROD. These four ternary items are teacher explanations ( EXPL), remediation of student errors (REMED), student questioning and reasoning (SMQR), and imprecision in mathematical language (LANGIMP).³³3LANGIMP is reverse-coded so higher scores are better and has interesting self-referentiality vis-à-vis instrument uncertainty that is worth noting, but out of scope for the current study. See Appendix C.2 for more on this and other negatively worded items. Analyses for all other items are in the appendices. Prior studies have explored the reliability of MQI instrument ratings generally Kane and Staiger (2012); Mantzicopoulos et al. (2018); Hill et al. (2012b); Kane et al. (2015); Ji (2023); this study confirms previous findings via reproduced reliability metrics in Section 3.2, which correspond to the NCTE Study, Appendix Section 2).

3.2 Human Expert Raters

Human rater information for both the MQI and CLASS instruments can be found in the Appendix of the DS0 Study-Level Files from the NCTE Main study. MQI raters in particular were recruited from a separate pool of applicants based on their background in mathematics and through contacting colleagues in mathematics departments (Hill et al., 2012a; Blazar et al., 2017) and then passed certification exams to score the MQI, and attended biweekly calibration meetings to ensure standardization of scoring procedures.

3.3 Classroom Observations

63 human raters watched videos and provided ratings at regular intervals across all items in the MQI. Transcripts of these same videos (Demszky and Hill, 2022) are used by LLMs for the same task, where the class discourse is equipartitioned across utterances (GPT family models) or words (Encoder family models) by the total number of classroom segments to align the text to the human labels in the absence of timestamps. Data from the NCTE Main study (Kane et al., 2015) ⁴⁴4https://www.icpsr.umich.edu/web/ICPSR/studies/36095/datadocumentation and for the associated transcripts (Demszky and Hill, 2023)⁵⁵5https://github.com/ddemszky/classroom-transcript-analysis are available online.

4 Model Families and Model Rater Data

GPT Models

The GPT model family from Wang and Demszky (2023)⁶⁶6https://github.com/rosewang2008/zero-shot-teacher-feedback/ have 7,660 ratings for 223 different teachers. The family consists of three models differing in prompt engineering methods (herein called N, NR, and ND), and brief summary of those differences is in Table 8. GPT models were evaluated on curated selections of classroom text with the least transcriptorial noise (i.e., minimizing instances of [inaudible]), and were edited to indicate whether the speakers were teachers or students.

Encoder Models

Encoder family models are custom transformer encoders trained on the NCTE classroom transcripts. The five models (un1, un2, un3, gte, and e5) use fixed-parameter pretrained sentence embeddings, differing in these and in training hyperparamters, thereby exploiting LLM sensitivites to pretraining regimes (D’Amour et al., 2020; McCoy et al., 2023). A summary of differences is in Table 7 and more training details can be found in Appendix D. In contrast to the model experiments of Xu et al. who used different combinations of models by item, each encoder model produces labels for all 13 MQI (and 12 CLASS) items. In contrast to the GPT models, the only text preprocessing used with the Encoders simply replaced all transcription notes with [inaudible] to mimic the uncertainty in live audio transcription, and no edits to indicate speakership were included. For the Encoder models, all model outputs⁷⁷7https://github.com/hardy-education/LLM-Psychometrics in this study were conducted with a lesson-level-stratified held-out test set (see Figure 8) that was not used during model development. Encoder models were trained a single GPU in Google Colab with training detailed in Appendix D.3.

5 Evaluation Methods

Typical reliability metrics (see Section 5.1) provide a backdrop of descriptives that can flag issues of low quality labels. Measures of statistical dependability can be used for generalizing label conclusions and identifying spurious correlations (see Section 5.3), a part of improving accuracy. Methods for disentangling human and model label biases (see Section 5.4) are first demonstrated and then extended to estimate fairness across racial lines in Section 5.5. Usefulness, as measured by the amount of rating reliability improvement a model can provide to a human rater in human-in-the-loop contexts, including associated cost savings in human time (for encoder models) are in Section 5.6.

5.1 Concordance: Agreement and Reliability Metrics

RQ 1:

How do automated models perform relative to humans in the presence of low label reliability? RQ 1: Case Study Reframing: How well do automated models perform relative to humans when evaluating instruction?

5.1.1 Baseline Human Metrics

Full reproductions⁸⁸8Small differences in the reported values here compared to the original study arise from random human rater selection required in the procedure, which were done at the segment level. All families and model evaluations used the same random sample of human raters for comparison. of all reliability metrics and calculation processes exactly as described in the NCTE Main Study Appendix Section 2 were conducted. (Kane et al., 2015). Following their same procedures, replicated calculations were extended to the model families, replacing a human rater score with a specified or random model for evaluations of individual models and model families, respectively. Intra-class correlations (ICCs) are with the calculation methods in Appendix F. Reproduced human results and model results, including additional metrics in this section, are fully reported in Appendix F.1 and all item results can be found in the online supplement.

5.1.2 Commonly Used Metrics

The results also include three additional correlation and reliability metrics: Quadratic Weighted Kappa (QWK) typically used in ordinal classification tasks to penalize distance quadratically (squared error) while accounting for categorical agreement by chance (e.g., Shermis (2014); Hardy (2021); Wang and Demszky (2023)), Pearson correlation $r$ , (e.g., Whitehill and LoCasale-Crouch (2024)) Spearman correlation $\rho$ (e.g., Wang and Demszky (2023); Xu et al. (2024)), and Kendall correlation $\tau$ (e.g., Liu et al. (2023b)). Figure 2 shows Spearman correlations ( $\rho$ ) and confidence intervals for all model families and for models from Xu et al. (2024). The table in Figure 2 contains the $\rho$ estimates.

5.1.3 Results

Using nearly any standardized combination of metrics across all items from Section 5.1, Encoder models perform better than the single highest performing expert human rater. The human labels assigned for the four focus MQI have very low reliabilities, despite the significant training and calibration for human raters described in 3.2. Overall, the human labels are highly unreliable, but if a researcher were trying to compare the model to human performance, they could be displayed as they are in Table 1. For metrics of agreement and reliability, each encoder model outperformed humans on average, whilst each GPT model underperformed humans on every metric and every item. Table 1 has a summary of the full panel of lesson segment-level inter-rater reliability metrics for each MQI item. Specific metrics for the four focus MQI items in this study are in Panel (b) in Figure 4, and the full individual model-item comparisons for all MQI items and metrics in this section are in Table LABEL:tab:tab:full. Additionally, the detailed full results for all models and metrics, MQI, and CLASS rubrics can be found in the supplementary materials online.

Using only these metrics and without further testing, one might assume that the encoder models are therefore ready to help with the task of automated annotations of teaching quality or that GPT models show improvement to ICC measures and could be helpful. Implications: Basic statistics in the presence of unreliable labels can mislead interpretations of model performance. Researchers should be wary of studies reporting few metrics in the presence of low reliabilities.

Metric	Encoders	un1	un2	un3	gte	e5	GPTs	N	NR	ND
%Agr	0.54	0.69	0.77	0.69	0.39	0.39	0.00	0.00	0.00	0.00
C’s $\kappa$	0.69	0.85	0.77	0.62	0.62	0.62	0.00	0.00	0.00	0.00
QWK	1.00	1.00	1.00	1.00	0.92	0.92	0.00	0.00	0.00	0.00
$r$	1.00	1.00	1.00	1.00	1.00	1.00	0.00	0.00	0.00	0.00
$\rho$	1.00	1.00	1.00	1.00	0.77	0.77	0.00	0.00	0.00	0.00
$\tau$	1.00	1.00	1.00	1.00	0.77	0.77	0.00	0.00	0.00	0.00

Table 1: Concordance: Performance above Human Reliability and Agreement Metrics. Proportion of MQI items where the model or model family listed had better results than human baselines. Bold indicates where performance was better on more than half of items rated. Inter-rater reliability metrics introduced in Section 5.1. C’s

\kappa

: Cohen’s

\kappa

; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement;

{r}

: Pearson’s correlation;

\mathbf{\rho}

: Spearman’s rank correlation;

\mathbf{\tau}

: Kendall’s concordance correlation;. Full data can be found in the supplementary material online.

5.2 Confidence: Generalizable Reliability

RQ 2:

How generalizable are findings from unreliable labels? RQ 2 Case Study Reframing: To what extent would the ratings of a teacher’s instructional quality persist across lessons or contexts?

5.2.1 Generalizability and Dependability

Generalizability Study (g-study) (Brennan, 2001a, 2013, b; Hill et al., 2012b) designs utilize random effect estimates across possible configurations of different sources of variance to quantify how generalizable labels. This is done by estimating the extent to which given labels would persist if sources of variation changed (e.g., same teacher, different day; same lesson, different rater; human rater vs model rater; etc.). $\mathbf{E}\rho^{2}$ is a measure of the relative generalizability of a rating (i.e., is rating order preserved), and $\mathit{\Phi}$ , accounting for absolute error, is a measure of label dependability: how likely specific ratings would be numerically the same with different sources of variation. These two reliability-like estimates can help quantify how "golden" labels are.

The multifaceted g-study design used to estimate the how much variation ( $\nu$ ) in individual teachers’ instructional quality, $i$ , contributed to a rating label, $X$ , annotated for a section of a lesson, $s$ , during an observation, $o$ , on rubric item $j$ by rater $r$ is known as a Item-by-Rater-by-Segment-within-Observation-within-Individual Teacher design: $J\times R\times(S:O:I)$ . Overall estimates across all MQI items for a given rater family, $\mathbb{F}$ , are in Table 2. For item-level reliabilities, we simplify the expression by holding the item fixed, resulting in a $R\times(S:O:I)$ design. Using nested random effects notation, the estimation model is:

\displaystyle X_{s:o:ir}^{(j)}=\mu+\nu_{i}+\nu_{o:i}+\nu_{s:o:i}+\nu_{ir}+\nu_% {r}+\nu_{s:o:ir},\forall j\in\textbf{J}

(1)

where $j$ indicates the item index.⁹⁹9For the estimates in Fig. 4 (c), for dependability metrics of Section 5.3, and for comparability with human baselines(Hill et al., 2012b; Kane et al., 2015; Ho and Kane, 2013; Kane and Staiger, 2012), a simplified model, an by-item $R\times(O:I)$ design, was conducted for the human expert rater family with results in Appendix H.1. The simplified model is $X_{o:ir}^{(j)}=\mu+\nu_{i}+\nu_{o:i}+\nu_{ir}+\nu_{r}+\nu_{o:ir}$ The full model structures of Eq. 1, 2 and 3 are used for Section 5.6. Code for the model specification is in Appendix H.3. Then, $\mathbf{E}\rho^{2}$ (Equation 2) and $\mathit{\Phi}$ (Equation 3) are easily estimated from the random effects for raters in rater family $\mathbb{F}$ :

\displaystyle{\mathbf{E}\mathit{\rho}^{2}_{\mathbb{F}}}^{(j)}=\frac{\nu_{ij}}{% \nu_{ij}+\nu_{o:ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{s:o:irj}}\text{, }

(2)

\displaystyle\mathit{\Phi}_{\mathbb{F}}^{(j)}=\frac{\nu_{ij}}{\nu_{ij}+\nu_{o:% ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{rj}+\nu_{s:o:irj}}\text{, }

(3)

$\forall r\in\mathbb{F}$ , where the individual item-rating-segment variation, $\nu_{s:o:irj}$ , is confounded with error variation. These results are found in Table 2. A figure comparing the $\mathbf{E}\hat{\rho}^{2}_{j}$ item values to item-level reliability estimates related to Guttman’s $\lambda_{6}$ , $\rho^{\lambda_{6}}_{jj\prime}$ , from Classical Test Theory (Zijlmans et al., 2018a, b), can be found in Appendix H.2. Additionally an illustration of sources of variance including descriptions can be found in Appendix H, color-coded to support interpretation of sources of variance with the table of results.

5.2.2 Results

Humans, on average, produce labels that are both more reliable and generalizable. The full results for human rater labels, decomposed into variance components, can be found in H.3¹⁰¹⁰10Appendix 2.c of Kane et al. (2015) provided a g-study, but, surprisingly, not using the data from the study. and estimates for $\mathbf{E}\rho^{2}$ and $\mathit{\Phi}$ can also be found in panel (c) of Figure 4. The Encoder models outperform humans on nearly every item in terms of inter-rater reliability metrics (Table 1) , but not in generalizable reliability metrics as seen in panel (c) tables of Figure 4. Importantly, the large difference between $\mathbf{E}\hat{\rho}^{2}$ and $\mathit{\hat{\Phi}}$ for Humans and Encoders is due to properties of individual items, which accounted for over 75% of the variation in those families. GPT models, on the other hand, did not change ratings very much on different items, consistent with literature on these models not understanding such prompts Liu et al. (2023a); Webson and Pavlick (2022); Heo et al. (2024).

Table 2 shows that Encoder model still performs better than humans on the majority of items, but it is no longer as clear. Interestingly, as mentioned in Section 4, the encoder models did not receive any annotations outside of the transcript, including speaker. This means that the model would struggle to identify teacher explanations (EXPL) from student explanations (STEXPL). This shift in interpreting encoder family performance from superhuman to zero reliability adds validity to the argument that these metrics provide valuable insight, showing that the relationships found in some of the variables could be explained by variance unrelated to the label construct. Implications: Measures of generalizability and dependability derived from structured variance decomposition can meaningfully quantify label quality.

		$\mathbf{E}\hat{\rho}^{2}$			$\mathit{\hat{\Phi}}$
ITEM	Human	Encoders	GPTs	Human	Encoders	GPTs
ETCA	0.17	0.20		0.15	0.19
EXPL	0.15	0.00	0.00	0.12	0.00	0.00
LANGIMP	0.09	0.15	0.08	0.08	0.14	0.08
LCP	0.11	0.27		0.09	0.26
LINK	0.13	0.19		0.12	0.19
MAJERR	0.08	0.00		0.07	0.00
MGEN	0.03	0.08		0.02	0.08
MLANG	0.07	0.18		0.06	0.17
MMETH	0.13	0.37		0.13	0.36
REMED	0.13	0.10	0.05	0.11	0.09	0.04
SMQR	0.14	0.09	0.00	0.13	0.09	0.00
STEXPL	0.25	0.00		0.23	0.00
USEPROD	0.19	0.25		0.17	0.25
All Items	0.114	0.106	0.007	0.010	0.014	0.004

Table 2: Generalizability and Dependability metrics by model families for each MQI Item. Bold represents the best rater family for each of

\mathbf{E}\rho^{2}

and

\mathit{\Phi}

, respectively. Underlined items are focus MQI items, because they were evaluated by Wang and Demszky (2023). For the overall "All Items" calculation, a

J\times R\times(O:I)

model was used for comparability with other similar research.

5.3 Validity: Convergent and Spurious Correlations

RQ 3:

To what extent can accuracy and validity be estimated with unreliable labels? RQ 3Case Study Reframing: To what extent do models and humans rate the same underlying construct similarly?

5.3.1 Disattenuating High Noise Correlations

Dependability and generalizability do not guarantee accuracy, but even at these very low levels, they can be used in indirect tests of convergent validity to see whether correlations between humans and models are low because of measurement error, such as poor rubric item construction, or because the two sets are really uncorrelated. If an individual teacher’s latent instructional ability $\theta_{i}$ is about the same from lesson to lesson with the same students, we can correlate $\hat{\theta}_{i}$ for human ( $\mathbb{h}$ ) and model ( $\mathbb{m}$ ) family ratings for different lessons coming from the same teacher and correct for measurement error by disattenuating the correlations by each rater family’s $\mathbb{F}$ label generalizability, $\mathbf{E}\hat{\mathit{\rho}}_{\mathbb{F}}^{(j)}$ , for a given item $j$ . The disattenuated correlation, $\mathbf{\varrho}_{\mathbb{hm}}^{(j)}$ , between humans and a family of models for item, $j$ , can be estimated:

\displaystyle\mathbf{\varrho}_{\mathbb{hm}}^{(j)}=\frac{\operatorname{Corr}[% \operatorname{\tilde{\mathcal{X}}_{\mathbb{h}}}(i,\mathfrak{L},j,r_{\mathbb{h}% }),\operatorname{\tilde{\mathcal{X}}_{\mathbb{m}}}(i,\neg\mathfrak{L},j,r_{% \mathbb{m}})]}{\sqrt{{\mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{h}}}^{(j)}{% \mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{m}}}^{(j)}}}

(4)

where $\tilde{\mathcal{X}}_{\mathbb{F}}$ is score retrieval function for individual teacher $i$ on item $j$ by a random member $r$ of rater family $\mathbb{F}$ in relation to some observed lesson $\mathfrak{L}$ with family label generalizability, ${\mathbf{E}\hat{\mathit{\rho}}^{2}_{\mathbb{F}}}^{(j)}$ defined in Equation 2. In other words, the numerator (represented in red in Figure 3) is the correlation in scores whenever two different lessons from the same teacher were scored by raters from different families (human and model). The denominator then adjusts for based on the reliabilities of raters from each family to account for the known tendency of low reliability to diminish observed correlations.

Figure 4 panel (b) has the disattenuated correlations and their respective 95% confidence intervals, calculated at $\alpha=0.05$ using empirical confidence scaling methods defined by Charles (2005), which produces more conservative confidence intervals on this data than traditional Fisher normalization (Kromrey et al., 2008), which is preferable given the low levels of reliability in Section 5.2 which can lead to overcorrection. Reported disattenuated correlations of 1.0 do not mean perfect correlation: it generally means that measurement error is not randomly distributed.

Disattenuated correlations are not directly comparable¹¹¹¹11For example, reported disattenuated correlations of 1.0 do not mean perfect correlation: it generally means that measurement error is not randomly distributed. to the measures of correlation in Section 5.1 (Muchinsky, 1996). However, failure of disattenuation to identify viable human-model correlations for items that previously such showed correlated relationships in Section 5.1 suggests the prior correlations may be spurious. Disattenuation does not change the low reliability across items nor the quality of the measurement, but it offers indirect evidence for discerning model predictive validity by quantifying the how changes in the underlying construct result in changes in the same direction for both human and model.

Results for disattenuated correlations described in Section 5.3 and their confidence intervals are in Figure 3. Most items show correlated relationships after disattenuation, and most with confidence intervals above 0.5, suggesting that the encoder models and the humans are likely identifying similar sources of underlying teacher variation for those items.

5.3.2 Results

Disattenuation analyses and Section 5.2 suggest that the Encoder model family’s SOTA-level correlations on the EXPL and STEXPL item may have been spurious (likely identifying speech patterns associated with higher teacher performance, and not necessarily specific to explanations), a direct result of low generalizabilities found in Section 5.2. Additionally, we see see very large confidence intervals for the encoders for items where item score distributions are most imbalanced (MGEN, MAJERR), suggesting that correlations found are not justified in the presence of low reliabilities. Items where the disattenuated correlations are lower (e.g., LCP, MMETH) suggests that models and humans interpreted observational features differently. Implications: when measurement error is high, disattenuating model and human correlations can help identify whether items with high or similar correlations have spuriousness or are responding to similar features.

This method only minimally provides evidence for investigating accuracy and validity, but, for the Encoder models, evidence can be built upon by comparing how the more continuous ratings of the models and humans change and correlate over the course of a given observation. While not explicitly part of this study, an example of how Encoders’ and humans’ ratings change from the start to the end of a class for a randomly chosen lesson observation is illustrated in Figure 15. Investigating the validity of a construct would require more robust qualitative review of the content.

5.4 Bias: Disentangling Individual Rater Behaviors

RQ 4:

Can bias contributed by individual rater behaviors be identified and disentangled from labels? RQ 4: Case Study Reframe: How do individual rater effects contribute to ratings bias?

5.4.1 Hierarchical Rater Models

Rater biases in complex tasks are usually not directly measurable, but we can estimate latent constructs that quantify the effects of individual raters’ behaviors using methods commonly used to estimate latent attributes of rubric items (e.g., item difficulty) and latent attributes individuals (e.g., ability) throughout Item Response Theory (IRT). If the data had no variation due to raters, various polytomous IRT methods could help estimate "true scores"/"gold" labels ( $\xi_{ij}$ ) during classroom observations, teacher instructional abilities ( $\theta_{i}$ ), and the various individual item effects. For tasks with human-mediated labels, human raters introduce additional sources of measurement error for each classification and the data may include multiple measures from multiple raters for a single observation (leading to an accumulation of information at overlap observation points). To address this, hierarchical rater modeling (HRM) (Patz et al., 2002; Decarlo, 2003; DeCarlo et al., 2011) combines an IRT model with a first stage estimation defined by a signal detection theory (SDT) relationship. The latter asks the question, "given the presence of the ’true’ score, can a rater detect it?" as the former asks, "given the inputs, can we estimate the ’true’ score accounting for differences in the tasks used to measure it?". The hierarchical structure addresses the problem of accumulation of information in the estimates. HRMs consist of three components:

\displaystyle\text{HRM}\begin{cases}\boldsymbol{\theta}_{i}\sim\text{MVN}(% \textbf{0}_{M\times 1},\textbf{I}_{M\times M})\text{,}\\ \xi_{oij}\sim\text{{IRT model}: Equation \ref{eq:MHRM_IRT}}\\ X_{soijr}\sim\text{{SDT model}: Equation \ref{eq:MHRM_SDM}}\end{cases}

(5)

where an IRT model estimates the "gold" label score $\xi_{soij}$ for a given item for some time segment $s$ in teacher $i$ ’s $o$ -th observed lesson for item $j$ , which arises from $i$ ’s $M$ -dimensionally distributed latent instructional ability/needs ( $\boldsymbol{\theta}_{i}$ ), and a Signal Detection Theory (SDT) model component disentangles individual rater biases from each recorded score, $X_{soijr}$ , by quantifying the latent attributes that mediate whether rater $r$ correctly detects the true score, i.e., $p_{\xi kr}=\ P\left[X_{soijr}=k\ |\xi_{oij}=\xi\ \right]$ .

The IRT component of Equation 5 estimating the the true scores based on rubric item- and teacher-specific parameters is a $K_{j}$ -category multidimensional generalized partial credit model (MGPCM) (Muraki, 1992; Adams et al., 1997; Cui et al., 2024; Casabianca, 2021). Distributional challenges of negatively worded items can be addressed through a multidimensional parameterization of the underlying latent teacher instructional abilities, with between-item dimensionality confirmatorily defined by the factors in Blazar et al. (2017). The MGPCM item discrimination parameters, $\boldsymbol{\alpha}_{j}=\alpha_{jm}$ , a vector of dimension-specific traits $\boldsymbol{\theta}_{i}=\theta_{im}$ are separated for $m\in M$ latent dimensions, and parameters for item difficulties $\gamma_{jk}$ exist for each possible score category $k$ in item $j$ :

\displaystyle P\left[\xi_{oij}=\xi\ |\boldsymbol{\theta^{\prime}}_{i},\ % \boldsymbol{\alpha}_{j\ },\ \gamma_{j\xi},o\right]=\frac{\exp\left\{(k-1)% \boldsymbol{\alpha}_{j}\boldsymbol{\theta^{\prime}}_{i}-\sum_{k=1}^{k}\gamma_{% jk}\right\}}{\sum_{h=1}^{K_{j}}\exp\left\{(k-1)\boldsymbol{\alpha}_{j}% \boldsymbol{\theta^{\prime}}_{i}-\sum_{k=1}^{h}\gamma_{jk}\right\}},\forall s\in o

(6)

where $oi=1,...,N$ lessons observed for teacher $i$ , $j=1,...,J$ items, $r=1,...,R$ raters, and $k=1,...,K$ possible scores.

As parameterized by Patz et al. (2002), the base-level SDT model of the HRM represents the measurement error induced by rater $r$ whose ability to "detect" the true score changes according to an individual rater’s item-specific biases, $\phi_{jr}$ and variabilities, $\psi_{jr}$ , on the x and y axes of Figure 4:

\displaystyle p_{\xi kr}\propto\exp\left\{-\ \frac{1}{2\psi_{jr}^{2\ }}\left[k% -\left(\xi\ +\ \phi_{jr}\right)\right]^{2}\right\}\

(7)

where $\boldsymbol{\phi}_{jr}=\textbf{Y}_{jr}\eta$ is a linear model for rating bias for items and with design matrix $\textbf{Y}_{jr}$ of dimensions $(RJ)\times(R+J)$ and $\eta=(\phi_{1},...,\phi_{R},\eta_{1},...\eta_{J})^{T}$ for $R$ raters and $J$ items, as parameterized in Mariano and Junker (2007). Correspondingly, we update $\ln{\psi_{jr}^{2}}=\textbf{Y}_{jr}(\ln{\tau^{2}})$ where $\ln{\mathbf{\tau}^{2}}=(\ln{\psi_{1}^{2}},...,\ln{\psi_{R}^{2}},\ln{\tau_{1}^{% 2}},...,\ln{\tau_{J}^{2}})^{T}$ . The complete rater estimates from these models are displayed in Figure 10. The Bayesian estimates were calculated via Markov-chain Monte Carlo (MCMC) simulation using Gibbs sampling across four chains using JAGS (Plummer, 2003) in R using very weakly-informative priors, converging with $\hat{R}<1.1$ for each parameter. A structural plate diagram and JAGS code for the full extended model can be found in Appendix G.

5.4.2 Results

Individual annotator tendencies and behaviors can be measured and indiciate significant differences. The vertical dashed lines on the graphs in panels (d) and (e) in Figure 4 represent 0.5 standard deviations of difference for individual raters from the mean. GPT models show significantly different rater behavior. Implications: even tasks where there is minimal overlap of observations to individual raters, behaviors can still be modeled and removed. This allows for improved curation of datasets and model selection.

5.5 Fairness: Estimation of Ratings Racial Lines

RQ 5:

With unreliable labels and complex tasks, can rater contributions to biased labeling across groups be estimated? RQ 5 Case Study Reframe: Can issues of racial fairness in ratings be disentangled from individual rater behaviors?

5.5.1 Measuring Racial Discrimination as Rater Covariates

Disentangling individual rater biases further, across sensitive attributes, can provide a measure of fairness for labels and identify raters (human or model) that display discriminatory biases. Variables representing a sensitive attribute, $\varsigma$ (e.g., race/ethnicity, gender, age, etc.) should be independent of observed score $X_{soijr}$ given the true score $\xi_{soij}$ if ratings are fair: $X\perp\varsigma\Rightarrow P_{\varsigma=a}(X_{jr}|\xi_{j})=P_{\varsigma=b}(X_{% jr}|\xi_{j}),\forall a,b$ . In the notation used for disentangling rater effects, there should be no difference in variation in scoring from rater $r$ on item $j$ is fair with respect to attribute $\varsigma$ given $\varsigma\perp\xi$ :

\displaystyle P[X_{soijr}|\xi_{soij},r,j,\varsigma_{i}]=P[X_{soijr}|\xi_{soij}% ,r,j]

(8)

To measure a rater’s item-level fairness with respect to some sensitive teacher attribute, $\varsigma$ , the rater parameter vectors are easily updated where $\phi_{jr\varsigma}=\textbf{Y}_{jr\varsigma}\eta$ is now a linear model for rating bias for items and with $\textbf{Y}_{jr\varsigma}$ is a design matrix of dimensions $(RJ\Sigma)\times(R+J+\Sigma)$ and $\Sigma=\{B,W\}$ for Black and White self-identified teachers respectively. In this case, where $\varsigma_{i}\in\{B,W\}$ , we can update the vector explicitly to illustrate those values $\eta=(\phi_{1_{B}},\dots,\phi_{R_{B}},\phi_{1_{W}},\dots,\phi_{R_{W}},\eta_{1_% {B}},...\eta_{J_{B}},\eta_{1_{W}},...\eta_{J_{W}})^{T}$ for $R$ raters, $J$ items, , and , $\ln{\psi_{jr\varsigma}^{2}}=\textbf{Y}_{jr\varsigma}(\ln{\tau^{2}})$ is similarly updated such that $\ln{\mathbf{\tau}^{2}}=(\ln{\psi_{1}^{2}},\dots,\ln{\psi_{R}^{2}},\ln{\tau_{1}% ^{2}},...,\ln{\tau_{J}^{2},\tau_{B}^{2},\tau_{W}^{2}})^{T}$ .

By approaching the estimation this way, where $\phi_{jr\varsigma}$ is estimated as a parameter, we disentangle contributions to rater scores based on teacher race. This simplifies the task of evaluating for fairness using the metric of group independence, $X\perp\varsigma$ , where we can directly calculate $P[X_{soijr}|\xi_{oij},\phi_{jr\varsigma},\varsigma_{i}]=P[X_{soijr}|\xi_{oij},% \phi_{jr\varsigma}]$ . Thus, $X\perp\varsigma\measeq\phi_{B}-\phi_{W}\approxeq 0$ .

When estimated, less than 1% of parameter estimates had $\hat{R}\geq 1.1$ , whose differences in posterior distributions have no material effect on results or discussion; all rater-item-specific 95% credible intervals for biases are represented as horizontal lines in Figure 4, in panel (e). Appendix G has full JAGS code used for the formula specification for all items and dimensions, including initial value parameters. Additionally, a plate diagram for MCMC modeling can be found in Figure 9.

5.5.2 Results

Racial bias at the individual rater level is significiantly measurable. The GPT model families show a negative bias trend against Black teachers relative to White teachers on most items, as seen in the comparison of those models across panels (d) and (e) in Figure 4. Potentially more precisely, GPT models’ rating centrality seemed to diminish when rating Black teachers, especially with the "reasoning" model, adding evidence that these foundation models may be sensitive to linguistic differences found in African-American English (AAE) (Hofmann et al., 2024b; Fleisig et al., 2024), possibly due to historical data or models’ relative unfamiliarity with AAE Rickford and King (2016). These results alone should give pause to edtech developers relying on prompt-engineering of foundation LLMs, as subtleties in biases exist in very complex tasks. Additionally, it is not just GPT models showing biases. For some types of items, such as negatively worded items, individual human rater effects could be detected where abnormal rater biases, either positive or negative, towards teachers with some sensitive attribute.

Overall, encoders displayed much less bias than humans. However, while not as severe as the GPT or human biases, the encoder models did not avoid issues of racial bias. On the worst performing item for both human and encoder models, MGEN, all of the encoder models found spurious relationships in some language feature while overfitting with a negative bias against Black teachers. The reasons are likely to do with label sparcity and underrepresentativeness across label categories: with so few examples of ratings in the higher categories in the training dataset, overfit on a biased sample was not adequately controlled for, showing a microcosm of alignment to poor data that GPT exhibits in macrocosm. Fortunately for the encoders, many earlier data had already suggested that neither the models nor humans (see Appendix F.1 and Hill et al. (2012b)) could sufficiently distinguish between the item’s categories.

Implications: even tasks where there is minimal overlap of observations to individual raters, bias can still be modeled and removed. This allows for improved curation of datasets and model selection. The techniques can be used for evaluation of biases from given populations.

5.6 Helpfulness: Estimating Real-world of Effects

RQ 6:

Can we estimate the effects on rating quality and changes in real-world cost if a model were to be used with a human-in-the-loop? RQ 6 Case Study Reframe: For a teacher, how would automated ratings of instruction affect human rating quality?

5.6.1 Mixed Decision Studies

A Decision Study (D-study) estimates how reliabilities of ratings could improve by adjusting measured facets of variation, much like Ho and Kane did to motivate the case study. To estimate the reliability in a human-in-the-loop scenario, multiple g-studies and d-studies would need to be constructed to combine the variance contributions across a set rater families, $\mathbb{F}$ . For this work, only two different types of families are consider in each d-study, and one of them will always be human, as automated rating models, even high-performing Encoders, are not yet ready to produce ratings independent from human confirmation. For a human-in-the-loop decision study, $\mathbb{F}$ would consist of families $\mathbb{f}$ that have humans only and models only, and a combined human-model family. For a $(S:O:i)\times R$ study estimated dependability of ratings provided to teachers $i$ on item $j$ , $\tilde{\Phi}_{j}$ is, in the joined "universe" $\mathbb{F}^{\prime}$ where estimations are represented by $\mathbf{K}$ , the collection of unique parameterizations and estimates, $\varkappa$ , for the facets of variance in each D-study:

\displaystyle\widetilde{\Phi}_{j,{\mathbb{F^{\prime}_{\varkappa}}}\sim\mathbf{% K}}=\frac{\sum_{\mathbb{f}}^{\mathbb{F}}{\sigma^{2}(i_{\varkappa})}_{j\mathbb{% f}}}{\sum_{\mathbb{f}}^{\mathbb{F}}{\sigma^{2}(i_{\varkappa})}_{j\mathbb{f}}+{% \sigma^{2}(\Delta_{\varkappa})}_{j\mathbb{f}}}

(9)

where the summations in Equation 9 combines the variation across the familial "universes", indexed by $\varkappa$ , of different rater families in $\mathbb{F}$ and ${\sigma^{2}(i_{\varkappa})}_{j}$ and ${\sigma^{2}(\Delta_{\varkappa})}_{j}$ represents the "universe" variability for teacher $i$ and the absolute error for dependability, respectively, at the teacher-year-level ( $i$ ) across the combined parameterization set $\mathbf{K}$ . Structurally, Equation 9 shares similarities with the two-stage ICC calculation of Eq. 12. These values are represented in the ratio for calculating dependability, $\Phi_{j}$ , as found in Equation 3 ${\sigma^{2}(\Delta)}_{j}\equiv\nu_{o:ij}+\nu_{s:o:ij}+\nu_{irj}+\nu_{rj}+\nu_{% s:o:irj}$ . The absolute error for a rater family ( $\mathbb{f}$ ) indexed by $\varkappa$ across any permutation of decision values in this study:

\displaystyle\sigma^{2}(\Delta_{\varkappa})

\displaystyle=\frac{\sigma^{2}(r_{\varkappa})}{n_{r_{\varkappa}}^{\prime}}+% \frac{\sigma^{2}(o_{\varkappa}:i)}{n_{o_{\varkappa}}^{\prime}}+\frac{\sigma^{2% }(r_{\varkappa}i)}{n_{r_{\varkappa}}^{\prime}}+\frac{\sigma^{2}(s_{\varkappa}:% o_{\varkappa}:i)}{n_{s_{\varkappa}}^{\prime}n_{o_{\varkappa}}^{\prime}}+\frac{% \sigma^{2}(s_{\varkappa}:o_{\varkappa}:ir_{\varkappa})}{n_{s_{\varkappa}}^{% \prime}n_{o_{\varkappa}}^{\prime}n_{r_{\varkappa}}^{\prime}}

(10)

where the decision values vary across design facets and whose contribution is weighted by the combined count $n_{k}^{\prime}$ of a given facet $k$ for ratings generated only by the family indexed by $\varkappa$ , $n_{k_{\varkappa}}$ and those facets, if any, shared between families, $n_{k_{\mathbb{F}^{\prime}}}$ : $n_{k_{\varkappa}}^{\prime}=n_{k_{\varkappa}}+n_{k_{\mathbb{F}^{\prime}}}% \forall k\in\{s,o,r\},n_{r_{\mathbb{F}^{\prime}}}=0$ . These distinct sets of parameter values for each design study are represented in Equation 9. For human-in-the-loop only use cases, $\varkappa_{\text{HIL}}$ , the value $n_{k_{\mathbb{F}^{\prime}}}$ represents those sources of variation that are shared between rater families, and for a model family $\mathbb{f}=\mathbb{m}$ , where there would be no observations made by a model without a human, the model would not have any independent observations $n_{o_{\mathbb{m}}}=0$ . To represent these $n$ values where a human $\mathbb{h}$ observes a classroom for 15 minutes¹²¹²12For the MQI instrument, observation segments are 7.5 minutes long. with a model and where a single model $\mathbb{m}$ continues to observe for the remainder of the class (an additional 45 minutes), $\mathbf{K}_{n\in\varkappa_{\text{HIL}}}=\{n_{o_{\mathbb{m}}}=0,n_{o_{\mathbb{h% }}}=0,n_{o_{\mathbb{F}^{\prime}}}=1,n_{s_{\mathbb{m}}}=6,n_{s_{\mathbb{h}}}=0,% n_{s_{\mathbb{F}^{\prime}}}=2,n_{r_{\mathbb{m}}}=1,n_{r_{\mathbb{h}}}=1,n_{o_{% \mathbb{F}^{\prime}}}=0\}$ and where the variance components are solved similarly to the coefficients of Eq. 1.

5.6.2 Results

Estimates of impacts of model use can be reconstructed from measurable variances. The estimates for $\widetilde{\Phi}_{j,{\mathbb{F^{\prime}}}}$ are in Figure 4 panel (f) with complete results for all items in Figure 14. As conducting actual human annotated classroom observation ratings is immensely expensive, the decision study analyses of Section 5.6 offer methods for estimating the improvement gained by using a model or model family. Parameterizing the decision conditions to reflect "human-in-the-loop" scenarios can even offer insight into whether the variation offered from automated ratings adds or detracts from human rating quality, offering a means of estimating research questions before more expensive trials.

Constructs that are relatively infrequent, such as LANGIMP, could greatly benefit automated ratings, since sufficient human observations for identifying that construct would be expensive. Having encoder models listen in for three entire classes yields reliabilities for that construct that are twice that of the combined efforts of multiple human raters stopping by a teacher’s classroom 10 times, fifteen minutes each time—a net savings of two hours for the principal and a potential savings of over 10 hours if such a level of reliability were desire and were these trends to continue. Implications: Not all variance contributes equally, and its careful deconstruction and reconstruction can anticipate future effects before setting up more expensive studies.

	Category	Metric	GPTs				Encoders
	Category	Metric	EXPL	LANGIMP	REMED	SMQR	EXPL	LANGIMP	REMED	SMQR
RQ1	Concordance	IRRs	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faQuestionCircle[regular]	\faCheckCircle	\faCheckCircle
		$r,\rho,\tau$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle
RQ2	Confidence	$\mathbf{E}\rho^{2}$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faTimesCircle[regular]	\faTimesCircle[regular]
		$\Phi$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faTimesCircle[regular]	\faTimesCircle[regular]
RQ3	Validity	$\varrho_{\mathbb{hm}}^{(j)}$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faCheckCircle	\faQuestionCircle[regular]
RQ4	Bias	$\phi_{r}$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faQuestionCircle[regular]
RQ5	Fairness	$X\perp\varsigma$	\faQuestionCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle
RQ6	Helpfulness	$\widetilde{\Phi}_{{\mathbb{F^{\prime}_{\text{HIL}}}}\sim\mathbf{K}}$	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faTimesCircle[regular]	\faCheckCircle	\faCheckCircle	\faQuestionCircle[regular]

Table 3: Summary Table for Item-level Metrics and Relative Performance for Model Families on four focus items. GPTs are from Wang and Demszky and Encoders are from the present study. For each metric, symbols represent whether the model family generally performs as good as or better than humans \faCheckCircle, worse than humans \faTimesCircle[regular], or if performance relative to humans is unclear \faQuestionCircle[regular]. The results for all MQI items can be found in Table 4. IRRs refers to the Inter-rater Agreement metrics from Section 5.1.

6 Overall Results and Discussion

At the outset we asked How can we know when the behaviors of models are good enough to be used lieu of the humans estimated by Ho and Kane? This question, which is a question of validity, is unanswerable by purely empirical means. While reliability (and accuracy) are measurable, validity is a case made from argument. Thus, the answer to that question is not a binary, but one of quality; it is about knowing when the behaviors of models are "good enough" on some item on some instrument for some population of classrooms against some standard of performance. Even though the Encoder family in this study outperform humans, we need to be wary of the validity of the construct being measured, as humans have exhibited the tendency to collaborate poorly with LLM/AI models in their current state Vaccaro et al. (2024); Agarwal et al. (2023); Zhou et al. (2024); Azaria et al. (2024); UpLevel (2024). The constraints of human uses demand arguments to validity that are beyond the scope of this work, despite the intentional wording of the primary research question.

The overall results relative to human performance corresponding to each of the research questions and their respective metrics for the four focus MQI items can be found in Table 3 and Table 4 has all MQI items.

For the four focus MQI items, contrasting panel (b) with panel (c) in Figure 4 reveals commonly used evaluation metrics can obscure important aspects of model performance. However, as demonstrated in panels (c)-(f), there are methods that can be used to improve evaluation under label uncertainty. Many of these methods could be applied to annotated data prior to model training to improve data quality and support training (Gordon et al., 2022).

Encoder models, on most items and in general, outperformed human raters in terms of reduced biases, improved performance metrics, and anticipated cost savings. They represent the best performing models for automated rating of classroom instruction using an authentic measurement instrument of which we are aware at the time of writing, showing large gains over human performance and even larger compared to other models, across metrics discussed herein. While not the focus of this study, the best reported single metric by Whitehill and LoCasale-Crouch on the CLASS rubric across all items and models, $R=0.48$ , is contrasted with the average CLASS item performance of the encoders, $\bar{R}=0.60$ , and the single worst item for any Encoder model $\min R=0.50$ , as reported in the online materials. Thus, the Encoder family models offer a pathway forward for supporting the expensive research task of instructional annotation, regardless of whether they are ready for actual deployment teachers.

This is in stark contrast to the GPT models, which perform much worse than human raters. GPT models likely performed poorly in part due to the prompt length (Liu et al., 2023a), the out-of-distribution inputs of elementary school classroom discourse and task of instructional assessment (McCoy et al., 2023): hypotheses which could be investigated with future research. As GPT-style models increase in popularity, in use, and in sophistication, these methods can help identify sophistry and speciousness in third-party models even in the presence of low reliability. Like humans, models tended to choose a preferred rating value, and their deviations, conditionally informed by billions of fixed parameters at inference, are non-random.¹³¹³13Variables like ‘temperature‘ can increase stochasticity of model outputs.

Being able to identify biases in cases of unreliable annotations is important, and researchers should resist the urge to withhold evaluable results from foundation models even if the data fail to reject a null hypothesis. By performing more rigorous evaluations, researchers could crowdsource measuring model biases and behavior tendencies to help all users be more discerning of speciousness, especially as these models’ poor behaviors get harder to detect (Azaria et al., 2024; Hosking et al., 2024; Zhou et al., 2024) and as researchers make bolder claims about their abilities (see Binz et al. 2024, inter alia).

The Encoder models’ designs, by contrast, were constructed to allow for multiple methods of interpretability and use by evaluating continuous windows of classroom discourse. This could be used for real-time diagnosis, interpretation, and supporting common understanding between teacher and coach. An example of such use can be found in Figure 15, where the continuous predictions for all encoder models are displayed next to average human rating scores. Improvements to this process, combined with successful feature attribution, could boost validity and trust in model use for these high-stake scenarios. If various performance measures continue to display performance Feature attribution (see Appendix I.1) could then be used in the future for augmenting transcripts of classroom instruction to support model training and inference.

Automated encoder LLMs could reduce the high costs of improving classroom observers’ annotations and serve as a stepping stone to quality teacher development.¹⁴¹⁴14Code for statistical models is available in the appendix and free for use. Education technologists and EdTech enthusiasts should be wary of foundation models’ abilities to do out-of-distribution tasks. These "stochastic parrots" (Bender et al., 2021) might start fires with their "embers of autoregression" (McCoy et al., 2023) when trying to perform tasks for data so far from their training distribution, which is certainly the case with authentic fourth and fifth grade mathematics classroom discourse.

7 Limitations

The methods serve as a proof of concept for enhancing reliability in widespread and costly classroom evaluation tasks. Even though these models can perform better than a human given many accepted metrics, much more analysis and technological development is needed. Despite being best in class, these models should not be used in production in their current state. Even with a human in the loop, much more work must be done to ensure their readiness for possible assumed capabilities by end users. Far more important is that GPT style models are not used similarly, and this paper does not endorse their use for this or similar tasks.

Demonstrating multiple methods in a paper with suggestion towards their flexibility evokes the Garden of Forking Paths Problem. This study chose to follow the same parameterizations in Section 5.1 and data aggregations as the original study (Kane et al., 2015) in order to preserve comparability with the original data and human raters by using more familiar methods for the context. However, this parameterization has its limitations. An example of where aggregating and calculating reliabilities at the segment level (as was demonstrated in Section 5.6) would be to look at reliability and validity issues at the utterance level—something uniquely available to the Encoder model family herein that is not available to other raters or models. Figure 15 illustrates this capability, underexplored in this paper. Such analyses could be bolstered further by authentic feature attribution for improving interpretability. (See Appendix I.1 for more on directions for future work implied here.)

While they do demonstrate the claims, the methods of this paper might not be the best implementation of available methods. Rather it is intended to illustrate the potential for better quantifying behaviors in both labelers and models when we have uncertainty in labels. For example, if more understanding of rater perceptions and behaviors of labeling tasks is needed, using a more expressive substitution of Equation 7 (DeCarlo et al., 2011; DeCarlo, 2023, 2008) could give greater insight, especially in the case where models may perceive label category thresholds differently.

Psychometric models generally assume that the underlying latent variables are distributed normally across a population, which is usually a reasonable assumption with humans. But this assumption need not be true for models nor for all tasks. In this study, few models were estimated alongside humans to demonstrate how differently they behave under this assumption, but this paper provides no evidence that model abilities would be normally distributed for LLMs (e.g., latent constructs could follow multimodal distributions, depending on a family and pretraining, or follow a Normal-exponential-gamma distribution for shifts in metric-specific emergent behaviors). Were researchers interested in modeling learning in a larger population of models, other methods, such as, unipolar IRT models (Huang and Bolt, 2023), could potentially help for understanding between-model behaviors for the case where the rating instrument is purely an issue of detection and then magnitude. The usefulness of basic psychometric models presented is based on usefulness of the anthropomorphic distributional comparisons we can reasonably make in the presence of uncertain labels.

The parameters and variables selected for reporting decision study results presented do not represent all use cases and algorithms. While the assumption that models like GPT would have their labels treated as if they were human is a reasonable assumption , it is still an assumption. For example, the decision study of Section 5.6 does not have a within-observation-longitudinal parameterization and thus assumes that humans observing multiple segments of a class period do not necessarily need to observe the segments consecutively. While the MQI rubric is worded so as to be robust to within-lesson autocorrelation, actual lessons are obviously autocorrelated. Longitudinality could likewise support more accurate versions of Equation 6.

While many studies cited herein seek to generalize similar research across all classrooms, we acknowledge that this cannot be done with the transcript data we use for this presented work, as it only consists of fourth and fifth-grade mathematics classrooms from the United States. While the methods potentially possess broad applicability across all grades and subject areas, the current models lack generalizability beyond elementary mathematics classrooms in U.S. public schools, highlighting the need for more publicly available data in this area. Furthermore, the associated ratings and reliability metrics pertain solely to a subset of rating items on the MQI rubric¹⁵¹⁵15The full set of items from MQI and CLASS rubrics are available in Appendices and in the online materials., which may introduce limitations when addressing the more universal task of automated instruction ratings. This is associated with the limitations of the instruments themselves, as imperfect tools for even calibrated and trained raters.

Similarly, as the focus of this paper is to demonstrate evaluation techniques in the presence of unreliable labels, the generalizability of models is low. Encoder models, while each is powerful and individually able to produce automated scores for 25 different authentic measures of classroom instruction (in contrast to the models of Xu et al., which used 11 separate fine-tuned models for the MQI items evaluated), were built specifically for this task and would not generalize further without data or architecture changes. GPT models represent available autoregressive decoder in-context learning via prompt engineering in 2023. Models have scaled and improved since then and it is possible that performance would improve, but issues of underlying racial biases (Section 5.5) continue to exist, even with more current models (Hofmann et al., 2024b, a; Warr et al., 2024; Shieh et al., 2024; Nghiem et al., 2024; Henderson et al., 2024).

The Encoder models were trained under the assumptions that the actual expert human ratings are not very reliable, that the alignment of the coordination of timing across rubrics and across transcripts is imperfect, that the discourse transcripts are imperfect, and that information is lost by keeping fixed sentence-level embeddings. While the methods outlined worked to extract a meaningful signal despite these challenges, it should be noted that the signal is still trained on noisy human ratings. If, on average, the raters had a particular bias, the model would carry that bias. For example, this is particularly true with the CLASS item ratings, as there were only 19 different raters used, compared to the 63 used for the MQI rubric items, and only had one rater per classroom observation. Results are included for comparability and generalizability, but they likely carry more human raters’ idiosyncrasies.

The encoder models removed transcription notes and intentionally did not use transcription information (such as identification of speaker) to best emulate what the functionality would be in a audio-input-only setup. While this is an authentic interpretation of the task, the transcription process was still done with humans. While direct input from audio would capture even more information (such as tone or long breaks in speaking for independent work), these models have not been trained to work with automated transcription.

The encoder models could be improved through metalearning training, so they could be more adaptive to new instructional rubrics and classrooms. Without metalearning across tasks, transferability is limited by the training regime and architecture as well as the data. Future work will include metalearning, allowing the model to take advantage of 72% more observations.

Finally, while the paper reported on "GPT" family performance, it only used the performance corresponding to a since study, which used only prompt engineering and which used ChatGPT 3.5. Perhaps with fine-tuning, multi-agent prompting, and other enhanced uses of such models, performance might improve. However, it is not clear that, even as models continue to improve on general use tasks, that they will improve on their ability to understand and respond to text that is outside of their training distribution (i.e., classroom discourse). Even if the text were within the training distribution, this study has demonstrated that evaluation of such text is non-trivial and, thus, the task would still be more challenging for such models (McCoy et al., 2023).

8 Authorship and Positionality Statement

Michael Hardy is the sole author of this work. Prior to his research work, he worked in public education as a teacher, principal, superintendent, and a state chief, where he evaluated and improved instructional materials and practices across many contexts. With more than decade of successful coaching instruction and as a former Educator of the Year for Texas, he is compelled by his passion and expertise to improve and support classroom teachers so that all students can have access to an excellent education. Third-party generative language models, such as ChatGPT, were not used for any aspect of the study, except where explicitly stated.

References

Abercrombie et al. (2023) Gavin Abercrombie, Verena Rieser, and Dirk Hovy. 2023. Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement. arXiv preprint. ArXiv:2301.10684 [cs].
Adams et al. (1997) Raymond J. Adams, Mark Wilson, and Wen-chung Wang. 1997. The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1):1–23. Place: US Publisher: Sage Publications.
Adebayo et al. (2020) Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2020. Sanity Checks for Saliency Maps. arXiv preprint. ArXiv:1810.03292 [cs, stat].
Agarwal et al. (2023) Nikhil Agarwal, Alex Moehring, Pranav Rajpurkar, and Tobias Salz. 2023. Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology.
Aguilar (2013) Elena Aguilar. 2013. Developing a Work Plan: How Do I Determine What to Do? In The art of coaching: effective strategies for school transformation, pages 119–144. Jossey-Bass, A Wiley Brand, San Francisco.
Alic et al. (2022) Sterling Alic, Dorottya Demszky, Zid Mancenido, Jing Liu, Heather Hill, and Dan Jurafsky. 2022. Computationally identifying funneling and focusing questions in classroom discourse. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 224–233, Seattle, Washington. Association for Computational Linguistics.
Azaria et al. (2024) Amos Azaria, Rina Azoulay, and Shulamit Reches. 2024. ChatGPT is a Remarkable Tool—For Experts. Data Intelligence, 6(1):240–296.
Baan et al. (2022) Joris Baan, Wilker Aziz, Barbara Plank, and Raquel Fernández. 2022. Stop Measuring Calibration When Humans Disagree. arXiv preprint. ArXiv:2210.16133 [cs].
Baan et al. (2024) Joris Baan, Raquel Fernández, Barbara Plank, and Wilker Aziz. 2024. Interpreting Predictive Probabilities: Model Confidence or Human Label Variation? arXiv preprint. ArXiv:2402.16102 [cs] version: 1.
Bacher-Hicks et al. (2017) Andrew Bacher-Hicks, Mark J. Chin, Thomas J. Kane, and Douglas O. Staiger. 2017. An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.
Bacher-Hicks et al. (2019) Andrew Bacher-Hicks, Mark J. Chin, Thomas J. Kane, and Douglas O. Staiger. 2019. An experimental evaluation of three teacher quality measures: Value-added, classroom observations, and student surveys. Economics of Education Review, 73:101919.
Bambrick-Santoyo (2016) Paul Bambrick-Santoyo. 2016. Get better faster: a 90-day plan for coaching new teachers. Jossey-Bass, A Wiley Brand, San Francisco, CA.
Bambrick-Santoyo (2018) Paul Bambrick-Santoyo. 2018. Leverage leadership 2.0: a practical guide to building exceptional schools. Jossey-Bass, San Francisco, CA.
Bates et al. (2015) Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67:1–48.
Bejar et al. (2006) Isaac 1 Bejar, David M. Williamson, and and Robert J. Mislevy. 2006. Human Scoring. In Automated Scoring of Complex Tasks in Computer-Based Testing. Routledge. Num Pages: 34.
Belz et al. (2020) Anya Belz, Simon Mille, and David M. Howcroft. 2020. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. In Proceedings of the 13th International Conference on Natural Language Generation, pages 183–194, Dublin, Ireland. Association for Computational Linguistics.
Belz et al. (2023) Anya Belz, Craig Thomson, Ehud Reiter, and Simon Mille. 2023. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New York, NY, USA. Association for Computing Machinery.
Binz et al. (2024) Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K. Eckstein, Noémi Éltető, Thomas L. Griffiths, Susanne Haridi, Akshay K. Jagadish, Li Ji-An, Alexander Kipnis, Sreejan Kumar, Tobias Ludwig, Marvin Mathony, Marcelo Mattar, Alireza Modirshanechi, Surabhi S. Nath, Joshua C. Peterson, Milena Rmus, Evan M. Russek, Tankred Saanum, Natalia Scharfenberg, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Xin Sui, Mirko Thalmann, Fabian Theis, Vuong Truong, Vishaal Udandarao, Konstantinos Voudouris, Robert Wilson, Kristin Witte, Shuchen Wu, Dirk Wulff, Huadong Xiong, and Eric Schulz. 2024. Centaur: a foundation model of human cognition. arXiv preprint. ArXiv:2410.20268.
Birhane et al. (2022) Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. 2022. The Values Encoded in Machine Learning Research. arXiv preprint. ArXiv:2106.15590.
Blazar (2018) David Blazar. 2018. Validating Teacher Effects on Students’ Attitudes and Behaviors: Evidence from Random Assignment of Teachers to Students. Education Finance and Policy, 13(3):281–309.
Blazar et al. (2017) David Blazar, David Braslow, Charalambos Y. Charalambous, and Heather C. Hill. 2017. Attending to General and Mathematics-Specific Dimensions of Teaching: Exploring Factors Across Two Observation Instruments. Educational Assessment, 22(2):71–94. Publisher: Routledge _eprint: https://doi.org/10.1080/10627197.2017.1309274.
Blazar and Pollard (2022) David Blazar and Cynthia Pollard. 2022. Challenges and Tradeoffs of “Good” Teaching: The Pursuit of Multiple Educational Outcomes. Technical report, Annenberg Institute at Brown University. Publication Title: EdWorkingPapers.com.
Brennan (2001a) Robert L. Brennan. 2001a. Generalizability Theory. Springer, New York, NY.
Brennan (2001b) Robert L. Brennan. 2001b. Variability of Statistics in Generalizability Theory. In Robert L. Brennan, editor, Generalizability Theory, Statistics for Social Sciences and Public Policy, pages 179–213. Springer, New York, NY.
Brennan (2013) Robert L. Brennan. 2013. Generalizability Theory. Springer Science & Business Media. Google-Books-ID: nbHbBwAAQBAJ.
Briggs and Wilson (2007) Derek C. Briggs and Mark Wilson. 2007. Generalizability in item response modeling. Journal of Educational Measurement, 44(2):131–155. Place: United Kingdom Publisher: Blackwell Publishing.
Casabianca (2021) Jodi M. Casabianca. 2021. Digital Module 27: Hierarchical Rater Models. Educational Measurement: Issues and Practice, 40(4):103–104. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/emip.12478.
Casabianca et al. (2013) Jodi M. Casabianca, Daniel F. McCaffrey, Drew H. Gitomer, Courtney A. Bell, Bridget K. Hamre, and Robert C. Pianta. 2013. Effect of Observation Mode on Measures of Secondary Mathematics Teaching. Educational and Psychological Measurement, 73(5):757–783. Publisher: SAGE Publications Inc.
Charalambous and Delaney (2019) Charalambos Y. Charalambous and Seán Delaney. 2019. 13 Mathematics Teaching Practices and Practice-Based Pedagogies. Brill. Section: International Handbook of Mathematics Teacher Education: Volume 1.
Charles (2005) Eric P. Charles. 2005. The Correction for Attenuation Due to Measurement Error: Clarifying Concepts and Creating Confidence Sets. Psychological Methods, 10(2):206–226. Place: US Publisher: American Psychological Association.
Corbett-Davies et al. (2023) Sam Corbett-Davies, Johann D. Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel. 2023. The Measure and Mismeasure of Fairness. arXiv preprint. ArXiv:1808.00023 [cs].
Cui et al. (2024) Chengyu Cui, Chun Wang, and Gongjun Xu. 2024. Variational Estimation for Multidimensional Generalized Partial Credit Model. Psychometrika.
D’Amour et al. (2020) Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, and D. Sculley. 2020. Underspecification Presents Challenges for Credibility in Modern Machine Learning. arXiv preprint. ArXiv:2011.03395 [cs, stat].
Darling-Hammond (2014) Linda Darling-Hammond. 2014. What Can PISA Tell Us about U.S. Education Policy? New England Journal of Public Policy, 26(1).
Darling-Hammond et al. (2020) Linda Darling-Hammond, Lisa Flook, Channa Cook-Harvey, Brigid Barron, and David Osher. 2020. Implications for educational practice of the science of learning and development. Applied Developmental Science, 24(2):97–140. Publisher: Routledge _eprint: https://doi.org/10.1080/10888691.2018.1537791.
Decarlo (2003) Lawrence T. Decarlo. 2003. Using the PLUM procedure of SPSS to fit unequal variance and generalized signal detection models. Behavior Research Methods, Instruments, & Computers, 35(1):49–56.
DeCarlo (2008) Lawrence T. DeCarlo. 2008. Studies of a Latent-Class Signal-Detection Model for Constructed-Response Scoring. ETS Research Report Series, 2008(2):i–55. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.2333-8504.2008.tb02149.x.
DeCarlo (2023) Lawrence T. DeCarlo. 2023. Classical Item Analysis from a Signal Detection Perspective. Journal of Educational Measurement, 60(3):520–547. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12358.
DeCarlo et al. (2011) Lawrence T. DeCarlo, YoungKoung Kim, and Matthew S. Johnson. 2011. A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model. Journal of Educational Measurement, 48(3):333–356. Publisher: National Council on Measurement in Education.
Demszky and Hill (2022) Dorottya Demszky and Heather Hill. 2022. The NCTE Transcripts: A Dataset of Elementary Math Classroom Transcripts. Publisher: arXiv Version Number: 1.
Demszky and Hill (2023) Dorottya Demszky and Heather Hill. 2023. The NCTE Transcripts: A Dataset of Elementary Math Classroom Transcripts. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 528–538, Toronto, Canada. Association for Computational Linguistics.
Demszky and Liu (2023) Dorottya Demszky and Jing Liu. 2023. M-Powering Teachers: Natural Language Processing Powered Feedback Improves 1:1 Instruction and Student Outcomes. In Proceedings of the Tenth ACM Conference on Learning @ Scale, L@S ’23, pages 59–69, New York, NY, USA. Association for Computing Machinery. Event-place: Copenhagen, Denmark.
Demszky et al. (2023) Dorottya Demszky, Jing Liu, Heather C. Hill, Shyamoli Sanghi, and Ariel Chung. 2023. Improving Teachers’ Questioning Quality through Automated Feedback: A Mixed-Methods Randomized Controlled Trial in Brick-and-Mortar Classrooms. Technical report, Annenberg Institute at Brown University. Publication Title: EdWorkingPapers.com.
Demszky et al. (2021) Dorottya Demszky, Jing Liu, Zid Mancenido, Julie Cohen, Heather Hill, Dan Jurafsky, and Tatsunori Hashimoto. 2021. Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions. Publisher: arXiv Version Number: 1.
Demszky et al. (2024) Dorottya Demszky, Rose Wang, Sean Geraghty, and Carol Yu. 2024. Does Feedback on Talk Time Increase Student Engagement? Evidence from a Randomized Controlled Trial on a Math Tutoring Platform. In Proceedings of the 14th Learning Analytics and Knowledge Conference, LAK ’24, pages 632–644, New York, NY, USA. Association for Computing Machinery.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ding et al. (2022) Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2022. Retiring Adult: New Datasets for Fair Machine Learning. arXiv preprint. ArXiv:2108.04884 [cs, stat].
Donnelly et al. (2017) Patrick J. Donnelly, Nathaniel Blanchard, Andrew M. Olney, Sean Kelly, Martin Nystrand, and Sidney K. D’Mello. 2017. Words matter: automatic detection of teacher questions in live classroom discourse using linguistics, acoustics, and context. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, LAK ’17, pages 218–227, New York, NY, USA. Association for Computing Machinery.
Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pages 214–226, New York, NY, USA. Association for Computing Machinery.
(51) Thomas Eckes and Kuan-Yu Jin. Detecting Illusory Halo Effects in Rater- Mediated Assessment: A Mixture Rasch Facets Modeling Approach.
Field et al. (2021) Anjalie Field, Su Lin Blodgett, Zeerak Waseem, and Yulia Tsvetkov. 2021. A Survey of Race, Racism, and Anti-Racism in NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1905–1925, Online. Association for Computational Linguistics.
Fleisig et al. (2024) Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, and Dan Klein. 2024. Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination. arXiv preprint. ArXiv:2406.08818 [cs] version: 1.
Gao (2022) Shuai Gao. 2022. Système de traduction automatique neuronale français-mongol (historique, mise en place et évaluations) (French-Mongolian neural machine translation system (history, implementation, and evaluations) machine translation (hereafter abbreviated MT) is currently undergoing rapid development, during which less-resourced languages nevertheless seem to be less developed). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 : 24e Rencontres Etudiants Chercheurs en Informatique pour le TAL (RECITAL), pages 97–110, Avignon, France. ATALA.
Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint. ArXiv:2209.14375.
Gordon et al. (2022) Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. 2022. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–19, New York, NY, USA. Association for Computing Machinery.
Grissom et al. (2013) Jason Grissom, Susanna Loeb, and Benjamin Master. 2013. Effective Instructional Time Use for School Leaders: Longitudinal Evidence from Observations of Principals. Educational Researcher, 42(8)(42(8)):433.
Guttman (1945) Louis Guttman. 1945. A basis for analyzing test-retest reliability. Psychometrika, 10(4):255–282.
Hammond (2015) Zaretta Hammond. 2015. Culturally responsive teaching and the brain: promoting authentic engagement and rigor among culturally and linguistically diverse students. Corwin, a SAGE company, Thousand Oaks, California. OCLC: ocn889185083.
Hardt et al. (2016) Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity in Supervised Learning. arXiv preprint. ArXiv:1610.02413 [cs].
Hardy (2021) Mike Hardy. 2021. Toward Educator-focused Automated Scoring Systems for Reading and Writing. arXiv preprint. ArXiv:2112.11973 [cs].
Hebert-Johnson et al. (2018) Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In Proceedings of the 35th International Conference on Machine Learning, pages 1939–1948. PMLR. ISSN: 2640-3498.
Henderson et al. (2024) Peter Henderson, Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, and Prateek Mittal. 2024. Safety Risks from Customizing Foundation Models via Fine-tuning.
Heo et al. (2024) Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Shirley Ren, Udhay Nallasamy, Andy Miller, Kwan Ho Ryan Chan, and Jaya Narain. 2024. Do LLMs "know" internally when they follow instructions?
Hill et al. (2008) Heather C. Hill, Merrie L. Blunk, Charalambos Y. Charalambous, Jennifer M. Lewis, Geoffrey C. Phelps, Laurie Sleep, and Deborah Loewenberg Ball. 2008. Mathematical Knowledge for Teaching and the Mathematical Quality of Instruction: An Exploratory Study. Cognition and Instruction, 26(4):430–511. Publisher: Taylor & Francis, Ltd.
Hill et al. (2012a) Heather C. Hill, Charalambos Y. Charalambous, David Blazar, Daniel McGinn, Matthew A. Kraft, Mary Beisiegel, Andrea Humez, Erica Litke, and Kathleen Lynch. 2012a. Validating Arguments for Observational Instruments: Attending to Multiple Sources of Variation. Educational Assessment, 17(2-3):88–106. Publisher: Routledge _eprint: https://doi.org/10.1080/10627197.2012.715019.
Hill et al. (2012b) Heather C. Hill, Charalambos Y. Charalambous, and Matthew A. Kraft. 2012b. When Rater Reliability Is Not Enough: Teacher Observation Systems and a Case for the Generalizability Study. Educational Researcher, 41(2):56–64. Publisher: American Educational Research Association.
Ho and Kane (2013) Andrew D. Ho and Thomas J. Kane. 2013. The Reliability of Classroom Observations by School Personnel. Research Paper. MET Project. Technical report, Bill & Melinda Gates Foundation. Publication Title: Bill & Melinda Gates Foundation ERIC Number: ED540957.
Hofmann et al. (2024a) Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024a. AI generates covertly racist decisions about people based on their dialect. Nature, pages 1–8. Publisher: Nature Publishing Group.
Hofmann et al. (2024b) Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024b. Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. arXiv preprint. ArXiv:2403.00742 [cs].
Hosking et al. (2024) Tom Hosking, Phil Blunsom, and Max Bartolo. 2024. Human Feedback is not Gold Standard. arXiv preprint. ArXiv:2309.16349.
Hosseiny Marani et al. (2022) Amin Hosseiny Marani, Joshua Levine, and Eric P.S. Baumer. 2022. One Rating to Rule Them All? Evidence of Multidimensionality in Human Assessment of Topic Labeling Quality. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, pages 768–779, New York, NY, USA. Association for Computing Machinery.
Huang and Bolt (2023) Qi (Helen) Huang and Daniel M. Bolt. 2023. Unipolar IRT and the Author Recognition Test (ART). Behavior Research Methods.
Jacobs et al. (2022) Cassandra L. Jacobs, Ryan J. Hubbard, and Kara D. Federmeier. 2022. Masked language models directly encode linguistic uncertainty. In Proceedings of the Society for Computation in Linguistics 2022, pages 225–228, online. Association for Computational Linguistics.
Ji (2023) Xuejun (Ryan) Ji. 2023. Using cross-classified mixed effects model for validation studies : a flexible and pragmatic validation method. Ph.D. thesis, University of British Columbia.
Jurenka et al. (2024) Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, Ankit Anand, Miruna Pîslar, Stephanie Chan, Lisa Wang, Jennifer She, Parsa Mahmoudieh, Wei-Jen Ko, Andrea Huber, Brett Wiltshire, Gal Elidan, Roni Rabin, Jasmin Rubinovitz, Mac McAllister, Julia Wilkowski, David Choi, Roee Engelberg, Lidan Hackmon, Adva Levin, Rachel Griffin, Michael Sears, Filip Bar, Mia Mesar, Mana Jabbour, Arslan Chaudhry, James Cohan, Sridhar Thiagarajan, Nir Levine, Ben Brown, Dilan Gorur, Svetlana Grant, Rachel Hashimshoni, Jieru Hu, Dawn Chen, Kuba Dolecki, Canfer Akbulut, Maxwell Bileschi, Laura Culp, Wen-Xin Dong, Nahema Marchal, Kelsie Van Deman, Hema Bajaj Misra, Michael Duah, Moran Ambar, Avi Caciularu, Sandra Lefdal, Chris Summerfield, James An, Pierre-Alexandre Kamienny, Abhinit Mohdi, Theofilos Strinopoulous, Annie Hale, Wayne Anderson, Luis C Cobo, Niv Efron, Muktha Ananda, Shakir Mohamed, Maureen Heymans, Zoubin Ghahramani, Yossi Matias, Ben Gomes, and Lila Ibrahim. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach.
Kane et al. (2015) Thomas Kane, Heather Hill, and Douglas Staiger. 2015. National Center for Teacher Effectiveness Main Study: Version 4.
Kane et al. (2013) Thomas J. Kane, Daniel F. McCaffrey, Trey Miller, and Douglas O. Staiger. 2013. Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment. Research Paper. MET Project. Technical report, Bill & Melinda Gates Foundation. Publication Title: Bill & Melinda Gates Foundation ERIC Number: ED540959.
Kane and Staiger (2012) Thomas J. Kane and Douglas O. Staiger. 2012. Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Research Paper. MET Project. Technical report, Bill & Melinda Gates Foundation. Publication Title: Bill & Melinda Gates Foundation ERIC Number: ED540960.
Kasy and Abebe (2021) Maximilian Kasy and Rediet Abebe. 2021. Fairness, Equality, and Power in Algorithmic Decision-Making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 576–586, Virtual Event Canada. ACM.
Kazai et al. (2013) Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Information Retrieval, 16(2):138–178.
Kelly et al. (2018) Sean Kelly, Andrew M. Olney, Patrick Donnelly, Martin Nystrand, and Sidney K. D’Mello. 2018. Automatically Measuring Question Authenticity in Real-World Classrooms. Educational Researcher, 47(7):451–464. Publisher: American Educational Research Association.
Kiela et al. (2021) Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. arXiv preprint. ArXiv:2104.14337 [cs].
Kim et al. (2018) Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). arXiv preprint. ArXiv:1711.11279 [stat].
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv preprint. ArXiv:1412.6980 [cs].
Klahr (2013) David Klahr. 2013. What do we mean? On the importance of not abandoning scientific rigor when talking about science education. Proceedings of the National Academy of Sciences, 110(supplement_3):14075–14080. Publisher: Proceedings of the National Academy of Sciences.
Kromrey et al. (2008) J. Kromrey, Robert H. Fay, and Aarti P. Bellara. 2008. Macro for Computing Confidence Intervals for Disattenuated Correlation Coefficients.
Lemov (2021) Doug Lemov. 2021. Teach like a champion 3.0: 63 techniques that put students on the path to college, third edition edition. Jossey-Bass, a Wiley imprint, Hoboken, NJ.
Lemov and Atkins (2015) Doug Lemov and Norman Atkins. 2015. Teach like a champion 2.0: 62 techniques that put students on the path to college, second edition edition. Jossey-Bass, San Francisco, CA.
Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv preprint. ArXiv:2308.03281 [cs].
Liljedahl et al. (2021) Peter Liljedahl, Tracy Johnston Zager, and Laura Wheeler. 2021. Building thinking classrooms in mathematics: 14 teaching practices for enhancing learning: Grades K-12. Corwin Mathematics. Corwin, Thousand Oaks, California London New Delhi Singapore.
Liu and Cohen (2021) Jing Liu and Julie Cohen. 2021. Measuring Teaching Practices at Scale: A Novel Application of Text-as-Data Methods. Educational Evaluation and Policy Analysis, 43(4):587–614. Publisher: American Educational Research Association.
Liu et al. (2023a) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023a. Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint. ArXiv:2307.03172 [cs].
Liu et al. (2023b) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023b. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv preprint. ArXiv:2303.16634 [cs].
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint. ArXiv:1907.11692 [cs].
Lundberg and Lee (2017) Scott Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. arXiv preprint. ArXiv:1705.07874 [cs, stat].
Mantzicopoulos et al. (2018) Panayota Mantzicopoulos, Brian F. French, and Helen Patrick. 2018. The Mathematical Quality of Instruction (MQI) in Kindergarten: An Evaluation of the Stability of the MQI Using Generalizability Theory. Early Education and Development, 29(6):893–908. Publisher: Routledge _eprint: https://doi.org/10.1080/10409289.2018.1477903.
Mariano and Junker (2007) Louis T. Mariano and Brian W. Junker. 2007. Covariates of the Rating Process in Hierarchical Models for Multiple Ratings of Test Items. Journal of Educational and Behavioral Statistics, 32(3):287–314.
McCoy et al. (2023) R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. 2023. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve.
Messick (1998) Samuel Messick. 1998. Test Validity: A Matter of Consequence. Social Indicators Research, 45(1/3):35–44. Publisher: Springer.
Muchinsky (1996) Paul M. Muchinsky. 1996. The Correction for Attenuation. Educational and Psychological Measurement, 56(1):63–75. Publisher: SAGE Publications Inc.
Muraki (1992) Eiji Muraki. 1992. A Generalized Partial Credit Model: Application of an EM Algorithm. Applied Psychological Measurement, 16(2):159–176. Publisher: SAGE Publications Inc.
Murphy and Beretvas (2015) Daniel L. Murphy and S. Natasha Beretvas. 2015. A Comparison of Teacher Effectiveness Measures Calculated Using Three Multilevel Models for Raters Effects. Applied Measurement in Education, 28(3):219–236. Publisher: Routledge _eprint: https://doi.org/10.1080/08957347.2015.1042158.
Nghiem et al. (2024) Huy Nghiem, John Prindle, Jieyu Zhao, and Hal Daumé III. 2024. "You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations. arXiv preprint. ArXiv:2406.12232.
Patz et al. (2002) Richard J. Patz, Brian W. Junker, Matthew S. Johnson, and Louis T. Mariano. 2002. The Hierarchical Rater Model for Rated Test Items and Its Application to Large-Scale Educational Assessment Data. Journal of Educational and Behavioral Statistics, 27(4):341–384. Publisher: [American Educational Research Association, Sage Publications, Inc., American Statistical Association].
Pianta and Hamre (2009) Robert C. Pianta and Bridget K. Hamre. 2009. Conceptualization, Measurement, and Improvement of Classroom Processes: Standardized Observation Can Leverage Capacity. Educational Researcher, 38(2):109–119. Publisher: American Educational Research Association.
Pianta et al. (2008) Robert C. Pianta, Karen M. La Paro, and Bridget K. Hamre. 2008. Classroom Assessment Scoring System (CLASS) Manual, K-3. Paul H. Brookes Publishing Company. Google-Books-ID: NBeaGgAACAAJ.
Pleiss et al. (2017) Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. 2017. On Fairness and Calibration. arXiv preprint. ArXiv:1709.02012 [cs, stat].
Plummer (2003) Martyn Plummer. 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Working Papers.
Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv preprint. ArXiv:2310.03693.
Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
Rickford and King (2016) John R. Rickford and Sharese King. 2016. Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and beyond. Language, 92(4):948–988.
Rudin (2019) Cynthia Rudin. 2019. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. arXiv preprint. ArXiv:1811.10154 [cs, stat].
Samei et al. (2014) Borhan Samei, Andrew M. Olney, Sean Kelly, Martin Nystrand, Sidney D’Mello, Nathan Blanchard, Xiaoyi Sun, Marcy Glaus, and Art Graesser. 2014. Domain Independent Assessment of Dialogic Properties of Classroom Discourse. Technical report. Publication Title: Grantee Submission ERIC Number: ED566380.
Sanh et al. (2020) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint. ArXiv:1910.01108.
Saphier et al. (2008) Jon Saphier, Mary Ann Haley-Speca, and Robert Gower. 2008. The skillful teacher: building your teaching skills, 6th ed edition. Research for Better Teaching, Acton, Mass.
Schwartz et al. (2016) Daniel L. Schwartz, Jessica M. Tsang, and Kristen P. Blair. 2016. The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them, first edition edition. Norton books in education. W.W. Norton & Company, New York.
Shermis (2014) Mark D. Shermis. 2014. State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20:53–76.
Shieh et al. (2024) Evan Shieh, Faye-Marie Vassel, Cassidy Sugimoto, and Thema Monroe-White. 2024. Laissez-Faire Harms: Algorithmic Biases in Generative Language Models. arXiv preprint. ArXiv:2404.07475.
Slavin (2002) Robert E. Slavin. 2002. Evidence-Based Education Policies: Transforming Educational Practice and Research. Educational Researcher, 31(7):15–21. Publisher: American Educational Research Association.
Song et al. (2020) Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. 2020. Learning Controllable Fair Representations. arXiv preprint. ArXiv:1812.04218 [cs, stat].
Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3319–3328. PMLR. ISSN: 2640-3498.
Suresh et al. (2022) Abhijit Suresh, Jennifer Jacobs, Charis Harty, Margaret Perkoff, James H. Martin, and Tamara Sumner. 2022. The TalkMoves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4654–4662, Marseille, France. European Language Resources Association.
Tack et al. (2023) Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. arXiv preprint. ArXiv:2306.06941.
(125) R Core Team. R: A Language and Environment for Statistical Computing.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint. ArXiv:2307.09288 [cs].
UpLevel (2024) UpLevel. 2024. Gen AI for Coding Research Report. Technical report, Uplevel Data Labs.
Vaccaro et al. (2024) Michelle Vaccaro, Abdullah Almaatouq, and Thomas Malone. 2024. When Are Combinations of Humans and AI Useful? arXiv preprint. ArXiv:2405.06087 [cs].
van der Lee et al. (2019) Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan. Association for Computational Linguistics.
Wang et al. (2022) Jiarui Wang, Richong Zhang, Junfan Chen, Jaein Kim, and Yongyi Mao. 2022. Text style transferring via adversarial masking and styled filling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7654–7663, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Wang and Demszky (2023) Rose Wang and Dorottya Demszky. 2023. Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 626–667, Toronto, Canada. Association for Computational Linguistics.
Warr et al. (2024) Melissa Warr, Nicole Jakubczyk Oster, and Roger Isaac. 2024. Implicit bias in large language models: Experimental proof and implications for education. Journal of Research on Technology in Education, 0(0):1–24. Publisher: Routledge _eprint: https://doi.org/10.1080/15391523.2024.2395295.
Waseem (2016) Zeerak Waseem. 2016. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. In Proceedings of the First Workshop on NLP and Computational Social Science, pages 138–142, Austin, Texas. Association for Computational Linguistics.
Webson et al. (2023) Albert Webson, Alyssa Loo, Qinan Yu, and Ellie Pavlick. 2023. Are Language Models Worse than Humans at Following Prompts? It’s Complicated. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7662–7686, Singapore. Association for Computational Linguistics.
Webson and Pavlick (2022) Albert Webson and Ellie Pavlick. 2022. Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States. Association for Computational Linguistics.
Whitehill and LoCasale-Crouch (2024) Jacob Whitehill and Jennifer LoCasale-Crouch. 2024. Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback. arXiv preprint. ArXiv:2310.01132 [cs].
Whitehurst et al. (2014) Grover J. Whitehurst, Matthew M. Chingos, and Katharine M. Lindquist. 2014. Evaluating Teachers with Classroom Observations: Lessons Learned in Four Districts. Technical report, Brookings Institution. Publication Title: Brookings Institution ERIC Number: ED553815.
Wind (2019) Stefanie A. Wind. 2019. Nonparametric Evidence of Validity, Reliability, and Fairness for Rater-Mediated Assessments: An Illustration Using Mokken Scale Analysis. Journal of Educational Measurement, 56(3):478–504. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12222.
Wind and Guo (2019) Stefanie A. Wind and Wenjing Guo. 2019. Exploring the Combined Effects of Rater Misfit and Differential Rater Functioning in Performance Assessments. Educational and Psychological Measurement, 79(5):962–987. Publisher: SAGE Publications Inc.
Xu et al. (2024) Paiheng Xu, Jing Liu, Nathan Jones, Julie Cohen, and Wei Ai. 2024. The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education. arXiv preprint. ArXiv:2404.02444.
Yang et al. (2020) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint. ArXiv:1906.08237.
Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning Fair Representations. In Proceedings of the 30th International Conference on Machine Learning, pages 325–333. PMLR. ISSN: 1938-7228.
Zhao and Ermon (2021) Shengjia Zhao and Stefano Ermon. 2021. Right Decisions from Wrong Predictions: A Mechanism Design Alternative to Individual Calibration. arXiv preprint. ArXiv:2011.07476 [cs, math, stat].
Zhou et al. (2024) Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, and Maarten Sap. 2024. Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty. arXiv preprint. ArXiv:2401.06730 [cs].
Zhou et al. (2023) Xiaofei Zhou, Christopher Kok, Rebecca M. Quintana, Anita Delahay, and Xu Wang. 2023. How Learning Experience Designers Make Design Decisions: The Role of Data, the Reliance on Subject Matter Expertise, and the Opportunities for Data-Driven Support. In Proceedings of the Tenth ACM Conference on Learning @ Scale, L@S ’23, pages 132–143, New York, NY, USA. Association for Computing Machinery. Event-place: Copenhagen, Denmark.
Zijlmans et al. (2018a) Eva A. O. Zijlmans, Jesper Tijmstra, L. Andries van der Ark, and Klaas Sijtsma. 2018a. Item-Score Reliability in Empirical-Data Sets and Its Relationship With Other Item Indices. Educational and Psychological Measurement, 78(6):998–1020. Publisher: SAGE Publications Inc.
Zijlmans et al. (2018b) Eva A. O. Zijlmans, L. Andries van der Ark, Jesper Tijmstra, and Klaas Sijtsma. 2018b. Methods for Estimating Item-Score Reliability. Applied Psychological Measurement, 42(7):553–570. Publisher: SAGE Publications Inc.

Appendix A NCTE Population Descriptive Statistics

	NCTE sample means
Female	0.85
African-American	0.22
Asian	0.03
Hispanic	0.03
White	0.65
Teaching Experience (Years)	10.59
Teachers	N=309
Female	0.50
African-American	0.41
Asian	0.08
Hispanic	0.24
White	0.24
Free or Reduced Price Lunch	0.65
Special Education	0.11
English Language Learners	0.21
Prior Year State Math Test (Standardized)	0.08
Prior Year State ELA Test (Standardized)	0.07
Students	N=9,141

Table 5: Teacher and student descriptive statistics.

Appendix B Observation Instrument Item Descriptions and Distributions

For each of the observation instruments, the abbreviation codes used in this study are listed with the expanded names in Table 6. The distributions of scores across all items for all rater families are in Figure 6. The CLASS rubric has 12 items on a scale from 1 to 7, rated at 15 minute intervals. The MQI rubric has 13 items on a scale from 1 to 3, rated at 7.5 minute intervals.

Abbreviation	Item	Item Description
MQI Instrument
ETCA	Enacted Task Cognitive Activation	Task cognitive demand, such as drawing connections among different representations, concepts, or solution methods; identifying and explaining patterns.
EXPL	Teacher Explanations	Teacher explanations that give meaning to ideas, procedures, steps, or solution methods.
LANGIMP†	Imprecision in Language or Notation	Imprecision in language or notation, with regard to mathematical symbols and technical or general mathematical language.
LCP†	Lack of Clarity in Presentation of Mathematical Content	Lack of clarity in teachers’ launching of tasks or presentation of the content.
LINK	Linking and Connections	Linking and connections of mathematical representations, ideas, and procedures.
MAJERR†	Major Mathematical Errors	Major mathematical errors, such as solving problems incorrectly, defining terms incorrectly, forgetting a key condition in a definition, equating two non-identical mathematical terms.
MGEN	Developing Mathematical Generalizations	Developing generalizations based on multiple examples.
MLANG	Mathematical Language	Mathematical language is dense and precise and is used fluently and consistently.
MMETH	Multiple Procedures or Solution Methods	Multiple procedures or solution methods for a single problem.
REMED	Remediation of Student Errors and Difficulties	Remediation of student errors and difficulties addressed in a substantive manner.
SMQR	Student Mathematical Questioning and Reasoning	Student mathematical questioning and reasoning, such as posing mathematically motivated questions, offering mathematical claims or counterclaims.
STEXPL	Students Provide Explanations	Student explanations that give meaning to ideas, procedures, steps, or solution methods.
USEPROD	Responding to Student Mathematical Productions	Responding to student mathematical productions in instruction, such as appropriately identifying mathematical insight in specific student questions, comments, or work; building instruction on student ideas or methods.
CLASS Instrument
CLPC	Classroom Positive Climate
CLNC†	Classroom Negative Climate
CLTS	Teacher Sensitivity
CLRSP	Regard for Student Perspective
CLBM	Behavior Management
CLPRDT	Productivity
CLILF	Instructional Learning Formats
CLCU	Content Understanding
CLAPS	Applied Problem Solving
CLQF	Quality of Feedback
CLINSTD	Instructional Dialogue
CLSTENG	Student Engagement

Table 6: CLASS and MQI item descriptions and corresponding abbreviations. †denotes items that are reverse coded due to being negatively worded with respect to the construct of teacher ability. Bolded items are those evaluated by the GPT family of raters and reported by Wang and Demszky. Each member of the Human and Encoder families of raters evaluated all 25 items.

Appendix C MQI Instrument

C.1 MQI Instrument Properties

For our purposes, the MQI instrument has a few unique properties that warrant further analysis, as the instrument may have some qualitative attributes that may influence human raters.

The MQI ratings are written to identify the presence of a behavior and then, if present, report the magnitude or quality of its presence, doing so repeatedly at regular intervals throughout the lesson (in this case, 7.5 minutes). This shortened window with simpler targets provides an opportunity for training a model for real-time use (rather than an arbitrary interval) to find different features across a single lesson, as shown in Figure 15.

The version of the MQI for which data is available in the NCTE dataset is ternary, in contrast to the current MQI version, which is quaternary. The lowest rating on the ternary MQI scale is a combination of the lowest two ratings on the quaternary, meaning the present data cannot distinguish between whether the attribute described in each item is “Not present” or “Low”. ¹⁶¹⁶16There is one exception, which the original authors of the Appendix adjusted for: the USEPROD item is replaced by the MATCON item, with the correction of combining the lowest two categories. This ternary classification scheme creates non-normal distributions as seen in Figure 6, which will need to inform models and methods during quantitative analysis.

This is unfortunate because the difference between these two categories are “None.” And “Brief content error, instance of imprecision, lack of clarity. Does not obscure the mathematics of the segment,” respectively (for the Errors and Imprecision domain in Hill et al. and second MQI-only factor in Blazar et al.: MAJERR, LANGIMP, LCP).

C.2 Possible Effects of Negative-worded Items

The MQI is unique in having a separate domain of items that try to capture aspects of poor mathematical instruction. Unlike most items in observation rubrics, the MQI has three items that are worded in the negative direction, specifically, higher scores on the MAJERR, LANGIMP, and LCP items indicate worse performance.¹⁷¹⁷17In the analyses of this paper, these will be reverse coded, as will the one negative CLASS item CLNC It is possible that looking for negative attributes may make these items more susceptible to different rater biases. A partial description of the potential impact of this rubric attribute for the LCP item found in Appendix C.2 with further details.

Of note, the LCP item is particularly subjective. In the documentation and training provided for the MQI, You have to ask: “What, mathematically, was the teacher trying to say?” This is already problematic, as it is asking for observers to use their judgment to determine what the teacher was “trying to say.” The subjectivity increases further for observers who may not be as familiar with African-American Vernacular English (AAVE). The subjectivity is further mixes lack of content clarity (lack of clarity explaining math) with lack of directional clarity (unclear instructions for an activity, which is typically associated with items addressing classroom management), as stated in the MQI rubric:

Teacher’s launch of a task/activity lacks clarity (the “launch” is the teacher’s effort to get the mathematical tasks/activities into play). If the launch is problematic, score for the launch plus amount of time students are confused/off-task/engaging in non-productive explorations…[Example:] Garbling a task launch, e.g., by asking initially “How much TV is watched in the US?” when students really must draw a graph to show “How many TVs in US vs. Europe vs. rest of the world?

Instructing observers to score based on the “amount of time students are confused/off-task/engaging in non-productive explorations”, is more likely to capture problems with classroom management and directional lack of clarity, not mathematical lack of clarity, compounded by the request for raters to guess what the teachers were trying to say and training instructions that let raters "code Lack of Clarity even with correction". This mix of observational cues and overlapping constructs makes this item particularly susceptible to individual rater biases.¹⁸¹⁸18As a note, the skill of providing clear directions, foundational to establishing a well-managed classroom, is also not included the CLASS instrument’s ”Behavior Management” item, suggesting that neither of these instruments is perfectly designed to address root causes of instructional shortcomings and thus may be inadequate as tools for coaching and developing skills in teachers.

Indeed, while not reported in this paper explicitly, we identified that one rater in particular rated Black teachers much more harshly on these, especially on LCP, providing some evidence that some items can be more prone to rater biases, even with research-quality observers and calibration.

C.3 Prior work on Rater Fairness with MQI

Recent work has begun to look at rater biases, including racial bias, in these data and with the MQI instrument. Ji (2023) uses cross-classified mixed effects models for analysis and evaluation, which seeks to answer similar questions through combining G-theory and IRT estimations Briggs and Wilson (2007). However, the helpfulness of this study is limited by data selection decisions: it eliminates 23% of MQI items (all of the second MQI factor in Blazar et al. (2017)) without explanation; it only uses 21% of available classroom observations (from a single year) and by so doing also eliminates 43% of the study’s raters; it then truncates the class lengths to 45 minutes thus removing another 20% of the remaining data observations, and when evaluating for differences in teacher race, combines all non-white races/ethnicities into a single category, removing meaningful inference from the contrast. These decisions to use only 13% of available data would lead to a model with better fit, as all of those removals simplify trends in the data, indirectly suggesting that the mixed effects model constructions used are not robust to the complete set of observations Murphy and Beretvas (2015) and are therefore inadequate for our purposes here.

Appendix D Encoder Family Construction

Pretraining and training/fine-tuning regimes can have significant effects on model performance D’Amour et al. (2020), so our family of models sought to exploit this by using three different pretrainings for sentence-level embeddings and including variations on training regimes (e.g., different checkpoints), the summary of these variations can be found in Table 7. Thus, the encoder family of models designed for this study share the same architecture,¹⁹¹⁹19One model, ”un2”, has a slightly different architecture, differing in the number of attention heads. training and held-out test sets, differing only as outlined in Table 7.

[Another forthcoming paper to be under review] explores this protocol in greater depth, showing that the extreme training and treatment of data noise can achieve SOTA and "super-human" results on a variety of sentence embedding pretrainings, with a more complete set of training

Model	Pretrained Embedding	Layer Attn. Heads	Train Epochs	Dropout
un1	Unsupervised SimCSE (Gao, 2022)	32	3	75
un2	Unsupervised SimCSE (Gao, 2022)	16	4	75
un3	Unsupervised SimCSE (Gao, 2022)	32	8	75
e5	E5 (Wang et al., 2022)	32	2	15
gte	GTE (Li et al., 2023)	32	4	65

Table 7: Encoder Within-family differences: Summary of basic differences within the Encoder family of models. Detailed information about training and architecture can be found in Appendix D.3.

All results were run on a completely held out test set (Figure 8) of entire classroom transcripts. No analyses were conducted using the held-out test set until after all models in the model family were trained, thus preserving the integrity of the study.

GPT Model	Name	Prompt Info	Output
N	Numeric	Item Overview	Single Number
ND	Numeric w/ Description	Rubric Descriptions of Score Categories	Single Number
NR	Numeric after Reasoning	Item Overview and CoT instructions	Reasoning and Number

Table 8: GPT Within-family model differences: Details for the GPT/Decoder models can be found in the original paper (Wang and Demszky, 2023).

D.0.1 Encoder Model Preprocessing

As mentioned in Section 4, preprocessing of the transcript data was intentionally minimal, replacing bracketed transcription notes (e.g. [cross-talk]) with [inaudible]. For this study, the transcript was not annotated denote whether a teacher or a student is speaking to reflect the broadest future use case of general classroom microphones. In other words, this family of models does not know who is speaking, and the results of this decision are evident in the models’ relative underperformance in two MQI items that distinguish between teacher explanations (EXPL) and student explanations (STEXPL), a trend that might be evident in the validity demonstration in Figure 15, where models may be responding nearly identically to/failing to distinguish between these two items.

To align transcripted class segments to human observation ratings, transcripts were equipartitioned at the word-level across the maximum number of lesson segments for which there were human annotations available, and estimated timestamps were made across sentences by linear interpolation weighted by word count.

D.1 Sentence-level Embeddings

One key difference to other studies using these same transcripts is the choice to parse the utterances at the sentence level. Sentences, rather than individual words or long, uninterrupted utterances, are the key unit of meaning for interpretability of models for classroom discourse. The downstream tasks are a key decision for this choice. Sentence level parsing anticipates meaningful feature attribution studies (Sundararajan et al., 2017) to further investigate construct validity.

Parsing at the sentence level both augments the total number of unique observations in the data and, by creating more standardization in sequence lengths prior to sentence-embedding, the variation in the density of semantic information is reduced.

The model takes as input an approximate 12 min rolling window of class text (stepping at each sentence), and simultaneously predicts ratings for each of the 12 CLASS dimensions, 13 of the MQI dimensions for rounded-rolling average scores for that time window. Each model is multi-task predicting all 25 scores simultaneously for each of the MQI and CLASS items. This multi-task training takes advantage of the interrelated skills of teaching that may be implicit in human ratings. Over one million unique observations from fewer than 1,600 unique classroom transcripts were generated, with rolling windows representing each observation. Training-val-test splits of this data were 75/15/10, stratified at the classroom level.

Classroom transcripts are extremely long, with thousands of sentences, and with classes having tokens in the hundreds of thousands. Sentence-level inputs could capture the relationship between something a teacher says and something a student says five minutes later without incurring large costs associated with sequence length. These long-range dependencies are needed to identify some of the instructional constructs being measured.

Raw class transcripts also have a lot of noise: content that is unrelated to any of the tasks, including fillers, self-corrections, interruptions and self-interruptions, sentences that are partially repeated or emphasized, text that requires being able to refer to a visual cue in the classroom, etc. While sentence level embeddings lose information relative to subword tokenizations, this loss of information may mitigate disproportionate effects of idiosyncratic speaking styles.

D.1.1 Embedding Model Selection

To save on compute, static embeddings were pre-computed. To represent the very noisy transcript data, we have to be careful in using sentence-embeddings, as they decrease the completeness of the information captured. We tested sentence-level embeddings using across different pretrained embedding models accessed through Huggingface on a subset of the training data for a small random selection of target measures:

•

unsup-simcse-roberta-large: from princeton-nlp (Gao, 2022), was pretrained using unsupervised contrastive sentence representations. simCSE
•

sup-simcse-roberta-large: from princeton-nlp (Gao, 2022), was pretrained using supervised training. At the writing of this paper, we did not yet have a converged model with reportable results. simCSE
•

e5-large-v2: from intfloat Wang et al. (2022), pretrained using weakly supervised contrastive sentence representations with sentence pair training. e5-large-v2
•

gte-large: from thenlper Li et al. (2023), pretrained using multistage contrastive sentence representations. gte-large

The first three models had significantly reduced performance, compared to our sentence embedding model of choice, SimCSE (Gao, 2022), which uses unsupervised self-contrasting learning to improve sentence-level representations of words.

D.2 Model Architecture

D.3 Encoder Model Training and Description

Models were built and trained in pytorch,²⁰²⁰20https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html largely based on the Encoder modules available. Each model was trained on a single L4 GPU in Google Colab. Each epoch took about 4.25 hours:

•

8 transformer encoder layers
•

25 total classifier heads (with a single dense layer each) for each task (using double objective functions, results 50 total loss calculations backpropagated.)
•

All encoder layer parameters are shared by objectives, but the trainable parameters of the single dense layer classification heads are specific to each item.
•

Attention heads: 32. Since a lot of semantic information were needed to be extracted from within each embedding and its neighbors, supporting an increase in multi-head self-attention mechanisms.
•

Hidden dimension: 2048

D.3.1 Preventing Overfit within the Model

An abnormally high 0.75 Dropout rate was the primary regularization technique to avoid overfit in a noisy, repetitively augmented dataset with non-gold labels.

•

Optimizer: Adamax: defined in the original paper by Kingma and Ba (2017), this is a variant of Adam that replaces the L2 norm of the gradients with the L-infinity norm which provides stability in sparse gradients resulting from the dropout. Additionally, its initial momentum and second derivative momentum are limited slightly to 0.78 and 0.9, respectively, to prevent overfitting, but increasing training time, and increased the weight decay to 0.0003 similarly.
•

Learning Rate: initial learning rate was set to 2.5e-5, within the learning rate schedule seen below.
•

Gradient clipping: set to 4 (instead of the typical 1), since we did not want an unusual batch to explode, but recognizing the need to capture as much info as we can from our optimizer given dropout was a primary regularization to account for high level of repetition in the augmented transcript windows.
•

Learning rate schedule: Using chaining, began linear from zero with warmup, a 1,000 step linear ramp, followed by exponential decay with gamma = 0.9995) multiplied with CosineAnnealingWarmRestarts from pytorch²¹²¹21https://pytorch.org/, scheduling with annealing cycles cutting frequency by a third each time. We have initial data to suggest that using a cyclic learning rate improves model performance, but did not sufficiently ablate this additional level of complexity sufficiently to claim whether, without it, the models would still learn effectively.
•

Loss functions In addition to cross-entropy loss, we use a custom loss function implementing Quadratic weighted kappa loss with fuzzy labels/label smoothing set at 0.2, to increase noise around the unreliable human ratings.

D.4 Encoder Model Test Set

The distributions for the held out test set for Encoder model can be found in Figure 8 compared to the training/development data.

Appendix E GPT Model Family

E.1 Model construction

Detailed descriptions of the three models and data generated by them can be found in the original paper and accompanying websites Wang and Demszky²²²²22The automated rating data was retrieved from https://github.com/rosewang2008/zero-shot-teacher-feedback/tree/main which examples for how the three models differ. A brief summary of those differences can be found in Table 8.

E.1.1 GPT Model Preprocessing

In contrast to the Encoder model preprocessing, a preliminary analysis was conducted by Wang and Demszky to identify the highest quality 7.5-minute segments available in the dataset, as defined by fewest transcriber notes. The models are provided the discrourse from these selections and also information about the subset of items they provide ratings for, including four items from the MQI (EXPL, LANGIMP, REMED, SMQR).

Appendix F Reliability Metrics

ICC calculations were reproduced using the following multilevel model where lesson $l$ scores for each rubric item are nested within teachers $k$ :

\displaystyle{ITEM}_{lk}=\beta_{0}+\mu_{k}+\varepsilon_{lk}\text{,}

(11)

and then calculate the ICC and Adjusted ICC

\displaystyle ICC=\frac{\operatorname{var}\left(\mu_{k}\right)}{\operatorname{% var}\left(\mu_{k}\right)+\frac{\operatorname{var}\left(\varepsilon_{lk}\right)% }{n_{l}}},

(12)

where $n_{l}=1$ for ICC and where $n_{l}=6$ for Adjusted ICC following the original study. Full results of human baselines and comparisons against the various models can be found in Appendix F.1.

F.1 Full Results

Table LABEL:tab:tab:full contains the full results calculations referenced in Section 5.1. The metric symbols found in the table are as follows: C’s $\kappa$ : Cohen’s $\kappa$ ; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement; Agr±1: percent agreement within 1 category; ICC and AdjICC: intraclass correlation and adjusted intraclass correlation (with nested staging in Eq. 12; $r$ : $r$ , Pearson’s correlation; $\mathbf{\rho}$ : $\mathbf{\rho}$ , Spearman’s rank correlation, $\mathbf{\tau}$ : $\mathbf{\tau}$ , Kendall’s rank correlation. *.low and *.hi are low and high 95% confidence intervals at $\alpha=0.05$ , respectively. These results and full results for CLASS items can be found online.²³²³23https://github.com/hardy-education/LLM-Psychometrics

Table 9: Full Agreement Metrics

Instrument	Item	Metric	Human	Encoders	un1	un2	un3	gte	e5	GPTs	N	ND	NR
MQI	LINK	C’s $\kappa$	0.31	0.39	0.41	0.33	0.44	0.39	0.39
MQI	LINK	QWK	0.41	0.58	0.6	0.55	0.62	0.56	0.56
MQI	LINK	%Agr	0.7	0.73	0.74	0.71	0.75	0.71	0.71
MQI	LINK	Agr±1	0.97	0.98	0.97	0.98	0.98	0.98	0.98
MQI	LINK	$r$	0.41	0.58	0.61	0.56	0.63	0.56	0.56
MQI	LINK	$r$ .low	0.39	0.57	0.57	0.51	0.59	0.52	0.52
MQI	LINK	$r$ .hi	0.42	0.6	0.64	0.6	0.66	0.6	0.6
MQI	LINK	$\rho$	0.41	0.57	0.6	0.53	0.61	0.54	0.54
MQI	LINK	$\rho$ .low	0.4	0.55	0.56	0.48	0.58	0.5	0.5
MQI	LINK	$\rho$ .hi	0.43	0.58	0.64	0.58	0.65	0.58	0.58
MQI	LINK	$\tau$	0.4	0.54	0.57	0.51	0.59	0.51	0.51
MQI	LINK	$\tau$ .low	0.38	0.52	0.53	0.46	0.55	0.47	0.47
MQI	LINK	$\tau$ .hi	0.41	0.56	0.61	0.56	0.62	0.56	0.56
MQI	LINK	ICC	0.15	0.14	0.14	0.14	0.14	0.14	0.14
MQI	LINK	AdjICC	0.51	0.5	0.5	0.5	0.5	0.5	0.5
MQI	EXPL	C’s $\kappa$	0.23	0.25	0.25	0.28	0.23	0.24	0.24	0.03	0.01	0.07	0.01
MQI	EXPL	QWK	0.28	0.43	0.46	0.42	0.44	0.4	0.4	0.01	0.01	0.06	-0.01
MQI	EXPL	%Agr	0.7	0.72	0.72	0.69	0.72	0.72	0.72	0.31	0.31	0.42	0.15
MQI	EXPL	Agr±1	0.98	0.97	0.97	0.97	0.96	0.97	0.97	0.86	0.95	0.9	0.67
MQI	EXPL	$r$	0.28	0.44	0.48	0.43	0.47	0.41	0.41	0.03	0.03	0.09	-0.03
MQI	EXPL	$r$ .low	0.26	0.42	0.44	0.37	0.42	0.36	0.36	-0.03	-0.07	-0.01	-0.14
MQI	EXPL	$r$ .hi	0.29	0.46	0.52	0.48	0.51	0.46	0.46	0.08	0.13	0.19	0.09
MQI	EXPL	$\rho$	0.27	0.42	0.46	0.41	0.46	0.39	0.39	0.03	0.03	0.1	-0.03
MQI	EXPL	$\rho$ .low	0.25	0.4	0.42	0.35	0.41	0.34	0.34	-0.03	-0.07	0	-0.14
MQI	EXPL	$\rho$ .hi	0.29	0.44	0.51	0.46	0.5	0.43	0.43	0.09	0.13	0.19	0.08
MQI	EXPL	$\tau$	0.26	0.41	0.45	0.39	0.44	0.38	0.38	0.03	0.03	0.09	-0.03
MQI	EXPL	$\tau$ .low	0.25	0.39	0.4	0.33	0.4	0.32	0.32	-0.03	-0.07	-0.01	-0.14
MQI	EXPL	$\tau$ .hi	0.28	0.43	0.49	0.45	0.49	0.42	0.42	0.08	0.12	0.19	0.08
MQI	EXPL	ICC	0.17	0.17	0.17	0.17	0.17	0.17	0.17	0.17	0.17	0.17	0.17
MQI	EXPL	AdjICC	0.55	0.56	0.56	0.56	0.56	0.56	0.56	0.55	0.55	0.55	0.55
MQI	MMETH	C’s $\kappa$	0.42	0.33	0.46	0.39	0.33	0.27	0.27
MQI	MMETH	QWK	0.47	0.49	0.48	0.53	0.54	0.46	0.46
MQI	MMETH	%Agr	0.85	0.82	0.88	0.86	0.84	0.78	0.78
MQI	MMETH	Agr±1	0.99	0.98	0.99	0.98	0.98	0.97	0.97
MQI	MMETH	$r$	0.47	0.52	0.51	0.58	0.57	0.51	0.51
MQI	MMETH	$r$ .low	0.46	0.5	0.46	0.53	0.53	0.47	0.47
MQI	MMETH	$r$ .hi	0.49	0.54	0.55	0.62	0.61	0.56	0.56
MQI	MMETH	$\rho$	0.47	0.5	0.51	0.57	0.57	0.48	0.48
MQI	MMETH	$\rho$ .low	0.45	0.49	0.47	0.52	0.53	0.43	0.43
MQI	MMETH	$\rho$ .hi	0.48	0.52	0.55	0.61	0.61	0.52	0.52
MQI	MMETH	$\tau$	0.46	0.49	0.51	0.56	0.56	0.46	0.46
MQI	MMETH	$\tau$ .low	0.45	0.47	0.46	0.52	0.52	0.42	0.42
MQI	MMETH	$\tau$ .hi	0.48	0.51	0.55	0.61	0.6	0.51	0.51
MQI	MMETH	ICC	0.15	0.18	0.18	0.18	0.18	0.18	0.18
MQI	MMETH	AdjICC	0.52	0.57	0.57	0.57	0.57	0.57	0.57
MQI	MGEN	C’s $\kappa$	0.15	0.26	0.27	0.32	0.27	0.24	0.24
MQI	MGEN	QWK	0.19	0.34	0.34	0.48	0.34	0.29	0.29
MQI	MGEN	%Agr	0.95	0.95	0.96	0.96	0.96	0.94	0.94
MQI	MGEN	Agr±1	0.99	1	1	1	1	0.99	0.99
MQI	MGEN	$r$	0.19	0.34	0.37	0.48	0.37	0.29	0.29
MQI	MGEN	$r$ .low	0.18	0.32	0.32	0.43	0.32	0.24	0.24
MQI	MGEN	$r$ .hi	0.21	0.37	0.42	0.53	0.42	0.34	0.34
MQI	MGEN	$\rho$	0.19	0.32	0.34	0.42	0.34	0.28	0.28
MQI	MGEN	$\rho$ .low	0.17	0.29	0.29	0.37	0.29	0.22	0.22
MQI	MGEN	$\rho$ .hi	0.2	0.34	0.39	0.48	0.39	0.33	0.33
MQI	MGEN	$\tau$	0.18	0.32	0.34	0.42	0.34	0.27	0.27
MQI	MGEN	$\tau$ .low	0.17	0.29	0.29	0.37	0.29	0.22	0.22
MQI	MGEN	$\tau$ .hi	0.2	0.34	0.39	0.48	0.39	0.33	0.33
MQI	MGEN	ICC	0.04	0.03	0.03	0.03	0.03	0.03	0.03
MQI	MGEN	AdjICC	0.19	0.16	0.16	0.16	0.16	0.16	0.16
MQI	MLANG	C’s $\kappa$	0.23	0.37	0.4	0.44	0.43	0.31	0.31
MQI	MLANG	QWK	0.33	0.48	0.49	0.55	0.51	0.44	0.44
MQI	MLANG	%Agr	0.59	0.65	0.68	0.69	0.7	0.6	0.6
MQI	MLANG	Agr±1	0.98	0.98	0.99	0.99	0.99	0.98	0.98
MQI	MLANG	$r$	0.33	0.48	0.5	0.55	0.52	0.46	0.46
MQI	MLANG	$r$ .low	0.32	0.46	0.45	0.5	0.47	0.41	0.41
MQI	MLANG	$r$ .hi	0.35	0.5	0.54	0.59	0.56	0.5	0.5
MQI	MLANG	$\rho$	0.32	0.47	0.48	0.54	0.5	0.43	0.43
MQI	MLANG	$\rho$ .low	0.31	0.45	0.43	0.49	0.46	0.39	0.39
MQI	MLANG	$\rho$ .hi	0.34	0.49	0.52	0.59	0.55	0.48	0.48
MQI	MLANG	$\tau$	0.31	0.45	0.46	0.52	0.49	0.41	0.41
MQI	MLANG	$\tau$ .low	0.29	0.43	0.42	0.47	0.45	0.36	0.36
MQI	MLANG	$\tau$ .hi	0.33	0.47	0.51	0.57	0.53	0.46	0.46
MQI	MLANG	ICC	0.08	0.09	0.09	0.09	0.09	0.09	0.09
MQI	MLANG	AdjICC	0.34	0.36	0.36	0.36	0.36	0.36	0.36
MQI	REMED	C’s $\kappa$	0.27	0.3	0.27	0.34	0.35	0.27	0.27	-0.01	-0.01	0	0
MQI	REMED	QWK	0.32	0.44	0.44	0.52	0.42	0.42	0.42	0.02	0	0.06	0.02
MQI	REMED	%Agr	0.66	0.69	0.68	0.68	0.74	0.67	0.67	0.16	0.1	0.27	0.08
MQI	REMED	Agr±1	0.96	0.96	0.94	0.96	0.99	0.96	0.96	0.62	0.54	0.81	0.48
MQI	REMED	$r$	0.32	0.44	0.46	0.52	0.45	0.42	0.42	0.05	-0.01	0.11	0.11
MQI	REMED	$r$ .low	0.31	0.42	0.41	0.47	0.4	0.37	0.37	0	-0.11	0.01	-0.01
MQI	REMED	$r$ .hi	0.34	0.47	0.5	0.57	0.49	0.47	0.47	0.11	0.09	0.21	0.22
MQI	REMED	$\rho$	0.32	0.42	0.44	0.49	0.44	0.38	0.38	0.06	0	0.12	0.09
MQI	REMED	$\rho$ .low	0.31	0.4	0.39	0.43	0.4	0.33	0.33	0	-0.1	0.02	-0.02
MQI	REMED	$\rho$ .hi	0.34	0.44	0.48	0.54	0.49	0.43	0.43	0.12	0.1	0.22	0.2
MQI	REMED	$\tau$	0.31	0.4	0.42	0.46	0.43	0.37	0.37	0.06	0	0.11	0.09
MQI	REMED	$\tau$ .low	0.3	0.38	0.37	0.41	0.39	0.32	0.32	0	-0.1	0.02	-0.02
MQI	REMED	$\tau$ .hi	0.33	0.42	0.46	0.51	0.48	0.41	0.41	0.11	0.1	0.21	0.2
MQI	REMED	ICC	0.16	0.17	0.17	0.17	0.17	0.17	0.17	0.14	0.14	0.14	0.14
MQI	REMED	AdjICC	0.53	0.55	0.55	0.55	0.55	0.55	0.55	0.5	0.5	0.5	0.5
MQI	USEPROD	C’s $\kappa$	0.25	0.3	0.28	0.32	0.3	0.31	0.31
MQI	USEPROD	QWK	0.33	0.46	0.44	0.5	0.46	0.46	0.46
MQI	USEPROD	%Agr	0.76	0.75	0.74	0.8	0.75	0.74	0.74
MQI	USEPROD	Agr±1	0.98	0.95	0.93	0.97	0.94	0.95	0.95
MQI	USEPROD	$r$	0.33	0.49	0.48	0.5	0.5	0.49	0.49
MQI	USEPROD	$r$ .low	0.32	0.47	0.43	0.45	0.46	0.45	0.45
MQI	USEPROD	$r$ .hi	0.35	0.51	0.52	0.55	0.54	0.53	0.53
MQI	USEPROD	$\rho$	0.31	0.47	0.47	0.45	0.49	0.46	0.46
MQI	USEPROD	$\rho$ .low	0.29	0.45	0.42	0.39	0.45	0.42	0.42
MQI	USEPROD	$\rho$ .hi	0.32	0.49	0.51	0.5	0.53	0.51	0.51
MQI	USEPROD	$\tau$	0.3	0.45	0.45	0.44	0.48	0.45	0.45
MQI	USEPROD	$\tau$ .low	0.29	0.43	0.41	0.38	0.43	0.4	0.4
MQI	USEPROD	$\tau$ .hi	0.32	0.47	0.5	0.49	0.52	0.49	0.49
MQI	USEPROD	ICC	0.24	0.23	0.23	0.23	0.23	0.23	0.23
MQI	USEPROD	AdjICC	0.65	0.64	0.64	0.64	0.64	0.64	0.64
MQI	MAJERR	C’s $\kappa$	0.24	0.22	0.27	0.26	0.22	0.19	0.19
MQI	MAJERR	QWK	0.28	0.35	0.35	0.45	0.43	0.29	0.29
MQI	MAJERR	%Agr	0.91	0.9	0.92	0.91	0.92	0.87	0.87
MQI	MAJERR	Agr±1	0.99	0.99	1	0.99	0.99	0.98	0.98
MQI	MAJERR	$r$	0.28	0.36	0.38	0.45	0.44	0.31	0.31
MQI	MAJERR	$r$ .low	0.26	0.34	0.33	0.4	0.39	0.26	0.26
MQI	MAJERR	$r$ .hi	0.29	0.38	0.43	0.5	0.49	0.36	0.36
MQI	MAJERR	$\rho$	0.28	0.31	0.34	0.43	0.38	0.27	0.27
MQI	MAJERR	$\rho$ .low	0.26	0.29	0.28	0.37	0.33	0.21	0.21
MQI	MAJERR	$\rho$ .hi	0.29	0.33	0.39	0.48	0.43	0.32	0.32
MQI	MAJERR	$\tau$	0.28	0.31	0.33	0.42	0.37	0.26	0.26
MQI	MAJERR	$\tau$ .low	0.26	0.28	0.28	0.36	0.32	0.21	0.21
MQI	MAJERR	$\tau$ .hi	0.29	0.33	0.38	0.47	0.42	0.32	0.32
MQI	MAJERR	ICC	0.1	0.06	0.06	0.06	0.06	0.06	0.06
MQI	MAJERR	AdjICC	0.39	0.29	0.29	0.29	0.29	0.29	0.29
MQI	LANGIMP	C’s $\kappa$	0.25	0.2	0.32	0.21	0.21	0.15	0.15	0	0	-0.03	0.03
MQI	LANGIMP	QWK	0.29	0.34	0.36	0.43	0.39	0.29	0.29	-0.01	-0.01	-0.05	0.03
MQI	LANGIMP	%Agr	0.8	0.8	0.86	0.81	0.83	0.75	0.75	0.32	0.25	0.38	0.33
MQI	LANGIMP	Agr±1	0.99	0.98	1	0.99	0.98	0.97	0.97	0.98	0.97	0.98	0.99
MQI	LANGIMP	$r$	0.29	0.35	0.4	0.44	0.4	0.31	0.31	-0.02	-0.02	-0.08	0.06
MQI	LANGIMP	$r$ .low	0.27	0.33	0.35	0.38	0.35	0.26	0.26	-0.08	-0.12	-0.17	-0.05
MQI	LANGIMP	$r$ .hi	0.3	0.37	0.45	0.49	0.45	0.36	0.36	0.04	0.07	0.02	0.17
MQI	LANGIMP	$\rho$	0.28	0.31	0.38	0.4	0.37	0.26	0.26	-0.02	-0.03	-0.08	0.05
MQI	LANGIMP	$\rho$ .low	0.26	0.29	0.33	0.34	0.32	0.21	0.21	-0.08	-0.13	-0.17	-0.06
MQI	LANGIMP	$\rho$ .hi	0.29	0.34	0.43	0.45	0.42	0.32	0.32	0.03	0.07	0.02	0.17
MQI	LANGIMP	$\tau$	0.28	0.31	0.38	0.39	0.37	0.26	0.26	-0.02	-0.03	-0.07	0.05
MQI	LANGIMP	$\tau$ .low	0.26	0.28	0.33	0.33	0.31	0.2	0.2	-0.08	-0.13	-0.17	-0.06
MQI	LANGIMP	$\tau$ .hi	0.29	0.33	0.43	0.45	0.41	0.31	0.31	0.03	0.07	0.02	0.16
MQI	LANGIMP	ICC	0.12	0.13	0.13	0.13	0.13	0.13	0.13	0.12	0.12	0.12	0.12
MQI	LANGIMP	AdjICC	0.44	0.47	0.47	0.47	0.47	0.47	0.47	0.45	0.45	0.45	0.45
MQI	LCP	C’s $\kappa$	0.18	0.2	0.26	0.25	0.18	0.17	0.17
MQI	LCP	QWK	0.23	0.32	0.32	0.44	0.36	0.25	0.25
MQI	LCP	%Agr	0.86	0.86	0.89	0.87	0.89	0.83	0.83
MQI	LCP	Agr±1	0.99	0.98	0.99	0.98	0.98	0.98	0.98
MQI	LCP	$r$	0.23	0.32	0.36	0.45	0.37	0.25	0.25
MQI	LCP	$r$ .low	0.22	0.3	0.31	0.39	0.32	0.2	0.2
MQI	LCP	$r$ .hi	0.25	0.34	0.41	0.5	0.42	0.31	0.31
MQI	LCP	$\rho$	0.22	0.28	0.33	0.41	0.34	0.21	0.21
MQI	LCP	$\rho$ .low	0.2	0.25	0.28	0.35	0.29	0.15	0.15
MQI	LCP	$\rho$ .hi	0.23	0.3	0.38	0.46	0.39	0.26	0.26
MQI	LCP	$\tau$	0.21	0.27	0.33	0.41	0.34	0.21	0.21
MQI	LCP	$\tau$ .low	0.2	0.25	0.27	0.35	0.29	0.15	0.15
MQI	LCP	$\tau$ .hi	0.23	0.3	0.38	0.46	0.39	0.26	0.26
MQI	LCP	ICC	0.14	0.15	0.15	0.15	0.15	0.15	0.15
MQI	LCP	AdjICC	0.5	0.51	0.51	0.51	0.51	0.51	0.51
MQI	STEXPL	C’s $\kappa$	0.36	0.29	0.26	0.3	0.26	0.31	0.31
MQI	STEXPL	QWK	0.4	0.45	0.45	0.45	0.48	0.45	0.45
MQI	STEXPL	%Agr	0.8	0.77	0.76	0.79	0.77	0.77	0.77
MQI	STEXPL	Agr±1	0.99	0.97	0.97	0.97	0.98	0.97	0.97
MQI	STEXPL	$r$	0.4	0.48	0.48	0.46	0.51	0.48	0.48
MQI	STEXPL	$r$ .low	0.38	0.46	0.44	0.4	0.47	0.43	0.43
MQI	STEXPL	$r$ .hi	0.41	0.5	0.53	0.51	0.56	0.52	0.52
MQI	STEXPL	$\rho$	0.39	0.47	0.47	0.46	0.5	0.47	0.47
MQI	STEXPL	$\rho$ .low	0.38	0.45	0.42	0.4	0.46	0.43	0.43
MQI	STEXPL	$\rho$ .hi	0.41	0.49	0.51	0.51	0.54	0.52	0.52
MQI	STEXPL	$\tau$	0.39	0.46	0.46	0.45	0.49	0.46	0.46
MQI	STEXPL	$\tau$ .low	0.37	0.44	0.41	0.39	0.45	0.41	0.41
MQI	STEXPL	$\tau$ .hi	0.4	0.48	0.5	0.5	0.53	0.51	0.51
MQI	STEXPL	ICC	0.3	0.27	0.27	0.27	0.27	0.27	0.27
MQI	STEXPL	AdjICC	0.72	0.69	0.69	0.69	0.69	0.69	0.69
MQI	SMQR	C’s $\kappa$	0.25	0.3	0.23	0.29	0.35	0.32	0.32	0.07	0.1	0.09	0
MQI	SMQR	QWK	0.3	0.41	0.45	0.41	0.41	0.37	0.37	0.08	0.09	0.07	0.06
MQI	SMQR	%Agr	0.76	0.76	0.75	0.77	0.78	0.75	0.75	0.4	0.42	0.48	0.25
MQI	SMQR	Agr±1	0.98	0.99	0.97	0.97	0.99	0.99	0.99	0.9	0.91	0.88	0.93
MQI	SMQR	$r$	0.3	0.41	0.46	0.41	0.43	0.38	0.38	0.13	0.16	0.11	0.13
MQI	SMQR	$r$ .low	0.29	0.39	0.42	0.36	0.38	0.33	0.33	0.07	0.06	0.01	0.02
MQI	SMQR	$r$ .hi	0.32	0.43	0.51	0.47	0.47	0.43	0.43	0.19	0.25	0.2	0.24
MQI	SMQR	$\rho$	0.29	0.39	0.41	0.4	0.42	0.37	0.37	0.12	0.16	0.11	0.12
MQI	SMQR	$\rho$ .low	0.28	0.37	0.36	0.34	0.37	0.32	0.32	0.06	0.06	0.01	0.01
MQI	SMQR	$\rho$ .hi	0.31	0.42	0.46	0.46	0.46	0.42	0.42	0.18	0.25	0.2	0.23
MQI	SMQR	$\tau$	0.29	0.38	0.4	0.39	0.41	0.37	0.37	0.12	0.15	0.1	0.11
MQI	SMQR	$\tau$ .low	0.27	0.36	0.35	0.33	0.36	0.32	0.32	0.06	0.05	0	0
MQI	SMQR	$\tau$ .hi	0.3	0.41	0.45	0.45	0.46	0.42	0.42	0.17	0.24	0.2	0.23
MQI	SMQR	ICC	0.19	0.19	0.19	0.19	0.19	0.19	0.19	0.19	0.19	0.19	0.19
MQI	SMQR	AdjICC	0.59	0.59	0.59	0.59	0.59	0.59	0.59	0.59	0.59	0.59	0.59
MQI	ETCA	C’s $\kappa$	0.24	0.3	0.28	0.39	0.27	0.31	0.31
MQI	ETCA	QWK	0.32	0.5	0.5	0.55	0.51	0.48	0.48
MQI	ETCA	%Agr	0.67	0.68	0.66	0.74	0.65	0.69	0.69
MQI	ETCA	Agr±1	0.98	0.97	0.96	0.98	0.96	0.98	0.98
MQI	ETCA	$r$	0.32	0.52	0.52	0.56	0.55	0.48	0.48
MQI	ETCA	$r$ .low	0.3	0.5	0.48	0.51	0.51	0.44	0.44
MQI	ETCA	$r$ .hi	0.33	0.54	0.56	0.6	0.59	0.53	0.53
MQI	ETCA	$\rho$	0.3	0.5	0.51	0.54	0.55	0.46	0.46
MQI	ETCA	$\rho$ .low	0.28	0.48	0.47	0.49	0.51	0.42	0.42
MQI	ETCA	$\rho$ .hi	0.31	0.52	0.55	0.59	0.59	0.51	0.51
MQI	ETCA	$\tau$	0.29	0.48	0.49	0.52	0.53	0.44	0.44
MQI	ETCA	$\tau$ .low	0.27	0.46	0.45	0.47	0.48	0.4	0.4
MQI	ETCA	$\tau$ .hi	0.31	0.5	0.54	0.57	0.57	0.49	0.49
MQI	ETCA	ICC	0.21	0.22	0.22	0.22	0.22	0.22	0.22
MQI	ETCA	AdjICC	0.61	0.63	0.63	0.63	0.63	0.63	0.63

Table 9: Full Agreement Metrics (continued)

Appendix G Disentangling Bias and Measuring Fairness

Conducting a full fairness analysis across both CLASS and MQI items and raters is considerably more complicated when accounting for all four construct dimensions in Blazar et al. (2017). If only MQI items are modeled, as was the case in the plots of Figure 4, the model can be simplified two dimensions. Full item-level MQI results for those models for disentangling biases from Section 5.4 are in Figure 10. The item-level results for corresponding racial bias difference models from Section 5.5 are in Figure 11. JAGS code for MCMC in R is available online.²⁴²⁴24https://github.com/hardy-education/LLM-Psychometrics A structural plate diagram for the model in Section 5.5 is in Figure 9.

JAGS code of a full model representing Section 5.5, including code for the additional estimation of CLASS items and simultaneous estimation of human and model parameters, as seen in Figure 9. To reduce the total length of code, Code Listing G encapsulates all code for the various MCMC estimations used in this paper. For the creation of Panels (d) and (e) of Figure 4 and Figures 10 and 11, model parameters were estimated after human raters and teacher parameters were estimated and only using MQI items (i.e., xi[i,j] is held as fixed when estimating parameters for Encoders and GPTs). It also includes an additional hierarchical structure in latent abilities to allow for estimation of ideal scores at the lesson observation-level $\xi_{oij}$ so teacher latent abilities, $\theta_{oi}$ , can vary across lessons during the year and jointly be informed by the teacher’s true year-level latent abilities $\Theta_{i}$ . This would update the top latent ability estimation Equation 5 to the following.

\displaystyle\text{HRM}\begin{cases}\boldsymbol{\theta}_{oi}\sim\text{MVN}(% \boldsymbol{\Theta}_{M\times 1},\textbf{I}_{M\times M})\text{; }\Theta_{im}% \sim\mathcal{N}(0,1)\text{,}\\ \xi_{oij}\sim\text{{IRT model}}\\ X_{soijr}\sim\text{{SDT model}}\end{cases}

(13)

{minted}

R model ## Signal detection theory model with rater covariates for (i in 1:NN) x[i] dcat(prob.sdt[i, ]) for (k in 1:K) d[i, k] <- k - xi[subject[i], item[i]] - rhocov.r[rater[i], item[i], race[i]] z[i, k] <- exp(-d[i, k] * d[i, k]/2 * exp(zeta.r[rater[i], item[i], race[i]])) prob.sdt[i, k] <- ifelse((K - maxscore.by.item[item[i]]), ifelse(k < (maxscore.by.item[item[i]] + 1), z[i, k]/sum(z[i, ]), 0.00000E+00), z[i, k]/sum(z[i, ]))

## Multidimensional Generalized Partial Credit Model for (i in 1:N) for (j in 1:J) xi[i, j] dcat(prob.irt[i, j, ]) for (m in 1:M) kern[i, j, m] <- alpha[j, m] * (theta[i, m]) for (k in 1:K) dotprod[i, j, k] <- (k - 1) * sum(kern[i, j, ]) eta[i, j, k] <- dotprod[i, j, k] - sum(gamma[j, 1:k]) exp.eta[i, j, k] <- exp(eta[i, j, k]) prob.irt[i, j, k] <- ifelse(K - maxscore.by.item[j], ifelse(k <= (maxscore.by.item[j]), exp.eta[i, j, k]/sum(exp.eta[i, j, 1:maxscore.by.item[j]]), 0), exp.eta[i, j, k]/sum(exp.eta[i, j, 1:maxscore.by.item[j]])) ## Rater Parameters for (nu in r1.raters) for (s in r.1.in) for (ra in 1:RA) rhocov.r[nu, s, ra] dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra] dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.1.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1 for (nu in r2.raters) for (s in r.2.in) for (ra in 1:RA) rhocov.r[nu, s, ra] dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra] dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.2.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1 for (nu in r3.raters) for (s in r.3.in) for (ra in 1:RA) rhocov.r[nu, s, ra] dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra] dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (nu in r4.raters) for (s in r.4.in) for (ra in 1:RA) rhocov.r[nu, s, ra] dnorm(eta.rt[s, ra], prec.rhocov) zeta.r[nu, s, ra] dnorm(kappa.rt[s, ra], prec.zeta) omega.r[nu, s, ra] <- sqrt(1/exp(zeta.r[nu, s, ra])) for (s in r.4.out) for (ra in 1:RA) rhocov.r[nu, s, ra] <- 0 zeta.r[nu, s, ra] <- 0 omega.r[nu, s, ra] <- 1

## Multidimension parameters for (m in 1:M) pi.rt[m] <- 0 delta.rt[m] <- 0 sigma.rt[m] <- 1

## Item Parameters for (s in 1:S) for (ra in 1:RA) eta.rt[s, ra] dnorm(pi.rt[factors.by.item[s]], prec.eta) kappa.rt[s, ra] dnorm(delta.rt[factors.by.item[s]], prec.kappa) tau.rt[s, ra] <- sqrt(1/exp(kappa.rt[s, ra]))

## Initializations for rater and item parameters prec.pi dgamma(a.precpi, b.precpi) prec.delta dgamma(a.precdelta, b.precdelta) prec.eta dgamma(a.preceta, b.preceta) prec.kappa dgamma(a.preckappa, b.preckappa) prec.rhocov dgamma(a.precrhocov, b.precrhocov) prec.zeta dgamma(a.preczeta, b.preczeta) sd.rhocov <- sqrt(1/prec.rhocov) sd.zeta <- sqrt(1/prec.zeta) sd.pi <- sqrt(1/prec.pi) sd.delta <- sqrt(1/prec.delta) sd.eta <- sqrt(1/prec.eta) sd.kappa <- sqrt(1/prec.kappa) for (m in 1:M) alpha[d2[1], m] <- ifelse(m == 2, 1, 0) alpha[d1[1], m] <- ifelse(m == 1, 1, 0) alpha[d3[1], m] <- ifelse(m == 3, 1, 0) alpha[d4[1], m] <- ifelse(m == 4, 1, 0) for (j in d1[2:D1]) alpha[j, 1] dlnorm(0, prec.alpha) alpha[j, 2] <- 0 alpha[j, 3] <- 0 alpha[j, 4] <- 0 for (j in d2[2:D2]) alpha[j, 2] dlnorm(0, prec.alpha) alpha[j, 1] <- 0 alpha[j, 3] <- 0 alpha[j, 4] <- 0 for (j in d3[2:D3]) alpha[j, 3] dlnorm(0, prec.alpha) alpha[j, 2] <- 0 alpha[j, 1] <- 0 alpha[j, 4] <- 0 for (j in d4[2:D4]) alpha[j, 4] dlnorm(0, prec.alpha) alpha[j, 1] <- 0 alpha[j, 2] <- 0 alpha[j, 3] <- 0 for (j in 1:J) gamma[j, 1] <- 0 for (k in 2:maxscore.by.item[j]) gamma[j, k] dnorm(0, prec.gamma) for (k in (maxscore.by.item[j] + 1):(K + 1)) gamma[j, k] <- 0 ## Theta estimations for (i in 1:TY) for (m in 1:M) ty[i, m] dnorm(0, prec.ty) for (i in 1:N) theta[i, 1:M] dmnorm(ty[tyr.by.obs[i], ], Tau[, ]) Tau[1:M, 1:M] dwish(W[, ], DF) Sigma <- inverse(Tau[, ]) sd.th1 <- sqrt(Sigma[1, 1]) sd.th2 <- sqrt(Sigma[2, 2]) rho12 <- Sigma[1, 2]/sqrt(Sigma[1, 1] * Sigma[2, 2]) prec.ty dgamma(a.precty, b.precty) sd.ty <- 1/sqrt(prec.ty) prec.b <- pow(var.b, -1) prec.g <- pow(var.g, -1) prec.alpha <- pow(var.alpha, -1) prec.gamma <- pow(var.gamma, -1) prec.phi <- pow(var.phi, -1) ## initial values inits <- function() list( alpha = item.dims * runif(J*M,0.1,1.5), gamma = item.cats.by.score * rnorm(J*(K+1),0,0.5), # ty= matrix(rep(rnorm(TY, 0, 1),M),nrow=TY,ncol=M), theta = matrix(rnorm(N*M,0,1),ncol=M), phi = rnorm(R, 0, 1), tau = runif(R, 0.1, 8), rhocov = array(rnorm(R*S*RA),dim = c(R,S,RA)) * rnorm(R*S*RA,0,.5), zeta = array(rnorm(R*S*RA),dim = c(R,S,RA)), pi = rnorm(R,0,.5), delta = rnorm(R,0,.5), kappa = rnorm(R,0,.5), theta.prec = rgamma(1,100,100)) {listing} JAGS code of a full model representing Section 5.5, including code for the additional estimation of CLASS items and simultaneous estimation of human and model parameters, as seen in Figure 9. For brevity, this includes all code which can be reduced for the various methods herein. For the creation of Panels (d) and (e) of Figure 4 and Figures 10 and 11, model parameters were estimated after human raters and teacher parameters were estimated and only using MQI items (i.e., xi[i,j] is held as fixed when estimating parameters for Encoders and GPTs).

Appendix H Generalizability and Decision Studies

H.1 Generalizability Study Human Results (for NCTE Main Study)

The results of the item-level G-study for human expert ratings, consisting of only the estimates for individual items using the NCTE Main Study data Kane et al. (2015) to replicate Section 2.d from the Appendix. All calculations and representations are according to the design details listed therein.

In the Appendix of the NCTE study, the authors submitted a G-study on the MQI instrument, but not for data of the study: they provide a separate G-study of only eight (8) different middle school teachers teaching three (3) lessons each with only nine (9) raters, instead of the corresponding 317 NCTE Study teachers with an average 5.34 lessons each and 63 raters. For completeness, this paper conducts the G-study for the NCTE main study Appendix, Section 3, using the NCTE dataset. The full results of the human label G-study are in Table 11.

Table 11: By item, the percentage contribution, excluding the residual (which accounts for the remainder of the variance), of each variance component in the given MQI Item’s R x (O:T) Generalizability Study

H.2 Item Generalizability and Item-score Reliability

As a complement and context stemming from Sections 5.1 and 5.2, $\mathbf{E}\hat{\rho}^{2}_{j}$ item values to item-level reliability estimates related to Guttman’s $\lambda_{6}$ (Guttman, 1945), $\hat{\rho}^{\lambda_{6}}_{jj\prime}$ (Zijlmans et al., 2018a, b). $\hat{\rho}^{\lambda_{6}}_{jj\prime}$ represents the proportion of an item’s variance shared by the to variance captured by other items. This estimate from Classical Test Theory (naïvely, in this case) assumes that all items measure the same latent construct, i.e., the Mathematical Quality of Instruction (Hill et al., 2008). $\hat{\rho}^{\lambda_{6}}_{jj\prime}$ removes the variance in the residual error, $\sigma^{2}_{\varepsilon_{j}}$ , from a multiple regression of item $j$ on the scores from the remaining $J-1$ items to estimate the proportion of total item variance $\sigma^{2}_{X_{j}}$ consistent with the unidimensional construct shared with the other items. Figure 13 highlights the large difference in the measurement used in Section 5.2 and item reliabilities from classical test theory. The latter of which describes the item reliability based on all scores, while the former is used in this study because it is more related to the reliability of individual scores for a given item.

H.3 Generalizability Theory Parameters and Code

A helpful heuristic for understanding the mathematics of G-theory might be they are very computationally similar to hierarchical mixed effect models, where estimates of interest are found in variation of the random effects. The two code blocks represent by item $(O:I)\times R$ and $(S:O:I)\times R$ parameterizations, respectively, using variable names from the original dataset. The former replicates the methods used in Hill et al. (2012b) and the Appendix Section 2.d of Kane et al. (2015) to create Table 11 in Appendix section H.1, and was used in this study to calculate the family generalizability metrics in Section 5.2, including those used in Section 5.3. The latter is used for the decision studies described in Section 5.6. Studies were conducted using lme4 (Bates et al., 2015) in R (Team, )

Full results for item-level d-studies as defined in Section 5.6 are in Figure 14.

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(R_TYPE == rater.type), formula = SCORE (1—RATERID) + (1—NCTETID/OBSID) + (1—ITEM) + (1—RATERID:NCTETID) + (1—RATERID:OBSID) + (1—ITEM:NCTETID) + (1—ITEM:OBSID) + (1—RATERID:ITEM) + (1—ITEM:RATERID:NCTETID) {listing} lme4 code for Family-wise all item estimations in Table 2

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(ITEM == item)—¿ filter(R_TYPE == rater.type), formula = SCORE (1—NCTETID/OBSID) + (1—RATERID) + (1—RATERID:NCTETID) {listing} lme4 code for item-level estimations of $\mathbf{E}\hat{\rho}^{2}_{j}$ in Equation 2

{minted}

R for (item in ITEMS) m[[item]] ¡- lmer(data = df—¿ filter(ITEM == item)—¿ filter(R_TYPE == rater.type), formula = SCORE (1—NCTETID_SCHOOLYEAR_SP/OBSID/CHAPNUM) + (1—RATERID) + (1—RATERID:NCTETID_SCHOOLYEAR_SP) {listing} lme4 code for item-level estimations used in Equation 9

Appendix I Interpretability of Encoder Labels

I.1 Feature Attribution Models and Tools

The Explainable Artificial Intelligence (XAI) community has proposed various cutting-edge methodologies to enhance the explainability of deep learning models. A popular strategy is feature attribution, wherein for a given neural network model f, an attribution method E delineates the significance of each input feature of x to the prediction y = f(x). Various strategies to ascertain feature importance have been introduced, encompassing gradient-based methods, surrogate methods, and perturbation-based methods. Our study employs Integrated Gradients, a gradient-based approach developed by Sundararajan et al. (2017), to identify pivotal sentences for classroom quality assessment. Integrated Gradients is engineered to comply with two essential axioms—Sensitivity and Implementation Invariance—that attribution methods ought to adhere to, as defined below:

		$\displaystyle\text{IntegratedGrads}^{approx}_{i}(x)::=$
		$\displaystyle(x_{i}-x^{\prime}_{i})\times\sum_{k=1}^{m}\frac{\partial F(x^{% \prime}+\frac{k}{m}\times(x-x^{\prime}))}{\partial x_{i}}\times\frac{1}{m}.$

In the above, $(x_{i}-x^{\prime}_{i})$ is the difference between the inputs, $x_{i}$ and the baseline, and $m$ is the number of loops used for each step in a Riemann approximation of the exact integral, as presented by Sundararajan et al. (2017). Integrated Gradients compute the average gradient by interpolating between a chosen baseline and the input. The resulting attributions are subsequently obtained as the element-wise product of this path-averaged gradient vector and the difference vector between the input and the baseline.

Raters	ETCA	EXPL	LANGIMP	LCP	LINK	MAJERR	MGEN	MLANG	MMETH	REMED	SMQR	STEXPL	USEPROD
Humans	0.3	0.27	0.28	0.21	0.41	0.28	0.19	0.32	0.47	0.32	0.29	0.39	0.31
Encoders	0.51	0.46	0.41	0.39	0.57	0.35	0.33	0.52	0.52	0.46	0.39	0.47	0.46
GPTs		0.04	0.04							0.04	0.12
Xu et al.	0.3	0.31	0.19	0.13	0.41	0.13		0.4	0.36	0.27	0.26	0.37