Modeling statistics ITAs’ speaking performances in a certification test

Thumbnail Image
Zhou, Ziwei
Major Professor
Ockey, Gary
Froelich, Amy
Vajjala, Sowmya
Cotos, Elena
Levis, John
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue
Is Version Of
In light of the ever-increasing capability of computer technology and advancement in speech and natural language processing techniques, automated speech scoring of constructed responses is gaining popularity in many high-stakes assessment and low-stakes educational settings. Automated scoring is a highly interdisciplinary and complex subject, and there is much unknown about the strengths and weaknesses of automated speech scoring systems (Evanini & Zechner, 2020). Research in automated speech scoring has been centralized around a few proprietary systems owned by large testing companies. Consequently, existing systems only serve large-scale standardized assessment purposes. Application of automated scoring technologies in local assessment contexts is much desired but rarely realized because the system’s inner workings have remained unfamiliar to many language assessment professionals. Moreover, assumptions about the reliability of human scores, on which automated scoring systems are trained, are untenable in many local assessment situations, where a myriad of factors would work together to co-determine the human scores. These factors may include the rating design, the test takers’ abilities, and the raters’ specific rating behaviors (e.g., severity/leniency, internal consistency, and application of the rating scale). In an attempt to apply automated scoring procedures to a local context, the primary purpose of this study is to develop and evaluate an appropriate automated speech scoring model for a local certification test of international teaching assistants (ITAs). To meet this goal, this study first implemented feature extraction and selection based on existing automated speech scoring technologies and the scoring rubric of the local speaking test. Then, the reliability of the human ratings was investigated based on both Classical Test Theory (CTT) and Item Response Theory (IRT) frameworks, focusing on detecting potential rater effects that could negatively impact the quality of the human scores. Finally, by experimenting and comparing a series of statistical modeling options, this study investigated the extent to which the association between the automatically extracted features and the human scores could be statistically modeled to offer a mechanism that reflects the multifaceted nature of the performance assessment in a unified statistical framework. The extensive search for the speech or linguistic features, covering the sub-domains of fluency, pronunciation, rhythm, vocabulary, grammar, content, and discourse cohesion, revealed that a small set of useful variables could be identified. A large number of features could be effectively summarized as single latent factors that showed reasonably high associations with the human scores. Reliability analysis of human scoring indicated that both inter-rater reliability and intra-rater reliability were acceptable, and through a fine-grained IRT analysis, several raters who were prone to the central tendency or randomness effects were identified. Model fit indices, model performance in prediction, and model diagnostics results in the statistical modeling indicated that the most appropriate approach to model the relationship between the features and the final human scores was a cumulative link model (CLM). In contrast, the most appropriate approach to model the relationship between the features and the ratings from the multiple raters was a cumulative link mixed model (CLMM). These models suggested that higher ability levels were significantly related to the lapse of time, faster speech with fewer disfluencies, more varied and sophisticated vocabulary, more complex syntactic structures, and fewer rater effects. Based on the model’s prediction on unseen data, the rating-level CLMM achieved an accuracy of 0.64, a Pearson correlation of 0.58, and a quadratically-weighted kappa of 0.57, as compared to the human ratings on the 3-point scale. Results from this study could be used to inform the development, design, and implementation for a prototypical automated scoring system for prospective ITAs, as well as providing empirical evidence for future scale development, rater training, and support for assessment-related instruction for the testing program and diagnostic feedback for the ITA test takers.