Detecting Rater Effects Using Many-Facet Rasch Models and Bootstrap Techniques
Is Version Of
The quality of ratings provided by expert raters in evaluating language learners’ constructed responses in performance assessment is typically investigated by means of statistical modeling. Several rater effects, including severity/leniency, central tendency, and randomness, have been well documented in the psychometrics literature (Myford & Wolfe, 2003). This study applies the Many-Facets Rasch Models to detect these rater effects for an in-house speaking assessment for international teaching assistants (ITAs) in a US university. The goal of this study is to evaluate the extent to which the models, estimation procedures, and statistics/numerical indices that are adopted in this study would work as intended in this context. Two simulation studies are conducted where different model parameters are simulated from different distributions, and a parametric bootstrap procedure is applied to attest to the statistical properties (i.e., consistency, variability, and mean squared error) of the parameter estimates and fit statistics. Then, the model parameters are estimated from the actual data, and the estimates are compared using different estimation procedures (Joint Maximum Likelihood (JML) vs. Marginal Maximum Likelihood (MML)) and different computational implementations (R vs. Facets). The parametric bootstrap procedure is also applied to provide an estimate of the sampling distributions of the parameters and fit statistics through replications. Finally, the indices for rater effects detection are compared using both numerical summaries and plotting techniques.
Results indicated that, when the model parameters and rater effects were simulated, the estimated severity parameters and the fit statistics were sensitive in detecting the intended effects. In comparison, MML estimation method showed certain superiority, in terms of statistical consistency and variability, over JML estimation method. But neither estimation method was free of bias. This was also true when the actual data were analyzed. Moreover, in terms of detecting the centrality or randomness effects in the actual data, evidence from the fit statistics could be used in conjunction with other indices from Facets and visualization techniques. However, the bootstrap results for the fit statistics indicated that, when the empirical distributions of the fit statistics were considered, disagreements between MML and JML were relatively large and the rule-of-thumb critical ranges of the fit statistic may be questionable.