Investigating reliability and construct validity of a source-based academic writing test for placement purposes
Source-based writing, in which writers read or listen to academic content before writing, has been considered to better assess academic writing skills than independent writing tasks (Read, 1990; Weigle, 2004). Because scores resulting from ratings of test takers’ source-based writing task responses are treated as indicators of their academic writing ability, researchers have begun to investigate the meaning of scores on source-based academic writing tests in an attempt to define the construct measured on such tests. Although this research has resulted in insights about source-based writing constructs and the rating reliability of such tests, it has been limited in its research perspective, the methods for collecting data about the rating process, and the clarity of the connection between reliability and construct validity. This study aimed to collect and analyze evidence regarding the reliability and construct validity of a source-based academic English test for placement purposes, called the EPT Writing, and to show the relationship between these two parts of the study by presenting the evidence in a validity argument (Kane, 1992, 2006, 2013). Specifically, important reliability aspects, including the appropriateness of the rating rubric based on raters’ opinions and statistical evidence, the performance of the raters in terms of severity, consistency, and bias, as well as test score reliability, were examined. Also, the construct of academic source-based writing assessed by the EPT Writing was explored by analysis of the writing features that raters attended to while rating test takers’ responses. The study employed the mixed-methods multiphase research design (Creswell & Plano Clark, 2012) in which both quantitative and qualitative data were collected and analyzed in two sequential phases to address the research questions. In Phase 1, quantitative data, consisting of 1,300 operational ratings provided by the EPT Office, were analyzed using Many-Facets Rasch Measurement (MFRM) and Generalizability theory to address the research questions related to the rubric’s functionality, raters’ performance, and score reliability. In Phase 2, 630 experimental ratings, 90 stimulated recalls collected with assistance from records from eye-tracking technology, as well as nine interviews from nine raters were analyzed to address the research questions pertaining to raters’ opinions of the rubric and the writing features that attracted raters’ attention during rating. The findings were presented in a validity argument to show the connection between the reliability of the ratings and the construct validity, which needs to be taken into account in research on rating processes. Overall, the raters’ interviews and MFRM analysis of the operational ratings showed that the rubric was mostly appropriate for providing evidence of variation in source-based academic writing ability. Regarding raters’ performance, MRFM analysis revealed that while most raters maintained their comparability and consistency in terms of severity, and impartiality towards the writing tasks, some of them were significantly more generous, inconsistent, and biased against task types. The score reliability estimate for a 2-task x 2-rater design was found below the desired level, suggesting that more tasks and raters are needed to increase reliability. Additionally, analysis of the verbal reports indicated that the raters attended to the writing features aligned with the source-based academic writing construct that the test aims to measure. The conclusion presents a partial validity framework for the EPT Writing, in addition to implications for construct definition of source-based academic writing tests, cognition research methods, and language assessment validation research. Recommendations for the EPR Writing include a clearer definition of the test construct, revision of the rubric, and more rigorous rater training. Suggested directions for future research include further research investigating raters’ cognition in source-based writing assessment and additional validation studies for other inferences of the validity framework for the EPT Writing.