Diagnostic Testing of Introductory Geology Students

A diagnostic test for assessing the general and Earth science knowledge of entry-level college students was administered to 451 students in 2002 and 401 students in 2003 enrolled in an introductory geology course at Iowa State University. The study shows that male students, seniors, and science-technology-math majors score higher than female students, freshmen, and non-science-technology-math majors and that the differences are statistically significant. Also, students who scored higher on the diagnostic test were more likely to pass the course. The results support the feasibility of a standardized diagnostic test as a tool for geoscience instructors for curriculum planning, student advising, and curriculum assessment, similar to standardized diagnostic testing and pre-post testing used in chemistry and physics courses. Standardized national tests would enhance college geoscience education.


INTRODUCTION
In the last decade there has been a dramatic increase in research aimed at studying and improving geoscience education. The ever-increasing number of manuscripts submitted to the Journal of Geoscience Education (JGE) for publication unequivocally signifies this change (Drummond, 2003). Most of the articles in JGE describe innovative techniques devised to improve student learning or to engage students in the study of Earth sciences. Compared to geoscience education, the production of education literature in general is monumental in scope and size: the Education Research Information Center, or ERIC, the leading educational database, contains more than one million citations and abstracts from over 700 educational journals and thousands of reports. The overwhelming volume of literature published in this field has one common goal: to enhance student learning.
When developing a new teaching technique, revising a syllabus to incorporate innovative activities, or designing a new curriculum, the question that each instructor naturally would ask is, "Will it improve learning of the subject matter?" In other words, how much more or better will students learn with the new approach? Because curricular innovations require time and effort, instructors must find them to be worthwhile. But how do we measure learning? Assessment is a critical part of planning educational research, and a very strong emphasis is currently placed on the development of successful assessment techniques. The recommended method is to give "before" and "after" exams to both an experimental and a control group.
The problems with this approach are multiple and well known: is the grading scale the same? Is the new technique the only part that changed in the course, or has the instructor revised other aspects of the course as well? How much time did the students spend on the activity? Is the demographic make-up of the courses the same? Are the tests comparable in length, type, and difficulty? On a larger scale, instructors may wonder what and how much their students are learning compared to students in other schools or other States. One way, and possibly the only way, to find an answer to all of these questions would be to create standardized national tests for the geosciences. Chemistry and physics instructors have actively used such tests for decades. For introductory college physics and chemistry courses, diagnostic tests have been developed in the last 10 years, and they are beginning to be used as statewide examinations (e.g., Krishnan and Howe, 1994;Russell, 1994;Steinberg and Sabella, 1997;McFate and Olmsted, 1999;Legg et al., 2001). Recently, the California Chemistry Diagnostic Test (CCDT) (Russell, 1994) was used to analyze the probability of success of students in general chemistry with a logistic regression analysis (Legg et al., 2001), so the CCDT can be used to predict student success and to advise students about their readiness when they start the course. In an effort to establish national standards in the understanding of chemistry, the Division of Chemical Education of the American Chemical Society has been providing K-16 instructors with standardized tests since 1934 (http://www.uwm.edu/Dept/chemexams/ INTRO/index.html). Similar tests are available for physics instructors (http://www.psrc-online.org/ under "Evaluation instruments"). These tests allow instructors to assess student knowledge to conduct pre-post testing of curriculum effectiveness, to compare local results to the national level, and to compare against the national science standards. By giving a standardized test at the beginning and end of the course, the instructor can assess individual student learning and how the students in the course compare to students at peer institutions, across the State, or nationwide.
With a similar goal, the American Geological Institute initiated the Earth Science Curriculum Project in the 1960's, which attempted to establish standardized tests for high school students. These tests, however, are no longer used, and there is no national exam on Earth science knowledge. The National Science Education Standards (National Research Council, 1996) and AAAS Benchmarks for Science Literacy have emphasized again the critical role of the Earth sciences in science education and the need for content standards in Earth sciences. After this study was conducted, Anderson (2005, 2006) published their Geoscience Concept Inventory (GCI), a pool of 73 multiple-choice questions covering various aspects of physical geology and fundamental physics and chemistry concepts (http://newton.bhsu.edu/eps/gci.html).
Establishing a national level of Earth science education would require the creation of standardized tests similar to the ones that exist for chemistry. The Geoscience Concept Inventory and the geoscience ConcepTests are excellent starting points for this community initiative. This paper presents the results from our attempt to develop and implement a diagnostic test for an introductory geology course for the purposes of (1) measuring incoming student knowledge of geology and science from high school or previous science courses, and (2) testing the feasibility of a standardized diagnostic test for introductory geology courses.  (Roadrangka et al., 1983). The New York State Regents Earth Science Exams have been administered by the New York State Education Department to high-school seniors for many years and old exams are available on line. Some challenge the validity of these statewide standardized tests (e.g., Olson, 2006) and we could not find any information on how the questions are selected and if they go through a process of validation as rigorous as the Geoscience Concept Inventory (Libarkin and Anderson, 2005). However, these were the only independently developed tests of high-school-level Earth science knowledge based on the National Science Education Standards that were available at the time of the study. Because these tests were designed for high school students, they were deemed appropriate and valid for measuring Earth science knowledge at a high school level.

METHODS
Questions were selected by two colleagues of the senior author to avoid potential bias. The 41 questions used on the test in 2002 and the 40 questions used in 2003 were grouped into four types: general science, geology, mathematics, and logic questions (see Electronic Appendix for the complete text of the exams). Some questions may be considered to be more than one type, so we asked a science colleague not involved in the study to assign each question to one of the four types to avoid researcher bias. The questions were mainly on geology and general science topics (2002: 17 geology, 15 general science, 4 math, 5 logic; 2003: 13 geology, 19 general science, 5 math, 3 logic). Nineteen questions from the 2002 exam were re-used on the 2003 exam (9 general science, 3 geology, 4 math, 3 logic). Since the goal of the study was to test the students' incoming knowledge of geology and general science, and most of them acquired this in high-school, we chose to use a high-school level test even if the questions are mainly based on memorization of facts instead of critical thinking. In fact, most of the questions can be categorized as testing skills in the lower half of Bloom's Taxonomy, i.e., knowledge, comprehension, and application, rather than the upper half of analysis, synthesis, and evaluation. For the same reason, we chose to not administer the same test at the end of the semester and instead evaluate the students' progress during the semester using the combination of assessment tools (homework, in-class assignments, tests) based on deeper conceptual understanding used in the class.
Study Sample -To give context to student performance on the diagnostic test, Iowa State University (ISU) is a large, Midwestern, land-grant research institution with approximately 21,000 undergraduate and 5,000 graduate students. For most Iowa high school graduates, if they rank above the 49th percentile in their graduating class and have completed the required courses, they are automatically admitted to ISU. Students with lower rankings (20th-49th percentiles) must additionally achieve minimum ACT and SAT I scores to receive probationary admission. The high school science preparation required for ISU admission is a total of three years distributed across at least two subjects from among biology, chemistry, and physics. Earth sciences are not required, and as a result, many high school students do not take Earth science or, rarely, take it in 9th grade. These requirements are typical of many States, but various organizations, including the American Geological Institute, are attempting to change the requirements to include the Earth sciences (Robert Ridky, personal communication, 2003).
The diagnostic test was administered during the second class-meeting of Geology 100 (a three-credit, lecture-only class) in  A t-test comparing the mean scores by gender yielded statistical evidence that male students score higher on the diagnostic test than female students (t=7.086, df=850, p=0.000). The results from a Pearson chi-square test comparing the pass-fail proportions by gender were not statistically significant ( 2=0.777, df=1, p=0.378). Table 2 shows the combined 2002 and 2003 data by major and for freshmen and seniors. A t-test comparing the mean scores by major yielded statistical evidence that SMT majors score higher on the diagnostic test than non-SMT majors (t=5.908, df=819, p=0.000). The results from a Pearson chi-square test comparing the pass-fail proportions by major were statistically significant (2=8.010, df=1, p=0.005), meaning that SMT majors pass the geology course more frequently than non-SMT majors. A t-test comparing the mean scores of freshmen vs. seniors yielded statistical evidence that seniors score higher on the diagnostic test than freshmen (t=2.423, df=426, p=0.016). The results from a Pearson chi-square test comparing the pass-fail proportions of freshmen vs. seniors were not statistically significant (2=1.050, df=1, p=0.305). Table 3 shows the 2002 and 2003 data for passing and failing students, and a t-test comparing the mean scores of passing students vs. failing students yielded statistical evidence that passing students score higher on the diagnostic test than failing students (t=3.394, df=850, p=0.001), evidence for a relationship between diagnostic test score and probability of passing the course. Tables 1 and 2 showed in each year the same patterns of males scoring higher than females, SMT majors scoring higher than non-SMT majors, and seniors scoring higher than freshmen. Student results from the 19 questions that appeared on both tests were analyzed to examine more closely the year-to-year consistency of student performance. Similar to the patterns in performance on the entire diagnostic test, males scored higher than females, SMT majors scored higher than non-SMT majors, and seniors scored higher than freshmen on the 19 questions, however, as a group, the 2002 students performed almost identically to the 2003 students (Tables 4 and 5). A t-test comparing the 2002 and 2003 mean performances on the 19 questions for all students was not statistically significant (t=0.312, df=850, p=0.755). Also, students in particular subgroups performed nearly the same as students in the same subgroup from a different year, e.g., the mean score for 2002 males was almost the same as the mean score for 2003 males (Tables 4 and 5).   (Table  6). For a particular question, the Item Difficulty (DIFF) equals the percent of students choosing the correct answer, and a high DIFF indicates a higher percent of correct answers, i.e., Question 24 in Table 6 was the "easiest" question for students to answer. The Item Discrimination Index (DISC) is a measure of how effectively a particular question distinguishes between high and low performing students on the test as a whole. Positive values for DISC mean that students who scored above the test average had more success answering the question than students who scored below the test average. DISC values close to zero indicate that above-average and below-average students had approximately equal success on the question. All 19 questions had DISC values indicating that the questions successfully differentiated student performance, although the easiest questions have values approaching zero, which is common for questions that nearly all students answer correctly. The test questions with a DIFF values above the test average by one standard deviation were considered areas of student strength. This cut-off value was 76.9+15.1% = 92.0% (Table 4), identifying Questions 24 and 29 as specific strengths of the students (Table 6). Question 24 was a math question on converting from scientific notation into standard notation, and Question 29 was a general science question about a brief definition of evolution.

Areas of Student Strengths and Weaknesses
The test questions with DIFF values below the test average by one standard deviation were considered areas of student weakness. This cut-off value was 76.9-15.1% = 61.8% (Table 4), identifying Question 37 as a specific weakness of the students (Table 6). Question 37 was a general science question about the most abundant atmospheric gas, and oxygen was the most selected response.
Using the same cut-off values to analyze for areas of strength and weakness for particular subgroups yielded similar results with some minor differences (Tables 7 and  8). Male students scored above 92.0% and below 61.8% on the same questions as all 852 students, with the exception that males also scored above 92.0% on Question 28, which was a math question about estimating the meter unit to the nearest English unit.
Female students scored above 92.0% on Question 24 only and scored below 61.8% on Questions 37, 40, 41, 30, 36, and 2 (Tables 7 and 8). Questions 40 and 41 were paired logic questions regarding conceptual understanding of displacement of equal volume spheres that had unequal masses. The correct answer was most selected, and the next most selected choice was based on greater mass leading to greater displacement. Question 30 was a general science question on remembering latitude vs. longitude. The correct answer was most selected, and the other answer was the next most selected. Question 36 was a math question on the definition of kilogram, with the correct answer being most selected, and the next most selected choice being off by a factor of 10. Question 2 was a geology question about surface runoff, with the correct answer being most selected, and the next most selected choice being an incorrect relationship between surface runoff and gradient. SMT majors scored above 92.0% and below 61.8% on the same questions as all 852 students, with the exception that SMT majors also scored below 61.8% on Questions 2 and 30, two questions for which female students also scored below 61.8%. Non-SMT majors scored above 92.0% and below 61.8% on the same questions as all 852 students, an expected result because of the large number of non-SMT majors in the course. Freshmen students scored above 92.0% on Question 24 only and scored below 61.8% on Questions 37 and 15. Question 15 was a logic question about the direct relationship between pressure and melting point of a substance, with correct answer being most selected, and the next most selected choice based on confusing ambient temperature with melting point of a substance. Seniors scored above 92.0% and below 61.8% on the same questions as all 852 students.

Analysis of SMT Majors by Gender -The responses by
SMT majors on the 19 common questions were analyzed by gender (Table 9). A t-test comparing the mean scores by gender yielded statistical evidence that male SMT majors score higher on the diagnostic test than female SMT majors (t=2.127, df=151, p=0.035), however, an ANOVA test showed no evidence for no interaction effects between the factors of gender, major, and pass-fail status. The results from a Pearson chi-square test comparing the pass-fail proportions by gender were not statistically significant ( 2=2.441, df=1, p=0.118).
The same cut-off values of 92.0% for areas of strength and 61.8% for areas of weakness were used to study student performance by major and gender on the 19 common questions (Tables 10 and 11). Although male SMT majors scored above 92.0% on the same questions as all 852 students, they additionally showed Question 34 as an area of strength. Question 34 was a general science question on identifying the unit of volume within a list of various units. Female SMT majors showed areas of strength much different than male SMT majors and much different than female non-SMT majors. Question 28 was the math question on estimating the meter unit to the nearest English unit, an area of strength identified for male students in general. Question 25 was a general science question on identifying the element within a list of metal alloys and one metal. Examining the data for non-SMT majors by gender yielded the same areas of strength identified previously for students of that gender, as expected due to the large number of non-SMT majors.
Male SMT majors scored below 61.8% on the same questions as all 852 students, all male students, and male non-SMT majors. Female SMT majors scored below 61.8% on the same questions as all female students. Female non-SMT majors scored below 61.8% on the same questions as all female students, except Question 15 replaced Question 2 as an area of weakness. Question 15 was the logic question on the direct relationship between pressure and melting point of a substance, which was identified as an area of weakness for freshmen, and female non-SMT majors had similar difficulty.

DISCUSSION
Overall Performance and Relationship to Pass-Fail -On each diagnostic test and on the 19 common questions, males scored higher than females, SMT majors scored higher than non-SMT majors, and seniors scored higher than freshmen. These results can be reasonably expected.
The well-known gender difference in performance on standardized tests based on math and spatial visualization skills (Gallagher, 2004, p. 129) leads to male students scoring higher than female students. SMT majors likely have taken more science and math courses than non-SMT majors, and therefore score higher than non-SMT majors. Seniors are more prepared than freshmen, whether through completion of more science and math courses or more coursework in general, so they score higher.
The results also show that students who eventually pass the course score higher than students who fail the course. The passing students had more incoming knowledge, more natural ability, or some combination of both that enabled greater success on the diagnostic test and in the course. The diagnostic test therefore had some ability to predict which students will pass the course and could be a potential advisement tool, similar to the CCDT, with significant development efforts.
Reliability -As a whole and within subgroups, students' diagnostic test averages were statistically the same on the common questions used in 2002 and 2003. The questions were apparently reliable from year to year, supporting the reliability of the diagnostic test from year to year, and this result makes sense because the questions were from past New York State Regents Earth Science Exams that were available on line. Presumably, these exams were well-developed and well-tested for many years, and thus, they successfully served as the basis for the diagnostic test in this study. The 2002 diagnostic test had a KR-20 reliability of 0.69, as did the 2003 test, and the combined data also yielded a KR-20 reliability of 0.69. Instruments with KR-20 reliabilities greater than 0.70 are generally considered to possess statistical reliability, therefore these diagnostic tests are on the borderline. Because of the wide variety of topics and levels of Bloom's Taxonomy covered by the questions selected, it is not an unexpected result that the KR-20 reliability is on the borderline. Still, the New York Regents exams may serve as a possible prototype for a national standardized exam, at least for introductory physical geology courses like the one used for this study. However, the recently developed questions in the Geoscience Concept Inventory (Libarkin and Anderson, 2005) potentially represent a better starting point for such national testing since they provide a broader selection of questions, suitable for a range of introductory geology courses.
Strengths and Weaknesses -The analysis of specific student strengths and weaknesses showed only a few areas that students found to be extremely easy or truly difficult. Students were able to convert from scientific notation into standard notation and were familiar with a definition of evolution. Students most likely found these questions to be easy because the questions were at the simple recall level of Bloom's Taxonomy (Table 6) and the topics were most likely thoroughly addressed at the high school level. Students generally had the common misconception that oxygen is the most abundant gas in the Earth's atmosphere, and although the test question was also at the simple recall level, this misconception is pervasive and resistant to change. Female students had more difficulty with the relationship between water displacement and an object's mass and volume, and probably reflected the general gender difference in spatial and math ability, rather than any greater difficulty due to the question level. Female students and SMT majors had difficulty with remembering the difference between latitude and longitude, and both groups also had difficulty with factors affecting surface runoff. Both of these questions are at the lower level of Bloom's Taxonomy, but most likely students simply confused the terms latitude and longitude and/or had not encountered these topics. The possibility of not being exposed to the topics, particularly in the case of SMT majors, is high because of the lack of a high school Earth science requirement. Although this initial evidence is limited, the results indicate the potential of using a standardized test to aid in curriculum planning and curriculum assessment, including testing for student competency with respect to national Earth science learning standards.
Examining SMT majors by gender uncovered few new findings, except that female SMT majors had much different areas of strength than other students as a whole and as subgroups. This subgroup had strengths in estimating the meter unit to the nearest English unit, an area of strength identified for male students, and in identifying an element from a list of substances. With so few questions identified as a strength, it may be better in the future to apply results from a test of SMT ability, rather than a student's stated major, to better understand how gender and SMT ability could interact to impact diagnostic test performance.

CONCLUSIONS
Our initial attempt to develop and implement a diagnostic test for an introductory geology course produced a test that matches known and expected trends in student performance on standardized tests with respect to gender and background preparation. The test has some ability to predict student success in passing the course, appears to be reliable from year-to-year, and uncovers specific student strengths and weaknesses. This study provides evidence for the feasibility of developing one or more national standardized tests in Earth sciences. Such tests would be valuable for Earth science education for many reasons. A diagnostic test could be used for student advising, particularly if geoscience educators produce a test with the greater predictive power of the CCDT as shown by Legg et al. (2001). This power should be achievable through optimization of the test and restricting questions to a single subject area, such as geology, instead of including a broad spectrum of science questions. The test results can be used to advise students about their chances of succeeding in a geology course and suggesting that the student should possibly seek additional preparation and/or tutoring. A diagnostic test could be used in curriculum planning by providing instructors with data that is specific to the instructor and course. Such information may guide instructors toward addressing different learning styles and backgrounds of different subgroups of students. The same or a different test could be used for curriculum assessment, especially if used as part of a pre-post testing approach. Instructors can use the tests as a method for monitoring teaching effectiveness and student success in the specific course. Lastly, instructors can use the tests to make national comparisons regarding their students' incoming preparation and outgoing level of achievement, which can then also be matched to national Earth science learning standards.
Taken together, these outcomes would lead to significant enhancement of geoscience education efforts at the national level. The tests can facilitate the efforts of instructors concerned with what they are teaching, how they are teaching, and how their students are learning. Based on these potential benefits and our initial encouraging results, state and national efforts to create and implement diagnostic testing for Earth science students should be initiated and pursued.