Extra text

Designing Valid Assessments

Instructional Designers need to create assessments for several purposes. This may include creating a test your understanding quiz, a unit review, or a summative assessment at the end of the course to certify a student has accomplished the expected learning outcomes. Unfortunately, not all assessments are valid measures of what they intended to measure, and the results cannot be used for their intended purpose. This is why an instructional designer needs to learn how to create learning objectives and develop quality assessment instruments that align with the goals of the instruction.

Assessment Validity

The results of an assessment are valid if the assessment measures what it's supposed to measure accurately and consistently and the results are used appropriately. 

Creating valid assessments goes beyond ensuring test questions focus on material covered in class or in the curriculum standards. Assessment validation involves checking that your assessment instruments produce accurate results and are used appropriately.

When we say a test is valid, we really mean the results are valid. In other words, the results are credible (i.e., they measure what they were supposed to measure) and, therefore, can be and are used for a specific intended purpose. And while we might say a test is, or the results are, valid, assessment validity might better be understood as a continuum. An assessment must be sufficiently credible and trustworthy so the results can be used confidently for making decisions (i.e., evaluations).

The validation process involves gathering evidence that allows you to confidently conclude that the results accurately represent whatever the assessment was supposed to measure. Several types of evidence can be used to support the validation process:

Face Validity. You will likely hear this term being used as a type of validity. In fact, this should not be used as evidence that an assessment is valid. This doesn't mean that face validity is not important. Face validity refers to the extent to which the assessment instrument appears to measure what it is designed to measure. It is important because if people don't think an assessment produces valid results, they won't use it. However, just because people believe an assessment is valid doesn't mean it is valid. For example, many people may take online personality tests; some believe they are accurate, while others are skeptical. Whether the test is valid cannot be determined by what people believe – other evidence is needed.

Creating test items

There are lots of resources available that teach item writing basics. Still, It's easy to write a lousy test. The quality of your assessment will depend on the quality of the items you use. Selecting the most appropriate type of test item to capture the expected learning is crucial, as well as testing and revising the items you create. Best practice suggests you write multiple versions of an item to weed out faulty items or to use similar items for equivalent forms of an assessment or in a test bank of questions.

There are a few item statistics that can help identify problematic items. However, these statistics only provide information that may be useful to review and improve the test items used in an assessment. Reviewing items needs to be done by subject matter experts and assessment specialists (e.g., psychometricians).  

  1. Item Difficulty. This statistic indicates the percentage of people who got an item correct. This information is not related to the quality of the item. You may wish to review the easy items as well as the difficult items. An easy item, one that almost everyone gets correct, may have an unintended clue to the correct answer or be written in a way that the correct answer is obvious. A difficult item may be unclear or contain more than one correct answer. These kinds of item writing mistakes lead to measurement error and diminish the validity of the assessment results.
  2. Discriminating Index. This statistic, also known as Discriminating Power, is a statistical measure that indicates the relationship (i.e., correlation) between the overall score on a test and how well individuals answered a specific test item. Each item on a test will have a discriminating index. A high discriminating index indicates that the item effectively discriminates between high and low performers. Conversely, a low discriminating index suggests that the item or question is less effective in differentiating between individuals and may not contribute as much to the overall purpose of the test. These statistics are typically used for norm-referenced tests where differentiating between students is the goal. Very easy and very hard items will have little or no discriminating power. In norm-referenced tests, items with low discriminating power are typically excluded. In a criterion-referenced test, this statistic is less important. Item selection is based on the importance of the material or skills being tested. When reviewing items, questions with a negative discriminating index should be reviewed. A negative discriminating power indicates that students who do better on the overall test tend to get this item correct. In other words, the more a student knows, the less likely they will answer this question correctly. Likewise, items with little or no discriminating power should also be reviewed. A discriminating index around zero suggests it is equally likely that a student who did well on the overall test will get this item correct as a student who does poorly on the test.

Detailed discussion on the development and testing items is beyond the scope of this chapter. However, as a general rule, items should align with the intended learning objectives; the items used should adequately cover the content focusing on the most important information and skills. Those developing an assessment should follow best practice guidelines for each item.

Assessment Challenges and Issues

Assessment specialists face many challenges when creating valid assessments. We have outlined a few here, but there are others.

Getting beyond recall and understanding. One of the biggest mistakes test creators make is focusing too heavily on the recall of basic information. This may be acceptable when a course's learning objective intentionally focuses on the ability to remember and understand facts and definitions; however, in many courses, the instructional objectives attempt to measure student learning beyond the initial level of Bloom's Taxonomy.   


Measuring affective characteristics. Most of what we measure in schools and training situations falls within the cognitive domain. However, often the instructional goals of a course may include affective objectives. Unlike knowledge, skills, and abilities, the affective domain includes personal characteristics like attitudinal dispositions, values, beliefs, and opinions (e.g., interest, caring, empathy, and appreciation) (see Davies, 2021). Simon and Binet (1916), the fathers of intelligence testing, suggest that as important as assessing cognitive ability may be, we might be well served first to teach (and assess) character. Assessing these personal characteristics required a different kind of assessment. It requires we create a scale that measures the degree to which individuals possess a certain characteristic or quality. 

High-stakes testing. One particularly contentious issue in schools is the political mandate to test students using standardized, summative assessments. A few issues arise from this policy. One issue with high-stakes testing revolves around the idea that these tests don't assess the whole person. The "whole person issue" in assessment refers to the challenge of capturing person's entire range of abilities, characteristics, and experiences in a comprehensive and accurate manner. Using a single assessment to judge a person may be limiting. A second issue focuses on balancing the need to assess with the need to teach. This can be problematic. Some educators complain they are so focused on testing that they have little time to teach. This includes the problem of teaching to the test. One additional issue with high-stakes testing relates to the need for such testing. Many educators believe that the most important purpose for assessment in schools is formative, not summative.  

Interpretation and inappropriate uses of assessment results. The inappropriate use of assessment results can also be a problem. Assessments are typically created for a specific purpose, and the results are not valid for other purposes. Assessment results are only valid if appropriately interpreted and used for the assessment's intended purpose. For example, in schools, test scores are designed to evaluate individual students' knowledge, skills, and abilities. Unfortunately, they are also inappropriately used to judge the quality of the instruction provided. While the quality of the teacher or instruction may influence the results of an assessment, many students fail to achieve despite being provided quality instruction. Often, students succeed despite their teachers' failings. A better assessment of teacher quality would require assessments explicitly designed for that purpose. Another example of inappropriate use of assessment results happens when we don't have a good measure of the intended learning outcomes. This can happen, for example, when we want to develop a specific affective characteristic but don't have a valid measure of the disposition—using an achievement test as an indirect substitute indicator would not be appropriate or valid practice. The challenge for assessment developers is to create direct valid measures of the expected learning outcomes.

Assessment Research Opportunities

If you are interested in researching the topic of assessment, there are several promising and challenging areas you might consider. 

Online test security. With increased online and distance learning acceptance, cheating on exams has become a prominent concern. Research on this topic has identified various vulnerabilities and proposed measures to address them. Online proctoring tools can help mitigate the risk of cheating. Using biometrics to verify students' identity and authorship has also been studied (for example, Young et al., 2019). Security breaches can be an issue for high-stakes testing and certification exams, where keeping test items secure is crucial. Proper training and communication with students can help promote ethical behavior during online assessments; however, ongoing research and development in this area will be important to ensure the integrity and validity of online assessments.

Learning Analytics. Recent calls for data-driven decision-making have prompted considerable interest in learning analytics. Research in this area is concerned with ways to personalize instruction. This includes the topics of stealth assessment and non-intrusive assessment data collection. With learning analytics, creating and using dashboards to communicate essential learning accomplishments and areas for improvement is particularly important. This includes identifying at-risk students and monitoring student progress with real-time student achievement results and engagement updates. Additional research is also needed to address student privacy and confidentiality concerns regarding the information we collect about students.

Automated Tutoring Systems. Providing feedback is an important function of the assessment process. Results from assessments can provide the information students need to resolve misconceptions, increase their understanding, and improve their skills. Timely feedback is essential for effective learning. Automating the feedback process can improve the speed and consistency of our assessment feedback. For example, generative AI-enabled tutors have become proficient at answering user questions, providing instruction, and assessing student learning (Davies & Murff, 2024). However, critics point out the need for human interaction and that inappropriate applications and overreliance on artificial intelligence to provide instruction and feedback can lead to trained incompetence rather than increasing students' ability. Research in this area will be important to ensure that automated assessment and feedback is accurate and administered appropriately.


Chapter Summary

  • In the field of instructional design, the

Discussion Questions

  1. Consider a

References

Bloom, B. S.; Engelhart, M. D.; Furst, E. J.; Hill, W. H.; Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals. Vol. Handbook I: Cognitive domain. New York: David McKay Company.

Davies, R. (2021). Establishing and Developing Professional Evaluator Dispositions. Canadian Journal of Program Evaluation, 35 (3).

Davies, R. & Murff, M. (2024) Using Generative AI and GPT chatBots to Improve Instruction, INTED2024 Proceedings, pp. xx-xx.

Gagné, R. M. (1965). The conditions of learning (1st ed.). New York: Holt, Rinehart & Winston.

Mager, R.F. (1984). Preparing instructional objectives. (2nd ed.). Belmont, CA: David S. Lake.

Simon, T., & Binet, T. (1916). The development of intelligence in children. Translated by Elizabeth S. Kite. Vineland, The Training School, publication, (11), 336.

Tyler, R. W. (2013). Basic principles of curriculum and instruction. In Curriculum studies reader E2 (pp. 60-68). Routledge.

Wiggins, G., & McTighe, J. (2005) Understanding by design (2nd ed.). Alexandria, VA: Association for Supervision and Curriculum Development ASCD. Colomb. Appl. Linguist. J., 19(1), pp. 140-142.

Young, J., Davies, R., Jenkins J., & Pfleger , I. (2019). Keystroke Dynamics: Establishing Keyprints to Verify Users in Online Courses. Computers in the Schools. 36(1). 1-21.

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/Assessment_Basics/extra_text.