Validity

Creating valid assessments goes beyond ensuring test questions focus on material covered in class or in the curriculum standards. Assessment validation involves checking that your assessment instruments consistently produce accurate results and are used appropriately.

Designing Valid Assessments

Instructional Designers need to create assessments for several purposes. This may include creating a test your understanding quiz, a unit review, or a summative assessment at the end of the course to certify a student has accomplished the expected learning objectives. Unfortunately, not all assessments are valid measures of what they intended to measure, and the results cannot be used for their intended purpose. This is why an instructional designer needs to learn how to create learning objectives and develop quality assessment instruments that align with the goals of the instruction.

Assessment Validity

The results of an assessment are valid if the assessment measures what it's supposed to measure accurately and consistently and the results are used appropriately. 

Creating valid assessments goes beyond ensuring test questions focus on material covered in class or in the curriculum standards. Assessment validation involves checking that your assessment instruments consistently produce accurate results and are used appropriately. When we say a test is valid, we really mean the results are valid. In other words, the results are credible (i.e., we believe the test measures what it was supposed to measure) and, therefore, can be used for a specific intended purpose.

Validity as a Unitary Construct.  While we might say a test is valid (or the results are valid), assessment validity is better be understood as a continuum. Current thinking on validity is that it is a unitary construct in that the results of an assessment are valid to some degree and that it is valid for a specific purpose. Evidence that the results of an assessment might be valid can be obtained from various sources. However, when we say, for example, the content validity of a test is good, we are not saying the test has content validity; instead, we are saying that one piece of evidence that the results of this test are valid comes from our evaluation regarding the content covered by the exam. The validation process requires we gather sufficient evidence of validity. You will note that some evidence will be more important than others, depending on the purpose of the assessment.

The Validation Process involves gathering evidence that allows you to confidently conclude that the results of an assessment will be valid. Several types of evidence can be used to support the validation process:

Face Validity. You will likely hear this term being used as a type of validity evidence. In fact, this should not be used as evidence that an assessment is valid. This doesn't mean that face validity is not important. Face validity refers to the extent to which the assessment instrument appears to measure what it is intended to measure. It is important because if people don't think an assessment produces valid results, they won't use it. However, just because people believe an assessment is valid doesn't mean it is valid. For example, many people may take online personality tests; some believe they are accurate, while others are skeptical. Whether the test is valid cannot be determined by what people believe – other evidence is needed.

Evidence Collection 

Content validity evidence. 

Evidence of content validity is obtained by reviewing the assessment items and assessing their relationship to and importance within the intended domain. This evidence cannot be determined statistically. Experts must review the items on the assessment and, using their expertise, agree that the items cover the content adequately. Any test you create is only a sample of all the items you might include on a test. Getting the right balance and having an adequate sample is important. There are two aspects that need to be considered. 

Content Domain Representation – The extent to which the test items as a group adequately represent the content domain.

 Relevant Importance – The extent to which the questions asked proportionally test the most important aspects of the content domain.

For example, suppose an assessment is designed to measure knowledge of world geography. In that case, the assessment items should adequately cover each geographical area of the world. The test should also focus on the most important ideas and concepts the individual should understand. Missing some content or skipping important ideas would diminish the validity of the assessment. Balancing the number of items from each content area while ensuring all the important concepts are tested isn't always easy. 

The table of specifications (or test blueprint) you create as part of the test plan for the assessment can be a useful tool in establishing the content domain coverage aspect of content validity. This test blueprint specifies the number of items you should include in each content area. However, you need to remember that just because a specific subset (one particular area) within the content domain is large doesn't mean you need more items in that area. You need to consider the length of the test (to avoid tester-taker fatigue) and whether the content in that area is important enough to warrant more items.

It is up to the test designer to ensure the items assess the most important content (the relative importance aspect of content validity). This may require the help of the subject matter expert (SME) or other assessment experts. 

One way to quantify this is to use a Content Validity Ratio (CVR). The CVR assesses the essentiality or criticality of each test item. It determines whether each item is important enough or necessary. To calculate CVR, follow these steps:

  1. Assemble a panel of subject matter experts. Depending on the high-stakes nature of the exam, the number of experts will vary. The teacher (or a single SME) may be the only expert for a classroom quiz. You would want more for a graduation qualifying exam (GQE) or state-mandated standardized assessment (3-5 experts are typically needed). This calculation needs at least two raters to work properly.
  2. Ask each expert to review the test items to be included on an exam and rate them on how essential the item is for determining whether the student has accomplished the learning objectives. The rating may include a three-point scale (not essential, useful but not essential, or essential), but the calculation only needs to know whether each expert thinks the item is essential or not.
  3. Calculate the CVR for each item using the formula: CVR = (NE - N/2) / (N/2), Where: NE is the number of experts who rated the item as "Essential," and N is the total number of experts.

1.   

Items with a positive CVR value (greater than zero) are considered to have some degree of importance (the higher the ratio, the more essential the items are), meaning that experts agree that these items are essential for measuring the required learning. Items with a CVR value of zero or negative are typically considered for exclusion from the test since experts do not consider them essential. Most tests should only include items deemed essential and useful but not essential.

Another way to quantify the content validity evidence uses a Content Validity Index (CVI). The CVI assesses the relevance or importance of test items within the content domain. It provides a more detailed assessment of item relevance compared to CVR. To calculate CVI, follow these steps:

  1. Assemble a panel of experts.
  2. Ask each expert to review the test items and rate them based on their importance or relevance within the content domain, considering the learning objectives being tested. The rating commonly uses a Likert scale (3 or 4 points). For example, 1: Not relevant/important, 2: Somewhat relevant/important, 3: Quite relevant/important, 4: Highly relevant/important.
  3. Calculate the CVI for each item using the formula: CVI = (Number of experts giving a rating of 3 or 4) / (Total number of experts).

.  

The CVI score for each item will range from 0 to 1, where 1 indicates unanimous agreement among experts regarding the item's importance or relevance. Typically, items with a CVI of 0.80 or higher (80% agreement or more) are considered to be important enough to include, and items with lower CVI scores may be candidates for exclusion.

Using these quantitative measures (CVR and CVI), you can gauge the content validity of your test items systematically and objectively. However, both of these methods for determining the importance of items require existing items and assumes that items for all essential topics and concepts were included in the sample. This may not be the case, and the subject matter experts may need to identify essential topics that were not covered by the items they rated. 

The number of items needed on an assessment will be determined by the exam's purpose and the complexity of the content domain. This is a subjective decision that would need to be determined by experts considering the attributes of the learners being tested and the importance of the test. Sometimes, a shorter test can be used. For example, short-form tests consist of a limited number of items and can be effective in measuring specific aspects of learning. They are often used for formative assessments or quick progress checks (test your understanding quizzes). Well-designed shorter tests can also be effective in other situations, but they require careful planning and item selection to ensure meaningful and accurate measurement. For example, Item Response Theory (IRT) models can help optimize the precision of measurement with a smaller number of items. These models consider the difficulty of items and the test-taker's ability level to select the most informative items. IRT is used in Computer-Adaptive Testing (CAT), which tailors the test to the test-takers ability level. It uses a smaller set of items on the test but selects items from a pool based on the test-takers responses to the previous item or group of items. Researchers and educators should balance the trade-offs between test length and measurement precision based on their specific assessment goals.

To summarize, content validity evidence requires us to balance the coverage of the content with the most important topics to be tested. We use a sample of the potential questions because we cannot ask all the questions we might want to ask. The goal is one of generalizability. We want to infer that if a student gets a certain percentage of items correct on an exam, they will get a similar percentage of questions correct had we selected a different set of items. 

Construct validity evidence. 

Like content validity, evidence of construct validity must be done by an expert; construct validity cannot be verified statistically. However, some statistics are used to provide potential evidence that an exam measures a specific construct (e.g., correlations with known validated measures of the same construct or the internal consistency of the test items). However, even if, statistically speaking, there is evidence of internal consistency for the items within an exam, this does not mean the items are targeting the intended construct; it simply means the items are likely measuring the same construct. The important point is that evidence of reliability (i.e., consistently measuring something) is not sufficient evidence of validity if what the test measures is not what was intended. A test with good reliability may be measurinig a construct other than what was intended. 

A "construct" refers to an abstract concept or theoretical notion we wish to measure or assess but which is not directly observable. Constructs are mental models or ideas that represent key components of the learning process, cognitive abilities, attitudes, or other educational phenomena. They are often complex and multifaceted, requiring careful conceptual and operationalization definitions. In educational assessment, a construct is typically either a cognitive ability or an affective disposition. In this chapter, we will focus on constructs associated with cognitive ability. Validating scales that measure affective characteristics and socio-emotional constructs will be covered in another chapter.

Cognitive Ability. Cognitive ability refers to mental capacities such as recall, understanding, reasoning, problem-solving, and critical thinking. Bloom's Taxonomy for the Cognitive domain describes these cognitive constructs well (see Figure 1). There are low-level and high-level cognitive skills. Establishing evidence of construct validity for cognitive constructs requires carefully examining the items on the test to ensure that the item elicits the targeted cognitive skill.

Figure 1. Bloom's Taxonomy of the Cognitive Domain


It's important to note that these categories are not mutually exclusive, and many assessments target multiple cognitive constructs simultaneously. For example, The item in Figure 2 requires the student to demonstrate various cognitive abilities. Suppose the construct targeted by the learning objective involves sentence diagramming. In that case, an item must elicit the student's ability to analyze, which will require several other cognitive abilities (i.e., recall, understanding, and application). They must remember terms and definitions, identify the components of the sentence, and apply procedural knowledge of sentence diagramming to accomplish the task.


Diagram the following sentence. In your diagram, identify the main clause, any subordinate clauses, and the parts of speech (subject, verb, etc.) for each clause. Also, point out any adjectives, adverbs, or prepositional phrases.


Sentence: "After the rain stopped, the colorful birds sang sweetly in the garden."


Figure 2. Example of an item that requires multiple cognitive abilities to answer the question.

Evidence that a test has construct validity requires experts to verify that the items on an assessment require the student to possess and demonstrate specific cognitive abilities. In other words, experts must agree that to answer the question, a student must demonstrate the cognitive ability being measured. This can be done by examining the test items or by conducting a cognitive "think-aloud" interview with students as they attempt to answer the questions. 

It is important to note that all assessments measure the ability to recall. A criticism of many tests is that they only measure recall. Using previously unencountered items in tests is essential for accurately assessing genuine understanding, critical thinking, and the ability to apply knowledge and skills in new contexts. If students are exposed to specific items and taught the correct answer before taking the test, any claim that the items measure higher-level cognitive ability is misleading. If the student was taught the correct answer, the assessment only tests the construct of memorization ability. Teaching for the test is essential; teaching the test is ill-advised if you want to maintain construct validity.

Other evidence of construct validity can also be obtained by examining the relationship between the assessment scores and other measures of the same construct. For example, if an assessment is designed to measure critical thinking skills, evidence of construct validity can be obtained by comparing assessment results with other validated measures of this construct. Additional evidence might be obtained by examining the items used on a test to verify that the items elicit the targeted skill, not some unrelated or irrelevant skill or ability. For example, when measuring the ability to converse in a second language, using the results of a vocabulary test would not be a valid measure of a person's verbal speaking skills. Likewise, if the results of a math skills or mathematical reasoning assessment are influenced by reading ability, the assessment results are less valid.

Construct validation is concerned with reducing two things in a test:

Construct Under-Representation – The extent to which the test items as a group fail to adequately elicit the construct being measured. 

 Construct Irrelevant Variance – The extent to which irrelevant factors (other than the construct of interest) influence the results of an assessment.

Criterion-Assessment Relationship Evidence. 

In the context of educational testing, this focuses on the degree to which the results of a specific assessment correlate with, predict, or are otherwise related to an external criterion. When validating a test you created, your test is the assessment, and another is the criterion. The results from the criterion assessment come from a different test that measures the same content and constructs your test is assumed to be measuring. The criterion becomes a standard or benchmark representing the construct or skills your assessment aims to measure. The purpose of evaluating this relationship is to gather evidence regarding the validity and relevance of your assessment in relation to the desired outcomes or standards.

Data Collection for this validation process involves collecting scores from the assessment and criterion measures. The same students must take both assessments. The relationship we are testing is between student scores on the assessment and their scores on the criterion test. You cannot determine the relationship unless the same students take both tests. If the test scores are related, we have evidence that the assessment measures what it intended to measure. This assumes that the criterion test is a valid measure of the content and construct of interest. The relationship evidence you gather can be concurrent or predictive.

For a Concurrent validity study, the criterion and assessment data are collected simultaneously using the same students. For example, the scores on the assessment may be compared with other established benchmarks or standards to evaluate the relationship between the two assessments.

Predictive validity, on the other hand, involves correlating assessment scores with criterion measures taken at a future point in time. For example, longitudinal studies are often conducted where participants are followed over time to observe how earlier assessment scores relate to later criterion measures. Again, the same student must take both assessments.

The most common method is using correlation coefficients (like Pearson's r) to determine the strength and direction of the relationship between assessment scores and the criterion. Higher correlations indicate stronger evidence of validity. Subject matter experts may also review the assessment and the criterion measures to provide qualitative judgments about the relevance and appropriateness of the relationship.

Gathering these data can be a challenge. Choosing an appropriate and universally accepted criterion can be difficult. Over time, the standards or benchmarks that serve as criteria might evolve, affecting the validity of the assessment. The relationship might vary across different groups, making it essential to ensure that the assessment is valid across diverse populations.

When gathering evidence of predictive validity, you need to remember that correlation does not imply causation and a high correlation between assessment and criterion does not necessarily mean that one causes the other. Likewise, a low correlation may deceptively indicate the assessment is not valid when it is. This is because many confounding factors may affect a student's ability to achieve over time. The effect may be positive and negative for individual students. In the time between testing, different groups of students may have different learning opportunities, their life circumstances may change, or their interest and effort may improve or decrease. These factors may affect the relationship determination and degrade the evidence of validity for the assessment.

In summary, Criterion-Assessment Relationship Evidence can play a vital role in ensuring that assessments are valid and relevant for their intended purposes. Obtaining this evidence requires careful planning, robust data collection, and meticulous analysis, keeping in mind the challenges and nuances associated with different types of validity and diverse populations.

Consequential Validity Evidence. 

Consequential validity is an integral aspect of the overall evidence of validity we might gather that extends beyond the traditional notions of validity. It concerns itself with the ramifications that come from the inappropriate use of assessment results, incorrectly interpreting results, or the unintended social consequences of using an assessment inappropriately.  

Consequential validity demands a critical examination of the way we use assessment data to ensure it aligns with a test’s intended purposes and that the use of an assessment does lead to unintended, adverse consequences. It scrutinizes how test results are employed in decision-making processes and examines the broader implications these uses may have on individuals, groups, and society at large. It recognizes that even assessments that are technically sound and reliable can lead to adverse outcomes if their results are misinterpreted or misused.

Several key dimensions underscore the importance of consequential validity evidence:

Incorrect Interpretation of Results: Proper interpretation of assessment results is paramount. Consequential validity questions whether the test results are being understood and used as intended by the test designers. Misinterpretation can lead to misguided decisions, affecting individuals’ educational, professional, or personal pathways. For example, assuming that students who perform above the average (or mean) have learned what was expected of them is a misinterpretation of the results. Using the mean (or some other arbitrary pass score) as a substitute for a performance standard would be improper. Likewise, failing to consider confidence intervals when comparing two students with similar but slightly different test scores would constitute a misinterpretation of assessment results. Another example of misinterpretation happens when a test designed to measure proficiency in a particular subject is used to make unrelated judgments (i.e., logical errors), such as assessing an individual’s overall intelligence or predicting their performance in unrelated domains. Doing things like this lowers the overall validity of the assessment.

Inappropriate Use of Results: Tests are often designed for specific purposes, yet, in practice, their results might be employed in contexts far removed from their original intent. The purpose, expectation, and conditions for an assessment should be transparent, meaning the test plan specifies what is expected of the test-takers, what is being measured, and how the results should be used. Explaining the purpose, expectations, and test requirements helps reduce test anxiety, allowing examinees to prepare adequately and perform to the best of their ability. Specifying how the results should be used decreases the likelihood that the results will be used inappropriately. A primary concern is the misuse of student test results for objectives beyond the test's original design.

The significance of this lies in the fact that assessment results will most likely be invalid if used for purposes and in ways the assessment was not intended. For instance, the primary function of most tests is to determine what a student knows, understands, or can do. One of the most egregious misapplications of assessment results is to judge the quality of the instructor, a school in general, or the value of a particular educational activity or resource based solely on test scores. For example, it would be inappropriate to judge the pedagogical knowledge and ability of a teacher as excellent based only on that fact that all the students in a course did well on a test. There are many factors that might explain student achievement that have nothing to do with the quality of instruction they did or did not receive. Likewise, the value of a particular course design or educational resource is not diminished simply because students don’t perform well on the test. This is especially the case with resistant learners. There are better assessments we should use to evaluate teachers and educational resources that are designed specifically for that purpose. 

Unintended Social Consequences: Assessments do not exist in a vacuum and can perpetuate or exacerbate social inequalities. Issues like fairness and accessibility are central to consequential validity. It scrutinizes whether an assessment is equitable to all groups or whether it inadvertently privileges certain demographics, thereby reinforcing existing disparities in education, employment, or other critical sectors. Issues of fairness and accessibility are discussed in more detail later in the book.

Impact on Stakeholders: Consequential validity evidence also involves evaluating the effect of test use on various stakeholders, including test-takers, educators, and institutions. For instance, high-stakes testing might influence teaching methods, leading to 'teaching to the test' rather than focusing on a holistic educational approach. Such shifts can have profound implications for the quality of education and the overall learning environment.

 

In summary, evidence of consequential validity compels test designers, administrators, and policymakers to adopt a forward-thinking perspective, contemplating not just the technical qualities of an assessment but also the broader implications of its implementation. It promotes a more ethical, responsible, and informed approach to assessment, ensuring that tests contribute positively to individual growth and societal advancement. This holistic view is crucial in an era where decisions based on assessment results have far-reaching consequences for individuals and communities alike.


Chapter Summary

  • The results of an assessment are valid if the assessment measures what it's supposed to measure accurately and consistently and the results are used appropriately. 
  • Although we say a test or an assessment is valid, we really mean the results of the test are valid for a specific purpose.
  • Validity is a unitary construct. The overall validity of an assessment is determined by gathering evidence from various sources. 
  • The process of validation involves collecting various types of evidence. Types of evidence used in the validation process may include content validity, construct validity, assessment-criterion relationship (or predictive) validity, and consequential validity.
  • Face validity is not valid evidence of actual validity. However, it is important that people think a test is valid; otherwise, they won't use the test or trust the results.
  • Evidence of Content Validity refers to the extent to which the assessment instrument adequately covers the content domain.
  • Evidence of Construct Validity refers to the extent to which the assessment instrument focuses on the cognitive constructs it was supposed to measure as outlined by the learning objectives. 
  • Evidence of Assessment-Criterion Relationship Validity refers to the extent to which a test score (the assessment) predicts future performance or success (the criterion). 
  • Evidence of Appropriate Use (consequential validity) is also an essential aspect of validity. It refers to a determination that the results are being interpreted and used appropriately. 

Discussion Questions

  1. Provide an example where test results might have been used inappropriately. What was the main reason for the error? 
  2. Describe the process you might use to validate a test you wish to use. 
  3. Consider a high-stakes testing situation and describe how consequential validity issues might affect the validity of the results. 

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/Assessment_Basics/validity.