Both validity and reliability are important factors in selecting or creating an assessment instrument. However, it is important to recognize that while an assessment can yield reliable results, these results may not be valid for their intended use. In this way, validity is of paramount importance. However, reliability is a prerequisite for validity. Without consistent and stable results (reliability), an assessment cannot be considered valid. Both aspects must be carefully considered.
Determining Reliability
In educational assessment, reliability refers to the consistency of scores obtained from a test or assessment instrument. Imagine a student takes the same test twice, putting in genuine effort each time. We wouldn't expect identical scores, but their performance should be similar, indicating the assessment consistently gauges their knowledge, skills, or ability.
Assessment Reliability
Reliability refers to the degree to which test or assessment scores are consistent and dependable. However, the ability to produce consistent scores does not mean the assessment scores are valid for the assessment's intended purpose.
As it is with validity, reliability is inherently about the results obtained, not the instrument itself. When we say a test is reliable, we mean it produces consistent results. Without an estimate of reliability, we have no defensible basis for making informed judgments about the accuracy, dependability, or generalizability of our test results. Evidence of test reliability strengthens our confidence that the scores provide an accurate reflection of students' abilities or attributes being assessed because the test produces consistent results. Educational and psychological assessments often inform critical decisions about individuals' academic paths, professional opportunities, and personal development. The trustworthiness and ethical foundation of these decisions hinges on the reliability of the underlying test scores. For assessment results to be usable in research, the consistency of the results is a fundamental requirement. Results must be considered stable if they are to be used to generalize or compared across different settings or groups.
Reliability and Validity: A Relationship Example
Test results cannot be valid if they are not reliable. However, a test can produce reliable results without being valid. You can consistently miss the mark. A test might be highly reliable, consistently producing similar results, yet still fail to measure what it is supposed to measure, leading to questions about its validity. For instance, a test designed to measure mathematical ability might consistently yield the same results under the same conditions, and the results would be considered reliable. However, if the test results were influenced by irrelevant factors such as reading comprehension skills, the test’s ability to measure math skills is compromised despite having high reliability. A test may also be believed to measure a specific construct like higher level thinking, but only reliably measure the ability to recall facts. Likewise, if the results of a valid math test were used inappropriately to determine a student’s ability to dance, despite being reliable, the results are not valid for that purpose.Evidence Collection
Evidence of test reliability can be determined through several methods, each focusing on different aspects of consistency. Each of these methods examines reliability from a different angle and understanding of how consistent the test scores are. It's important to choose the method that best aligns with the nature of the test and the specific aspect of reliability being assessed. Some are more appropriate than others, and multiple methods might need to be used.
Test-Retest Reliability: This is a measure of test stability. This method assesses the consistency of test scores over time. It involves administering the same test to the same group of individuals at two different points in time and then comparing the scores. A high correlation between the two sets of scores indicates high test-retest reliability, signifying that the test is consistently measuring the same attribute over time.
While this method is valuable for assessing the stability of test results over time, it has several limitations and challenges:
- Memory Effects: One major challenge is the potential for students to have learned something by taking the first test. Test-takers might remember their responses from the first administration, which could influence their answers during the second administration. This recollection can artificially inflate the correlation between the two sets of scores. Students may also reflect on their answers and look up the correct answers to questions they realize they got wrong or didn’t know. This would likely decrease the measure of reliability. Both these situations would give a misleading impression of the test’s reliability.
- Test Sensitization: Closely related to memory effects, test sensitization refers to the possibility that individuals become more familiar with the test content or format after the first administration. This familiarity might improve their performance during the second administration, not because the test is reliable, but because students have learned how to take the test and what is expected.
- Time Interval: The choice of time interval between test administrations is critical. If the interval is too short, memory effects are more likely. If it's too long, there could be genuine changes in the attribute being measured, especially if the attribute is likely to change over time (like mood or certain skills), thus confounding the reliability assessment.
- Maturation and External Events: Changes in test-takers between the two administrations can impact their performance. Changes might be due to maturation or external events (like educational interventions or significant life events). Such changes can be mistakenly attributed to inconsistency in the test rather than changes in the test-takers knowledge, ability, or situation that might be reasonably expected to occur in a student’s life.
- Sample Specificity: The reliability established through test-retest might be specific to the sample used to establish reliability. Different populations may exhibit different levels of stability in their responses, limiting the generalizability of the reliability findings.
- Practicality and Participant Burden: Administering the same test twice can be logistically challenging and time-consuming. While most students love to take tests (just kidding), having to take a test twice might place an extra burden on participants, potentially affecting their motivation and performance during the second administration.
- Contextual Factors: The characteristics of the test-takers and the context in which the tests are administered can influence the results. For instance, factors like test anxiety, motivation, and environmental conditions might differ across the two administrations, affecting the comparability of the results.
Parallel-Forms Reliability: Also known as equivalent-forms reliability, this is a measure of equivalence. This approach involves creating two different versions of the same test (parallel forms) that are designed to be equivalent in terms of content, number of questions, difficulty, and style. These versions are then administered to the same group of individuals, and the scores from the two versions are correlated. A high correlation suggests that both forms of the test are equally reliable.
This method addresses some issues inherent in other reliability methods, but it has its own set of limitations and challenges:
- Creating Truly Equivalent Forms: The most significant challenge is ensuring that the two test forms are truly equivalent in terms of difficulty, content, and constructs being measured. Both tests must be equally difficult and ensuring that both versions of the test equally and adequately cover the content and constructs being measured is crucial. Differences in the two forms can be a problem. If one version underrepresents or overrepresents certain aspects of the test then differences in performance may not be due to changes in the trait being measured. If the two forms are not truly equivalent, it will affect the reliability estimate.
- Timing of Administration: Similar to the test-retest method, the timing between the administrations of the two forms can impact the results. Too short an interval might lead to memory effects, while too long an interval might see changes in the trait being measured. If the tests are administered at the same time, test fatigue can be an issue. If participants become tired or less motivated, this may impact their performance on the second form and the estimate of reliability.
Split-Half Reliability: This method involves splitting a test into two halves (e.g., odd and even items) and then correlating the scores obtained on each half. This approach measures the internal consistency of a test, testing whether the two parts of the test contribute equally to what is being measured. Unlike test-retest and parallel forms, this method only requires students to take the test once. So, while the split-half method provides a reliability estimate of internal consistency based on a single administration of the test, it doesn't account for factors like stability.
While this method offers certain advantages, it is not often used as there are notable limitations and challenges:
- Division of Test: The viability of this method depends on how the test is split. If the two halves are not equivalent in terms of difficulty, content, and construct coverage, it can lead to an inaccurate estimate of reliability. For instance, if one half contains more difficult questions than the other, this could skew the results. In practice, creating two halves that are truly equivalent in terms of difficulty and content coverage can be challenging, especially for complex assessments that measure multiple dimensions or skills. In tests that cover a wide range of content or skills, splitting the test into halves might mean that each half assesses different content or only assesses a subset of the intended content or skills. This partial coverage can lead to a misrepresentation of the test's actual reliability.
- Assumption of Homogeneity: This method assumes that all items on the test are measuring the same underlying construct to the same degree. If the test includes a mix of different types of questions or skills, the split-half method may not provide an accurate measure of reliability.
- Arbitrary Nature of Splitting: The arbitrary nature of splitting a test into two parts can be a problem. There are numerous pairings you could use. Different ways of splitting the test (e.g., first half vs. second half, odd vs. even items) and different pairings can yield different reliability estimates, raising questions about whether this method of determining reliability is useful or even appropriate.
- Test Length Sensitivity: The reliability estimates obtained from the split-half method are sensitive to the length of the test. Shorter tests may yield lower reliability estimates simply because each half has fewer items to capture the variability in test-takers performances.
- Statistical Adjustments: The reliability estimate obtained from the split-half method is based only on half the length of the test. Therefore, some testing experts suggest the use of a statistical adjustment, like the Spearman-Brown prophecy formula, to estimate the reliability of the whole test. This adjustment introduces an additional layer of complexity and potential error.
Internal Consistency Reliability: As the label implies, this is a measure of internal consistency. A common approach in educational and psychological assessments is to use a Cronbach's Alpha calculation to determine the internal consistency of a measurement instrument. Cronbach's Alpha is popular because it requires only one sitting of the exam and can handle items that are scored dichotomously (i.e., items that are scored correct or incorrect, 0 or 1 point awarded) or non-dichotomously (i.e., awarding partial marks or items out of more than 1 point). Other statistical methods for calculating internal consistency, like the Kuder-Richardson formulas, only handle dichotomously scored items but will produce similar results.
Cronbach's Alpha is a measure of the extent to which all items on a test measure the same underlying construct (i.e., homogeneity). It can be used for testing cognition but is most appropriate for validating the assessment of affect or other psychological constructs. It assumes that each question contributes equally to the concept or construct being measured, which may not be the case.
In simple terms, the formula compares the average variance of each item to the total variance of all items combined. The basic idea is that if all the questions are good measures of the concept and consistently contribute to the measure, a person's responses will be relatively consistent across all the questions on the exam. The result produces a reliability coefficient (r) like that produced by a correlation. A higher value (closer to 1) suggests that the test has high internal consistency, meaning the items are well correlated and reliably measure the same construct. A lower value (closer to 0) indicates poor internal consistency, suggesting that the items might not be well related to each other. Again, it is important to remember that while a high Cronbach's Alpha value indicates good internal consistency, it doesn't guarantee that the test is measuring the intended concept or construct accurately (validity).
While this method has widespread use, it also presents several limitations and challenges. It can be a valuable tool for assessing the internal consistency of a test, but it is important to be aware of its limitations and to use it in conjunction with other methods of reliability as appropriate.
- Assumption of Unidimensionality: Cronbach's Alpha assumes that the test measures a single, unified construct. If a test is designed to measure multiple dimensions or constructs, Cronbach's Alpha may not be an appropriate measure of reliability. It can underestimate reliability in multidimensional tests.
- Influence of Test Length: Cronbach's Alpha is sensitive to the number of items in the test. Generally, tests with more items tend to have higher alpha values, irrespective of the actual reliability of the test items. This can lead to a false sense of reliability in longer tests.
- Item Homogeneity: While Cronbach's Alpha assesses the overall consistency of test items, it doesn't account for the possibility that some items might be redundant or overly similar. High reliability estimates can sometimes be the result of redundant items rather than an indication of a high-quality assessment tool.
- Interpreting Results: Determining what constitutes an "acceptable" alpha value can be subjective and context-dependent. While higher values indicate greater internal consistency, extremely high values might equally indicate redundancy among items. Moreover, what is considered an acceptable level of reliability can vary depending on the nature of the test and its intended use.
- Neglecting Item-Response Theory (IRT) Considerations: Cronbach's Alpha is based on Classical Test Theory and does not take into account the item characteristics that Item-Response Theory models. For example, it does not consider how item difficulty and discrimination affect test reliability. More sophisticated statistical methods may be more appropriate.
- Statistical Assumptions: The calculation of Cronbach's Alpha is based on certain statistical assumptions, such as the assumption that errors in item scores are uncorrelated, meaning any problems with the test construction are random, not systematic. Obviously, the quality of the test items makes a difference. Violations of the statistical assumptions can lead to inaccurate reliability estimates.
- Overemphasis on Internal Consistency: Relying solely on Cronbach's Alpha for a reliability assessment can lead to an overemphasis on internal consistency at the expense of other important aspects of reliability, such as stability over time (test-retest reliability) or equivalence across different forms (parallel-forms reliability). This method of obtaining reliability evidence is also misunderstood by many people which can lead to misinterpretation of the resulting reliability estimate.
Inter-Rater Reliability: This is a measure of consistency of ratings. This form of reliability is crucial in assessments where subjective judgment plays a role, such as essay grading or performance assessments. It assesses the degree of agreement or consistency between different raters or judges. High inter-rater reliability means that different raters are consistently arriving at similar scores or conclusions. Remember, consistency in rating does not mean the ratings were valid. It could be the case that while consistent, each of the raters scored the performance poorly. Reliability focuses on consistency as a pre-condition of validity. The validity of the raters' scores must be established separately.
Raters and rubrics are required for performance assessments like writing, presentations, or projects. Grading these performances requires subjective scoring. Objective scoring is used when there is one correct answer that experts agree is the correct answer. Objective scoring is efficient and can be facilitated by technology. Subjective scoring is needed when different responses students provide all might be considered adequate. Subjective scoring is used to determine the degree to which a response is correct. While in some cases, technology can facilitate this process, most often, trained human raters are required to accomplish this assessment task.
In a classroom, the teacher may be the only person scoring student performances. In this case, inter-rater reliability cannot be determined. It takes two or more raters to establish inter-rater reliability. Typically, we only use multiple raters for high-stakes assessment situations (e.g., state standardized testing or college entrance exams). When only one rater is used, Intra-rater reliability can be used to determine the degree to which an individual’s ratings are consistent with their own ratings. Intra-rater reliability determines how consistently a single rater consistently scores performances over time. This requires the same rater to score each performance twice, with a time-lapse in between. Intra-rater reliability can also be used to determine whether a single rater scores similar performances the same. This requires an evaluation of the performance scores by the rater or another individual and can result in revising the scores, so they are consistent.
Inter-rater reliability is typically calculated using one of the following methods:
- Correlations: This method involves calculating the statistical correlation between the ratings of different raters. The most common measure used is Pearson's correlation coefficient, which ranges from -1 to +1. A coefficient close to +1 indicates a high level of agreement between raters, while a coefficient close to -1 suggests a high level of disagreement. A coefficient around 0 implies no relationship between the ratings. This method is particularly useful when ratings are quantitative (e.g., scores on a test) and when you're interested in the degree of association between raters. Inter-rater correlations are sensitive to changes in the order or ranking of the performances. Even if the scores are different but the two raters rank the performances the same, you will get a high correlation. When the rankings differ, you will get a lower correlation. When more than two raters are involved, an inter-class correlation or ICC is used. This method assesses both the degree of correlation and agreement between the ratings.
- Percentage Agreement: This is a straightforward method for determining inter-rater reliability where you calculate the percentage of times raters agree in their assessments. For instance, if two teachers are grading essays and they give the same grade 80 out of 100 times, the percentage agreement is 80%. This method is easy to understand and compute, but it has limitations. The number of points on the rating scale can affect the reliability estimate. If the rating scale uses a ten-point scale, for example, it is more likely that the raters will give the same score than if the rating scale was out of 100 points. One adjustment to this method is to calculate the percent agreement within a range. For example, agreement within plus or minus one point, or plus or minus 10 points in the case of a 100-point scale.
- Mastery Decision: This approach is used particularly in educational settings or in contexts where a standard or benchmark is set for "mastery" or adequate competence in a skill. Here, inter-rater reliability is assessed based on how consistently raters classify performances as meeting or not meeting the mastery criterion. For example, if two raters are evaluating whether students have mastered a certain math skill, inter-rater reliability would be calculated based on how often both raters agree on whether a student has or has not achieved the criterion.
Like all the methods we might use for determining reliability, there are several limitations and challenges to consider when calculating interrater reliability methods.
- Rater Bias and Subjectivity: Different raters may have individual biases or perspectives that influence their judgments. Personal experiences, training, expectations, and even mood can affect how raters perceive and score a particular behavior or response. To help alleviate this, it is best to use experienced raters who have content expertise both pedagogically and practically. Inter-rater reliability can be significantly impacted by the level of training and expertise of the raters. Inconsistent training or varying levels of ability among raters can lead to discrepancies in ratings. Some common rater biases that can affect inter-rater reliability include:
- Halo (or Horn) Effect: This occurs when an evaluator's overall impression of a person, often based on one positive (or negative) characteristic or first impression, influences their judgment about that person's performance. For example, the way a student dresses or looks. An initial impression of an individual can disproportionately influence a rating and subsequent evaluations. If the rater has experience with the student, the student's behavior or an early mistake or success might color all future assessments of that person
- Logical errors: This occurs when a rater makes a faulty assumption or generalization about a student they are rating. For example, if a student shows ability in writing, they must be intelligent; therefore, they will also be good at math. If the student fails to do well on a math assessment, the rater may justify giving the student the benefit of the doubt and a higher score because the rater assumes the student is generally intelligent based on their writing ability.
- Recency Effect: Past performance can affect future grades. This bias happens when raters focus more on the most recent behavior or performance, overlooking earlier information. For instance, a student's latest assignment might disproportionately influence their overall grade for the course.
- Similarity Bias: This happens when raters favor individuals who are similar to them in terms of interests, beliefs, background, or preference. For example, a rater might give higher scores to individuals who share similar views, perspectives, or hobbies.
- Leniency or Severity Bias: This occurs when a rater consistently gives higher (leniency) or lower (severity) scores than warranted, often due to a personal inclination or belief about grading standards. This can lead to unjustifiably high or low evaluations for all individuals being assessed. This will not affect inter-rater reliability calculated using a correlation, but it can affect reliability determined by percent agreement.
- Central Tendency Bias: Raters who are reluctant to give extreme scores might rate everyone near the average, avoiding high and low ratings. This can obscure meaningful distinctions in performance or ability.
- Complexity of the Behavior or Task Being Rated: If the task or behavior being assessed is complex or involves multiple dimensions, achieving consistency among raters becomes more challenging. Raters might focus on different aspects or interpret criteria differently. This is where rubrics are needed most.
- Ambiguity in Rating Scales or Criteria: Inter-rater reliability is highly dependent on the clarity and specificity of the rating scales or criteria used. Ambiguous or subjective criteria can lead to inconsistent interpretations and ratings by different raters.
- Rater Drift: Over time, raters may unconsciously drift away from the established scoring guidelines. This drift is sometimes called floating criteria and can occur due to changes in perception, fatigue, or increased familiarity with the subject matter over time. This is really an issue of intra-rater reliability (i.e., an individual rater scoring similar responses differently) that can affect inter-rater reliability.
- Influence of Feedback and Interaction Among Raters: If raters discuss or receive feedback about their ratings, it can influence their subsequent ratings. This type of interaction might be seen as a good thing (i.e., its fair) but can lead to artificially inflated reliability estimates, as raters may become more similar in their evaluations due to conformity to the group.
- Cost and Feasibility: Conducting interrater reliability assessment can be resource-intensive, requiring multiple trained raters and often multiple assessment sessions. This can be a significant limitation in terms of time, cost, and feasibility, especially in the assessment of large groups of students.
Chapter Summary
Reliability is a prerequisite for validity. An assessment must consistently yield stable results to be considered valid. However, just because a test produces reliable results, does not mean the test scores are valid for the intended purpose.
Types of Reliability include:
- Test-Retest Reliability: Measures consistency of test scores over time.
- Parallel-Forms Reliability: Involves using equivalent forms of a test, emphasizing the creation of truly equivalent test versions.
- Split-Half Reliability: Divides a test into two parts to evaluate internal consistency. Due to issues of creating two similar parts, this method is not often used.
- Internal Consistency Reliability: Often assessed using Cronbach's Alpha, focusing on item homogeneity and unidimensionality of the assessment.
- Inter-Rater Reliability: Assesses consistency in subjective scoring by two or more raters. Three methods of doing this include correlations, percent agreement, and master decision agreement.
Challenges and Limitations: Each method for determining reliability faces specific challenges.
Discussion Questions
- Provide an example of how a test might be considered reliable but not valid.
- Describe a situation where you would need two equivalent forms of a test. Explain the process and challenges associated with creating the two equivalent forms of the test.
- Provide an example of a test situation and explain which form of reliability evidence would be needed. Considering the limitations of each method for determining reliability, describe what might be done to alleviate issues associated with that method.