Reliability

Both validity and reliability are important factors in selecting or creating an assessment instrument. However, it is important to recognize that while an assessment can yield reliable results, these results may not be valid for their intended use. In this way, validity is of paramount importance. However, reliability is a prerequisite for validity. Without consistent and stable results (reliability), an assessment cannot be considered valid. Both aspects must be carefully considered.


Determining Reliability 

In educational assessment, reliability refers to the consistency of scores obtained from a test or assessment instrument. Imagine a student takes the same test twice, putting in genuine effort each time. We wouldn't expect identical scores, but their performance should be similar, indicating the assessment consistently gauges their knowledge, skills, or ability.

Assessment Reliability

Reliability refers to the degree to which test or assessment scores are consistent and dependable. However, the ability to produce consistent scores does not mean the assessment scores are valid for the assessment's intended purpose. 

As it is with validity, reliability is inherently about the results obtained, not the instrument itself. When we say a test is reliable, we mean it produces consistent results. Without an estimate of reliability, we have no defensible basis for making informed judgments about the accuracy, dependability, or generalizability of our test results. Evidence of test reliability strengthens our confidence that the scores provide an accurate reflection of students' abilities or attributes being assessed because the test produces consistent results. Educational and psychological assessments often inform critical decisions about individuals' academic paths, professional opportunities, and personal development. The trustworthiness and ethical foundation of these decisions hinges on the reliability of the underlying test scores. For assessment results to be usable in research, the consistency of the results is a fundamental requirement. Results must be considered stable if they are to be used to generalize or compared across different settings or groups.

Reliability and Validity: A Relationship Example

Test results cannot be valid if they are not reliable. However, a test can produce reliable results without being valid. You can consistently miss the mark. A test might be highly reliable, consistently producing similar results, yet still fail to measure what it is supposed to measure, leading to questions about its validity. For instance, a test designed to measure mathematical ability might consistently yield the same results under the same conditions, and the results would be considered reliable. However, if the test results were influenced by irrelevant factors such as reading comprehension skills, the test’s ability to measure math skills is compromised despite having high reliability. A test may also be believed to measure a specific construct like higher level thinking, but only reliably measure the ability to recall facts. Likewise, if the results of a valid math test were used inappropriately to determine a student’s ability to dance, despite being reliable, the results are not valid for that purpose.

Evidence Collection 

Evidence of test reliability can be determined through several methods, each focusing on different aspects of consistency. Each of these methods examines reliability from a different angle and understanding of how consistent the test scores are. It's important to choose the method that best aligns with the nature of the test and the specific aspect of reliability being assessed. Some are more appropriate than others, and multiple methods might need to be used.

Test-Retest Reliability: This is a measure of test stability. This method assesses the consistency of test scores over time. It involves administering the same test to the same group of individuals at two different points in time and then comparing the scores. A high correlation between the two sets of scores indicates high test-retest reliability, signifying that the test is consistently measuring the same attribute over time.

While this method is valuable for assessing the stability of test results over time, it has several limitations and challenges:

Parallel-Forms Reliability: Also known as equivalent-forms reliability, this is a measure of equivalence. This approach involves creating two different versions of the same test (parallel forms) that are designed to be equivalent in terms of content, number of questions, difficulty, and style. These versions are then administered to the same group of individuals, and the scores from the two versions are correlated. A high correlation suggests that both forms of the test are equally reliable.

This method addresses some issues inherent in other reliability methods, but it has its own set of limitations and challenges:

Split-Half Reliability: This method involves splitting a test into two halves (e.g., odd and even items) and then correlating the scores obtained on each half. This approach measures the internal consistency of a test, testing whether the two parts of the test contribute equally to what is being measured. Unlike test-retest and parallel forms, this method only requires students to take the test once.  So, while the split-half method provides a reliability estimate of internal consistency based on a single administration of the test, it doesn't account for factors like stability.

While this method offers certain advantages, it is not often used as there are notable limitations and challenges:

Internal Consistency Reliability: As the label implies, this is a measure of internal consistency.  A common approach in educational and psychological assessments is to use a Cronbach's Alpha calculation to determine the internal consistency of a measurement instrument. Cronbach's Alpha is popular because it requires only one sitting of the exam and can handle items that are scored dichotomously (i.e., items that are scored correct or incorrect, 0 or 1 point awarded) or non-dichotomously (i.e., awarding partial marks or items out of more than 1 point). Other statistical methods for calculating internal consistency, like the Kuder-Richardson formulas, only handle dichotomously scored items but will produce similar results. 

Cronbach's Alpha is a measure of the extent to which all items on a test measure the same underlying construct (i.e., homogeneity). It can be used for testing cognition but is most appropriate for validating the assessment of affect or other psychological constructs. It assumes that each question contributes equally to the concept or construct being measured, which may not be the case. 

In simple terms, the formula compares the average variance of each item to the total variance of all items combined. The basic idea is that if all the questions are good measures of the concept and consistently contribute to the measure, a person's responses will be relatively consistent across all the questions on the exam. The result produces a reliability coefficient (r) like that produced by a correlation. A higher value (closer to 1) suggests that the test has high internal consistency, meaning the items are well correlated and reliably measure the same construct. A lower value (closer to 0) indicates poor internal consistency, suggesting that the items might not be well related to each other. Again, it is important to remember that while a high Cronbach's Alpha value indicates good internal consistency, it doesn't guarantee that the test is measuring the intended concept or construct accurately (validity).

While this method has widespread use, it also presents several limitations and challenges. It can be a valuable tool for assessing the internal consistency of a test, but it is important to be aware of its limitations and to use it in conjunction with other methods of reliability as appropriate.

Inter-Rater Reliability: This is a measure of consistency of ratings. This form of reliability is crucial in assessments where subjective judgment plays a role, such as essay grading or performance assessments. It assesses the degree of agreement or consistency between different raters or judges. High inter-rater reliability means that different raters are consistently arriving at similar scores or conclusions. Remember, consistency in rating does not mean the ratings were valid. It could be the case that while consistent, each of the raters scored the performance poorly. Reliability focuses on consistency as a pre-condition of validity. The validity of the raters' scores must be established separately. 

Raters and rubrics are required for performance assessments like writing, presentations, or projects. Grading these performances requires subjective scoring. Objective scoring is used when there is one correct answer that experts agree is the correct answer. Objective scoring is efficient and can be facilitated by technology. Subjective scoring is needed when different responses students provide all might be considered adequate. Subjective scoring is used to determine the degree to which a response is correct. While in some cases, technology can facilitate this process, most often, trained human raters are required to accomplish this assessment task. 

In a classroom, the teacher may be the only person scoring student performances. In this case, inter-rater reliability cannot be determined. It takes two or more raters to establish inter-rater reliability. Typically, we only use multiple raters for high-stakes assessment situations (e.g., state standardized testing or college entrance exams). When only one rater is used, Intra-rater reliability can be used to determine the degree to which an individual’s ratings are consistent with their own ratings. Intra-rater reliability determines how consistently a single rater consistently scores performances over time. This requires the same rater to score each performance twice, with a time-lapse in between. Intra-rater reliability can also be used to determine whether a single rater scores similar performances the same. This requires an evaluation of the performance scores by the rater or another individual and can result in revising the scores, so they are consistent.

Inter-rater reliability is typically calculated using one of the following methods:

  1. Correlations: This method involves calculating the statistical correlation between the ratings of different raters. The most common measure used is Pearson's correlation coefficient, which ranges from -1 to +1. A coefficient close to +1 indicates a high level of agreement between raters, while a coefficient close to -1 suggests a high level of disagreement. A coefficient around 0 implies no relationship between the ratings. This method is particularly useful when ratings are quantitative (e.g., scores on a test) and when you're interested in the degree of association between raters. Inter-rater correlations are sensitive to changes in the order or ranking of the performances. Even if the scores are different but the two raters rank the performances the same, you will get a high correlation.  When the rankings differ, you will get a lower correlation. When more than two raters are involved, an inter-class correlation or ICC is used. This method assesses both the degree of correlation and agreement between the ratings.  
  2. Percentage Agreement: This is a straightforward method for determining inter-rater reliability where you calculate the percentage of times raters agree in their assessments. For instance, if two teachers are grading essays and they give the same grade 80 out of 100 times, the percentage agreement is 80%. This method is easy to understand and compute, but it has limitations. The number of points on the rating scale can affect the reliability estimate. If the rating scale uses a ten-point scale, for example, it is more likely that the raters will give the same score than if the rating scale was out of 100 points. One adjustment to this method is to calculate the percent agreement within a range. For example, agreement within plus or minus one point, or plus or minus 10 points in the case of a 100-point scale.
  3. Mastery Decision: This approach is used particularly in educational settings or in contexts where a standard or benchmark is set for "mastery" or adequate competence in a skill. Here, inter-rater reliability is assessed based on how consistently raters classify performances as meeting or not meeting the mastery criterion. For example, if two raters are evaluating whether students have mastered a certain math skill, inter-rater reliability would be calculated based on how often both raters agree on whether a student has or has not achieved the criterion.

Like all the methods we might use for determining reliability, there are several limitations and challenges to consider when calculating interrater reliability methods. 

    1. Halo (or Horn) Effect: This occurs when an evaluator's overall impression of a person, often based on one positive (or negative) characteristic or first impression, influences their judgment about that person's performance. For example, the way a student dresses or looks. An initial impression of an individual can disproportionately influence a rating and subsequent evaluations. If the rater has experience with the student, the student's behavior or an early mistake or success might color all future assessments of that person
    2. Logical errors: This occurs when a rater makes a faulty assumption or generalization about a student they are rating. For example, if a student shows ability in writing, they must be intelligent; therefore, they will also be good at math. If the student fails to do well on a math assessment, the rater may justify giving the student the benefit of the doubt and a higher score because the rater assumes the student is generally intelligent based on their writing ability. 
    3. Recency Effect: Past performance can affect future grades. This bias happens when raters focus more on the most recent behavior or performance, overlooking earlier information. For instance, a student's latest assignment might disproportionately influence their overall grade for the course.
    4. Similarity Bias: This happens when raters favor individuals who are similar to them in terms of interests, beliefs, background, or preference. For example, a rater might give higher scores to individuals who share similar views, perspectives, or hobbies.
    5. Leniency or Severity Bias: This occurs when a rater consistently gives higher (leniency) or lower (severity) scores than warranted, often due to a personal inclination or belief about grading standards. This can lead to unjustifiably high or low evaluations for all individuals being assessed. This will not affect inter-rater reliability calculated using a correlation, but it can affect reliability determined by percent agreement.
    6. Central Tendency Bias: Raters who are reluctant to give extreme scores might rate everyone near the average, avoiding high and low ratings. This can obscure meaningful distinctions in performance or ability.

Chapter Summary

  • Reliability is a prerequisite for validity. An assessment must consistently yield stable results to be considered valid. However, just because a test produces reliable results, does not mean the test scores are valid for the intended purpose.

  • Types of Reliability include:

    • Test-Retest Reliability: Measures consistency of test scores over time.
    • Parallel-Forms Reliability: Involves using equivalent forms of a test, emphasizing the creation of truly equivalent test versions.
    • Split-Half Reliability: Divides a test into two parts to evaluate internal consistency. Due to issues of creating two similar parts, this method is not often used.
    • Internal Consistency Reliability: Often assessed using Cronbach's Alpha, focusing on item homogeneity and unidimensionality of the assessment.
    • Inter-Rater Reliability: Assesses consistency in subjective scoring by two or more raters. Three methods of doing this include correlations, percent agreement, and master decision agreement.
  • Challenges and Limitations: Each method for determining reliability faces specific challenges.

Discussion Questions

  1. Provide an example of how a test might be considered reliable but not valid.
  2. Describe a situation where you would need two equivalent forms of a test. Explain the process and challenges associated with creating the two equivalent forms of the test. 
  3. Provide an example of a test situation and explain which form of reliability evidence would be needed. Considering the limitations of each method for determining reliability, describe what might be done to alleviate issues associated with that method.  



This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/Assessment_Basics/reliability.