Item Statistics and Analysis


 When reviewing items, it is best to pilot-test them with real students. This gives us data that we can use to help evaluate each item.  A minimum of 100-200 respondents is often cited as a recommended sample size for basic (classical) item analysis. This allows for preliminary calculations of item difficulty and discrimination. However, the specific requirements can vary based on the purpose of the assessment, the stakes involved, and the particular statistical methods being used. For high-stakes or large-scale assessments, you will need a larger sample, one that is representative of the population you are testing (sampling). You can calculate the item statistics for a classroom test, but be cautious of the interpretation. Small sample sizes may obtain a biased result because the characteristics of those in the class to not represent other students in the larger population. If a test is reused, the combined item statistics from multiple exams will better estimate how well individual items perform.  

Item Statistics

Several metrics are used to review the effectiveness of test items. Item difficulty, the discriminating index, and a distractor analysis (for multiple-choice items) are commonly used in a basic (classic) item review process. 

Item Difficulty. This statistic indicates the percentage of people who got an item correct. Each item on the test will have an item difficulty. The lower the percentage, the more difficult the test item. This information is not related to the quality of the item. It simply indicates how difficult the item was for those taking the test. You may wish to review items identified as easy items as well as difficult items. An easy item, one that almost everyone gets correct, may have an unintended clue to the correct answer or be written in a way that the correct answer is obvious. A difficult item may be unclear or contain more than one correct answer. These item writing mistakes lead to measurement errors and diminish the validity of the assessment results. This calculation is represented by the letter P because it indicated the percentage of students who got the item correct. 

$$ P = \frac { R}{ T} $$

Where 

R = number of individuals who got the item correct

T = total number of individuals who took the exam

This assumes students will be awarded one point for correctly answering the questions. Use the average score of the item if the item is out of more than one point, and awarding part marks (partial points) is a possibility.  

Example: Suppose 62 people took a test and 49 of them got question 1 on the exam correct. Then:

$$ P = \frac { R}{ T}\;= \frac { 49}{ 62} \; = \;0.79\; or\; 79\% $$

Interpreting this statistic is a simple descriptive process. Seventy-nine percent (79%) of those attempting this item got it correct. However, categorizing the difficulty of an item qualitatively is somewhat arbitrary. For example, you might use the following ranges to describe the difficulty of an item.

$$ Easy \; \to \; 80\% \;\;to\;\;100\% $$

$$ Moderately\; Easy\; \to \; 50\% \;\;to\;\;79\% $$

$$ Moderately \;Difficult\; \to \; 30\% \;\;to\;\;49\% $$

$$ Difficult\; \to \; 0\% \;\;to\;\;29\% $$

It is important to remember that the result should not be understood as an indicator of the quality of the item. The item difficulty is simply an indicator of how well individuals did on that item overall. For example, an item with a P of 1.00 means that everyone attempting that item got it correct. It was an easy item, but we should not assume it was a good item. The item may have a hint to the correct answer, leading everyone to get the item correct even if they didn't know the answer. Likewise, an item with a P of 0 means everyone got the item wrong -- a difficult item.  We might assume the item is problematic, but that is not a correct interpretation of the result. The item may simply measure something those taking the test were not taught or developmentally were unable to accomplish. Or, there was a flaw in the item, leading people to get the answer wrong.

The difficulty level (P) is simply a piece of information that can be used, along with other data, to evaluate the item. For instance, if the item was part of a mastery test and the item measured an essential piece of knowledge, a difficulty level between 0.80 and 1.00 would be acceptable and might be expected. However, if the item was part of a norm-referenced test, even if the item was well written, you may choose to exclude the item as it may not help discriminate between students. 

A decision about whether the level of difficulty for an item is acceptable must take into account what the item was measuring, the purpose of the test, and why the item was included in the test. 

Discriminating Index. This statistic, also known as Discriminating Power, is a statistical measure that indicates the relationship (i.e., correlation) between the overall score on a test and how well individuals answered a specific test item. Each item on a test will have a discriminating index. A positive discriminating power indicates that students who do well on the overall test tend to do well on the test item. Conversely, a negative discriminating power indicates that students who do well on the overall test tend to do poorly on the test item. This means that a high discriminating index indicates that the item effectively discriminates between high and low performers on the overall test. Conversely, a low discriminating index suggests that the item or question is less effective in differentiating between individuals and, in the case of a norm-reference test, may not contribute as much to the overall purpose of the test. These statistics are not as useful in a criterion-referenced test. They are typically used for norm-referenced tests where differentiating between students is the goal. Very easy and very hard items will have little or no discriminating power. In a criterion-referend test, you may have several easy questions that test essential knowledge. Item selection is based on the importance of the material or skills being tested. The discriminating power for easy items (and difficult items) would be close to zero, but the item might still provide valuable information about the student's overall competence. In norm-referenced tests, discriminating power is important. Items with low discriminating power are typically excluded. When reviewing items, questions with a negative discriminating index should be reviewed. A discriminating index around zero suggests it is equally likely that a student who did well on the overall test will get this item correct as a student who does poorly on the test. Items with little or no discriminating power should also be reviewed. 

We will discuss two ways to calculate discriminating power.  

$$ Discriminating \;Index = \frac {{R\;_{upper}}-{R\;_{lower}}} { {T_{upper\; and\; lower}\; *\; 0.5}} $$


If the intent is a discriminating question, a range of 0.30 to 0.70 is generally acceptable.

Distractor Analysis for Multiple-Choice Questions

Distractor analysis involves evaluating the incorrect answer choices (distractors) in multiple-choice (MC) questions to determine their effectiveness. This process examines how often each distractor is chosen by students and identifies any patterns that might indicate problems with the question or the distractors themselves. Effective distractors should be plausible enough to be chosen by students who do not know the correct answer but not so confusing that they mislead students who do understand the material. By analyzing distractor performance, educators can refine questions to improve their diagnostic value and ensure they accurately measure student knowledge and understanding.

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/Assessment_Basics/item_analsyis.