Assessment in the Instructional Design Process

This chapter provides an overview of assessment practices in the instructional design process.

Assessment is an essential aspect of the instructional design process associated with the expected learning outcomes of an educational product. Reiser and Dempsey (2012) suggested that instructional design involves designing and developing educational products to facilitate learning and improve performance. However, how do we know we have accomplished this goal once we have created an instructional product? This is where assessment is needed!

Evaluation and Assessment

Before we explore the topic of assessment, we need to understand the difference between two terms: assessment and evaluation. These related terms are often used interchangeably, but they have distinct meanings and purposes.

Assessment refers to the process of gathering information about an individual's knowledge, skills, abilities, or other characteristics. Assessment often requires that we create instruments (e.g., tests) to measure these characteristics. However, assessment can take other forms, such as observations or interviews. The primary purpose of assessment is to gather accurate, often quantitative, information about an individual so we can communicate and compare results.

Evaluation, on the other hand, refers to the process of making judgments or decisions based on the results of an assessment. An evaluation aims to make value-based judgments about an individual's performance or cognitive ability; this often requires we establish evaluation criteria.

The difference between these two terms is subtle. Assessment is descriptive, while evaluation involves judgment. Assessment is the process of gathering information, while evaluation is the process of making decisions based on the results of an assessment. An assessment becomes an evaluation when we make a determination about an individual based on assessment results.

Types of and Purpose for Assessment

Assessment serves multiple purposes in education, including:

Measuring Student Learning: Summative assessments measure achievement, enabling teachers to determine what students have learned (accountability) and verify they have accomplished the expected learning outcomes (certification). These types of assessments are most often evaluations.

Informing Instructional Planning: Formative assessments help teachers make informed decisions about the instructional needs of their students. The results of a formative assessment can help teachers plan the scope and focus of their instruction.

Assessing Readiness and Need: Placement assessments are a form of formative assessment that helps teachers determine a student's readiness for the planned instruction or whether a student needs to participate in the proposed instruction.

Diagnosing Learning Problems: Like formative assessment, diagnostic assessment can help with lesson planning, but at an individual level rather than a group level. The results of a diagnostic assessment are used to identify specific misconceptions a student may have or provide reasons why they failed to accomplish a specific task (got the question wrong). The results of a diagnostic assessment are used to provide detailed feedback to students---not just that they got a question wrong but also why they may have answered a question incorrectly or unsuccessfully completed a task.

Study Guides: Research has shown that using tests can be an effective study technique (Karpicke & Blunt, 2011). For example, taking a test-your-understanding quiz can help students improve their retention and recall of information. The results can provide valuable feedback for students, helping them identify areas where they need to improve. In addition, taking practice tests can reduce test anxiety as students become more comfortable with the testing process and the types of items used in an assessment.

Evaluating Program Effectiveness: The results of assessments can be used to evaluate the effectiveness of educational programs and initiatives, helping teachers and schools make data-driven decisions about improving the education they provide. However, when evaluating a program, assessment results are but one piece of evidence that should be considered. For a complete evaluation, one might include, among other things, an implementation fidelity study, a negative case analysis, or an analysis of unintended consequences.

Background of Assessment in Instructional Design

The field of instructional design emerged in the mid-1900s. The military was the first to design instruction systematically; they needed to train soldiers quickly and efficiently to perform specific tasks. An essential aspect of the military's training was the assessment of a soldier's aptitude and ability to correctly carry out what they had learned. Over the next few decades, an Instructional Systems Design (ISD) approach was adopted by most instructional designers. The main goal of ISD was to outline key steps that should be taken to ensure that quality instruction was created.

In the 1970s, the ADDIE model for designing and developing instruction was one of the first formal ISD models developed – reportedly by the Center for Educational Technology at Florida State University for the United States Armed Forces. ADDIE stands for Analyze, Design, Develop, Implement, and Evaluate. The analysis phase of the Addie model required a gap or needs analysis to determine the goals and objectives of the instruction to be developed. The original purpose of the evaluation phase in the ADDIE model focused on assessing student learning to determine whether the learning objective of the course had been met. The results of a summative assessment were used to certify that students had accomplished the intended learning objectives and were the main criteria used to determine the effectiveness of the instruction. However, the purpose of evaluation in the model was later expanded to include a more comprehensive view of evaluation that included formative evaluations of the instructional approach, design, usability, and maintenance of the instructional product.

The ADDIE model is arguably the most prominent instructional design model developed, but many others have since been developed and promoted. There are differences in the models, but there are three broad activities an instructional designer must accomplish:

  1. Establish the learning objectives for the instruction.
  2. Decide how to assess the expected learning outcomes.
  3. Design and develop instructional activities to facilitate the desired learning.

Wiggins and McTighe (2005) popularized this idea by coining the term Backward Design or starting with the end in mind. Their book Understanding by Design included the following steps: Identify the desired results, determine acceptable evidence that the expected learning outcomes have been met, and then plan learning experiences and instruction to facilitate the expected learning. This approach of establishing learning objectives and assessments into the instructional plan before creating learning activities was not a new concept, but Wiggins and McTighe effectively rebranded the ideas of Tyler, Gagné, Mager, and othersconcepts that were the foundation of most ISD models developed in the 1950s and 1960s. As a result of Wiggins and McTighe's work, present-day educators and instructional designers have been reintroduced to these critical concepts.

Test Plans and Learning Objectives

Many instructional designers skip this step, but creating a test plan with clear learning objectives is crucial before creating an assessment. Creating a test plan helps ensure that the test is designed to measure what learners are expected to know and be able to do after completing a training or learning program.

There are many ways to plan a test. A test plan need not be complicated, but there are a few specific details the plan should address.

1. Purpose. The purpose of the assessment should be established. Who will take the test, and how will the results be used?

2. Learning Objectives. A test plan with clear learning objectives that help focus the testing process on the specific knowledge and skills learners need to demonstrate.

3. Content. Describing the content to be covered can help guide the test creator and the instructional designer.

4.  Test Specifications Table. A table of specifications helps test creators make decisions about the number of items to include. It helps them align test items with the content and the learning objectives or constructs being assessed. Using a specification test can also help validate an assessment by providing a visual representation of the content and construct coverage.

Performance assessments. Formal traditional testing processes involve what is often referred to as paper and pencil tests of cognitive learning objectives. Although, to a large extent, online assessments have replaced printed assessment formats, and informal assessments may be quite informal, involving a knowledgeable other asking questions and receiving responses verbally. Also, not all tests assess an individual's cognitive ability; many assessments are best classified as performance assessments. These tests assess the abilities and skills of an individual and are designed and administered differently. Learning objectives that address skills should be assessed by observing the performance of the skill. For example, when learning a foreign language, a cognitive assessment of vocabulary is important but not a sufficient demonstration of speaking ability. A performance test must be planned differently from cognitive assessments. These assessments will include a description of how the assessment will be administered, and because no two performances will be exactly the same, a rubric for grading the quality or adequacy of the performance will often replace the table of specifications.

Designing Valid Assessments

Instructional designers need to create assessments for several purposes. This may include creating a test-your-understanding quiz, a unit review, or a summative assessment at the end of the course to certify a student has accomplished the expected learning outcomes. Unfortunately, not all assessments are valid measures of what they intended to measure, and the results cannot be used for their intended purpose. This is why an instructional designer needs to learn how to create learning objectives and develop quality assessment instruments that align with the goals of the instruction.

Definition of Assessment Validity

The results of an assessment are valid if the assessment measures what it is supposed to measure accurately and consistently. 

Creating valid assessments goes beyond ensuring test questions focus on material covered in class or in the curriculum standards. Assessment validation involves checking that your assessment instruments produce accurate results and are used appropriately.

When we say a test is valid, we really mean the results are valid. In other words, the results are credible (i.e., they measure what they were supposed to measure) and, therefore, can be and are used for a specific intended purpose. And while we might say a test is (     or the results are)      valid, assessment validity might better be understood as a continuum. An assessment must be sufficiently credible and trustworthy. Or the results can be used confidently to make decisions (i.e., evaluations).

The validation process involves gathering evidence that allows you to confidently conclude that the results accurately represent whatever the assessment was supposed to measure. Several types of evidence can be used to support the validation process:

  • Evidence of Content Validity refers to the extent to which the assessment instrument covers the content domain it intends to measure. Evidence of content validity can be obtained by reviewing the assessment items and assessing their relevance to, and importance within, the intended domain. For example, suppose an assessment is designed to measure knowledge of world geography. In that case, the assessment items should adequately cover each geographical area of the world. The test should also focus on the most important ideas and concepts the individual should understand. Missing some content or skipping important ideas would diminish the validity of the assessment.
  • Evidence of Construct Validity refers to the extent to which the assessment instrument focuses on the construct or concept it intended to measure. Evidence of construct validity can be obtained by examining the relationship between the assessment scores and other measures of the same construct. For example, if an assessment is designed to measure critical thinking skills, evidence of construct validity can be obtained by comparing assessment results with other validated measures of this construct. Additional evidence is obtained by examining the items used on a test to verify that the items elicit the targeted skill, not some unrelated or irrelevant skill or ability. For example, if the results of a math skills assessment are influenced by reading ability, the assessment results are less valid.
  • Evidence of Assessment-Criterion Relationship Validity refers to the extent to which a test score (the assessment) predicts future performance or success (the criterion). Predictive validation studies focus on the relationship between the assessment and future performance. For example, if we determine that individuals will need specific math skills to do a particular task (i.e., the criterion), then an assessment of the requisite math skills should correlate well with the individual's ability to complete the task. Concurrent validation studies compare the results of an instrument designed to measure a requisite skill with validated measures of the criterion.

Creating test items

There are lots of resources available that teach item writing basics. Still, it is easy to write a lousy test. The quality of your assessment will depend on the quality of the items you use. Selecting the most appropriate type of test item to capture the expected learning is crucial, as well as testing and revising the items you create. Best practice suggests you write multiple versions of an item to weed out faulty items or to use similar items for equivalent forms of an assessment or in a test bank of questions.

There are a few item statistics that can help identify problematic items. However, these statistics only provide information that may be useful to review and improve the test items used in an assessment. Reviewing items needs to be done by subject matter experts and assessment specialists (e.g., psychometricians). 

1. Item Difficulty. This statistic indicates the percentage of people who got an item correct. This information is not related to the quality of the item. You may wish to review the easy items as well as the difficult items. An easy item, one that almost everyone gets correct, may have an unintended clue to the correct answer or be written in a way that the correct answer is obvious. A difficult item may be unclear or contain more than one correct answer. These kinds of item writing mistakes lead to measurement error and diminish the validity of the assessment results.

2. Discriminating Index. This statistic, also known as discriminating power, is a statistical measure that indicates the relationship (i.e., correlation) between the overall score on a test and how well individuals answered a specific test item. Each item on a test will have a discriminating index. A high discriminating index indicates that the item effectively discriminates between high and low performers. Conversely, a low discriminating index suggests that the item or question is less effective in differentiating between individuals and may not contribute as much to the overall purpose of the test.

These statistics are typically used for norm-referenced tests where differentiating between students is the goal. Very easy and very hard items will have little or no discriminating power. In norm-referenced tests, items with low discriminating power are typically excluded. In a criterion-referenced test, this statistic is less important. Item selection is based on the importance of the material or skills being tested.

When reviewing items, questions with a negative discriminating index should be reviewed. A negative discriminating power indicates that students who do better on the overall test tend to get this item correct. In other words, the more a student knows, the less likely they will answer this question correctly. Likewise, items with little or no discriminating power should also be reviewed. A discriminating index around zero suggests it is equally likely that a student who did well on the overall test will get this item correct as a student who does poorly on the test.

A detailed discussion of the development and testing items is beyond the scope of this chapter. However, as a general rule, items should align with the intended learning objectives; the items used should adequately cover the content, focusing on the most important information and skills. Those developing an assessment should follow best practice guidelines for each item.      

Assessment Challenges and Issues

Assessment specialists face many challenges when creating valid assessments. We have outlined a few here, but there are others.

Getting beyond recall and understanding. One of the biggest mistakes test creators make is focusing too heavily on the recall of basic information. This may be acceptable when a course's learning objective intentionally focuses exclusively on the ability to remember and understand facts and definitions; however, in many courses, the instructional objectives attempt to measure student learning beyond the initial level of Bloom's Taxonomy.  

Measuring affective characteristics. Most of what we measure in schools and training situations falls within the cognitive domain. However, often the instructional goals of a course may include affective objectives. Unlike knowledge, skills, and abilities, the affective domain includes personal characteristics like attitudinal dispositions, values, beliefs, and opinions (e.g., interest, caring, empathy, and appreciation) (see Davies, 2021). Simon and Binet (1916), the fathers of intelligence testing, suggested that as important as assessing cognitive ability may be, we might be well served first to teach (and assess) character. Assessing these personal characteristics required a different kind of assessment. It requires us to create a scale that measures the degree to which individuals possess a certain characteristic or quality. 

A visual of Bloom's Taxonomy for the Cognitive Domain

High-stakes testing. One particularly contentious issue in schools is the political mandate to test students using standardized, summative assessments. A few issues arise from this policy. One issue with high-stakes testing revolves around the idea that these tests do not assess the whole person. The "whole person issue" in assessment refers to the challenge of capturing a person's entire range of abilities, characteristics, and experiences in a comprehensive and accurate manner. Using a single assessment to judge a person may be limiting. A second issue focuses on balancing the need to assess with the need to teach. This can be problematic. Some educators complain they are so focused on testing that they have little time to teach. This includes the problem of teaching to the test. One additional issue with high-stakes testing relates to the need for such testing. Many educators believe that the most important purpose for assessment in schools is formative, not summative. 

Interpretation and inappropriate uses of assessment results. The inappropriate use of assessment results can also be a problem. Assessments are typically created for a specific purpose, and the results are not valid for other purposes. Assessment results are only valid if appropriately interpreted and used for the assessment's intended purpose. For example, in schools, test scores are designed to evaluate individual students' knowledge, skills, and abilities. Unfortunately, they are also inappropriately used to judge the quality of the instruction provided. While the quality of the teacher or instruction may influence the results of an assessment, many students fail to achieve despite being provided quality instruction. Often, students succeed despite their teachers' failings. A better assessment of teacher quality would require assessments explicitly designed for that purpose.

Another example of inappropriate use of assessment results happens when we don't have a good measure of the intended learning outcomes. This can happen, for example, when we want to develop a specific s affective characteristic but don't have a valid measure of the disposition—using an achievement test as an indirect substitute indicator would not be appropriate or valid practice. The challenge for assessment developers is to create direct valid measures of the expected learning outcomes.

Areas of Assessment Research

If you are interested in researching the topic of assessment, there are several promising and challenging areas you might consider.

Online test security. With increased online and distance learning acceptance, cheating on exams has become a prominent concern. Research on this topic has identified various vulnerabilities and proposed measures to address them. Online proctoring tools can help mitigate the risk of cheating. Using biometrics to verify students' identity and authorship has also been studied (for example, Young et al., 2019). Security breaches can be an issue for high-stakes testing and certification exams, where keeping test items secure is crucial. Proper training and communication with students can help promote ethical behavior during online assessments; however, ongoing research and development in this area will be important to ensure the integrity and validity of online assessments.

Learning Analytics. Recent calls for data-driven decision-making have prompted considerable interest in learning analytics. Research in this area is concerned with ways to personalize instruction. This includes the topics of stealth assessment and non-intrusive assessment data collection. With learning analytics, creating and using dashboards to communicate essential learning accomplishments and areas for improvement is particularly important. This includes identifying at-risk students and monitoring student progress with real-time student achievement results and engagement updates. Additional research is also needed to address student privacy and confidentiality concerns regarding the information we collect about students.

Automated Tutoring Systems. Providing feedback is an important function of the assessment process. Results from assessments can provide the information students need to resolve misconceptions, increase their understanding, and improve their skills. Timely feedback is essential for effective learning. Automating the feedback process can improve the speed and consistency of our assessment feedback. However, while more timely, many of these automated systems are less effective than human feedback in providing personalized, context-specific, and actionable feedback. Much of the research in this area relates to artificial intelligence and machine learning. However, critics point out that inappropriate applications and overreliance on artificial intelligence to provide feedback can lead to trained incompetence rather than increasing students' ability. Research in this area will be important to ensure that automated feedback is accurate and administered appropriately.

References

Bloom, B. S.; Engelhart, M. D.; Furst, E. J.; Hill, W. H.; Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals. Vol. Handbook I: Cognitive domain. New York: David McKay Company.

Davies, R. (2021). Establishing and Developing Professional Evaluator Dispositions. Canadian Journal of Program Evaluation, 35 (3).

Gagné, R. M. (1965). The conditions of learning (1st ed.). New York: Holt, Rinehart & Winston.

Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science331(6018), 772-775.

Mager, R.F. (1984). Preparing instructional objectives. (2nd ed.). Belmont, CA: David S. Lake.

Reiser, R. A., & Dempsey, J. V. (Eds.). (2012). Trends and issues in instructional design and technology (p. 408). Boston: Pearson.

Simon, T., & Binet, T. (1916). The development of intelligence in children. Translated by Elizabeth S. Kite. Vineland, The Training School, publication, (11), 336.

Tyler, R. W. (2013). Basic principles of curriculum and instruction. In Curriculum studies reader E2 (pp. 60-68). Routledge.

Wiggins, G., & McTighe, J. (2005) Understanding by design (2nd ed.). Alexandria, VA: Association for Supervision and Curriculum Development ASCD. Colomb. Appl. Linguist. J., 19(1), pp. 140-142.

Young, J., Davies, R., Jenkins J., & Pfleger , I. (2019). Keystroke Dynamics: Establishing Keyprints to Verify Users in Online Courses. Computers in the Schools, 36(1). 1-21.