Psychometric Testing and Measurement Exam Questions

Posted by Anonymous and classified in Psychology and Sociology

Written on June 1, 2026 in English with a size of 213.31 KB

Final Exam Item Pool - Part 1

Section 1: Levels of Measurement

1. A developmental psychologist records children’s ages in full years. A colleague argues this variable should be treated as ordinal because one cannot be certain children aged 6 and 7 differ by “exactly one year” in psychological maturity. Which response best evaluates this argument?

a) The colleague is correct; developmental maturity is inherently ordinal.
b) The colleague is incorrect; age in years has equal intervals and a true zero, making it ratio.
c) The colleague is correct because ratio measurement requires a psychological, not just physical, zero point.
d) The colleague is incorrect because classifying a variable depends on conceptual intent, not physical properties.

2. Which of the following operations is mathematically permissible for interval-level data but NOT for ordinal-level data?

a) Ranking scores from highest to lowest.
b) Computing the arithmetic mean.
c) Determining the mode.
d) Establishing whether one score is greater than another.

3. A clinical researcher assesses pain using a Numeric Rating Scale (NRS) where 0 = “no pain” and 10 = “worst pain imaginable.” A patient scores 8 on Day 1 and 4 on Day 10.

(a) Identify the most defensible level of measurement for the NRS and justify your answer.

Answer: Ordinal—the rating scale uses numbers to rank pain in some order, but the intervals cannot be assumed to be equal, making it unable to be described as an interval/ratio scale.

(b) Is it appropriate to conclude the patient’s pain was “cut in half”? Explain using measurement theory concepts.

Answer: We cannot assume that the patient’s pain was cut in half. Although the pain score decreased from 8 to 4, the NRS is ordinal, meaning the numbers indicate order but do not represent equal psychological intervals. Because spacing between the pain measures is not guaranteed to be uniform, the change seen reflects directional improvement, not proportional.

4. For each variable below, identify the level of measurement and justify your answer in one sentence.

Country of birth (e.g., USA, Canada, Mexico)
Answer: Nominal—the various countries have the property of distinctiveness, but the countries distinguish different categories of birth country and do not have any real numerical significance.
Military rank (Private, Corporal, Sergeant, Lieutenant)
Answer: Ordinal—the ranks have a clear order but the distances between them are not equal. Since the rankings have a built-in hierarchy, the labels have significance and a clear order.
Number of siblings
Answer: Ratio—it is a count with equal intervals between values and has a meaningful zero point (0 = no siblings). You can also make meaningful ratio comparisons, such as saying you have twice as many siblings as someone else.
Calendar year (e.g., 2020, 2021, 2022)
Answer: Interval—calendar year has the property of equal intervals between values, but the zero point is arbitrary rather than representing an absolute absence of time.
Reaction time in milliseconds
Answer: Ratio—time in milliseconds has equal, continuous intervals between each second value, and a meaningful 0 point representing no elapsed time.

Section 2: Cognitive Item Writing

5. A high school teacher wants to measure students’ ability to apply the laws of thermodynamics to novel real-world situations. Which item format is most appropriate for this purpose?

a) True-false, because it allows the teacher to cover all thermodynamic laws efficiently.
b) Matching, because it tests the association between laws and their definitions.
c) Multiple-choice with application-level stems, because well-constructed items can require students to reason through unfamiliar scenarios.
d) Short-answer restricted-response essay, because it requires students to produce rather than recognize an answer.

6. Which of the following statements about distractors in multiple-choice items is most accurate?

a) Distractors should be randomly generated to ensure fairness across examinees.
b) Effective distractors reflect common student misconceptions or plausible errors.
c) Distractors should be shorter than the correct answer to reduce test-taker bias.
d) The more distractors an item has, the more reliable the test will be.

7. A teacher writes the following true-false item:

“Multiple-choice tests are always better than essay tests because they are objective and easy to score.”

Identify at least two guidelines violated by this item. For each, name the guideline, explain the problem it creates, and suggest a correction.

Answer: This item uses a specific determiner, “always,” which can lead to confusion and may suggest the correct answer. A correction to this would be to simply remove the word.

Another guideline that is violated is that this item is based on opinion. This creates confusion and does not classify the item as entirely, objectively true or false. To fix this, you can remove the part that says they are better than essays, and instead just state the two descriptions.

8. Examine the multiple-choice item below. Identify all guideline violations, explain what problem each creates for measurement quality, and rewrite the item so every flaw is corrected.

Which of the following is NOT true about validity?

a) Validity refers to whether a test measures what it claims to measure.
b) Tests are always either valid or invalid.
c) All of the above.
d) Validity is a unitary concept supported by multiple sources of evidence.

Critique:

The stem is not stated in positive form, but the negative word is emphasized so it is not a major issue.
The alternatives are not the same length; this reveals to some test-takers that the longest option may be correct.
All of the alternatives are not grammatically consistent with each other or with the stem. Option B stands out because it does not begin with “Validity...”
The placement of “All of the above” is confusing. Test-takers may question if it is referring to only A or B, since they are “above,” or all of the answers including D.
Option B uses the word “always,” which is a verbal cue and can lead to confusion.

Rewritten:

Which of the following is NOT true about validity?

Validity refers to whether a test measures what it claims to measure.
Validity reveals that tests are either valid or invalid.
Validity is a unitary concept supported by multiple sources of evidence.
All of the above.

Section 3: Noncognitive Item Writing

9. Which of the following best describes acquiescence response bias?

a) Respondents choose extreme response options regardless of item content.
b) Respondents agree with statements regardless of their actual beliefs or attitudes.
c) Respondents choose the socially desirable response rather than their true attitude.
d) Respondents become fatigued and begin selecting responses randomly.

10. Which of the following strategies is most effective in reducing socially desirable responding on a noncognitive measure?

a) Using dichotomous (yes/no) response scales instead of Likert-type scales.
b) Ensuring confidentiality, using indirect or third-person phrasing, and including validity scales.
c) Randomizing item order so respondents cannot detect the construct being measured.
d) Including only positively worded items to encourage honest agreement.

11. A researcher develops a Likert-type workplace satisfaction scale and writes the item: “I am satisfied with my pay, my colleagues, and the overall company culture.”

(a) Identify the specific guideline violated.

Answer: This question is double-barreled (or triple-barreled).

(b) Explain the consequences of this flaw for data quality.

Answer: Participants may agree with one portion of the question and not agree with the other, causing variation in responses that do not measure the belief toward each aspect of the question. This question would not accurately capture the participant's belief due to variation in response for each portion of the question.

(c) Rewrite the item correctly.

Answer: I am satisfied with the overall work environment of my company.

12. Explain the difference between faking good and faking bad in noncognitive assessment. For each:

(a) Describe one real-world context where it is most likely to occur.

(b) Explain one method a researcher could use to detect it.

(c) Describe one design decision a test developer could make to reduce its impact.

Answer for Faking Good: Faking good could occur when satisficers choose socially desirable responses when they are aware of their own shortcomings and do not want to reveal these flaws on an assessment. This is likely to occur on employment or personality tests when the participant wants to appear favorably. One method to detect it is administering an additional social desirability measure. Participants with high scores can be flagged on these measures and even be removed from the dataset of the other assessment. A developer could also utilize the inclusion of an equal number of positively and negatively oriented items to discourage yea- and nay-saying.

Answer for Faking Bad: Faking bad, also known as malingering, is the tendency for respondents to portray themselves as more disturbed or pathological than is actually the case. This can be seen in personality measurement when respondents are undergoing evaluation for criminal prosecution. One may fabricate mental illness in hopes of a reduced sentence. A researcher can detect malingering by using the MMPI-2 F-scale, which includes scales from multiple pathologies, making it unusual for someone to answer all questions in a pathological direction. A developer could utilize a design to foil malingering by using responding measures that are so subtle respondents cannot guess what the desirable answer is.

Section 4: Norms and Standardized Scores

13. A student scores at the 72nd percentile on a standardized reading test. Which of the following is the most accurate interpretation?

a) The student answered 72% of the items correctly.
b) The student scored higher than 72% of the students in the normative sample.
c) The student’s score is 72 points above the mean.
d) The student’s score falls within the top 28% of all possible scores on the test.

14. A psychologist reports that a child’s standard score on an intelligence test is 85, where the mean is 100 and the standard deviation is 15. Which of the following best describes this performance?

a) The child scored one standard deviation above the mean.
b) The child scored one standard deviation below the mean.
c) The child scored at the mean.
d) The child scored two standard deviations below the mean.

15. A test has a mean of 50 and a standard deviation of 10. A student receives a raw score of 65.

(a) Calculate the student’s z-score and the corresponding T-score. Show all work.

Formula: z = (x - μ) / σ

Calculation:
z = (65 - 50) / 10 = 15 / 10 = 1.5
T = z(10) + 50 = 1.5(10) + 50 = 15 + 50 = 65

(b) Interpret the T-score in plain language appropriate for a parent-teacher conference.

Answer: T = 65. This score indicates that the student scored significantly above average. A score of 65 is one and a half standard deviations above the mean.

16. A parent is told her child earned a Grade Equivalent (GE) score of 7.4 on a reading test administered in the spring of 4th grade. The parent concludes her child should be placed in 7th grade reading.

(a) Explain why this interpretation is incorrect by describing two specific problems with Grade Equivalent scores.

Answer: Grade Equivalent scores are based on the average performance within a grade level. They assume that knowledge increases linearly, which is a problem because that likely isn't true. Another issue with these is that they are only based on a sample of students within each grade level and are sample-dependent. Therefore, this interpretation is incorrect because a score of 7.4 on a reading test given to 4th graders does not necessarily mean that the child can keep up with 7th graders. That score is respective to a sample of 4th-grade students, so her child is performing above their peers in the same grade, but not necessarily at a 7th-grade instructional level.

(b) Recommend a more appropriate standardized score to report to the parent and explain why it communicates the child’s performance more accurately.

Answer: A percentile rank would be a more appropriate standardized score here because it would allow the parent to see how the child performed in comparison to their peers of the same age and grade at the same time of year. This would be clearer because it explains that the student is being compared to peers and not other students in higher grade levels with different instructional expectations.

Section 5: Classical Test Theory

17. A researcher administers a 20-item vocabulary test and obtains a reliability coefficient of ρ = 0.84. According to CTT, what proportion of the observed score variance is attributable to true score variance?

a) 0.16
b) 0.42
c) 0.84
d) 0.92

18. Which of the following statements about random error in Classical Test Theory is correct?

a) Random errors are systematic and therefore predictable across test occasions.
b) Random errors have an expected value of zero across a large number of administrations.
c) Systematic errors cancel out when scores are averaged across many test-takers.
d) CTT assumes that random and systematic errors are positively correlated with true scores.

19. Explain the concept of the Standard Error of Measurement (SEM) in CTT.

(a) State the formula and explain what each component represents.

Formula: SEM = σ_E = σ_X * √(1 - ρ_XX)

Explanation: SEM is the standard error of measurement, σ_X is the standard deviation of the test scores, and ρ_XX is the reliability coefficient of the test.

(b) Explain what SEM tells us about an individual observed score.

Answer: The standard error of measurement becomes smaller as the reliability of the test increases. As reliability increases, we can expect to get a more precise estimate of the person’s true score.

(c) Construct a 68% confidence interval around an observed score of 72, given SEM = 4. Interpret the interval in plain language.

Answer: (68, 76). If this person were tested many times with the same test, about 68% of their observed scores would fall between 68 and 76. This interval reflects the range in which their true score is likely to lie, given measurement error.

20. A test developer wants to lengthen a 20-item test to improve its reliability. The current reliability is ρ = 0.60.

(a) Use the Spearman-Brown prophecy formula to calculate the reliability of a 60-item version of the test. Show all work.

Calculation:
K = 3 (3 times the length of the old test)
Reliability = (3 * 0.60) / [1 + (3 - 1) * 0.60]
= 1.8 / [1 + 2 * 0.60]
= 1.8 / [1 + 1.2]
= 1.8 / 2.2
= 0.82

(b) State one assumption that must hold for this prediction to be accurate.

Answer: One assumption that must hold for this prediction to be accurate is that the new test items are parallel to the original items. This means the added items measure the same construct and have equal true-score and error variances, ensuring that reliability increases in the way the formula predicts.

(c) A colleague suggests that simply adding more items always improves reliability. Describe one situation in which adding items would not improve reliability, and explain why.

Answer: One situation in which this would not improve reliability is if the added items do not measure the same constructs as the original items. Reliability reflects the consistency of measurement of a single construct, and adding irrelevant or off-construct items increases error variance. This violates the assumption of parallel items required for the Spearman-Brown formula, and reliability will not improve.

Section 6: Interrater Agreement and Reliability

21. A researcher obtains a Cohen’s Kappa of 0.35 for two raters assessing anxiety disorders using a structured interview. Which of the following is the most appropriate interpretation?

a) Raters agree perfectly; Kappa of 0.35 corresponds to 35% agreement.
b) Agreement is fair and only slightly above what would be expected by chance.
c) Agreement is substantial and acceptable for clinical use.
d) Raters disagree more than expected by chance.

22. Which of the following scenarios calls for the use of the Intraclass Correlation Coefficient (ICC) rather than Cohen’s Kappa?

a) Two psychiatrists independently assign patients to one of five diagnostic categories (nominal scale).
b) Three raters score student essays on a continuous scale from 1 to 100.
c) Two coders classify interview responses into mutually exclusive thematic categories.
d) Two nurses independently rate patients as either “at risk” or “not at risk” for falls.

23. Two clinical psychologists independently rate six therapy sessions for empathy on a 5-point scale. The data are shown below.

Session	Rater 1	Rater 2
1	4	3
2	2	2
3	5	4
4	3	5
5	4	4
6	1	2

(a) Calculate the percent exact agreement between the two raters.

Answer: Out of 6 sessions, the raters agreed exactly on Session 2 (2, 2) and Session 5 (4, 4).
Percent exact agreement = 2 / 6 = 33.3%

R1 \ R2	2	3	4	5
1	1	0	0	0
2	1	0	0	0
3	0	0	0	1
4	0	1	1	0
5	0	0	1	0

(b) Identify one limitation of percent agreement as the sole index of interrater reliability for these data.

Answer: Percent agreement does not take chance agreement into account. For example, even with random guessing, raters will agree a certain percentage of the time by chance alone depending on the number of categories.

(c) Which index would be more appropriate given the scale type? Justify your choice.

Answer: Since the empathy scale is ordinal (a 5-point scale), a weighted Kappa or an Intraclass Correlation Coefficient (ICC) would be more appropriate. These indices account for the ordered nature of the data and penalize larger disagreements more heavily than smaller disagreements.

24. Explain the difference between interrater reliability and interrater agreement.

(a) Define each concept and explain how they differ conceptually.

Answer: Interrater agreement measures the extent to which different raters provide the exact same rating. Interrater reliability measures assess the degree to which ratings provided by different raters result in the same relative rank order of individuals. They differ in the quantities that they assess, as reliability looks at the overall pattern of scores across all items, while agreement looks at whether the specific scores of each item differ.

(b) Provide an original example in which two raters could have high interrater reliability but low interrater agreement.

Answer: An example would be scoring an essay with a 10-point rubric system. High interrater reliability would come from raters' scores being highly correlated in their scoring pattern. For example, Rater 1 scores three essays as 2, 9, and 4. Rater 2 scores the same essays as 3, 10, and 5. The rank order is identical, yielding perfect reliability, but they have 0% exact agreement because Rater 2 consistently scores exactly 1 point higher than Rater 1.

(c) Explain why this distinction matters for measurement practice, particularly in clinical or educational settings.

Answer: This matters in clinical and educational settings for diagnostic and grading purposes. It is, at the bare minimum, important that there is interrater reliability so that relative performance is preserved. However, if exact cut-scores are used for decisions (e.g., passing a test or receiving a clinical diagnosis), low interrater agreement can lead to inconsistent outcomes for individuals depending on which rater they get.

Final Exam Item Pool - Part 2

Section 1: Validity

1. The statement “A test is valid if it measures what it purports to measure” has become controversial among validity theorists. Explain why, and describe how contemporary frameworks have expanded or revised this definition.

Answer: The traditional view of validity is considered too narrow because it treats validity as an inherent property of the test itself. Contemporary frameworks emphasize that validity is about the interpretations and uses of test scores, not the test instrument. Modern validity theory defines validity as the degree to which evidence and theory support the intended interpretations of test scores for specific purposes. This shifts the focus from what a test measures to an ongoing process of gathering evidence to support the rationale of interpretations.

2. According to Messick, there are two primary sources of invalidity. Identify each source, provide a definition, and give one example in which each source of invalidity would be a concern for test interpretation.

Answer: One is construct underrepresentation, which occurs when the assessment is too narrow and fails to include important dimensions of the construct. An example of when this would be a concern is a math test on two-step equations that does not include questions with a broad range of operations (e.g., only including addition and subtraction, but omitting multiplication and division).

The other is construct-irrelevant variance, which occurs when the assessment is too broad and contains too much variance that is not relevant to the intended construct. An example of this is when students are supposed to write a timed essay to assess their ability to analyze a passage, but the strict time limit makes it a measure of writing speed rather than analytical ability.

3. Which source of validity evidence is most relevant when evaluating how well a test predicts future performance?

a) Content
b) Response process
c) Internal structure
d) Relations to other variables

4. What is the main difference between convergent and discriminant validity?

a) Convergent validity compares tests with different constructs; discriminant validity compares tests with the same construct.
b) Convergent validity is about the correlation with unrelated constructs; discriminant validity is about correlation with related constructs.
c) Convergent validity shows high correlations with related constructs; discriminant validity shows low correlations with unrelated constructs.
d) There is no difference; both terms refer to predictive validity.

5. The administration at a university discovers a correlation of .80 between first-year graduate GPA and self-reported frequency of coffee consumption prior to admission. They propose abandoning current admissions procedures and simply asking applicants to report their coffee consumption.

(a) Identify what is problematic about this proposal.

Answer: A correlation does not imply causation; drinking coffee does not guarantee or cause a higher GPA. Additionally, there could be many confounding factors that are not accounted for. Furthermore, coffee consumption is not conceptually relevant to academic ability, and because it is self-reported, it is highly susceptible to faking once high stakes are attached.

(b) Choose either Messick’s or Kane’s validity framework and use it to explain why this evidence alone does not support the proposed interpretation or use of scores.

Answer: Under Messick’s framework, construct validity absorbs all other forms of validity evidence. Since coffee consumption is not relevant to the academic constructs being measured in the typical admissions process, it is not a valid basis for decision-making. Messick emphasizes that empirical associations that are not aligned with the construct are insufficient. This threat to validity is a clear example of construct-irrelevant variance.

6. A researcher constructs a bathroom scale and offers the following score-based interpretation: “When objects are placed on this scale, variation in observed scores reflects differences in weight.” Describe at least one type of validity evidence that could be used to investigate this interpretation and explain how you would collect it.

A photograph of a mechanical bathroom scale with a black textured platform and a round dial at the top. The dial displays weight measurements from 0 to 280 pounds, with colored segments indicating different weight ranges.

Answer: Evidence based on relations to other variables (specifically convergent validity) could be used. To collect this, you would weigh a set of objects on the new scale and also on a gold-standard, highly calibrated laboratory scale. A high correlation between the two sets of measurements would provide strong validity evidence for the proposed interpretation.

7. A high squared correlation between true scores and observed scores indicates that a test is valid. True or False? Explain your response using relevant measurement concepts.

Answer: This is false; it actually indicates that a test is reliable. In CTT, the squared correlation between true and observed scores represents the reliability coefficient. A test is not guaranteed to be valid just because this coefficient is high. Reliability reflects the consistency of the measurement (how much of the observed score variance is attributable to true score variance rather than error), whereas validity concerns whether the test measures the intended construct.

Section 2: Exploratory Factor Analysis (EFA)

8. Which of the following is a wrong reason for performing rotations in EFA?

a) To make the factors easier to interpret by increasing the simplicity of the factor structure.
b) To ensure that each variable has equal contribution to each factor.
c) To achieve a more meaningful and interpretable factor solution.
d) To make the factor loadings more clearly related to a smaller number of variables.

9. Which of the following statements about factor loadings in EFA is incorrect?

a) Factor loadings represent the correlation between an observed variable and a latent factor.
b) A high factor loading means the variable is a strong indicator of the factor.
c) Factor loadings can never be negative because correlations are always positive.
d) Variables can have high loadings on more than one factor in an oblique rotation.

10. Which of the following statements about uniqueness in EFA is incorrect?

a) Uniqueness represents the variance in an observed variable that is not explained by the common factors.
b) A variable with a uniqueness of 0.10 means that 90% of its variance is unexplained.
c) Uniqueness includes both specific variance and error variance.
d) High uniqueness may indicate that a variable is not well represented by the factor solution.

11. Which of the following statements about eigenvalues in EFA is incorrect?

a) Eigenvalues represent the amount of variance explained by each factor in the model.
b) A factor with an eigenvalue greater than 1 is always considered a meaningful factor in the analysis.
c) Eigenvalues are used to determine how many factors to retain in EFA.
d) Eigenvalues reflect the total variance in the observed variables accounted for by the factor.

12. Use the correlation matrix below to answer this question.

	X1	X2	X3	X4	X5	X6
X1	1.0
X2	.7	1.0
X3	.75	.8	1.0
X4	.05	.02	.05	1.0
X5	.02	.01	.04	.76	1.0
X6	.03	.02	.03	.78	.80	1.0

Text Box 1, Textbox

(a) Which of the following path diagrams corresponds best to the factor solution the matrix above would yield? Explain why you chose the diagram.

Answer: Model 2 best fits the correlation matrix because it shows two clear clusters of variables: {X1, X2, X3} and {X4, X5, X6}. Those two clusters represent two distinct factors. I also chose Model 2 because it does not have an arrow correlating the two factors with each other. The correlation between the factors themselves is low because the correlations across variable clusters (e.g., X2 with X5) are very close to zero.

Text Box 1, Textbox

Group 3, Grouped object

(b) In Model 1 below, explain what is represented by each of the following:

The circles: The latent factors.
The arrow labeled “a”: The correlation present between the two factors.
The arrow labeled “b”: The factor coefficients or loadings, signifying the relation between a factor and its observed variable.

13. In a standard EFA path diagram, explain what is represented by each of the following components:

(a) The circles (or ovals): The latent factors/unobserved variables that account for the covariance among the observed items.

(b) The arrows pointing from a factor to an observed variable: Factor loadings, which represent the strength and direction of the relationship between the factor and the observed variable.

(c) The arrows pointing from a unique factor to an observed variable: Uniqueness, describing the variance in each observed variable not explained by the common factors.

Section 3: Item Analysis

14. What does a high item-total correlation suggest about an item?

a) The item has poor discrimination.
b) The item is likely redundant with others.
c) The item aligns well with the overall test score.
d) The item is too easy or too hard.

15. Which of the following would be considered a poor discrimination index for a cognitive item?

a) 0.45
b) 0.30
c) 0.10
d) 0.65

16. If an item has negative discrimination, what might this indicate?

a) The item is too easy for most test-takers.
b) High-performing students are more likely to get the item wrong than low-performing students.
c) The item has high reliability.
d) The item’s distractors are all non-functional.

17. Which of the following is NOT a common method for evaluating item quality?

a) Item-total correlation.
b) Cronbach’s alpha if item deleted.
c) Test-retest reliability.
d) Distractor analysis.

18. Describe two differences between item analysis when applied to cognitive versus noncognitive items.

Answer: The two main differences are their focus and goal. Cognitive item analysis focuses on item difficulty, discrimination, and distractor functioning, with the goal of differentiating high- and low-ability examinees. Noncognitive item analysis focuses on item-total correlations, inter-item correlations, response distributions, and internal consistency, with the goal of assessing how well items reflect an underlying trait or attitude.

19. Describe the procedures involved in distractor analysis. Explain how distractor analysis contributes to evaluating and improving cognitive test items.

Answer: Distractor analysis involves checking how often each incorrect option is selected, and whether low-scoring or high-scoring examinees were more likely to select it. This helps developers identify non-functional distractors, misleading wording, or items that are keyed incorrectly. By showing which options work and which do not, distractor analysis helps improve the quality and discrimination of cognitive test items.

20. What does it mean if Cronbach’s alpha increases when a specific item is deleted from a scale? What action, if any, should a test developer take, and what considerations should guide that decision?

Answer: If Cronbach's alpha increases when an item is deleted, it means that the item is hurting the internal consistency of the scale. This indicates that the item does not align well with the construct measured by the other items. A test developer should review the item for issues such as poor wording, low variance, or multidimensionality. The item could be revised or removed, but the decision should be guided by content validity, theoretical importance, and whether the item captures an essential aspect of the construct.

Section 4: Item Response Theory

21. In the 1-parameter logistic model (1PL), what is the only parameter estimated for each item?

a) Discrimination
b) Difficulty
c) Guessing
d) Speed

22. What is the difference between the Standard Error of Measurement (SEM) in CTT and the standard error of proficiency estimation in IRT? In your response, address how each varies (or does not vary) across the score scale and what implications this has for test interpretation.

Answer: The SEM in CTT reflects the average measurement error in an observed score and is assumed to be constant across the entire score scale for all examinees. This implies that a test measures those of extreme ability just as reliably as those with average ability, which is often unrealistic. In contrast, the standard error of proficiency (SEP) in IRT varies across the proficiency scale because it depends on the information provided by the items at different levels of ability. This implies that we can determine exactly where on the scale the test is most precise (typically where the items are targeted).

23. Describe what is modeled in an Item Characteristic Curve (ICC).

(a) Identify the quantity shown on each axis and explain what it represents.

Answer: The y-axis represents the probability of getting an item correct (ranging from 0 to 1). The x-axis represents the proficiency or ability level of the examinee (theta), typically measured on a standard scale from -3 to +3.

(b) Describe the differences between the 1PL, 2PL, and 3PL models. Specify which parameter(s) each model estimates and how additional parameters change the shape of the ICC.

Answer: The 1PL model estimates only item difficulty, meaning all ICCs have the same slope and only shift horizontally. The 2PL model estimates both difficulty and discrimination, allowing the slopes of the ICCs to vary (steeper curves indicate higher discrimination). The 3PL model estimates difficulty, discrimination, and a pseudo-guessing parameter, which raises the lower asymptote of the ICC above zero to account for the probability of low-ability examinees guessing the correct answer.

24. Use the ICC below to answer the following questions.

ActiveX control

Which item is least discriminating? How do you know?
Answer: Item 13, because its curve is the flattest (lowest slope).
Which item is most difficult? How do you know?
Answer: Item 16, because its curve is shifted furthest to the right, requiring a higher level of ability to achieve a 50% probability of a correct response.
In your own words, define the term local independence and explain why it is a fundamental assumption of IRT.
Answer: Local independence assumes that, after controlling for examinee proficiency, responses to different items are statistically independent. This is a fundamental assumption because it allows the joint probability of a set of item responses to be calculated as the product of the individual item probabilities.

25. Why might the 3PL model be preferred over the 1PL or 2PL models for multiple-choice tests in some contexts? What is the trade-off in choosing the 3PL model?

Answer: The 3PL model is preferred because it accounts for guessing, which is a real factor in multiple-choice testing. However, the trade-off is that the 3PL model requires much larger sample sizes to obtain stable parameter estimates compared to the simpler 1PL and 2PL models.

Section 5: Test Bias and Ethics

26. Which of the following best describes differential item functioning (DIF)?

a) An item that has low difficulty for most test-takers.
b) An item that measures multiple constructs simultaneously.
c) An item that disadvantages certain groups despite equal ability on the construct.
d) An item that has a high guessing parameter.

27. What statistical method is most commonly used to detect item bias?

a) Factor analysis.
b) Regression analysis.
c) Cronbach’s alpha.
d) Differential item functioning (DIF) analysis.

28. Male students obtain a higher mean score on a test than female students. Does this alone constitute test bias? Why or why not? In your response, distinguish between impact and bias, and explain what additional evidence would be needed to conclude that the test is biased.

Answer: No, this alone does not constitute bias. This difference is called impact, which refers to group differences in average scores that may reflect true differences in the underlying construct. In contrast, bias occurs when a test functions differently for groups who have the same underlying ability level. To conclude that a test is biased, we would need evidence of differential item functioning (DIF), showing that examinees of equal ability from different groups have different probabilities of answering an item correctly.

29. Suppose you are investigating differential item functioning (DIF) on a standardized test administered to two groups.

(a) Describe what specific statistical information would indicate that an item exhibits DIF.

Answer: DIF exists if examinees from two groups who have the same underlying ability have different probabilities of answering correctly. Statistically, this is indicated by non-overlapping ICCs for the two groups after placing them on the same ability scale, or by significant DIF parameters in Mantel-Haenszel or logistic regression analyses.

(b) If an item is flagged for DIF, what steps should a test developer take before deciding whether to retain, revise, or remove the item?

Answer: A developer should first replicate the analysis using a different DIF detection method (e.g., Mantel-Haenszel) to ensure the finding is robust. They should evaluate the effect size of the DIF and determine if it is uniform or non-uniform. Finally, a panel of content experts should review the item to determine if the source of the DIF is construct-irrelevant or if it represents a valid aspect of the construct being measured.

Related entries:

Tags:

	X1	X2	X3	X4	X5	X6
X1	1.0
X2	.7	1.0
X3	.75	.8	1.0
X4	.05	.02	.05	1.0
X5	.02	.01	.04	.76	1.0
X6	.03	.02	.03	.78	.80	1.0

	X1	X2	X3	X4	X5	X6
X1	1.0
X2	.7	1.0
X3	.75	.8	1.0
X4	.05	.02	.05	1.0
X5	.02	.01	.04	.76	1.0
X6	.03	.02	.03	.78	.80	1.0

	X1	X2	X3	X4	X5	X6
X1	1.0
X2	.7	1.0
X3	.75	.8	1.0
X4	.05	.02	.05	1.0
X5	.02	.01	.04	.76	1.0
X6	.03	.02	.03	.78	.80	1.0