In the 1960s, we had Latent Trait Analysis (Birnbaum, A., in Lord & Novack, 1968); in the 1970s, Darrell Bock lobbied for a new label ‘Item Response Theory’ (IRT) as more descriptive of what was actually going on. He was right about conventional psychometric practice in the US; he was wrong for condoning it as the reasonable thing to do. I am uncertain in what sense he used the word ‘theory’. We don’t seem to be talking about “a well-confirmed explanation of nature, in a way consistent with scientific method, and fulfilling the criteria required by modern science.” Maybe a theory of measurement would be better than a theory of item responses.
The ‘item response matrix’ is a means to an end. It is nothing more or less than rows and columns of ones and zeros recording which students answered which items correctly and which incorrectly.[1] It is certainly useful in the classroom for grading students and culling items; however, it is population dependent, and therefore scientifically rather uninteresting, like correlations and factor analysis. My interest in it is to harvest whatever the matrix can tell me about some underlying aspect of students.
The focus of IRT is estimate parameters to “best explain” the ones and zeros. “Explain” is used here in the rather sterile statistical sense of ‘minimize the residual errors’, which has long been a respectable activity in statistical practice. Once IRTists are done with their model estimation, the ‘tests of fit’ are typically likelihood ratio tests, which are relevant to the question, “Have we included enough parameters and built a big enough model?” and less relevant to “Why didn’t that item work?” or “Why did these students miss those items?”
‘Latent Trait Analysis’ is a more descriptive term for the harvesting of relevant information when our focus is on the aspect of the student that caused us to start this entire venture. Once all the information is extracted from the data, we will have located the students on an interval scale for a well-defined trait (the grading part). Anything left over is useful and essential for ‘controlling the model’, i.e., selecting, revising, or culling items and diagnosing students. Parameter estimation is only the first step toward understanding what happened: “explaining!” in the sense of ‘illuminating the underlying mechanism’.
“Grist” is grain separated from the chaff. The ones and zeros are grist for the measurement mill but several steps from sliced bread. The item response matrix is data carefully elicited from appropriate students using appropriate items to expose, isolate, and abstract the specific trait we are after. In addition to the chaff of extraneous factors, we need to deal with the nuisance parameters and insect parts as well.
Wright’s oft misunderstood phrases “person-free item calibration” and “item-free person measurement” (Wright, B. D., 1968) notwithstanding, it is still necessary to subject students to items before we make any inferences about the underlying trait. The problem is to make inferences that are “sample-freed”. The key, sine que non, and Rasch’s contribution that makes this possible was a class of models that had separable parameters leading to sufficient statistics ending at Specific Objectivity.
Describing Rasch latent trait models as a special case of 3PL IRT is Algebra 101 but fails to recognize the power and significance of sufficient statistics and the generality of his theory of measurement. Rasch’s Specific Objectivity is the culmination of the quest for Thurstone’s holy grail of ‘fundamental measurement’.
[1] For the moment, we will stick with the dichotomous case and talk in terms of educational assessment because it’s what I know, although it doesn’t change the argument to think polytomous responses in non-educational applications.