Lexiles: the making of a measure

PDF download: Using Lexiles Safely

A recent conversation with a former colleague (it was more of a lecture) about what psychometricians don’t understand about students and education led me to resurrect an article that I wrote for the Rasch Measurement Transactions four or five years ago. It deals specifically with Lexiles© but it is really about how one defines and uses measures in education and science.

The antagonism toward Lexiles in particular and Rasch measures in general is an opportunity to highlight some distinctions between measurement and analysis and between a measure and an assessment. Often when trying to discuss the development of reading proficiency, specialists in measurement and reading seem to be talking at cross-purposes. Reverting to argument by metaphor, measurement specialists are talking about measuring weight; and reading specialists, about providing proper nutrition.

There is a great deal involved in physical development that is not captured when we measure a child’s weight and the process of measuring weight tells us nothing about whether the result is good, bad, or normal; if you should continue on as you are, schedule a doctor’s appointment, or go to the emergency room without changing your underwear. Evaluation of the result is an analysis that comes after the measurement and depends on the result being a measure. No one would suggest that, because it doesn’t define health, weight is not worth measuring or that it is too politically sensitive to talk about in front of nutritionists. A high number does not imply good nutrition nor does a low number imply poor nutrition. Nonetheless, the measurement of weight is always a part of an assessment of well-being.

A Lexile score, applied to a person, is a measure of reading ability[i], which I use to mean the capability to decode words, sentences, paragraphs, and Supreme Court decisions. Lexiles, as applied to a text, is a measure of how difficult the text is to decode. Hemingway’s “For Whom the Bell Tolls” (840 Lexile score) has been cited as an instance where Lexiles do not work. Because a 50th percentile sixth-grade reader could engage with this text, something must be wrong because the book was written for adults. This counter-example, if true, is an interesting case. I have two counter-counter-arguments: first, all measuring instruments have limitations to their use and, second, Lexiles may be describing Hemingway appropriately.

First, outside the context of Lexiles, there is always difficulty for either humans or computer algorithms in scoring exceptional, highly creative writing. (I would venture to guess that many publishers, who make their livings recognizing good writing[ii], would reject Hemingway, Joyce, or Faulkner-like manuscripts if they received them from unknown authors.) I don’t think it follows that we should avoid trying to evaluate exceptional writing. But we do need to know the limits of our instruments.

I rely, on a daily basis, on a bathroom scale. I rely on it even though I believe I shouldn’t use it on the moon, under water, or for elephants or measuring height. It does not undermine the validity of Lexiles in general to discover an extraordinary case for which it does not apply. We need to know the limits of our instrument; when does it produce valid measures and when does it not.

Second, given that we have defined the Lexile for a text as the difficulty of decoding the words and sentences, the Lexile analyzer may be doing exactly what it should with a Hemingway text. Decoding the words and sentences in Hemingway is not that hard: the vocabulary is simple, the sentences short. That’s pretty much what a Lexile score reflects.

Understanding or appreciating Hemingway is something else again. This may be getting into the distinction between reading ability, as I defined it, and reading comprehension, as the specialists define that. You must be able to read (i.e., decode) before you can comprehend. Analogously, you have to be able to do arithmetic before you can solve math word problems[iii]. The latter requires the former; the former does not guarantee the latter.

The Lexile metric is a true developmental scale that is not related to instructional method or materials, or to grade-level content standards. The metric reflects increasing ability to read, in the narrow sense of decode, increasingly advanced text. As students advance through the reading/language arts curriculum, they should progress up the Lexile scale. Effective, even standards-based, instruction in ELA[iv] should cause them to progress on the Lexile scale; analogously good nutrition should cause children to progress on the weight scale[v].

One could coach children to progress on the weight scale in ways counter to good nutrition[vi]. One might subvert Lexile measurements by coaching students to write with big words and long sentences. This does not invalidate either weight or reading ability as useful things to measure. There do need to be checks to ensure we are effecting what we set out to effect.

The role of standards-based assessment is to identify which constituents of reading ability and reading comprehension are present and which absent. Understanding imagery and literary devices, locating topic sentences, identifying main ideas, recognizing sarcasm or satire, comparing authors’ purposes in two passages are within its purview but are not considered in the Lexile score. Its analyzer relies on rather simple surrogates for semantic and syntactic complexity.

The role of measurement on the Lexile scale is to provide a narrowly defined measure of the student’s status on an interval scale that extends over a broad range of reading from Dick and Jane to Scalia and Sotomayor. The Lexile scale does not define reading, recognize the breadth of the ELA curriculum, or replace grade-level content standards-based assessment, but it can help us design instruction and target assessment to be appropriate to the student. We do not expect students to say anything intelligent about text they cannot decode, nor should we attempt to assess their analytic skills using such text.

Jack Stenner (aka, Dr. Lexile) uses as one of his parables, you don’t buy shoes for a child based on grade level but we don’t think twice about assigning textbooks with the formula (age – 5). It’s not one-size-fits-all in physical development. Cognitive development is probably no simpler if we were able to measure all its components. To paraphrase Ben Wright, how we measure weight has nothing to do with how skillful you are at football, but you better have some measures before you attempt the analysis.

[i] Ability may not be the best choice of a word. As used in psychometrics, ability is a generic placeholder for whatever we are trying to measure about a person. It implies nothing about where it came from, what it is good for, or how much is enough. In this case, we are using reading ability to refer to a very specific skill that must be taught, learned, and practiced.

[ii] It may be more realistic to say they make their livings recognizing marketable writing, but my cynicism may be showing.

[iii] You also have to decode the word problem but that’s not the point of this sentence. We assume, often erroneously, that the difficulty of decoding the text is not an impediment to anyone doing the math.

[iv] Effective instruction in science, social studies, or basketball strategy should cause progress on the Lexile measure as well; perhaps not so directly. Anything that adds to the student’s repertoire of words and ideas should contribute.

[v] For weight, progress often does not equal gain.

[vi] Metaphors, like measuring instruments, have their limits and I may have exceeded one. However, one might consider the extraordinary measures amateur wrestlers or professional models employ to achieve a target weight.

Advertisements

Computer-Administered Tests That May Teach

PDF download: Answer until Correct

One of the political issues with computer administered tests (CAT) is what to do about examinees who want to revisit, review, and revise earlier responses. Examinees sometimes express frustration when they are not allowed to; psychometricians don’t like the option being available because each item selection is based on previous successes and failures, so changing answers after moving on has the potential of upsetting the psychometric apple cart. One of our more diabolical thinkers has suggested that a clever examinee would intentionally miss several early items, thereby getting an easier test, and returning later to fix the intentionally incorrect responses, ensuring more correct answers and presumably a higher ability estimate. While this strategy could sometimes work in the examinee’s favor (if receiving an incorrect estimate is actually in anyone’s favor), it is somewhat limited because many right answers on an easy test is not necessarily better than fewer right answers on a difficult test and because a good CAT engine should recover from a bad start given the opportunity. While we might trust in CAT, we should still row away from the rocks.

The core issue for educational measurement is test as contest versus a useful self-assessment. When the assessments are infrequent and high stakes with potentially dire consequences for students, schools, districts, administrators, and teachers, there is little incentive not to look for a rumored edge whenever possible[1]. Frequent, low-stakes tests with immediate feedback could actually be valued and helpful to both students and teachers. There is research, for example, suggesting that taking a quiz is more effective for improved understanding and retention than rereading the material.

The issue of revisiting can be avoided, even with high stakes, if we don’t let the examinee leave an item until the response is correct. First, present a multiple choice item (hopefully more creatively than putting a digitized image of a print item on a screen). If we get the right response, we say “Congratulations” or “Good work” and move on to the next item. If the response is incorrect, we give some kind of feedback, ranging from “Nope, what are you thinking?” to “Interesting but not what we’re looking for” or perhaps some discussion of why it isn’t what we’re looking for (recommended). Then we re-present the item with the selected, incorrect foil omitted.  Repeat. The last response from the examinee will always be the correct one, which might even be retained.

The examinee’s score on the item is the number of distractors remaining when we finally get to the correct response[2]. Calibration of the thresholds can be quick and dirty. It is convenient for me here to use the “rating scale” form for the logit [bv – (di + tij)]. The highest threshold, associated with giving the correct response on the first attempt, is the same as the logit difficulty of the original multiple choice item, because that is exactly the situation we are in, and tim = 0 for an item with m distractors (i.e., m+1 foils.) The logits for the other thresholds depend on the attractiveness of the distractors. (usually when written in this form, the tij sum to zero but that’s not helpful here.

To make things easy for myself, I will use a hypothetical example of a four-choice item with equally popular distractors. The difficulty of the item is captured in the di and doesn’t come into the thresholds. Assuming an item with a p-value of 0.5 and equally attractive distractors, the incorrect responses will be spread across the three, with 17% on each. After one incorrect response, we expect the typical examinee to have a [0.5 / (0.5+.017+0.17)] = 0.6 chance of success on the second try. A 0.6 chance of success corresponds to a logit difficulty ln [(1 – 0.6) / 0.6] = –0.4. Similarly for the third attempt, the probability of success is [0.5 / (0.5+.017)] = 0.75 and the logit difficulty ln [(1 – 0.75) / 0.75] = –1.1. All of which gives us the three thresholds t = {-1.1, -0.4, 0.0}.

This was easy because I assumed distractors that are equally attractive across the ability continuum; then the order in which they are eliminated doesn’t matter in the arithmetic. With other patterns, it is more laborious but no more profound. If, for example, we have an item like:

  1. Litmus turns what color in acid?
    1. red
    2. blue
    3. black
    4. white,

we could see probabilities across the foils like (0.5, 0.4, 0.07, and 0.03) for the standard examinee. There is one way to answer correctly on the first attempt and score 3; this is the original multiple choice item and the probability of this is still 0.5. There are, assuming we didn’t succeed on the first attempt, three ways to score 2 (ba, ca, and da) that we would need to evaluate. And even more paths to scores of 1 or zero, which I’m not going to list.

Nor does it matter what p-value we start with, although the arithmetic would change. For example, reverting to equally attractive distractors, if we start with p=0.75 instead of 0.5, the chance of success on the second attempt is 0.78 and on the third is 0.875. This leads to logit thresholds of ln [(1 – 0.78) / 0.78] = –1.25, and ln [(1 – 0.875) / 0.875] = –1.95. There is also a non-zero threshold for the first attempt of ln [(1 – 0.7) / 0.7] = –0.85. This is reverting to the “partial credit” form of the logit (bvdij). To compare to the earlier paragraph requires taking the -0.85 out so that (-0.85, -1.25, -1.95) becomes -0.85 + (0.0, -0.4, -1.1) as before. I should note that this not the partial credit or rating scale model although a lot of the arithmetic turns out to be pretty much the same (see Linacre, 1991). It has been called “Answer until Correct;” or the Failure model because you keep going on the item until you succeed. This contrasts with the Success model[3] where you keep going until you fail. Or maybe I have the names reversed.

Because we don’t let the examinee end on a wrong answer and we provide some feedback along the way, we are running a serious risk that the examinees could learn something during this process with feedback and second chances. This would violate an ancient tenet in assessment that the agent shalt not alter the object, although I’m not sure how the Quantum Mechanics folks feel about this.

[1] Admission, certifying, and licensing tests have other cares and concerns.

[2] We could give a maximum score of one for an immediate correct response and fractional values for the later stages, but using fractional scores would require slightly different machinery and have no effect on the measures.

[3] DBA, the quiz show model.

Simplistic Statistics for Control of Polytomous Items

To download the PDF Simplistic Statistics

Several issues ago, I discussed estimating the logit difficulties with a simple pair algorithm, although this is viewed with distain in some quarters because it’s only least squares and does not involve maximum likelihood or Bayesian estimation. It begins with counting the number of times item a is correct and item b incorrect and vice versa; then converting the counts to log odds; and finally computing the logit estimates for dichotomous items as the row averages, if the data are sufficiently well behaved. If the data aren’t sufficiently well behaved, it could involve solving some simultaneous equations instead of taking the simple average.

This machinery was readily adaptable to include polytomous items by translating the items scores, 0 to mi, into the corresponding mi + 1 Guttmann patterns. That is, a five-point item has six possible scores 0 to 5 and six Guttmann patterns (00000, 10000, 11000, 11100, 11110, and 11111). Treating these just like five more dichotomous items allows us to use the exactly the same algorithm to compute the logit difficulty (aka, threshold) estimates. (The constraints on the allowable patterns means there will always be some missing log odds and the row averages never work; polytomous items will always require solving the simultaneous equations but the computer doesn’t much care.)

While the pair algorithm leads to some straightforward statistics of its own for controlling the model, my focus is always on the simple person-item residuals because the symmetry leads naturally to statistics for monitoring the person’s behavior as well as the items performance. For dichotomous items, the basic residual is yni = xni – pni, which can be interpreted as the person’s deviation from the item’s p-value. The basic residual can be manipulated and massaged any number of ways; for example, a squared standardized residual z2ni = (xni – pni) 2 / [pni (1- pni)], or (1 – p) / p if xni = 1 or p / (1 – p) if xni = 0, which can be interpreted as the odds against the response.

A logical extension to polytomous items (Ludlow, 1982) would be, for the basic residual, yni = xni – E(xni) and, for the standardized residual, zni = (xni – E(xni)) / √Var(xni), where xni is the observed item score from 0 to mi, where mi is greater than one. The interpretation for the basic form is now the deviation in item score (which is the same as the deviation in p-value when mi is one.) The interpretation for z2 is messier. This approach has been used extensively for the past thirty plus years, although not exploited as fully as it might be[1]. And there is an alternative that salvages much of the dichotomous machinery. And we have made dichotomous items out of the polytomous scores already.

We’re back in the world of 1’s and 0’s; or maybe we never left. All thresholds are dichotomies where you either pass, succeed, endorse, or do whatever you need to get by or you don’t. We have an observed value x = 0 or 1, an expected value p = (0, 1), and any form of residual we like, y or z. The following table shows the residuals for the six Guttmann patterns, based on a person with logit ability equal zero and a five-point item with nicely spaced thresholds (-2, -1, 0, 1, 2). Because the thresholds are symmetric and the person is centered on them, there is a lot or repetition. Values in bold font are the ones that changed from the preceding panel.

Category 1 2 3 4 5
Threshold -2.0 -1.0 0.0 1.0 2.0 Sum
P(r=k|b=0) 0.13 0.35 0.35 0.13 0.02 1.0*
p(x=1|b=0) 0.88 0.73 0.50 0.27 0.12
x 0 0 0 0 0 0
y -0.9 -0.7 -0.5 -0.3 -0.1 6.3 Squared
y2 0.8 0.5 0.3 0.1 0.0 1.6 of Squares
z -2.7 -1.6 -1.0 -0.6 -0.4 40.2 Squared
z2 7.4 2.7 1.0 0.4 0.1 11.6 of Squares
x 1 0 0 0 0 1
y 0.1 -0.7 -0.5 -0.3 -0.1 2.3 Squared
y2 0.0 0.5 0.3 0.1 0.0 0.9 of Squares
z 0.4 -1.6 -1.0 -0.6 -0.4 10.6 Squared
z2 0.1 2.7 1.0 0.4 0.1 4.4 of Squares
x 1 1 0 0 0 2
y 0.1 0.3 -0.5 -0.3 -0.1 0.3 Squared
y2 0.0 0.1 0.3 0.1 0.0 0.4 of Squares
z 0.4 0.6 -1.0 -0.6 -0.4 1.0 Squared
z2 0.1 0.4 1.0 0.4 0.1 2.0 of Squares
x 1 1 1 0 0 3
y 0.1 0.3 0.5 -0.3 -0.1 0.3 Squared
y2 0.0 0.1 0.3 0.1 0.0 0.4 of Squares
z 0.4 0.6 1.0 -0.6 -0.4 1.0 Squared
z2 0.1 0.4 1.0 0.4 0.1 2.0 of Squares
x 1 1 1 1 0 4
y 0.1 0.3 0.5 0.7 -0.1 2.3 Squared
y2 0.0 0.1 0.3 0.5 0.0 0.9 of Squares
z 0.4 0.6 1.0 1.6 -0.4 10.6 Squared
z2 0.1 0.4 1.0 2.7 0.1 4.4 of Squares
x 1 1 1 1 1 5
y 0.1 0.3 0.5 0.7 0.9 6.3 Squared
y2 0.0 0.1 0.3 0.5 0.8 1.6 of Squares
z 0.4 0.6 1.0 1.6 2.7 40.2 Squared
z2 0.1 0.4 1.0 2.7 7.4 11.6 of Squares

*Probabilities sum to one when we include category k=0.

Not surprisingly, for a person with a true expected response of 2.5, we are surprised when the person’s response was zero or five; less surprised by responses of one or four; and quite happy with responses of two or three. We would feel pretty much the same looking at [sum(z)]2 or almost any other number in the sum column. Not surprisingly, when we look at the numbers for each category, we are surprised when the person is stopped by a low valued threshold (e.g., the first panel, first column) or not stopped by a high valued (the last panel, last column.)

That’s what happens with nicely spaced thresholds targeted on the person. If the annoying happens and some thresholds are reversed, the effects on these calculations are less dramatic than one might expect or hope. For example, with thresholds of (-2, -1, 1, 0, 2), the sum of z2 for the six Guttmann patterns are (11.6, 4.4, 2.0, 4.4, 4.4, and 11.6). Comparing those to the table above, only the fourth value (response r=3) is changed at all (4.4 instead of 2.0.) How that would present itself in real data depends on who the people are and how they are distributed. The relevant panel is below; the others are unchanged.

Category 1 2 3 4 5
Threshold -2.0 -1.0 1.0 0.0 2.0 Sum
P(r=k|b=0) 0.17 0.45 0.17 0.17 0.02 1.0*
p(x=1|b=0) 0.88 0.73 0.27 0.50 0.12
x 1 1 1 0 0 3
y 0.1 0.3 0.7 -0.5 -0.1 0.3 Squared
y2 0.0 0.1 0.5 0.3 0.0 0.9 of Squares
z 0.4 0.6 1.6 -1.0 -0.4 1.6 Squared
z2 0.1 0.4 2.7 1.0 0.1 4.4 of Squares

*Probabilities sum to one when we include category k=0.

While there is nothing in the mathematics of the model that says the thresholds must be ordered, it makes the categories, which are ordered, a little puzzling. We are somewhat surprised (z2=2.7) that the person passed the third threshold but at the same time thought the person had a good chance (y=-0.5) of passing the fourth.

Reversing the last two thresholds (-2, -1, 0, 2, 1) gives similar results; in this case, only the calculations for response r=4 changes.

Category 1 2 3 4 5
Threshold -2.0 -1.0 0.0 2.0 1.0 Sum
P(r=k|b=0) 0.14 0.38 0.38 0.05 0.02 1.0*
P*(x=1|b=0) 0.88 0.73 0.50 0.12 0.27
x 1 1 1 1 0 4
y 0.1 0.3 0.5 0.9 -0.3 2.3 Squared
y2 0.0 0.1 0.3 0.8 0.1 1.2 of Squares
z 0.4 0.6 1.0 2.7 -0.6 16.7 Squared
z2 0.1 0.4 1.0 7.4 0.4 9.3 of Squares

*Probabilities sum to one when we include category k=0.

This discussion has been more about the person than the item. Given estimates of the person’s logit ability and the item’s thresholds, we can say relatively intelligent things about what we think of the person’s score on the item; we are surprised if difficult thresholds are passed or easy thresholds are missed. Whether or not any of this is visible in the item statistics depends on whether or not there are sufficient numbers of people behaving oddly.

Whether or not the disordered thresholds affects the item mean squares depends on how the item is targeted and the distribution of abilities. Estimation of the threshold logits is still not affected by the ability distribution, which keeps us comfortably in the Rasch family, even if we are a little puzzled.

[1] As with dichotomous items, we tend to sum over items (occasionally people) to get Infit or Outfit and proceed merrily on our way trusting everything is fine.

Computerized Adaptive Test (CAT): the easy part

[to download the PDF: Computerized Adaptive Testing]

If you are reading this in the 21st Century and are planning to launch a testing program, you probably aren’t even considering a paper-based test as your primary strategy. And if you are computer-based, there is no reason to consider a fixed form as your primary strategy. A computer-administered and adaptive assessment will be more efficient, more informative, and generally more fun than a one-size-fits-all fixed form. With enough imagination and a strong measurement model, we can escape from the world of the basic, text-heavy, four- or five-foil, multiple-choice item. For the examinee, the test should be a challenging but winnable game. While we may say we prefer ones we can win all the time, the best games are those we win a little more than we lose.

If you live in my SimCity with infinite, calibrated item banks of equally valid and reliable items, people with known logit abilities, and responses from an unfeeling and impersonal random number generator, then Computerized Adaptive Testing (CAT) is not that hard. The challenge of CAT has very little to do with simple logistic models and much to do with logistics and validity. It has to do with how do you get the person and the computer to communicate, how do you ensure security, how do you avoid using the same items over and over, how do you cover the content mandated by the powers that be, how do you replenish and refresh the bank, how do you allow reviewing answered items, how do you use built-in tools like rulers, calculators, dictionaries, and spell checkers, how do you deal with aging hardware, computer crashes, hackers, limited band width, skeptical school boards, nervous teachers, angry parents, gaming examinees, attention-seeking legislators, or investigative “journalists.” In short, how do you ensure a valid assessment for anyone and everyone?

I’m not going to help you with any of that. You should be reading van der Linden[1] and visiting the International Association for Computerized Adaptive Testing[2].

In my simulated world, an infinite item bank means I can always find exactly the item I need. Equally valid items means I can choose any item from the bank without worrying about how it fits into anybody’s test blueprint. Equally reliable items means I can pick the next item based on its logit difficulty, not worry about maximizing any information function. Actually in my world of Rasch measurement, picking the next item based on its logit difficulty is the same as maximizing the information function. The standard approach is to administer and score an item, calculate the person’s logit ability based on the items administered so far, retrieve and administer an item that matches the person’s logit (and satisfies any content requirements and other constraints,) and repeat until some stopping rule is satisfied. The stopping rule can be that the standard error of measurement is sufficiently small, or the probability of a correct classification is sufficiently large, or you have run out of time, items, or patience.

The process works on paper. The left chart shows the running estimates of ability (red lines) for five simulated people; the black curves are the running estimates of the standard error of measurement. The red lines should be between the black lines two thirds of the time. The black dots are the means of the five examinees. The only stopping rule imposed here was 50 items. The right chart shows the same things for 100 simulated people.

CATb0_5CATb0_100

With only five people, it’s fairly easy to follow the path of any individual. They tend to vacillate dramatically at the start but most settle down between the standard error lines pretty much. Given the nature of the business in general, there will always be considerable variability in the estimated measures. With the 50 items that we ended on, the standard error of measurement will be roughly 0.3 logits (no lower than 2/√50), which is hardly laser accuracy, but it is a reliability approaching 0.9 if you are that old school.

We started assuming a logit ability of zero, which is exactly on target and completely general because the items are relative to the person anyway. This may not seem quite fair because we are beginning right where we want to end up. But the first item will either be right or wrong so our next guess will be something different anyway. If we hadn’t started right where we wanted to be, our first step will usually be toward where we should be. For example, if we start one logit away, we get pictures like these:

CATb1_5CATb1_100

A curious artifact of this process is that if our starting guess is right, our second guess will be wrong. If our starting guess is wrong, we have a better than 50% chance of moving in the right direction on our second guess; the further off we are, the more likely we are to move in the right direction. Maybe we should always begin off target.

Which says to me, when we are off by a logit in the starting location, it doesn’t much matter. On average, it took 5 or 6 items to get on target, which causes one to wonder about the value of a five-item locator test, or maybe that’s exactly what we have done. One implication of optimistically starting one logit high for a person is there is a good chance that the first four or five responses will be wrong, which may not be the best thing to do to a person’s psyche at the outset.

The basic algorithm is choose the item for step k+1 such that d[k+1] = b[k], where b[k] is the ability estimated from the difficulties of and responses to the first k items. There is the start-up problem; we can’t estimate an ability unless we have a number correct score r greater than zero and less than k. I dealt with this by adjusting the previous difficulty by ±1/√k while r*(k-r) = 0. One rationale for this is the adjustment is something a little less than half a standard error. Another rationale is that the first adjustment will be one logit and moving one logit changes a 50% probability of the response to about 75% (actually 73%). We made a guess at the person’s location and observed a response. That response is more likely if assume the person is one logit away from the item rather that exactly equal to it. We’re guessing anyway at this point.

The standard logic, which we used in the simulations, seeks to maximize the information to be gained from the next item by picking the item for which we believe the person has a 50-50 chance of answering correctly. Alternatively, one might stick with the start-up strategy and look only at the most recent item, choosing a logit ability that makes the person’s result on it likely by adjusting the difficulty of the chosen item without bothering with estimating the person’s ability. The following charts adjust the difficulty by plus or minus one standard error, so that d[k+1] = d[k]± s[k], where s[k] is the standard error[3] of the logit ability estimate through step k.

First we tried it starting with a logit of zero:

CATp0_5CATp0_100

Then we tried it starting with a logit of one:

CATp1_5CATp1_100

The pictures for the two methods give the same impression. The results are too similar to cause anyone to pick one over the other and begin rewriting any CAT engines. Or to put it another way, these analyses are too crude to pick a winner or even know if it matters.

The viability of CAT in general and Rasch CAT in particular is sometimes debated on seemingly functional grounds that you need very large item banks to make it work. I don’t buy it[4]. First, if your entire item bank consists of the items from one fixed form, the CAT version will never be worse than the fixed form and may be a little better; the worst that can happen is you administer the entire fixed form. You can do a better job of tailoring if you have the items from two or three fixed forms but we are still a long way from thousands. Second, with computer-generated items and item engineering templates coming of age, items can become far more plentiful and economical. We could even throw crowd sourcing item development into the mix.

Rasch has gotten some bad press in here because it is so demanding that it is harder to build huge banks; it requires us to discard or rewrite a lot more items. This is a good thing. A large bank of marginal items isn’t going to help anyone[5]. The extra work up front should result in better measures, teach us something about the aspect we are after, and not fool us into thinking we have a bigger functional bank than we really do.

As with everything Rasch, the arithmetic is too simple to distract us for long from the bigger picture of defining better constructs and developing better items through technology. But that leaves us with plenty to do. Computer administration, in addition to helping us pick a more efficient next item, creates a whole new universe of possible item types beyond anything Terman (or Mead but maybe not Binet) could have envisioned and is much more exciting than minimizing the number of items administered.

The main barriers to the universal use of CAT have been hardware, misunderstanding, and politics. The hardware issue is fading fast or has morphed into how to manage all the hardware we have. Misunderstanding and politics are harder to dismiss or even separate. Those aren’t my purview or mission today. Well, maybe misunderstanding.


[1] van der Linden, W. J. (2007). The shadow-test approach: A universal framework for implementing adaptive testing. In D. J. Weiss (Ed.), Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing.

[2] www.iacat.org

[3] We are somewhat kidding ourselves when we say we didn’t need to bother estimating the person’s logit ability at every step of the way because we need that ability to calculate the standard error and check the stopping rule. We could approximate the standard error with 2/√k (or 1/√k or 2.5/√k; nothing here suggests it matters very much) but that doesn’t avoid the when to stop question.

[4] I will concede a very large item bank is nice and desirable if it is filled with nice items.

[5] In its favor, any self-respecting 3pl engine will try to avoid the marginal items but it would be better for everyone if they didn’t get in the bank in the first place. It has never been explained to me why you would put the third (guessing) parameter in a CAT model, where we should steer clear of the asymptotes.

Examinee Report with Scaffolding and a Few Numbers

Sample Report: This report is intended to be interactive but there is a limit to what I can do in this platform. You need to have this in hand to make sense of what I am about to say.

PDF Version of what I am about to say: New Report

The sample report is hardly the be all and end all of examinee reports. It would probably make any graphics designer cry but it does have the important elements: identification, results, details, and discussion. While I have crammed it on to one 8.5×11 page, it highest and best incarnation would be interactive. The first block is the minimum, which would be enough to satisfy some certifying organizations if not the psychometricians. The remaining information could be retrieved en masse for the old pros or element by element to avoid overwhelming more timid data analysts. All (almost[1]) the examinee information needed to create the report[2] can be found in the vector of residuals: yni = xni pni.

Comments in the sample are intended to be illustrative of the type of comments that should be provided, more positive than negative, supportive not critical. They should enforce what is shown in the numbers and charts but should provide insights into things not necessarily obvious, but suggestive of what one should be considering. Real educators in the real situation will without doubt have better language and more insights.

There should also be pop-ups for definitions of terms and explanations of charts and calculations, for those who choose to go that route. I have hinted at the nature of those help functions in the endnotes of the report. The complete Help list should not be what I think needs explaining but should come from the questions and problems of real users.

There are many issues that I have not addressed. Most testing agencies will want to put some promotional or branding information on the screen or sheet. That should never include the contact information for the item writers or psychometricians, but can include the name of the governor or state superintendent for education. I have also omitted any discussion of more important issues like how to present longitudinal data, which should become more valuable, or how to deal with multiple content areas. There’s a limit to what will go on one page but that should not be a restriction in the 21st Century. Nor should the use of color graphics.

Discussion for the Non-Faint of Heart

This report was generated for a person taking a rather sloppy computer adaptive test, with 50 multiple choice items, and five item clusters, four with 10 or 11 items and one [E] with 8 items. One cluster [E] was manipulated to disadvantage this candidate. I call it rather sloppy because the items cover a four logit range and I doubt if I would have stopped the CAT with the person sitting right on the decision line. (Administering one more item at the 500-GRit level would have a 50% probability of dropping Ronald below the Competent line.) Nonetheless, plus and minus two logits is sufficiently narrow to make it unlikely that any individual responses will be very surprising, i.e., it is difficult to detect anomalies. Or maybe it excludes the possibility of anomalies happening. I’ll take the later position.

Reporting for a certification test, all you really need is the first block, the one with no numbers. It answers the question the candidate wants answered; in this case, the way the candidate wanted it answered. The psychometrician’s guild requires the second block, the one with the numbers, to give some sense of our confidence in the result. Neither of these blocks is very innovative.

To be at all practicable, the four paragraphs of text need to be generated by the computer but they aren’t complicated enough to require much intelligence, artificial or otherwise. The first paragraph, one sentence long, exists in two forms: the Congratulations version and the Sorry, not good enough version. Then they need to stick in the name, measure, and level variables and it’s good to go.

The first paragraph under ‘Comments’ is based on the Plausible Range and Likelihood of Levels values to determine the message, depending on whether the candidate was nervously just over a line, annoyingly almost to a line, or comfortably in between.

Paragraph two relies on the total mean square (either unweighted or weighted, outfit or infit) to decide how much the GRit scale can be used to interpret the test result. In this simulated case, the candidate is almost too well behaved (unweighted mean square = 0.79) so it is completely justifiable to describe the person’s status by what is below and what is above the person’s measure. The chart that shows the GRit scale, the keyword descriptors, the person’s location and Plausible Range, and the item residuals has everything this paragraph is trying to say without worrying about any mean squares.

Paragraph three uses the Between Cluster Mean Squares to decide if there are variations worth talking about. [In the real world, the topic labels would be more informative than ABC and should be explained in a pop-up help box.] In this case, the cluster mean squares (i.e., [the sum of y] squared divided by the sum of pq, for the items in the cluster) are 2.7 and 1.9 for clusters E and C, which are on the margin of surprise.

With a little experience, a person could infer all of the comments from the plots of residuals without referring to any numbers; the numbers’ primary value is to organize the charts for people who want to understand the examinee and to distill the data for computers that generate the comments. Because the mean squares are functions of the number of items and distributions of ability, I am disinclined to provide any real guidelines to determine what should be flagged and commented on. Late in life, I have become a proponent of quantile regressions to help establish what is alarming rather than large simulation studies that never seem to quite match reality.

A Very Small Simulation Study

With that disclaimer, the sample candidate that we have been dissecting is a somewhat arbitrarily-chosen examinee number four from a simulation study of a grand total of ten examinees. The data were generated using an ability of 0.0 logits (500 GRits), difficulties uniformly distributed between -2 to +2 logits (318 to 682 GRits), and a disturbance of one logit added to cluster E. A disturbance of one logit means those eight items were one logit more difficult for the simulated examinees than for the calibrating sample that produced the logit difficulties in our item bank. The table below has the averages for some basic statistics, averaged over the ten replications.

Measure StDev Infit Outfit Cluster M.S.
Observed -0.24 0.42 0.97 0.99 1.60
Model 0.00 0.32 1.00 1.00 1.00
A B C D E
Number of Items 10 10 11 11 8
M.S. by Cluster 1.05 1.07 1.57 0.71 2.03
P-value change 0.10 0.01 0.11 -0.07 -0.14
Logit Change -0.42 -0.04 -0.45 0.27 0.79

The total mean squares (Infit and Outfit) look fine. The Cluster mean square (1.60) and the mean squares by cluster begin to suggest a problem, particularly for the E cluster (2.03.) This also shows, necessarily, in the change in p-value (0.14 lower for E) and the change in logit difficulty (0.79 harder for E.) It would be nice if we had gotten back the one logit disturbance that we put in but that isn’t the way things work. Because the residual analysis begins with the person’s estimated ability, the residuals have to sum to zero, which means if one cluster becomes more difficult, the others, on average, will appear easier. Thus even though we know the disturbance is all in cluster E, there are weaker effects bouncing around the others. The statistician has no way of knowing what the real effect is, just that there are differences.

The most disturbing, or perhaps just annoying, number in the table is the observed mean for the measure. This, for the average of 10 people, is -0.24 logits (478 GRits) when it should have been 0.0 (aka 500. While we actually observed a measure of 500 for the examinee we used in the sample report, that didn’t happen in general.) We might want to consider leaving Cluster E out of the measure to get a better estimate of the person’s true location, or we might want to identify the people for whom it is problem and correct the problem rather than avoid it. For a certifying exam, we probably wouldn’t consider dropping the cluster, unless it is affecting an identifiable protected class, if we think the content is important and not addressable in another way.

And as Einstein famously said, “Everything should be explained as simply as possible, and not one bit simpler.” That’s not necessarily my policy.

[1] We would also need to be provided the person’s logit ability.

[2] The non-examinee information includes the performance level criteria and the keyword descriptors. If we have the logit ability, we can deduce the logit difficulties from the residuals, which frees us even more from fixed forms. Obviously, if we know the difficulties and residuals, we can find the ability.

Probability Time Trials

It has come to my attention that I write the basic Rasch probability in half a dozen different forms; half of them are in logits (aka, log odds) and half are in the exponential metric (aka, odds.) My two favorites for exposition are, in logits, exp (b-d) / [1 + exp (b-d)] and, in exponentials, B / (B + D)., where B = eb and D = ed. The second of these I find the most intuitive: the probability in favor of the person is the person’s odds divided by the sum of the person and item odds. The first, the logit form, may be the most natural because logits are the units used for the measures and exhibit the interval scale properties and this form emphasizes the basic relationship between the person and item.

There are variations on each of these forms like, [B / D]/ [1 + B / D] and 1 / [1+ D / B], which are simple algebraic manipulations. The forms are all equivalent; the choice of which to use is simply convention, personal preference, or perhaps computing efficiency, but that has nothing to do with how we talk to each other, only how we talk to the computer. The goal of computing efficiency means to minimize the calls to the log and exponential functions, which causes me to work internally mostly in the exponentials and to do input and output in logits.

These deliberations led to a small time trial to provide a partial answer to the efficiency question in R. I first step up some basic parameters and defined a function to compute 100,000 probabilities. (When you consider a state-wide assessment, which can range from a few thousand to a few hundred thousand students per grade, that’s not a very big number. If I were more serious, I would use a timer with more precision than whole seconds.)

> b = 1.5; d = -1.5

> B = exp(b); > D = exp(d)

> timetrial = function (b, d, N=100000, Prob) { for (k in 1:N) p[k] = Prob(b,d) }

Then I ran timetrial 100,000 times for each of seven expressions for the probability; the first three and the seventh use logits; four, five, and six use exponentials.

> date ()

[1] “Tue Jan 06 11:49:00 ”

> timetrial(b,d,,(1 / (1+exp(d-b))))            # 26 seconds

> date ()

[2] “Tue Jan 06 11:49:26 ”

> timetrial(b,d,,(exp(b-d) / (1+exp(b-d)))) # 27 seconds

> date ()

[3] “Tue Jan 06 11:49:53 ”

> timetrial(b,d,,(exp(b)/(exp(b)+exp(d)))) # 27 seconds

> date ()

[4] “Tue Jan 06 11:50:20 ”

> timetrial(b,d,,(1 / (1+D/B)))                  # 26 seconds

> date ()

[5] “Tue Jan 06 11:50:46 ”

> timetrial(b,d,,((B/D) / (1+B/D)))            # 27 seconds

> date ()

[6] “Tue Jan 06 11:51:13 ”

> timetrial(b,d,,(B / (B+D)))                     # 26 seconds

> date ()

[7] “Tue Jan 06 11:51:39 ”

> timetrial(b,d,,(plogis(b-d)))                  # 27 seconds

> date ()

[8] “Tue Jan 06 11:52:06 ”

The winners were the usual suspects, the ones with the fewest calls and operations but the bottom line seems to be, at least in this limited case using an interpreted language, it makes very little difference. That I take as good news: there is little reason to bother using the exponential metric in the computing.

The seventh form of the probability, plogis, is the built-in R function for the logistic distribution. While it was no faster, it is an R function and so can handle a variety of arguments in a call like “plogis (b-d).” If b and d are both scalars, the value of the expression is a scalar. If either b or d is a vector or a matrix, the value is a vector or matrix of the same size. If both b and d are vectors then the argument (b-d) doesn’t work in general, but the argument outer(b,d,“-“) will create a matrix of probabilities with dimensions matching the lengths of b and d. This will allow computing all the probabilities for, say, a class or school on a fixed form with a single call.

The related R function, dlogis (b-d) has the value of p(1-p), which is useful in Newton’s method or when computing the standard errors. And may be useful for impressing your geek friends or further mystifying your non-geek clients.

Useful and Almost Number-free Reports

If I had asked my customers what they wanted, they would have said a faster horse. Henry Ford

Percentile ranks on student reports are tests as contests.

Raw scores on student reports are like live chickens on dinner plates.

If I were to step on my bathroom scale and see a single number like, say for example, 165 lbs (or 75 kilo) with no further explanation, I wouldn’t need an interpretation guide or course in psychometrics to know what the number means or to decide if I like it or not. Nor would I be writing to the manufacturer of the scale asking, “What’s a pound?” If I were to take a qualifying test to be a licensed dog walker and I received a postcard in the mail that said simply, “Your score is 509 GRits ± 41,” I would be a little frustrated and a lot annoyed. And I would need to ask some basic questions like, “What does that mean?” “What’s a GRit?” “Is the ‘41’ the standard error of measurement, or does it represent a confidence interval?” “If so, what level of confidence?” “What does 509 actually say about my proficiency to walk dogs?” And, of course, “Did I pass?

If the answer to the last question is yes, then most candidates, possibly excluding the psychometricians, will quickly lose interest in the others. If the answer is no, then the tone becomes a little more strident and now includes questions like, “Who decides what passing is?” “What did I miss?” “How close was I?” and if I was close, “Isn’t there almost a 50% chance that I actually passed?[1] People who did pass never seem concerned about the other half of this question.

If a postcard with a Scale Score (even with some form of the standard error of measurement) isn’t useful or meaningful, what does a report need to be? Examinee reports vary depending on the audience and the purpose of the exam, but for a report going to a student, teacher, parent, or anyone else who might actually make use of the information for the betterment of the examinee, there would seem to be four basic components:

  1. Identification
  2. Measurement
  3. Control
  4. Interpretation

There needs to be enough identifying information to locate the examinee and to deliver the report to the right place. For dog walking candidates, the address on the front of the postcard did the trick. For education, it probably takes some combination of student name, teacher name, classroom /section/period, grade, school, and district. We should also mention the name of the test and the date taken. That is almost always more than enough to locate the right person; if you are still worried about it, add birth date or a parent’s name. Our original list should be adequate to locate the teacher and the teacher should know the student by name.

Measurement of the examinee to determine something about status or progress is the point of the exercise. This report section could be the simple “509 GRits” but it should also include some indication of our confidence in this measurement, which means the standard error of measurement in some guise. To make it concrete, in this example, the standard error of measurement is 41, with a 95% confidence interval of 509 ± 82, or 427 to 591. It is probably prudent to never use a phrase involving the word “error” when communicating with parents or school boards; they tend to interpret “error” as “mistake” and blame you. One often sees phrases like “probable range” to describe the interval between the measure plus and minus two standard errors (or some other arbitrary multiplier), which avoids saying ‘error’ and also ducks the squabble between the frequentists and the Bayesians about what confidence means. A picture may not be worth a thousand words in this case but here it is.

Dog walking scale (3)

I am inclined to leave it at that for the moment but not everyone thinks a line with scale scores and a marker for the person’s location is all that informative. I am oft over-ridden to add other (useful and relevant) information like a conclusion (e.g., pass/fail or performance level[2]) and sometimes even subtest measures to the measurement section. One could also say things like a person at 509 has 41% likelihood of testing below the Competent level next time and a 1e-6 likelihood of testing above Skilled. These are really steps toward control and interpretation not measurement so the purist in me wants to put them in the next two sections. Although it’s not worth falling on my sword and what goes in what section is less rigid than I seem to be implying.

I am willing to give some meaning to the score by showing the ruler and some milestones along it. At this point, the scale score may be less meaningful than the milestones, but with experience, the scale score can become a useful shorthand for the milestones. It doesn’t take very much experience to understand what temperatures of 0°C and 37°C imply, even for US residents. This leads me to the less simple “map” below.

Dog walking scale (2)

Diagnosis With the Model

The vertical scale is GRits[3], which is our measure of dog walking proficiency and almost meaningless by itself; we wouldn’t lose much if we left the numbers off entirely[4]. The column of text labels is the substantive description of the scale. Topics at the low end, which are relatively easy, deal with type and use of basic equipment; topics at the high end, which are more difficult, deal with complex behaviors. The GRits bring precision; the text labels bring meaning.

The red vertical line has a tic mark for each possible raw score and a red diamond to mark the location of our hypothetical person. The red horizontal lines are the person’s location and plus/minus two standard errors. You can also add some normative information like means, standard deviations, frequency distributions, or quantiles, if you are into that sort of thing.

The gray horizontal lines mark the performance levels: 500 is Competent, 700 is Skilled, and 850 is Master. Labelling the lines rather than the regions between is not standard practice in educational assessment but it avoids the inconvenience of needing to label the region below Competent and the misinterpretation of the levels as actual developmental states or stages rather than simply more or less arbitrary criteria for addressing school accountability or dispensing certificates. So far we are just displaying the result, not interpreting it.

Control of the measurement model means, either, ensuring that we are warranted in treating the result as a valid measure, in the full sense of the word as we just did, or diagnosing what the anomalies tell us about the examinee. This is again the dichotomy of “diagnosing with the model” and “diagnosing from the model.” Determining which of these paths to follow requires a bit more than simply computing ‘infit’ or ‘outfit’ and consulting the appropriate table of big numbers. This involves looking at individual items, splitting the items into clusters, and looking for things that are funny. Maps like the following can be more useful than any fit statistic and almost number-free.

Dog Walking

Diagnosis From the Model

The display has the same measurement information as before and considerable detail about items and item clusters. First, the red vertical line still refers to the total test and still has a tic mark for each possible raw score and the red diamond for the person. It now has a black diamond for each item response; items to the left of the line are incorrect; those to the right are correct with the distance from the line representing the probability against the response; the greater the distance, the more improbable the response. The dotted vertical lines (blue shading) are control lines and represent probabilities of 75%. We don’t need to be much concerned about anything in the blue. There are four or five items, two difficult items passed and two or three easy items missed, outside the control lines that might warrant investigation.

Most of the same information has been added for each of five item clusters. These are very short tests so the results may seem a little erratic but in all cases, one more item right or wrong would bring them close to the total test measure. If you are so inclined, the number correct score can be found by counting the tic marks[5] up to the red diamond. You can’t necessarily find it by counting the item plotting symbols to the right of the scales because they can represent multiple items. (And it is further confused because some items were not assigned to clusters.) Overall, this is a well-behaved person.

Because I often disparage anything short of computer-administered, fully adaptive tests (CAT), I need to point out a serious issue for me and model control: in the world of CAT, there are no surprises. If we do the CAT right, everything should be in the blue bands. This puts all the load for model control on the item clusters. In our example, we have behaved as though clusters were based on content, which is appropriate for reporting. For control, we would do more by forming clusters based on difficulty, sequence, item type, item format, and item exposure or age are natural choices but as we become more creative in developing items for computer administration, there could be others.

Interpretation of the measure means explaining what the measurement tells us about the status and progress of the examinee. Establishing ‘performance levels,’ say, Master, Skilled, or Competent dog walker, is a significant step from measurement to meaning, or from quantification to qualification. Announcing that the candidate is above, say, the ‘Competent’ performance level is a start. Diagnosis with the model would then talk about what items candidates at this level have mastery of, what items they have no clue about, and what items are at the candidate’s level. This is reporting what the candidate can do, what the candidate can’t do, and what the candidate should find challenging but possible. That suggests three obvious comments that any computer could readily generate as personalized feedback assuming a well-behaved response pattern and diagnosis with the model.

Personalizing a report takes more than generating text that restates the obvious and uses the candidate’s name in a complete sentence, like “Ron, your Dog Walking GRit is 509 ± 82, which means we think you are Competent to walk dogs”. When we have a computer generating the feedback, we should use any intelligence, artificial or otherwise, that is available. It is generally ok to start with the generic, “Ron, you total score is …” and “You did well on item clusters D and A, but were weak on cluster C,” and move on to things that are less obvious. I prefer to open with a positive, encouraging statement (clusters D and A), then mention problem areas (cluster C), and close with things to work on immediately (topics that haven’t been mastered but are close). Ideally, we would discuss the specifics of the surprising responses. This includes difficult items that were passed and easy items that were missed. This is moving into diagnosis from the model.

The more specifics, the better, even item statistics and foil analysis if anyone asks. But it would be much more valuable and much more work for both the item developers and systems analysts to provide a discussion of the type of misunderstandings or errors implied by any incorrect responses. It is work for the item developers because they would need to understand and explain why every distractor is there and what selecting it means. It is work for the system analysts because they need to keep track of and manage everything.

In today’s world, there is little reason to limit reporting to what can be squeezed onto an 8.5×11 sheet of paper or by concerns about the cost of color printing[6]. Paper copies are static, often cramped and overwhelming. Ideally, an electronic report, like an electronic test, will be interactive, dynamic, and engaging, with effective, targeted scaffolding. It should begin with the general overview and then allow the user to explore or be led through the interesting, important, and useful aspects of the responses, showing more and more detail as needed. Performance Level Descriptors and item clusters could be defined and displayed on request; item details could pop up when the plotting symbol is clicked.

This is not free; there will be resistance to giving items away because they are expensive and the item bank is sacred. Hopefully, we are moving away from once-a-year, high-stakes tests toward testing when it is helpful for the student, drawing from computer-generated and crowd-sourced item banks. And more importantly, toward immediate and informative feedback that might actually have some educational value.

 

[1] No, you didn’t pass, but if you test again with a parallel form, there is almost a 50% chance that you will.

[2] The little gray lines mark the performance levels (Competent, Skilled, and Master from left to right).

[3] Because GRits use three, rather than e, as their base, a difference of 100 GRits means 3 to 1 odds. Our hypothetical person has 3 to 1 odds of answering an item about leash tension but less than 1 to 3 odds for an item about tugging. More generally, a difference of 100k GRits means odds of 3k to 1. That’s friendlier than base e and odds like 2.71828…k to one.

[4] The report is (almost) number-free in the sense that the numbers are not needed to understand and use the results. It is not number-free in another sense because they are essential to provide the framework to create and arrange the display.

[5] This demonstration does not include tic marks for zero and perfect; in real life, you would probably have to account for them somehow. They tend to radically distort the scale without adding much, if any, information. They would extend the scale over 100 GRits in both directions and have probably ranges more than four times that in width.

[6] Someone will undoubtedly want a printer-friendly version to stick in the file cabinet because they have the file cabinet.