Lexiles: the making of a measure

PDF download: Using Lexiles Safely

A recent conversation with a former colleague (it was more of a lecture) about what psychometricians don’t understand about students and education led me to resurrect an article that I wrote for the Rasch Measurement Transactions four or five years ago. It deals specifically with Lexiles© but it is really about how one defines and uses measures in education and science.

The antagonism toward Lexiles in particular and Rasch measures in general is an opportunity to highlight some distinctions between measurement and analysis and between a measure and an assessment. Often when trying to discuss the development of reading proficiency, specialists in measurement and reading seem to be talking at cross-purposes. Reverting to argument by metaphor, measurement specialists are talking about measuring weight; and reading specialists, about providing proper nutrition.

There is a great deal involved in physical development that is not captured when we measure a child’s weight and the process of measuring weight tells us nothing about whether the result is good, bad, or normal; if you should continue on as you are, schedule a doctor’s appointment, or go to the emergency room without changing your underwear. Evaluation of the result is an analysis that comes after the measurement and depends on the result being a measure. No one would suggest that, because it doesn’t define health, weight is not worth measuring or that it is too politically sensitive to talk about in front of nutritionists. A high number does not imply good nutrition nor does a low number imply poor nutrition. Nonetheless, the measurement of weight is always a part of an assessment of well-being.

A Lexile score, applied to a person, is a measure of reading ability[i], which I use to mean the capability to decode words, sentences, paragraphs, and Supreme Court decisions. Lexiles, as applied to a text, is a measure of how difficult the text is to decode. Hemingway’s “For Whom the Bell Tolls” (840 Lexile score) has been cited as an instance where Lexiles do not work. Because a 50th percentile sixth-grade reader could engage with this text, something must be wrong because the book was written for adults. This counter-example, if true, is an interesting case. I have two counter-counter-arguments: first, all measuring instruments have limitations to their use and, second, Lexiles may be describing Hemingway appropriately.

First, outside the context of Lexiles, there is always difficulty for either humans or computer algorithms in scoring exceptional, highly creative writing. (I would venture to guess that many publishers, who make their livings recognizing good writing[ii], would reject Hemingway, Joyce, or Faulkner-like manuscripts if they received them from unknown authors.) I don’t think it follows that we should avoid trying to evaluate exceptional writing. But we do need to know the limits of our instruments.

I rely, on a daily basis, on a bathroom scale. I rely on it even though I believe I shouldn’t use it on the moon, under water, or for elephants or measuring height. It does not undermine the validity of Lexiles in general to discover an extraordinary case for which it does not apply. We need to know the limits of our instrument; when does it produce valid measures and when does it not.

Second, given that we have defined the Lexile for a text as the difficulty of decoding the words and sentences, the Lexile analyzer may be doing exactly what it should with a Hemingway text. Decoding the words and sentences in Hemingway is not that hard: the vocabulary is simple, the sentences short. That’s pretty much what a Lexile score reflects.

Understanding or appreciating Hemingway is something else again. This may be getting into the distinction between reading ability, as I defined it, and reading comprehension, as the specialists define that. You must be able to read (i.e., decode) before you can comprehend. Analogously, you have to be able to do arithmetic before you can solve math word problems[iii]. The latter requires the former; the former does not guarantee the latter.

The Lexile metric is a true developmental scale that is not related to instructional method or materials, or to grade-level content standards. The metric reflects increasing ability to read, in the narrow sense of decode, increasingly advanced text. As students advance through the reading/language arts curriculum, they should progress up the Lexile scale. Effective, even standards-based, instruction in ELA[iv] should cause them to progress on the Lexile scale; analogously good nutrition should cause children to progress on the weight scale[v].

One could coach children to progress on the weight scale in ways counter to good nutrition[vi]. One might subvert Lexile measurements by coaching students to write with big words and long sentences. This does not invalidate either weight or reading ability as useful things to measure. There do need to be checks to ensure we are effecting what we set out to effect.

The role of standards-based assessment is to identify which constituents of reading ability and reading comprehension are present and which absent. Understanding imagery and literary devices, locating topic sentences, identifying main ideas, recognizing sarcasm or satire, comparing authors’ purposes in two passages are within its purview but are not considered in the Lexile score. Its analyzer relies on rather simple surrogates for semantic and syntactic complexity.

The role of measurement on the Lexile scale is to provide a narrowly defined measure of the student’s status on an interval scale that extends over a broad range of reading from Dick and Jane to Scalia and Sotomayor. The Lexile scale does not define reading, recognize the breadth of the ELA curriculum, or replace grade-level content standards-based assessment, but it can help us design instruction and target assessment to be appropriate to the student. We do not expect students to say anything intelligent about text they cannot decode, nor should we attempt to assess their analytic skills using such text.

Jack Stenner (aka, Dr. Lexile) uses as one of his parables, you don’t buy shoes for a child based on grade level but we don’t think twice about assigning textbooks with the formula (age – 5). It’s not one-size-fits-all in physical development. Cognitive development is probably no simpler if we were able to measure all its components. To paraphrase Ben Wright, how we measure weight has nothing to do with how skillful you are at football, but you better have some measures before you attempt the analysis.

[i] Ability may not be the best choice of a word. As used in psychometrics, ability is a generic placeholder for whatever we are trying to measure about a person. It implies nothing about where it came from, what it is good for, or how much is enough. In this case, we are using reading ability to refer to a very specific skill that must be taught, learned, and practiced.

[ii] It may be more realistic to say they make their livings recognizing marketable writing, but my cynicism may be showing.

[iii] You also have to decode the word problem but that’s not the point of this sentence. We assume, often erroneously, that the difficulty of decoding the text is not an impediment to anyone doing the math.

[iv] Effective instruction in science, social studies, or basketball strategy should cause progress on the Lexile measure as well; perhaps not so directly. Anything that adds to the student’s repertoire of words and ideas should contribute.

[v] For weight, progress often does not equal gain.

[vi] Metaphors, like measuring instruments, have their limits and I may have exceeded one. However, one might consider the extraordinary measures amateur wrestlers or professional models employ to achieve a target weight.

Advertisements

Computer-Administered Tests That May Teach

PDF download: Answer until Correct

One of the political issues with computer administered tests (CAT) is what to do about examinees who want to revisit, review, and revise earlier responses. Examinees sometimes express frustration when they are not allowed to; psychometricians don’t like the option being available because each item selection is based on previous successes and failures, so changing answers after moving on has the potential of upsetting the psychometric apple cart. One of our more diabolical thinkers has suggested that a clever examinee would intentionally miss several early items, thereby getting an easier test, and returning later to fix the intentionally incorrect responses, ensuring more correct answers and presumably a higher ability estimate. While this strategy could sometimes work in the examinee’s favor (if receiving an incorrect estimate is actually in anyone’s favor), it is somewhat limited because many right answers on an easy test is not necessarily better than fewer right answers on a difficult test and because a good CAT engine should recover from a bad start given the opportunity. While we might trust in CAT, we should still row away from the rocks.

The core issue for educational measurement is test as contest versus a useful self-assessment. When the assessments are infrequent and high stakes with potentially dire consequences for students, schools, districts, administrators, and teachers, there is little incentive not to look for a rumored edge whenever possible[1]. Frequent, low-stakes tests with immediate feedback could actually be valued and helpful to both students and teachers. There is research, for example, suggesting that taking a quiz is more effective for improved understanding and retention than rereading the material.

The issue of revisiting can be avoided, even with high stakes, if we don’t let the examinee leave an item until the response is correct. First, present a multiple choice item (hopefully more creatively than putting a digitized image of a print item on a screen). If we get the right response, we say “Congratulations” or “Good work” and move on to the next item. If the response is incorrect, we give some kind of feedback, ranging from “Nope, what are you thinking?” to “Interesting but not what we’re looking for” or perhaps some discussion of why it isn’t what we’re looking for (recommended). Then we re-present the item with the selected, incorrect foil omitted.  Repeat. The last response from the examinee will always be the correct one, which might even be retained.

The examinee’s score on the item is the number of distractors remaining when we finally get to the correct response[2]. Calibration of the thresholds can be quick and dirty. It is convenient for me here to use the “rating scale” form for the logit [bv – (di + tij)]. The highest threshold, associated with giving the correct response on the first attempt, is the same as the logit difficulty of the original multiple choice item, because that is exactly the situation we are in, and tim = 0 for an item with m distractors (i.e., m+1 foils.) The logits for the other thresholds depend on the attractiveness of the distractors. (usually when written in this form, the tij sum to zero but that’s not helpful here.

To make things easy for myself, I will use a hypothetical example of a four-choice item with equally popular distractors. The difficulty of the item is captured in the di and doesn’t come into the thresholds. Assuming an item with a p-value of 0.5 and equally attractive distractors, the incorrect responses will be spread across the three, with 17% on each. After one incorrect response, we expect the typical examinee to have a [0.5 / (0.5+.017+0.17)] = 0.6 chance of success on the second try. A 0.6 chance of success corresponds to a logit difficulty ln [(1 – 0.6) / 0.6] = –0.4. Similarly for the third attempt, the probability of success is [0.5 / (0.5+.017)] = 0.75 and the logit difficulty ln [(1 – 0.75) / 0.75] = –1.1. All of which gives us the three thresholds t = {-1.1, -0.4, 0.0}.

This was easy because I assumed distractors that are equally attractive across the ability continuum; then the order in which they are eliminated doesn’t matter in the arithmetic. With other patterns, it is more laborious but no more profound. If, for example, we have an item like:

  1. Litmus turns what color in acid?
    1. red
    2. blue
    3. black
    4. white,

we could see probabilities across the foils like (0.5, 0.4, 0.07, and 0.03) for the standard examinee. There is one way to answer correctly on the first attempt and score 3; this is the original multiple choice item and the probability of this is still 0.5. There are, assuming we didn’t succeed on the first attempt, three ways to score 2 (ba, ca, and da) that we would need to evaluate. And even more paths to scores of 1 or zero, which I’m not going to list.

Nor does it matter what p-value we start with, although the arithmetic would change. For example, reverting to equally attractive distractors, if we start with p=0.75 instead of 0.5, the chance of success on the second attempt is 0.78 and on the third is 0.875. This leads to logit thresholds of ln [(1 – 0.78) / 0.78] = –1.25, and ln [(1 – 0.875) / 0.875] = –1.95. There is also a non-zero threshold for the first attempt of ln [(1 – 0.7) / 0.7] = –0.85. This is reverting to the “partial credit” form of the logit (bvdij). To compare to the earlier paragraph requires taking the -0.85 out so that (-0.85, -1.25, -1.95) becomes -0.85 + (0.0, -0.4, -1.1) as before. I should note that this not the partial credit or rating scale model although a lot of the arithmetic turns out to be pretty much the same (see Linacre, 1991). It has been called “Answer until Correct;” or the Failure model because you keep going on the item until you succeed. This contrasts with the Success model[3] where you keep going until you fail. Or maybe I have the names reversed.

Because we don’t let the examinee end on a wrong answer and we provide some feedback along the way, we are running a serious risk that the examinees could learn something during this process with feedback and second chances. This would violate an ancient tenet in assessment that the agent shalt not alter the object, although I’m not sure how the Quantum Mechanics folks feel about this.

[1] Admission, certifying, and licensing tests have other cares and concerns.

[2] We could give a maximum score of one for an immediate correct response and fractional values for the later stages, but using fractional scores would require slightly different machinery and have no effect on the measures.

[3] DBA, the quiz show model.

Probability Time Trials

It has come to my attention that I write the basic Rasch probability in half a dozen different forms; half of them are in logits (aka, log odds) and half are in the exponential metric (aka, odds.) My two favorites for exposition are, in logits, exp (b-d) / [1 + exp (b-d)] and, in exponentials, B / (B + D)., where B = eb and D = ed. The second of these I find the most intuitive: the probability in favor of the person is the person’s odds divided by the sum of the person and item odds. The first, the logit form, may be the most natural because logits are the units used for the measures and exhibit the interval scale properties and this form emphasizes the basic relationship between the person and item.

There are variations on each of these forms like, [B / D]/ [1 + B / D] and 1 / [1+ D / B], which are simple algebraic manipulations. The forms are all equivalent; the choice of which to use is simply convention, personal preference, or perhaps computing efficiency, but that has nothing to do with how we talk to each other, only how we talk to the computer. The goal of computing efficiency means to minimize the calls to the log and exponential functions, which causes me to work internally mostly in the exponentials and to do input and output in logits.

These deliberations led to a small time trial to provide a partial answer to the efficiency question in R. I first step up some basic parameters and defined a function to compute 100,000 probabilities. (When you consider a state-wide assessment, which can range from a few thousand to a few hundred thousand students per grade, that’s not a very big number. If I were more serious, I would use a timer with more precision than whole seconds.)

> b = 1.5; d = -1.5

> B = exp(b); > D = exp(d)

> timetrial = function (b, d, N=100000, Prob) { for (k in 1:N) p[k] = Prob(b,d) }

Then I ran timetrial 100,000 times for each of seven expressions for the probability; the first three and the seventh use logits; four, five, and six use exponentials.

> date ()

[1] “Tue Jan 06 11:49:00 ”

> timetrial(b,d,,(1 / (1+exp(d-b))))            # 26 seconds

> date ()

[2] “Tue Jan 06 11:49:26 ”

> timetrial(b,d,,(exp(b-d) / (1+exp(b-d)))) # 27 seconds

> date ()

[3] “Tue Jan 06 11:49:53 ”

> timetrial(b,d,,(exp(b)/(exp(b)+exp(d)))) # 27 seconds

> date ()

[4] “Tue Jan 06 11:50:20 ”

> timetrial(b,d,,(1 / (1+D/B)))                  # 26 seconds

> date ()

[5] “Tue Jan 06 11:50:46 ”

> timetrial(b,d,,((B/D) / (1+B/D)))            # 27 seconds

> date ()

[6] “Tue Jan 06 11:51:13 ”

> timetrial(b,d,,(B / (B+D)))                     # 26 seconds

> date ()

[7] “Tue Jan 06 11:51:39 ”

> timetrial(b,d,,(plogis(b-d)))                  # 27 seconds

> date ()

[8] “Tue Jan 06 11:52:06 ”

The winners were the usual suspects, the ones with the fewest calls and operations but the bottom line seems to be, at least in this limited case using an interpreted language, it makes very little difference. That I take as good news: there is little reason to bother using the exponential metric in the computing.

The seventh form of the probability, plogis, is the built-in R function for the logistic distribution. While it was no faster, it is an R function and so can handle a variety of arguments in a call like “plogis (b-d).” If b and d are both scalars, the value of the expression is a scalar. If either b or d is a vector or a matrix, the value is a vector or matrix of the same size. If both b and d are vectors then the argument (b-d) doesn’t work in general, but the argument outer(b,d,“-“) will create a matrix of probabilities with dimensions matching the lengths of b and d. This will allow computing all the probabilities for, say, a class or school on a fixed form with a single call.

The related R function, dlogis (b-d) has the value of p(1-p), which is useful in Newton’s method or when computing the standard errors. And may be useful for impressing your geek friends or further mystifying your non-geek clients.

Useful and Almost Number-free Reports

If I had asked my customers what they wanted, they would have said a faster horse. Henry Ford

Percentile ranks on student reports are tests as contests.

Raw scores on student reports are like live chickens on dinner plates.

If I were to step on my bathroom scale and see a single number like, say for example, 165 lbs (or 75 kilo) with no further explanation, I wouldn’t need an interpretation guide or course in psychometrics to know what the number means or to decide if I like it or not. Nor would I be writing to the manufacturer of the scale asking, “What’s a pound?” If I were to take a qualifying test to be a licensed dog walker and I received a postcard in the mail that said simply, “Your score is 509 GRits ± 41,” I would be a little frustrated and a lot annoyed. And I would need to ask some basic questions like, “What does that mean?” “What’s a GRit?” “Is the ‘41’ the standard error of measurement, or does it represent a confidence interval?” “If so, what level of confidence?” “What does 509 actually say about my proficiency to walk dogs?” And, of course, “Did I pass?

If the answer to the last question is yes, then most candidates, possibly excluding the psychometricians, will quickly lose interest in the others. If the answer is no, then the tone becomes a little more strident and now includes questions like, “Who decides what passing is?” “What did I miss?” “How close was I?” and if I was close, “Isn’t there almost a 50% chance that I actually passed?[1] People who did pass never seem concerned about the other half of this question.

If a postcard with a Scale Score (even with some form of the standard error of measurement) isn’t useful or meaningful, what does a report need to be? Examinee reports vary depending on the audience and the purpose of the exam, but for a report going to a student, teacher, parent, or anyone else who might actually make use of the information for the betterment of the examinee, there would seem to be four basic components:

  1. Identification
  2. Measurement
  3. Control
  4. Interpretation

There needs to be enough identifying information to locate the examinee and to deliver the report to the right place. For dog walking candidates, the address on the front of the postcard did the trick. For education, it probably takes some combination of student name, teacher name, classroom /section/period, grade, school, and district. We should also mention the name of the test and the date taken. That is almost always more than enough to locate the right person; if you are still worried about it, add birth date or a parent’s name. Our original list should be adequate to locate the teacher and the teacher should know the student by name.

Measurement of the examinee to determine something about status or progress is the point of the exercise. This report section could be the simple “509 GRits” but it should also include some indication of our confidence in this measurement, which means the standard error of measurement in some guise. To make it concrete, in this example, the standard error of measurement is 41, with a 95% confidence interval of 509 ± 82, or 427 to 591. It is probably prudent to never use a phrase involving the word “error” when communicating with parents or school boards; they tend to interpret “error” as “mistake” and blame you. One often sees phrases like “probable range” to describe the interval between the measure plus and minus two standard errors (or some other arbitrary multiplier), which avoids saying ‘error’ and also ducks the squabble between the frequentists and the Bayesians about what confidence means. A picture may not be worth a thousand words in this case but here it is.

Dog walking scale (3)

I am inclined to leave it at that for the moment but not everyone thinks a line with scale scores and a marker for the person’s location is all that informative. I am oft over-ridden to add other (useful and relevant) information like a conclusion (e.g., pass/fail or performance level[2]) and sometimes even subtest measures to the measurement section. One could also say things like a person at 509 has 41% likelihood of testing below the Competent level next time and a 1e-6 likelihood of testing above Skilled. These are really steps toward control and interpretation not measurement so the purist in me wants to put them in the next two sections. Although it’s not worth falling on my sword and what goes in what section is less rigid than I seem to be implying.

I am willing to give some meaning to the score by showing the ruler and some milestones along it. At this point, the scale score may be less meaningful than the milestones, but with experience, the scale score can become a useful shorthand for the milestones. It doesn’t take very much experience to understand what temperatures of 0°C and 37°C imply, even for US residents. This leads me to the less simple “map” below.

Dog walking scale (2)

Diagnosis With the Model

The vertical scale is GRits[3], which is our measure of dog walking proficiency and almost meaningless by itself; we wouldn’t lose much if we left the numbers off entirely[4]. The column of text labels is the substantive description of the scale. Topics at the low end, which are relatively easy, deal with type and use of basic equipment; topics at the high end, which are more difficult, deal with complex behaviors. The GRits bring precision; the text labels bring meaning.

The red vertical line has a tic mark for each possible raw score and a red diamond to mark the location of our hypothetical person. The red horizontal lines are the person’s location and plus/minus two standard errors. You can also add some normative information like means, standard deviations, frequency distributions, or quantiles, if you are into that sort of thing.

The gray horizontal lines mark the performance levels: 500 is Competent, 700 is Skilled, and 850 is Master. Labelling the lines rather than the regions between is not standard practice in educational assessment but it avoids the inconvenience of needing to label the region below Competent and the misinterpretation of the levels as actual developmental states or stages rather than simply more or less arbitrary criteria for addressing school accountability or dispensing certificates. So far we are just displaying the result, not interpreting it.

Control of the measurement model means, either, ensuring that we are warranted in treating the result as a valid measure, in the full sense of the word as we just did, or diagnosing what the anomalies tell us about the examinee. This is again the dichotomy of “diagnosing with the model” and “diagnosing from the model.” Determining which of these paths to follow requires a bit more than simply computing ‘infit’ or ‘outfit’ and consulting the appropriate table of big numbers. This involves looking at individual items, splitting the items into clusters, and looking for things that are funny. Maps like the following can be more useful than any fit statistic and almost number-free.

Dog Walking

Diagnosis From the Model

The display has the same measurement information as before and considerable detail about items and item clusters. First, the red vertical line still refers to the total test and still has a tic mark for each possible raw score and the red diamond for the person. It now has a black diamond for each item response; items to the left of the line are incorrect; those to the right are correct with the distance from the line representing the probability against the response; the greater the distance, the more improbable the response. The dotted vertical lines (blue shading) are control lines and represent probabilities of 75%. We don’t need to be much concerned about anything in the blue. There are four or five items, two difficult items passed and two or three easy items missed, outside the control lines that might warrant investigation.

Most of the same information has been added for each of five item clusters. These are very short tests so the results may seem a little erratic but in all cases, one more item right or wrong would bring them close to the total test measure. If you are so inclined, the number correct score can be found by counting the tic marks[5] up to the red diamond. You can’t necessarily find it by counting the item plotting symbols to the right of the scales because they can represent multiple items. (And it is further confused because some items were not assigned to clusters.) Overall, this is a well-behaved person.

Because I often disparage anything short of computer-administered, fully adaptive tests (CAT), I need to point out a serious issue for me and model control: in the world of CAT, there are no surprises. If we do the CAT right, everything should be in the blue bands. This puts all the load for model control on the item clusters. In our example, we have behaved as though clusters were based on content, which is appropriate for reporting. For control, we would do more by forming clusters based on difficulty, sequence, item type, item format, and item exposure or age are natural choices but as we become more creative in developing items for computer administration, there could be others.

Interpretation of the measure means explaining what the measurement tells us about the status and progress of the examinee. Establishing ‘performance levels,’ say, Master, Skilled, or Competent dog walker, is a significant step from measurement to meaning, or from quantification to qualification. Announcing that the candidate is above, say, the ‘Competent’ performance level is a start. Diagnosis with the model would then talk about what items candidates at this level have mastery of, what items they have no clue about, and what items are at the candidate’s level. This is reporting what the candidate can do, what the candidate can’t do, and what the candidate should find challenging but possible. That suggests three obvious comments that any computer could readily generate as personalized feedback assuming a well-behaved response pattern and diagnosis with the model.

Personalizing a report takes more than generating text that restates the obvious and uses the candidate’s name in a complete sentence, like “Ron, your Dog Walking GRit is 509 ± 82, which means we think you are Competent to walk dogs”. When we have a computer generating the feedback, we should use any intelligence, artificial or otherwise, that is available. It is generally ok to start with the generic, “Ron, you total score is …” and “You did well on item clusters D and A, but were weak on cluster C,” and move on to things that are less obvious. I prefer to open with a positive, encouraging statement (clusters D and A), then mention problem areas (cluster C), and close with things to work on immediately (topics that haven’t been mastered but are close). Ideally, we would discuss the specifics of the surprising responses. This includes difficult items that were passed and easy items that were missed. This is moving into diagnosis from the model.

The more specifics, the better, even item statistics and foil analysis if anyone asks. But it would be much more valuable and much more work for both the item developers and systems analysts to provide a discussion of the type of misunderstandings or errors implied by any incorrect responses. It is work for the item developers because they would need to understand and explain why every distractor is there and what selecting it means. It is work for the system analysts because they need to keep track of and manage everything.

In today’s world, there is little reason to limit reporting to what can be squeezed onto an 8.5×11 sheet of paper or by concerns about the cost of color printing[6]. Paper copies are static, often cramped and overwhelming. Ideally, an electronic report, like an electronic test, will be interactive, dynamic, and engaging, with effective, targeted scaffolding. It should begin with the general overview and then allow the user to explore or be led through the interesting, important, and useful aspects of the responses, showing more and more detail as needed. Performance Level Descriptors and item clusters could be defined and displayed on request; item details could pop up when the plotting symbol is clicked.

This is not free; there will be resistance to giving items away because they are expensive and the item bank is sacred. Hopefully, we are moving away from once-a-year, high-stakes tests toward testing when it is helpful for the student, drawing from computer-generated and crowd-sourced item banks. And more importantly, toward immediate and informative feedback that might actually have some educational value.

 

[1] No, you didn’t pass, but if you test again with a parallel form, there is almost a 50% chance that you will.

[2] The little gray lines mark the performance levels (Competent, Skilled, and Master from left to right).

[3] Because GRits use three, rather than e, as their base, a difference of 100 GRits means 3 to 1 odds. Our hypothetical person has 3 to 1 odds of answering an item about leash tension but less than 1 to 3 odds for an item about tugging. More generally, a difference of 100k GRits means odds of 3k to 1. That’s friendlier than base e and odds like 2.71828…k to one.

[4] The report is (almost) number-free in the sense that the numbers are not needed to understand and use the results. It is not number-free in another sense because they are essential to provide the framework to create and arrange the display.

[5] This demonstration does not include tic marks for zero and perfect; in real life, you would probably have to account for them somehow. They tend to radically distort the scale without adding much, if any, information. They would extend the scale over 100 GRits in both directions and have probably ranges more than four times that in width.

[6] Someone will undoubtedly want a printer-friendly version to stick in the file cabinet because they have the file cabinet.

Ixb. R-code to make a simple model less simple and more useful

My life as a psychometrician, the ability algorithm, and some R procs to do the work

The number one job of the psychometrician, in the world of large-scale, state-wide assessments, is to produce the appropriate raw-to-scale tables on the day promised. When they are wrong or late, lawsuits ensue. When they are right and on time, everyone is happy. If we did nothing else, most wouldn’t notice; few would complain.

Once the setup is done, computers can produce the tables in a blink of an eye. It is so easy it is often better, especially in the universe beyond fixed forms, to provide the algorithm to produce scale scores on demand and not bother with lookup tables at all. Give it the item difficulties, feed in the raw score, and the scale score pops out. Novices must take care that management never finds out how easy this step really is.

With modern technology, the ability algorithm can be written in almost any computer language (there is probably an app for your phone) but some are easier than others. My native language is Fortran, so I am most at home with C, C++, R, or related dialects. I am currently using R most of the time. For me with dichotomous items, this does it:

Ability (d)          # where d is the vector of logit difficulties.

But first, I need to copy a few other things into the R window, like a procedure named Ability.

(A simple cut and paste from this post into R probably won’t work but the code did work when I copied it from the Editor. The website seems to use a font for which not all math symbols are recognized by R. In particular, slash (/), minus (-), single and double quotes (‘ “), and ellipses (…) needed to fixed. I’ve done it with a “replace all” in a text editor before moving it into R by copying the offending symbol from the text into the “replace” box of the “replace all” command and typing the same symbol into the “with” box.  Or leave a comment and I’ll email you a simple text version)


Ability <- function (d, M=rep(1,length(d)), first=1, last=(length(d)-1), A = 500, B = 91, …) { b <- NULL; s <- NULL
    b[first] <- first / (length(d) – first)
   D <- exp(d)
    for (r in first:last) {
      b[r] <- Ablest(r, D, M,  b[r], …)
      s[r] <- SEM (exp(b[r]), D, M)
      b[r+1] <- exp(b[r] + s[r]^2)
   }
return (data.frame(raw=(first:last), logit=b[first:last], sem.logit=s[first:last],
          GRit=round((A+B*b[first:last]),0),  sem.GRit=round(B*s[first:last],1)))
} ##############################################################

This procedure is just a front for functions named Ablest and SEM that actually do the work so you will need to copy them as well:

Ablest <- function (r, D, M=rep(1,length(D)), B=(r / (length (D)-r)), stop=0.01) {
# r is raw score; D is vector of exponential difficulties; M is vector of m[i]; stop is the convergence
      repeat {
         adjust <- (r – SumP(B,D,M)) / SumPQ (B,D,M)
         B <- exp(log(B) + adjust)
      if (abs(adjust) < stop) return (log(B))
      }
} ##########################################################ok
SEM <- function (b, d, m=(rep(1,length(d))))  return (1 / sqrt(SumPQ(b,d,m)))
  ##############################################################

And Ablest needs some even more basic utilities copied into the window:

SumP <- function (b, d, m=NULL, P=function (B,D) (B / (B+D))) {
   if (is.null(m)) return (sum (P(b,d))) # dichotomous case; sum() is a built-in function
   k <- 1
   Sp <- 0
   for (i in 1:length(m)) {
       Sp <- Sp + EV (b, d[k:(k+m[i]-1)])
       k <- k + m[k]
   }
return (Sp)
} ##################################################################ok
EV <- function (b, d) { #  %*% is the inner product, produces a scalar
   return (seq(1:length(d)) %*% P.Rasch(b, d, m=length(d)))
} ##################################################################ok
SumPQ <- function (B, D, m=NULL, P=function (B,D) {B/(B+D)}, PQ=function (p) {p-p^2}) {
   if (is.null(m)) return (sum(PQ(P(B,D))))  # dichotomous case;
   k <- 1
   Spq <- 0
   for (i in 1:length(m)) {
       Spq = Spq + VAR (B,D[k:(k+m[i]-1)])
       k <- k + m[k]
   }
return (Spq)
} ##################################################################ok
VAR <- function (b,d) {  # this is just the polytomous version of (p – p^2)
   return (P.Rasch(b, d, m=length(d)) %*% ((1:length(d))^2) – EV(b,d)^2)
} ##################################################################ok
P.Rasch <- function (b, d, m=NULL, P=function (B,D) (B / (B+D)) ) {
   if (is.null(m)) return (P(b,d)) # simple logistic
   return (P.poly (P(b,d),m))     # polytomous
} ##################################################################ok
P.poly <- function (p, m) { # p is a simple vector of category probabilities
   k <- 1
   for (i in 1:length(m)) {
      p[k:(k+m[i]-1)] = P.star (p[k:(k+m[i]-1)], m[i])
      k <- k + m[i]
   }
return (p)
} ##################################################################ok
 P.star <- function (pstar, m=length(pstar)) {
#
#       Converts p* to p; assumes a vector of probabilities
#       computed naively as B/(B+D).  This routine takes account
#       of the Guttmann response patterns allowed with PRM.
#
    q <- 1-pstar  # all wrong, 000…
    p <- prod(q)
    for (j in 1:m) {
        q[j] <- pstar[j] # one more right, eg., 100…, or 110…, …
        p[j+1] <- prod(q)
    }
    return (p[-1]/sum(p)) # Don’t return p for category 0
} ##################################################################ok
summary.ability <- function (score, dec=5) {
   print (round(score,dec))
   plot(score[,4],score[,1],xlab=”GRit”,ylab=”Raw Score”,type=’l’,col=’red’)
      points(score[,4]+score[,5],score[,1],type=’l’,lty=3)
      points(score[,4]-score[,5],score[,1],type=’l’,lty=3)
} ##################################################################

This is very similar to the earlier version of Ablest but has been generalized to handle polytomous items, which is where the vector M of maximum scores or number of thresholds comes in.

To use more bells and whistles, the call statement can be things like:

Ability (d, M, first, last, A, B, stop)         # All the parameters it has
Ability (d, M)                                         # first, last, A, B, & stop have defaults
Ability (d,,,,,, 0.0001)                             # stop is positional so the commas are needed
Ability (d, ,10, 20,,, 0.001)                       # default for M assumes dichotomous items
Ability (d, M,,, 100, 10)                          # defaults for A & B are 500 and 91

To really try it out, we can define a vector of item difficulties with, say, 25 uniformly spaced dichotomous items and two polytomous items, one with three thresholds and one with five. The vector m defines the matching vector of maximum scores.

dd=c(seq(-3,3,.25), c(-1,0,1), c(-2,-1,0,1,2))
m = c(rep(1,25),3,5)
score = Ability (d=dd, M=m)
summary.ability (score, 4)

Or give it your own vectors of logit difficulties and maximum scores.

For those who speak R, the code is fairly intuitive, perhaps not optimal, and could be translated almost line by line into Fortran, although some lines would become several. Most of the routines can be called directly if you’re so inclined and get the arguments right. Most importantly, Ability expects logit difficulties and returns logit abilities. Almost everything expects and uses exponentials. Almost all error messages are unintelligible and either because d and m don’t match or something is an exponential when it should be a logit or vice versa.

I haven’t mentioned what to do about zero and perfect scores today because, first, I’m annoyed that they are still necessary, second,  these routines don’t do them, and, third, I talked about the problem a few posts ago. But, if you must, you could use b[0] = b[1] – SEM[1]^2 and b[M] = b[M-1] + SEM[M-1]^2, where M is the maximum possible score, not necessarily the number of items. Or you could appear even more scientific and use something like b[0] = Ablest(0.3, D, m) and b[M] = Ablest(M-0.3, D, m). Here D is the vector of difficulties in the exponential form and m is the vector of maximum scores for the items (and M is the sum of the m‘s.) The length of D is the total number of thresholds (aka, M) and the length of m is the number of items (sometimes called L.) Ablest doesn’t care that the score isn’t an integer but Ability would care. The value 0.3 was a somewhat arbitrary choice; you may prefer 0.25 or 0.33 instead.

To call this the “setup” is a little misleading; we normally aren’t allowed to just make up the item difficulties this way. There are a few other preliminaries that the psychometrician might weigh in on or at least show up at meetings; for example, test design, item writing, field testing, field test analysis, item reviews, item calibration, linking, equating, standards setting, form development, item validation, and key verification. There is also the small matter of presenting the items to the students. Once those are out-of-the-way, the psychometrician’s job of producing the raw score to scale score lookup table is simple.

Once I deal with a few more preliminaries , I’ll go ahead and go back to the good stuff like diagnosing item and person anomalies.

Ix. Doing the Arithmetic Redux with Guttman Patterns

For almost the same thing as a PDF with better formatting: Doing the Arithmetic Redux

Many posts ago, I asserted that doing the arithmetic to get estimates of item difficulties for dichotomous items is almost trivial. You don’t need to know anything about second derivatives, Newton’s method iterations, or convergence criterion. You do need to:

  1. Create an L x L matrix N = [nij], where L is the number of items.
  2. For each person, add a 1 to nij if item j is correct and i is incorrect; zero otherwise.
  3. Create an L x L matrix R = [rij] of log odds; i.e., rij = log(nij / nji)
  4. Calculate the row averages; di = ∑ rij / L.

Done; the row average for row i is the logit difficulty of item i.

That’s the idea but it’s a little too simplistic. Mathematically, step three won’t work if either nij or nji is zero; in one case, you can’t do the division and in the other, you can’t take the log. In the real world, this means everyone has to take the same set of items and every item has to be a winner and a loser in every pair. For reasonably large fixed form assessments, neither of these is an issue.

Expressing step 4 in matrix speak, Ad = S, where A is an LxL diagonal matrix with L on the diagonal, d is the Lx1 vector of logit difficulties that we are after, and S is the Lx1 vector of row sums. Or d = A-1S, which is nothing more than the d are the row averages.

R-code that probably works, assuming L, x, and data have been properly defined, and almost line for line what we just said:

Block 1: Estimating Difficulties from a Complete Matrix of Counts R

N = matrix (0, L, L)                                 # Define and zero an LxL matrix

for ( x in data)                                          # Loop through people

N = N + ((1x) %o% x)                   # Outer product of vectors creates square

R = log (t(N) / N)                                      # Log Odds (ji) over (ij)

d = rowMeans(R)                                     # Find the row averages

This probably requires some explanation. The object data contains the scored data with one row for each person. The vector x contains the zero-one scored response string for a person. The outer product, %o%, of x with its complement creates a square matrix with a rij = 1 when both xj and (1 xi) are one; zero otherwise. The log odds line we used here to define R will always generate some errors as written because the diagonal of N will always be zero. It should have an error trap in it like: R = ifelse ((t(N)*N), log (N / t(N) ), 0).

But if the N and R aren’t full, we will need the coefficient matrix A. We could start with a diagonal matrix with L on the diagonal. Wherever we find a zero off-diagonal entry in Y, subtract one from the diagonal and add one to the same off-diagonal entry of A. Block 2 accomplishes the same thing with slightly different logic because of what I know how to do in R; here we start with a matrix of all zeros except ones where the log odds are missing and then figure out what the diagonal should be.

Block 2: Taking Care of Cells Missing from the Matrix of Log Odds R
Build_A <- function (L, R) {
   A = ifelse (R,0,1)                                              # Mark missing cells (includes diagonal)
   diag(A) = L – (rowSums(A) – 1)                       # Fix the diagonal (now every row sums to L)
return (A)
}

We can tweak the first block of code a little to take care of empty cells. This is pretty much the heart of the pair-wise method for estimating logit difficulties. With this and an R-interpreter, you could do it. However any functional, self-respecting, self-contained package would surround this core with several hundred lines of code to take care of the housekeeping to find and interpret the data and to communicate with you.

Block 3: More General Code Taking Care of Missing Cells

N = matrix (0, L, L)                                   # Define and zero an LxL matrix

for (x in data)                                            # Loop through people

{N = N + ((1x) %o% x)}                   # Outer product of vectors creates square

R = ifelse ((t(N)*N), log (N / t(N) ), 0)         # Log Odds (ji) over (ij)

A = Build_A (L, R)                             # Create coefficient matrix with empty cells

d = solve (A, rowSums(R))                       # Solve equations simultaneously

There is one gaping hole hidden in the rather innocuous expression, for (x in data), which will probably keeping you from actually using this code. The vector x is the scored, zero-one item responses for one person. The object data presumably holds all the response vectors for everyone in the sample. The idea is to retrieve one response vector at a time, add it into the counts matrix N in the appropriate manner, until we’ve worked our way through everyone. I’m not going to tackle how to construct data today. What I will do is skip ahead to the fourth line and show you some actual data.

Table 1: Table of Count Matrix N for Five Multiple Choice Items

Counts

MC.1 MC.2 MC.3 MC.4 MC.5

MC.1

0 35 58 45 33

MC.2

280 0 240 196 170
MC.3 112 49 0 83

58

MC.4 171 77 155 0

99

MC.5 253 145 224 193

0

Table 1 is the actual counts for part of a real assessment. The entries in the table are the number of times the row item was missed and the column item was passed. The table is complete (i.e., all non-zeros except for the diagonal). Table 2 is the log odds computed from Table 1; e.g., log (280 / 35) = 2.079 indicating item 2 is about two logits harder than item 1. Because the table is complete, we don’t really need the A-matrix of coefficients to get difficulty estimates; just add across each row and divide by five.

Table 2: Table of Log Odds R for Five Multiple Choice Items

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 Logit

MC.1

0 -2.079 -0.658 -1.335 -2.037 -1.222

MC.2

2.079 0 1.589 0.934 0.159 0.952
MC.3 0.658 -1.589 0 -0.625 -1.351

-0.581

MC.4 1.335 -0.934 0.625 0 -0.668

0.072

MC.5 2.037 -0.159 1.351 0.668 0

0.779

This brings me to the true elegance of the algorithm in Block 3. When we build the response vector x correctly (a rather significant qualification,) we can use exactly the same algorithm that we have been using for dichotomous items to handle polytomous items as well. So far, with zero-one items, the response vector was a string of zeros and ones and the vector’s length was the maximum possible score, which is also the number of items. We can coerce constructed responses into the same format.

If, for example, we have a constructed response item with four categories, there are three thresholds and the maximum possible score is three. With four categories, we can parse the person’s response into three non-independent items. There are four allowable response patterns, which not coincidentally, happen to be the four Guttmann patterns: (000), (100), (110), and (111), which correspond to the four observable scores: 0, 1, 2, and 3. All we need to do to make our algorithm work is replace the observed zero-to-three polytomous score with the corresponding zero-one Guttmann pattern.

Response

CR.1-2 CR.1-2 CR.1-3

0

0 0 0
1 1 0

0

2

1 1

0

3 1 1

1

If for example, the person’s response vector for the five MC and one CR was (101102), the new vector will be (10110110). The person’s total score of five hasn’t changed but we know have a response vector of all ones and zeros of length equal to the maximum possible score, which is the number of thresholds, which is greater than the number of items. With all dichotomous items, the length was also the maximum possible score and the number of thresholds but that was also the number of items. With the reconstructed response vectors, we can now naively apply the same algorithm and receive in return the logit difficulty for each threshold.

Here are some more numbers to make it a little less obscure.

Table 3: Table of Counts for Five Multiple Choice Items and One Constructed Response

Counts

MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3
MC.1 0 35 58 45 33 36 70

4

MC.2

280 0 240 196 170 91 234 21
MC.3 112 49 0 83 58 52 98

14

MC.4

171 77 155 0 99 59 162 12
MC.5 253 145 224 193 0 74 225

25

CR.1-1

14 5 14 11 8 0 0 0
CR.1-2 101 46 85 78 63 137 0

0

CR.1-3 432 268 404 340 277 639 502

0

The upper left corner is the same as we had earlier but I have now added one three-threshold item. Because we are restricted to the Guttman patterns, part of the lower right is missing: e.g., you cannot pass item CR.1-2 without passing CR.1-1, or put another way, we cannot observe non-Guttman response patterns like (0, 1, 0).

Table 4: Table of Log Odds R for Five Multiple Choice Items and One Constructed Response

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Mean
MC.1 0 -2.079 -0.658 -1.335 -2.037 0.944 -0.367 -4.682 -10.214

-1.277

MC.2

2.079 0 1.589 0.934 0.159 2.901 1.627 -2.546 6.743 0.843
MC.3 0.658 -1.589 0 -0.625 -1.351 1.312 0.142 -3.362 -4.814

-0.602

MC.4

1.335 -0.934 0.625 0 -0.668 1.680 0.731 -3.344 -0.576 -0.072
MC.5 2.037 -0.159 1.351 0.668 0 2.225 1.273 -2.405 4.989

0.624

CR.1-1

-0.944 -2.901 -1.312 -1.680 -2.225 0 0 0 -9.062 -1.133

CR.1-2

0.367 -1.627 -0.142 -0.731 -1.273 0 0 0 -3.406

-0.426

CR.1-3 4.682 2.546 3.362 3.344 2.405 0 0 0 16.340

2.043

Moving to the matrix of log odds, we have even more holes. The table includes the row sums, which we will need, and the row means, which are almost meaningless. The empty section of the logs odds does make it obvious that the constructed response thresholds are estimated from their relationship to the multiple choice items, not from anything internal to the constructed response itself.

The A-matrix of coefficients (Table 5) is now useful. The rows define the simultaneous equations to be solved. For the multiple choice, we can still just use the row means because those rows are complete. The logit difficulties in the final column are slightly different than the row means we got when working just with the five multiple choice for two reasons: the logits are now centered on the eight thresholds rather than the five difficulties, and we have added in some more data from the constructed response.

Table 5: Coefficient Matrix A for Five Multiple Choice Items and One Constructed Response

A

MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Logit

MC.1

8 0 0 0 0 0 0 0 -10.214 -1.277

MC.2

0 8 0 0 0 0 0 0 6.743 0.843

MC.3

0 0 8 0 0 0 0 0 -4.814 -0.602

MC.4

0 0 0 8 0 0 0 0 -0.576

-0.072

MC.5 0 0 0 0 8 0 0 0

4.989

0.624

CR.1-1 0 0 0 0 0 6 1 1 -9.062

-1.909

CR.1-2 0 0 0 0 0 1 6 1 -3.406

-0.778

CR.1-3 0 0 0 0 0 1 1 6 16.340

3.171

This is not intended to be an R primer so much as an alternative way to show some algebra and do some arithmetic. I have found the R language to be a convenient tool for doing matrix operations, the R packages to be powerful tools for many perhaps most complex analyses, and the R documentation to be almost impenetrable. The language was clearly designed by and most packages written by very clever people; the examples in the documentation seemed intended to impress the other very clever people with how very clever the author is rather than illustrate something I might actually want to do.

My examples probably aren’t any better.

Viiif: Apple Pie and Disordered Thresholds Redux

A second try at disordered thresholds

It has been suggested, with some justification, that I may be a little chauvinistic depending so heavily on a baseball analogy when pondering disordered thresholds. So for my friends in Australia, Cyprus, and the Czech Republic, I’ll try one based on apple pie.

Certified pie judges for the Minnesota State Fair are trained to evaluate each entry on the criteria in Table 1 and the results for pies, at least the ones entered into competitions, are unimodal, somewhat skewed to the left.

Table 1: Minnesota State Fair Pie Judging Rubric

Aspect

Points

Appearance

20

Color

10

Texture

20

Internal appearance

15

Aroma

10

Flavor

25

Total

100

We might suggest some tweaks to this process, but right now our assignment is to determine preferences of potential customers for our pie shop. All our pies would be 100s on the State Fair rubric so it won’t help. We could collect preference data from potential customers by giving away small taste samples at the fair and asking each taster to respond to a short five-category rating scale with categories suggested by our psychometric consultant.

My feeling about this pie is:

0

1 2 3 4
I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten

I could eat this right after a major feast!

The situation is hypothetical; the data are simulated from unimodal distributions with roughly equal means. On day one, thresholds 3 and 4 were reversed; on day two, thresholds 2 and 3 for some tasters were also reversed. None of that will stop me from interpreting the results. It is not shown in this summary of the data shown below, but the answer to our marketing question is pies made with apples were the clear winners. (To appropriate a comment that Rasch made about correlation coefficients, this result is population-dependent and therefore scientifically rather uninteresting.) Any problems that the data might have with the thresholds did not prevent us from reaching this conclusion rather comfortably. The most preferred pies received the highest scores in spite of our problematic category labels. Or at least that’s the story I will include with my invoice.

The numbers we observed for the categories are shown in Table 2. Right now we are only concerned with the categories, so this table is summed over the pies and the tasters.

Table 2: Results of Pie Preference Survey for Categories

Day

I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten I could eat this right after a major feast!

One

10 250 785 83

321

Two 120 751 95 22

482

In this scenario, we have created at least two problems; first, the wording of the category descriptions may be causing some confusion. I hope those distinctions survive the cultural and language differences between the US and the UK. Second, the day two group is making an even cruder distinction among the pies; almost I like it or I don’t like it.

The category 4 was intended to capture the idea that this pie is so good that I will eat it even if I have already eaten myself to the point of pain. For some people that may not be different than this pie is among the best I’ve ever eaten, which is why relatively few chose category 3. Anything involving mothers is always problematic on a rating scale. Depending on your mother, “Almost as good as my mother’s” may be the highest possible rating; for others, it may be slightly above boiled liver. That suggests there may be a problem with the category descriptors that our psychometrician gave us, but the fit statistics would not object. And it doesn’t explain the difference between days one and two.

Day Two happened to be the day that apples were being judged in a separate arena, completely independently of the pie judging. Consequently every serious apple grower in Minnesota was at the fair. Rather than spreading across the five categories, more or less, this group tended to see pies as a dichotomy: those that were made with apples and those that weren’t. While the general population spread out reasonably well across the continuum, the apple growers were definitely bimodal in their preferences.

The day two anomaly is in the data, not the model or thresholds. The disordered thresholds that exposed the anomaly by imposing a strong model, but not reflected in the standard fit statistics, are an indication that we should think a little more about what we are doing. Almost certainly, we could improve on the wording of the category descriptions. But we might also want to separate apple orchard owners from other respondents to our survey. The same might also be true for banana growers but they don’t figure heavily in Minnesota horticulture. Once again, Rasch has shown us what is population-independent, i.e., the thresholds (and therefore scientifically interesting) and what is population-dependent, i.e., frequencies and preferences, (and therefore only interesting to marketers.)

These insights don’t tell us much about marketing pies better but I wouldn’t try to sell banana cream to apple growers and I would want to know how much of my potential market are apple growers. I am still at a loss to explain why anyone, even beef growers, would pick liver over anything involving sugar and butter.