Probability Time Trials

It has come to my attention that I write the basic Rasch probability in half a dozen different forms; half of them are in logits (aka, log odds) and half are in the exponential metric (aka, odds.) My two favorites for exposition are, in logits, exp (b-d) / [1 + exp (b-d)] and, in exponentials, B / (B + D)., where B = eb and D = ed. The second of these I find the most intuitive: the probability in favor of the person is the person’s odds divided by the sum of the person and item odds. The first, the logit form, may be the most natural because logits are the units used for the measures and exhibit the interval scale properties and this form emphasizes the basic relationship between the person and item.

There are variations on each of these forms like, [B / D]/ [1 + B / D] and 1 / [1+ D / B], which are simple algebraic manipulations. The forms are all equivalent; the choice of which to use is simply convention, personal preference, or perhaps computing efficiency, but that has nothing to do with how we talk to each other, only how we talk to the computer. The goal of computing efficiency means to minimize the calls to the log and exponential functions, which causes me to work internally mostly in the exponentials and to do input and output in logits.

These deliberations led to a small time trial to provide a partial answer to the efficiency question in R. I first step up some basic parameters and defined a function to compute 100,000 probabilities. (When you consider a state-wide assessment, which can range from a few thousand to a few hundred thousand students per grade, that’s not a very big number. If I were more serious, I would use a timer with more precision than whole seconds.)

> b = 1.5; d = -1.5

> B = exp(b); > D = exp(d)

> timetrial = function (b, d, N=100000, Prob) { for (k in 1:N) p[k] = Prob(b,d) }

Then I ran timetrial 100,000 times for each of seven expressions for the probability; the first three and the seventh use logits; four, five, and six use exponentials.

> date ()

[1] “Tue Jan 06 11:49:00 ”

> timetrial(b,d,,(1 / (1+exp(d-b))))            # 26 seconds

> date ()

[2] “Tue Jan 06 11:49:26 ”

> timetrial(b,d,,(exp(b-d) / (1+exp(b-d)))) # 27 seconds

> date ()

[3] “Tue Jan 06 11:49:53 ”

> timetrial(b,d,,(exp(b)/(exp(b)+exp(d)))) # 27 seconds

> date ()

[4] “Tue Jan 06 11:50:20 ”

> timetrial(b,d,,(1 / (1+D/B)))                  # 26 seconds

> date ()

[5] “Tue Jan 06 11:50:46 ”

> timetrial(b,d,,((B/D) / (1+B/D)))            # 27 seconds

> date ()

[6] “Tue Jan 06 11:51:13 ”

> timetrial(b,d,,(B / (B+D)))                     # 26 seconds

> date ()

[7] “Tue Jan 06 11:51:39 ”

> timetrial(b,d,,(plogis(b-d)))                  # 27 seconds

> date ()

[8] “Tue Jan 06 11:52:06 ”

The winners were the usual suspects, the ones with the fewest calls and operations but the bottom line seems to be, at least in this limited case using an interpreted language, it makes very little difference. That I take as good news: there is little reason to bother using the exponential metric in the computing.

The seventh form of the probability, plogis, is the built-in R function for the logistic distribution. While it was no faster, it is an R function and so can handle a variety of arguments in a call like “plogis (b-d).” If b and d are both scalars, the value of the expression is a scalar. If either b or d is a vector or a matrix, the value is a vector or matrix of the same size. If both b and d are vectors then the argument (b-d) doesn’t work in general, but the argument outer(b,d,“-“) will create a matrix of probabilities with dimensions matching the lengths of b and d. This will allow computing all the probabilities for, say, a class or school on a fixed form with a single call.

The related R function, dlogis (b-d) has the value of p(1-p), which is useful in Newton’s method or when computing the standard errors. And may be useful for impressing your geek friends or further mystifying your non-geek clients.

Ixb. R-code to make a simple model less simple and more useful

My life as a psychometrician, the ability algorithm, and some R procs to do the work

The number one job of the psychometrician, in the world of large-scale, state-wide assessments, is to produce the appropriate raw-to-scale tables on the day promised. When they are wrong or late, lawsuits ensue. When they are right and on time, everyone is happy. If we did nothing else, most wouldn’t notice; few would complain.

Once the setup is done, computers can produce the tables in a blink of an eye. It is so easy it is often better, especially in the universe beyond fixed forms, to provide the algorithm to produce scale scores on demand and not bother with lookup tables at all. Give it the item difficulties, feed in the raw score, and the scale score pops out. Novices must take care that management never finds out how easy this step really is.

With modern technology, the ability algorithm can be written in almost any computer language (there is probably an app for your phone) but some are easier than others. My native language is Fortran, so I am most at home with C, C++, R, or related dialects. I am currently using R most of the time. For me with dichotomous items, this does it:

Ability (d)          # where d is the vector of logit difficulties.

But first, I need to copy a few other things into the R window, like a procedure named Ability.

(A simple cut and paste from this post into R probably won’t work but the code did work when I copied it from the Editor. The website seems to use a font for which not all math symbols are recognized by R. In particular, slash (/), minus (-), single and double quotes (‘ “), and ellipses (…) needed to fixed. I’ve done it with a “replace all” in a text editor before moving it into R by copying the offending symbol from the text into the “replace” box of the “replace all” command and typing the same symbol into the “with” box.  Or leave a comment and I’ll email you a simple text version)

Ability <- function (d, M=rep(1,length(d)), first=1, last=(length(d)-1), A = 500, B = 91, …) { b <- NULL; s <- NULL
    b[first] <- first / (length(d) – first)
   D <- exp(d)
    for (r in first:last) {
      b[r] <- Ablest(r, D, M,  b[r], …)
      s[r] <- SEM (exp(b[r]), D, M)
      b[r+1] <- exp(b[r] + s[r]^2)
return (data.frame(raw=(first:last), logit=b[first:last], sem.logit=s[first:last],
          GRit=round((A+B*b[first:last]),0),  sem.GRit=round(B*s[first:last],1)))
} ##############################################################

This procedure is just a front for functions named Ablest and SEM that actually do the work so you will need to copy them as well:

Ablest <- function (r, D, M=rep(1,length(D)), B=(r / (length (D)-r)), stop=0.01) {
# r is raw score; D is vector of exponential difficulties; M is vector of m[i]; stop is the convergence
      repeat {
         adjust <- (r – SumP(B,D,M)) / SumPQ (B,D,M)
         B <- exp(log(B) + adjust)
      if (abs(adjust) < stop) return (log(B))
} ##########################################################ok
SEM <- function (b, d, m=(rep(1,length(d))))  return (1 / sqrt(SumPQ(b,d,m)))

And Ablest needs some even more basic utilities copied into the window:

SumP <- function (b, d, m=NULL, P=function (B,D) (B / (B+D))) {
   if (is.null(m)) return (sum (P(b,d))) # dichotomous case; sum() is a built-in function
   k <- 1
   Sp <- 0
   for (i in 1:length(m)) {
       Sp <- Sp + EV (b, d[k:(k+m[i]-1)])
       k <- k + m[k]
return (Sp)
} ##################################################################ok
EV <- function (b, d) { #  %*% is the inner product, produces a scalar
   return (seq(1:length(d)) %*% P.Rasch(b, d, m=length(d)))
} ##################################################################ok
SumPQ <- function (B, D, m=NULL, P=function (B,D) {B/(B+D)}, PQ=function (p) {p-p^2}) {
   if (is.null(m)) return (sum(PQ(P(B,D))))  # dichotomous case;
   k <- 1
   Spq <- 0
   for (i in 1:length(m)) {
       Spq = Spq + VAR (B,D[k:(k+m[i]-1)])
       k <- k + m[k]
return (Spq)
} ##################################################################ok
VAR <- function (b,d) {  # this is just the polytomous version of (p – p^2)
   return (P.Rasch(b, d, m=length(d)) %*% ((1:length(d))^2) – EV(b,d)^2)
} ##################################################################ok
P.Rasch <- function (b, d, m=NULL, P=function (B,D) (B / (B+D)) ) {
   if (is.null(m)) return (P(b,d)) # simple logistic
   return (P.poly (P(b,d),m))     # polytomous
} ##################################################################ok
P.poly <- function (p, m) { # p is a simple vector of category probabilities
   k <- 1
   for (i in 1:length(m)) {
      p[k:(k+m[i]-1)] = (p[k:(k+m[i]-1)], m[i])
      k <- k + m[i]
return (p)
} ##################################################################ok <- function (pstar, m=length(pstar)) {
#       Converts p* to p; assumes a vector of probabilities
#       computed naively as B/(B+D).  This routine takes account
#       of the Guttmann response patterns allowed with PRM.
    q <- 1-pstar  # all wrong, 000…
    p <- prod(q)
    for (j in 1:m) {
        q[j] <- pstar[j] # one more right, eg., 100…, or 110…, …
        p[j+1] <- prod(q)
    return (p[-1]/sum(p)) # Don’t return p for category 0
} ##################################################################ok
summary.ability <- function (score, dec=5) {
   print (round(score,dec))
   plot(score[,4],score[,1],xlab=”GRit”,ylab=”Raw Score”,type=’l’,col=’red’)
} ##################################################################

This is very similar to the earlier version of Ablest but has been generalized to handle polytomous items, which is where the vector M of maximum scores or number of thresholds comes in.

To use more bells and whistles, the call statement can be things like:

Ability (d, M, first, last, A, B, stop)         # All the parameters it has
Ability (d, M)                                         # first, last, A, B, & stop have defaults
Ability (d,,,,,, 0.0001)                             # stop is positional so the commas are needed
Ability (d, ,10, 20,,, 0.001)                       # default for M assumes dichotomous items
Ability (d, M,,, 100, 10)                          # defaults for A & B are 500 and 91

To really try it out, we can define a vector of item difficulties with, say, 25 uniformly spaced dichotomous items and two polytomous items, one with three thresholds and one with five. The vector m defines the matching vector of maximum scores.

dd=c(seq(-3,3,.25), c(-1,0,1), c(-2,-1,0,1,2))
m = c(rep(1,25),3,5)
score = Ability (d=dd, M=m)
summary.ability (score, 4)

Or give it your own vectors of logit difficulties and maximum scores.

For those who speak R, the code is fairly intuitive, perhaps not optimal, and could be translated almost line by line into Fortran, although some lines would become several. Most of the routines can be called directly if you’re so inclined and get the arguments right. Most importantly, Ability expects logit difficulties and returns logit abilities. Almost everything expects and uses exponentials. Almost all error messages are unintelligible and either because d and m don’t match or something is an exponential when it should be a logit or vice versa.

I haven’t mentioned what to do about zero and perfect scores today because, first, I’m annoyed that they are still necessary, second,  these routines don’t do them, and, third, I talked about the problem a few posts ago. But, if you must, you could use b[0] = b[1] – SEM[1]^2 and b[M] = b[M-1] + SEM[M-1]^2, where M is the maximum possible score, not necessarily the number of items. Or you could appear even more scientific and use something like b[0] = Ablest(0.3, D, m) and b[M] = Ablest(M-0.3, D, m). Here D is the vector of difficulties in the exponential form and m is the vector of maximum scores for the items (and M is the sum of the m‘s.) The length of D is the total number of thresholds (aka, M) and the length of m is the number of items (sometimes called L.) Ablest doesn’t care that the score isn’t an integer but Ability would care. The value 0.3 was a somewhat arbitrary choice; you may prefer 0.25 or 0.33 instead.

To call this the “setup” is a little misleading; we normally aren’t allowed to just make up the item difficulties this way. There are a few other preliminaries that the psychometrician might weigh in on or at least show up at meetings; for example, test design, item writing, field testing, field test analysis, item reviews, item calibration, linking, equating, standards setting, form development, item validation, and key verification. There is also the small matter of presenting the items to the students. Once those are out-of-the-way, the psychometrician’s job of producing the raw score to scale score lookup table is simple.

Once I deal with a few more preliminaries , I’ll go ahead and go back to the good stuff like diagnosing item and person anomalies.

Ix. Doing the Arithmetic Redux with Guttman Patterns

For almost the same thing as a PDF with better formatting: Doing the Arithmetic Redux

Many posts ago, I asserted that doing the arithmetic to get estimates of item difficulties for dichotomous items is almost trivial. You don’t need to know anything about second derivatives, Newton’s method iterations, or convergence criterion. You do need to:

  1. Create an L x L matrix N = [nij], where L is the number of items.
  2. For each person, add a 1 to nij if item j is correct and i is incorrect; zero otherwise.
  3. Create an L x L matrix R = [rij] of log odds; i.e., rij = log(nij / nji)
  4. Calculate the row averages; di = ∑ rij / L.

Done; the row average for row i is the logit difficulty of item i.

That’s the idea but it’s a little too simplistic. Mathematically, step three won’t work if either nij or nji is zero; in one case, you can’t do the division and in the other, you can’t take the log. In the real world, this means everyone has to take the same set of items and every item has to be a winner and a loser in every pair. For reasonably large fixed form assessments, neither of these is an issue.

Expressing step 4 in matrix speak, Ad = S, where A is an LxL diagonal matrix with L on the diagonal, d is the Lx1 vector of logit difficulties that we are after, and S is the Lx1 vector of row sums. Or d = A-1S, which is nothing more than the d are the row averages.

R-code that probably works, assuming L, x, and data have been properly defined, and almost line for line what we just said:

Block 1: Estimating Difficulties from a Complete Matrix of Counts R

N = matrix (0, L, L)                                 # Define and zero an LxL matrix

for ( x in data)                                          # Loop through people

N = N + ((1x) %o% x)                   # Outer product of vectors creates square

R = log (t(N) / N)                                      # Log Odds (ji) over (ij)

d = rowMeans(R)                                     # Find the row averages

This probably requires some explanation. The object data contains the scored data with one row for each person. The vector x contains the zero-one scored response string for a person. The outer product, %o%, of x with its complement creates a square matrix with a rij = 1 when both xj and (1 xi) are one; zero otherwise. The log odds line we used here to define R will always generate some errors as written because the diagonal of N will always be zero. It should have an error trap in it like: R = ifelse ((t(N)*N), log (N / t(N) ), 0).

But if the N and R aren’t full, we will need the coefficient matrix A. We could start with a diagonal matrix with L on the diagonal. Wherever we find a zero off-diagonal entry in Y, subtract one from the diagonal and add one to the same off-diagonal entry of A. Block 2 accomplishes the same thing with slightly different logic because of what I know how to do in R; here we start with a matrix of all zeros except ones where the log odds are missing and then figure out what the diagonal should be.

Block 2: Taking Care of Cells Missing from the Matrix of Log Odds R
Build_A <- function (L, R) {
   A = ifelse (R,0,1)                                              # Mark missing cells (includes diagonal)
   diag(A) = L – (rowSums(A) – 1)                       # Fix the diagonal (now every row sums to L)
return (A)

We can tweak the first block of code a little to take care of empty cells. This is pretty much the heart of the pair-wise method for estimating logit difficulties. With this and an R-interpreter, you could do it. However any functional, self-respecting, self-contained package would surround this core with several hundred lines of code to take care of the housekeeping to find and interpret the data and to communicate with you.

Block 3: More General Code Taking Care of Missing Cells

N = matrix (0, L, L)                                   # Define and zero an LxL matrix

for (x in data)                                            # Loop through people

{N = N + ((1x) %o% x)}                   # Outer product of vectors creates square

R = ifelse ((t(N)*N), log (N / t(N) ), 0)         # Log Odds (ji) over (ij)

A = Build_A (L, R)                             # Create coefficient matrix with empty cells

d = solve (A, rowSums(R))                       # Solve equations simultaneously

There is one gaping hole hidden in the rather innocuous expression, for (x in data), which will probably keeping you from actually using this code. The vector x is the scored, zero-one item responses for one person. The object data presumably holds all the response vectors for everyone in the sample. The idea is to retrieve one response vector at a time, add it into the counts matrix N in the appropriate manner, until we’ve worked our way through everyone. I’m not going to tackle how to construct data today. What I will do is skip ahead to the fourth line and show you some actual data.

Table 1: Table of Count Matrix N for Five Multiple Choice Items


MC.1 MC.2 MC.3 MC.4 MC.5


0 35 58 45 33


280 0 240 196 170
MC.3 112 49 0 83


MC.4 171 77 155 0


MC.5 253 145 224 193


Table 1 is the actual counts for part of a real assessment. The entries in the table are the number of times the row item was missed and the column item was passed. The table is complete (i.e., all non-zeros except for the diagonal). Table 2 is the log odds computed from Table 1; e.g., log (280 / 35) = 2.079 indicating item 2 is about two logits harder than item 1. Because the table is complete, we don’t really need the A-matrix of coefficients to get difficulty estimates; just add across each row and divide by five.

Table 2: Table of Log Odds R for Five Multiple Choice Items

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 Logit


0 -2.079 -0.658 -1.335 -2.037 -1.222


2.079 0 1.589 0.934 0.159 0.952
MC.3 0.658 -1.589 0 -0.625 -1.351


MC.4 1.335 -0.934 0.625 0 -0.668


MC.5 2.037 -0.159 1.351 0.668 0


This brings me to the true elegance of the algorithm in Block 3. When we build the response vector x correctly (a rather significant qualification,) we can use exactly the same algorithm that we have been using for dichotomous items to handle polytomous items as well. So far, with zero-one items, the response vector was a string of zeros and ones and the vector’s length was the maximum possible score, which is also the number of items. We can coerce constructed responses into the same format.

If, for example, we have a constructed response item with four categories, there are three thresholds and the maximum possible score is three. With four categories, we can parse the person’s response into three non-independent items. There are four allowable response patterns, which not coincidentally, happen to be the four Guttmann patterns: (000), (100), (110), and (111), which correspond to the four observable scores: 0, 1, 2, and 3. All we need to do to make our algorithm work is replace the observed zero-to-three polytomous score with the corresponding zero-one Guttmann pattern.


CR.1-2 CR.1-2 CR.1-3


0 0 0
1 1 0



1 1


3 1 1


If for example, the person’s response vector for the five MC and one CR was (101102), the new vector will be (10110110). The person’s total score of five hasn’t changed but we know have a response vector of all ones and zeros of length equal to the maximum possible score, which is the number of thresholds, which is greater than the number of items. With all dichotomous items, the length was also the maximum possible score and the number of thresholds but that was also the number of items. With the reconstructed response vectors, we can now naively apply the same algorithm and receive in return the logit difficulty for each threshold.

Here are some more numbers to make it a little less obscure.

Table 3: Table of Counts for Five Multiple Choice Items and One Constructed Response


MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3
MC.1 0 35 58 45 33 36 70



280 0 240 196 170 91 234 21
MC.3 112 49 0 83 58 52 98



171 77 155 0 99 59 162 12
MC.5 253 145 224 193 0 74 225



14 5 14 11 8 0 0 0
CR.1-2 101 46 85 78 63 137 0


CR.1-3 432 268 404 340 277 639 502


The upper left corner is the same as we had earlier but I have now added one three-threshold item. Because we are restricted to the Guttman patterns, part of the lower right is missing: e.g., you cannot pass item CR.1-2 without passing CR.1-1, or put another way, we cannot observe non-Guttman response patterns like (0, 1, 0).

Table 4: Table of Log Odds R for Five Multiple Choice Items and One Constructed Response

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Mean
MC.1 0 -2.079 -0.658 -1.335 -2.037 0.944 -0.367 -4.682 -10.214



2.079 0 1.589 0.934 0.159 2.901 1.627 -2.546 6.743 0.843
MC.3 0.658 -1.589 0 -0.625 -1.351 1.312 0.142 -3.362 -4.814



1.335 -0.934 0.625 0 -0.668 1.680 0.731 -3.344 -0.576 -0.072
MC.5 2.037 -0.159 1.351 0.668 0 2.225 1.273 -2.405 4.989



-0.944 -2.901 -1.312 -1.680 -2.225 0 0 0 -9.062 -1.133


0.367 -1.627 -0.142 -0.731 -1.273 0 0 0 -3.406


CR.1-3 4.682 2.546 3.362 3.344 2.405 0 0 0 16.340


Moving to the matrix of log odds, we have even more holes. The table includes the row sums, which we will need, and the row means, which are almost meaningless. The empty section of the logs odds does make it obvious that the constructed response thresholds are estimated from their relationship to the multiple choice items, not from anything internal to the constructed response itself.

The A-matrix of coefficients (Table 5) is now useful. The rows define the simultaneous equations to be solved. For the multiple choice, we can still just use the row means because those rows are complete. The logit difficulties in the final column are slightly different than the row means we got when working just with the five multiple choice for two reasons: the logits are now centered on the eight thresholds rather than the five difficulties, and we have added in some more data from the constructed response.

Table 5: Coefficient Matrix A for Five Multiple Choice Items and One Constructed Response


MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Logit


8 0 0 0 0 0 0 0 -10.214 -1.277


0 8 0 0 0 0 0 0 6.743 0.843


0 0 8 0 0 0 0 0 -4.814 -0.602


0 0 0 8 0 0 0 0 -0.576


MC.5 0 0 0 0 8 0 0 0



CR.1-1 0 0 0 0 0 6 1 1 -9.062


CR.1-2 0 0 0 0 0 1 6 1 -3.406


CR.1-3 0 0 0 0 0 1 1 6 16.340


This is not intended to be an R primer so much as an alternative way to show some algebra and do some arithmetic. I have found the R language to be a convenient tool for doing matrix operations, the R packages to be powerful tools for many perhaps most complex analyses, and the R documentation to be almost impenetrable. The language was clearly designed by and most packages written by very clever people; the examples in the documentation seemed intended to impress the other very clever people with how very clever the author is rather than illustrate something I might actually want to do.

My examples probably aren’t any better.

Viib. Using R to do a little work

Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty d and the person’s ability b, the probability of person succeeding on the item is p = B / (B + D), where B = exp(b) and D = exp(d). If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the b that makes the observed score equal the expected score, i.e., solves the equation r = ∑p, where r is the person’s number correct score and p was just defined.

If you are more concrete than that, here is a little R-code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure r is in the range 1 to L-1 (where L = length of d,) the ds are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.

#Block 1: Routine to compute logit ability for number correct r given d
Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties
   b <- log (r / (length (d)-r))    # Initialize
   repeat {
         adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))
         b <- b + adjust
         if (abs(adjust) < stop) return (b)
}      }
P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability
PQ <- function (p) (p-p^2)                     # p(1-p) aka inverse of the 2nd derivative

If you would like to try it, copy the text between the lines above into an R-window and then define the ds somehow and type in, say, Able(r=1, d=ds) or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command Able(r,d) if r and d have been defined. If you don’t have R installed on your computer, following the link to LLTM in the menu on the right will take you to an R site that has a “Get R” option.

In the world of R, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and R had it first.

#Block 2: Test ability routines
Test.Able <- function (low, high, inc) {
#Create a vector of logit difficulties to play with,
d = seq(low, high, inc)

# The ability for a raw score of 1,
# overriding default the convergence criterion of 0.01 with 0.0001
print (“Ability r=1:”)
    print (Able(r=1, d=d, stop=0.0001))
#To get all the abilities from 1 to L-1
# first create a spot to receive results
b = NA
#Then compute the abilities; default convergence = 0.01
for (r in 1:(length(d)-1) )
     b[r] = Able (r, d)
#Show what we got
print (“Ability r=1 to L-1:”)
Test.Able (-2,2,0.25)

I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (sem); we have everything we need for them. For a quick average, of sorts, sem, useful for planning and test design, we have the Wright-Douglas approximation: sem = 2.5/√L, where L is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on height, width, and length, where height is the percent correct, width is the  range of difficulties, and length is the number of items. Or to extricate the sem for almost any score from the logit ability table, semr = √[(br+1 – br-1)/2]. Or if you want to do it right, semr =1 / √[∑pr(1-pr)].

Of course, I have some R-code. Let me know if it doesn’t work.

#Block 3: Standard Errors and a few shortcuts
# Wright-Douglas ‘typical’ sem
wd.sem <- function (k) (2.5/sqrt(k))
# Wright-Stone from Mead-Ryan
SEMbyHWL <- function (H=0.5,W=4,L=1) {
     C2 <- NA
     W <- ifelse(W>0,W,.001)
     for (k in 1:length(H))
            C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))
return (sqrt( C2 / L))
# SEM from logit ability table
bToSem <- function (r1, r2, b) {
     s  <- NA
     for (r in r1:r2)
           s[r] <- (sqrt((b[r+1]-b[r-1])/2))
return (s)
# Full blown SEM
sem <- function (b, d) {
     s <-  NA
    for (r in 1:length(b))
          s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))
 return (s)

To get the SEM’s from all four approaches, all you really need are the four lines below after “Now we’re ready” below. The rest is start up and reporting.


#Block 4: Try out Standard Error procedures
Test.SEM <- function (d) {
# First, a little setup (assuming Able is still loaded.)
L = length (d)
        W = max(d) – min(d)
        H = seq(L-1)/L
# Then compute the abilities; default convergence = 0.01
      b = NA
      for (r in 1:(L-1))
            b[r] = Able (r, d)
# Now we’re ready
       s.wd = wd.sem (length(d))
       s.HWL = SEMbyHWL (H,W,L)
       s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment
       s = sem(b,d)
# Show what we got
     print (“Height”)
     print (“Width”)
     print (“Length”)
    print (“Wright-Douglas typical SEM:”)
        print (round(s.wd,2))
    print (“HWL SEM r=1 to L-1:”)
        print (round(s.HWL,3))
    print (“SEM r=2 to L-2 from Ability table:”)
       print (round(c(s.from.b,NA),3))
    print (“MLE SEM r=1 to L-1:”)
      print (round(s,3))
   plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)
Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with d=seq(-3,3,0.25) are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the best test design is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form SEM =  C / √L, where C is a rather messy function of H and W. The real innovation in HWL was the recognition that test length L could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either L = ∑p or 0 = ∑p true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the sem for 1 and L-1 are typically a little greater than one, in logits, we could adjust the ability estimates for 1 and L-1 by 1.2 or so; the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to 0 and L, say, 0.25 and L-0.25. Or you can get slightly less extreme values using 0.33 or 0.5, or more extreme using 0.1.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. Close is defined as 0.33 or 0.25 or 0.10 logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

Method Shift Zero Perfect
Constant 1.20 -5.58 5.58
SE shift One -5.51 5.51
Shift 0.33 -5.57 5.57
Shift 0.25 -5.86 5.86
Shift 0.10 -6.80 6.80

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.
Test.0100 <- function (shift) {
      d = seq(-3,3,0.25)
      b = NA
      for (r in 1:(length(d)-1) ) b[r] = Able (r, d)
# Adjust by something a little greater than one squared
b0 = b[1]-shift[1]
      bL = b[length(d)-1]+shift[1] 
      print(c(“Constant shift”,shift[1],round(b0, 2),round(bL, 2)))
      plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)
# Adjust by one standard error squared
s = sem(b,d)
      b0 = b[1]-s[1]^2
      bL = b[length(d)-1]+s[1]^2
      print(c(“SE shift”,round(b0, 2),round(bL, 2)))
      points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)
#Estimate ability for something “close” to zero;
for (x in shift[-1]) {
           b0 = Able(x,d)                         # if you try Able(0,d) you will get an inscrutable error.
           bL = Able(length(d)-x,d)
           print( c(“Shift”,x,round(b0, 2),round(bL, 2)))
           points (c(b0,b,bL),c(0:length(d)+1),type=”b”)
}    }

Test.0100 (c(1.2,.33,.25,.1))

The basic issue is not statistics; it’s policy for how much the powers that be want to punish or reward zero or perfect. But, if you really want to do the right thing, don’t give tests so far off target.

Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models

Vii: Significant Relationships in the Life of a Psychometrician

Rules of Thumb, Shortcuts, Loose Ends, and Other Off-Topic Topics:

Unless you can prove your approximation is as good as my exact solution, I am not interested in your approximation. R. Daryl Bock[1]

Unless you can show me your exact solution is better than my approximation, I am not interested in your exact solution. Benjamin D. Wright[2]

Rule of Thumb Estimates for Rasch Standard Errors

The asymptotic standard error for Marginal Maximum Likelihood estimates of the Rasch difficulty d or ability b parameters is:

Continue: Rules of Thumb, Short Cuts, Loose Ends


[1] I first applied to the University of Chicago because Prof. Bock was there.

[2] There was a reason I ended up working with Prof. Wright.

Gustavus Adolphus Nobel Conference

I have been distracted for a few days with the Gustavus Adolphus Nobel Conference. This was its 50th anniversary and focused on the question, “Where does science go from here?” The Presenters included 3 or 4 Nobel laureates and some people who could talk to them.

The videos of the two days should be archived online in a couple weeks and are well worth watching. The link is

The two most applied were Jennifer West talking about using nanotechnology to destroy cancer cells and Harry Gray talking about using high school students to work on oxidizing water, although West was one of the weaker presenters. For entertainment, try Patricia Churchland, neurophilosopher, talking about the evolution of morality, although she may make your head hurt trying to keep up.

IV. Doing the math (and a little algebra)

Estimates and Estimators: connecting model to data

The essential attributes of a Rasch model are sufficient statistics and separable parameters, which allow, but don’t guarantee, specific objectivity. Well, actually sufficient statistics come pretty close if they really are sufficient to capture all the relevant information in the data. We will come back to this in the discussion of what Rasch called control of the model and most of us call goodness of fit. The current topic is a demonstration, more intuitive than mathematical, of how to manipulate the model to estimate item difficulties.

The process begins with the basic Rasch model for how likely the person wins when one person takes one dichotomous item: . . . The Disappearing Beta Trick


Previous: IIIf. Another Aspect                               Home

III. Abstracting Some Aspects

  • Measure what is measurable, and make measurable what is not so. Galileo Galilei

Measuring rocks and the significance of being sufficient

The process of measurement begins long before any data are collected. The starting point is a notion, or even better, a theory about an aspect of a class of things we want to understand better, maybe even do some science on. Successful measurement depends on clear thinking about the aspect and clever ideas for the agents. This is much more challenging and much more rewarding than any mathematical gymnastics that might be performed to fit model to data.

All analogies are limited but some are useful. Considering aspects of things far removed from cognitive traits may help avoid some of the pitfalls encountered when working too close to home. Hardness is a property of materials that is hard to define but we all know what it is when it hits us. Color is a narrow region of a continuous spectrum that non-physicists tend to think about as discrete categories. Temperature is an intimate part of our daily lives, which we are quite adept at sensing and more recently at measuring, but the closely connected idea, heat, may actually be more real, less bound to conventions and populations. If I could scale the proficiency of professional football teams and reliably predict the outcomes of games, I wouldn’t be writing this.

Continue reading . . . Hard Headedness: the importance of being sufficient


Previous: II: Measurement in Science         Next: IIIb: The Aspect of Color

II. Measurement in Science

Measurement is the breaking up of a quantum of energy into equal units. George Herbert Mead

What does it mean “to measure” and #RaschMeasurement as a foreign language

If, in a discussion about buying a new table, your spouse were to say to you, “I measured the width of the room and …” you would not expect the conversation to degenerate immediately into a discussion about what is width, or what does measured mean, or who made your yardstick, or what units you used. But if, in a discussion with the school guidance counselor, you are told, “I measured the intelligence of your child and …” you could, and probably should, ask those same questions, although they probably won’t be any more warmly received in the guidance office than they were in the dining room.

Continue reading . . . Measurement in Science

Previous: I. Rasch’s Theory of Relativity               Next: IIIa: Abstracting Some Aspects

I. Rasch’s Theory of Relativity

The reasonable man strives to adapt himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. George Bernard Shaw, Man and Superman: Maxims to a Revolutionist

Population dependent and convention versus Rasch’s #SpecificObjectivity

While trying to understand what Rasch meant when he referred (as he often did) to tests “built by my methods and conforming to my principles,” I began this philosophical apology, which was intended to help bring sanity and civility to our approach to the practical problems of assessing status, monitoring growth, evaluating effectiveness, and, in short, doing science in the social sciences.

If you have been introduced to the Rasch Measurement as a special case of the more mathematically exotic item response theory (IRT) , then you and Rasch have not been properly introduced, and you are probably under the misimpressions that:

  • All the mathematical gymnastics associated with IRT are necessary.
  • Rasch is what you do if you don’t have the resources to do it right.

The point of this treatise is to disabuse you of those misunderstandings. If I had graduate students, this is what I would tell them to inoculate them against the standard introduction-to-measurement course. We will reserve the term IRT for the models that are not Rasch models, because that term seems to describe the intent of those models, with their focus on fitting the data. The older term latent trait theory fits better with the Rasch perspective, with its focus on the underlying aspect to be measured.

Our concern is the efficacy of Rasch measurement, how it works under controlled conditions, which can hardly be controversial. When the data conform to Rasch’s principles, i.e., the data are based on agents that are equally valid and reliable; not subject to interference from extraneous attributes of the objects, the models have the power to encompass and extend the best of Thurstone and Guttman. Guttman created a non-stochastic Rasch model, with very sufficient statistics; Thurstone defined “fundamental measurement”, which foreshadowed Rasch’s “specific objectivity.” This leads to measurement, as the layman understands the word, and sets the stage for the more vital tasks of making and analyzing measures.

Most of the mainstream debate surrounding Rasch measurement has focused on effectiveness, how the models function when confronted with real responses from real people to real tests, questionnaires, surveys, checklists, and other instruments, some put together with little or no thought for their suitability for measurement[1]. The conclusion that Rasch models are robust, i.e., do pretty well in this real world, should not be taken as justification to continue doing what we’ve been doing.

There are two commonly cited motivations for using Rasch’s models: the most popular being they are extraordinarily easy to apply, compared to IRT models. Useful results can come from relatively small samples and the estimation algorithms converge readily unless the data are pathologically bizarre. In this very data-driven world, “Rasch analysis” (or the verb form, to rasch) seems to mean running data through Rasch calibration software. This requires minimal intellectual commitment and by itself doesn’t accomplish what Rasch set out to accomplish. Rasch’s more compelling motivation takes more effort.

Rasch’s Motivation

While trying to solve practical problems in statistics and in educational and psychological testing, Georg Rasch came upon a special class of models, which led him to a general philosophy of measurement. Rasch defined measurement: if it’s not Rasch measurement, it’s not measurement! Georg was a very unreasonable man.

The phrase “Rasch Measurement” is redundant; I use it to avoid ambiguity. For its adherents, Rasch measurement is axiomatic: self-evident because this is the way it must be or it’s not measurement:

The calibration of the agents must be independent of the objects used and the measurement of the objects must be independent of the agents used, over a useful range.[2]

This is not an unchecked assertion, but a rational criterion by which one can evaluate data under the bright light of a theory. From the model follow consequences. If the observed data are consistent with the anticipated consequences, we have met the conditions of Thurstone’s fundamental measurement and can treat the results with the same respect we have for measures of, say, mass, heat, distance, or duration, which, like reading fluency or art appreciation, are not real things but aspects of things.

I come by my biases naturally. Professionally, I am a grandson, on my statistics side, and great grandson, on my measurement side, of Sir Ronald Fisher. My view of statistics was shaped by people who worked with Fisher. I was grounded in statistics at Iowa State University in a department founded by George Snedecor; the focus heavily on the design of experiments and the analysis of variance, which I learned from the likes of T.A. Bancroft, Oscar Kempthorne, David Jowett, Herbert David, and David Huntsberger, some of whom had known, worked, and undoubtedly argued (if you knew Fisher, Snedecor or Kempthorne) with Fisher on a regular basis.

My view of measurement was shaped by Georg Rasch, who worked with Fisher in England, taking away from that experience Fisher’s concepts of sufficient statistics and maximum likelihood estimation (MLE). The existence of sufficient statistics is the sine qua non of Rasch measurement. I learned Rasch measurement at Chicago (in the Department of Education founded by John Dewey) sitting at the feet of Benjamin Wright, Rasch’s most active and vocal North American disciple and evangelist. Rasch visited the University of Chicago while I was a student there, although I was too green to benefit much from that visitation.

There are strong parallels and much overlap between my two universes. Rasch measurement is to item response theory (IRT) as design of experiments is to general linear models (GLM). GLM is what you do if you can’t, or won’t, get the design right; IRT is what you do if you can’t, or won’t, get the instrument right. Both cases necessitate mathematical gymnastics that can substitute for clear thinking and mask poor planning. GLM and IRT rely on fitting models to “explain” data in a statistical sense, a venerable activity in some statistical traditions and very much in vogue in today’s Big Data world. But it’s not my tradition and it’s not measurement.

The point of experimental design is to produce observations that permit unambiguous inferences to answer specific, carefully stated questions: questions like, what level of catalyst is optimal for reforming crude oil or which feed ration is best for finishing hogs? We would really like the answer to be independent of the specific breed, gender, age, growing conditions, and intended use of the pig, but that isn’t going to happen. More likely, the answer will include a description of the specific domain to which it applies.

The point of Rasch measurement is to produce measures that unambiguously quantify a specific aspect of the object of interest; measures that are independent of any other attribute of either the object (e.g., a person) or our agents (e.g., items.) We would like the agents to be universally applicable, but more likely they will be valid for small neighborhoods in the universe, which must be described.

Rasch’s Principle and Method

Design of experiments and Rasch measurement rely fundamentally on sufficient statistics to make inferences. Sufficient statistics are the constant, overarching theme. They are what make analysis of variance[3], as described by Fisher and Snedecor, which implies more than just simply partitioning the sum of squares, work; they are what make Rasch measurement, which implies more than just running data through an appropriately named piece of software, measurement. Once you have harvested the information in the sufficient statistic, you know everything the data have to tell you about the factor that you are testing or the aspect that you are measuring. That is Rasch’s principle.

Anything left in the data should be noise and anything gleaned from the residual is to be used for control of the model. That is Rasch’s method.

It is unlikely that you will learn anything useful about Maximum Likelihood Estimation in these pages; the mathematics employed here are several steps down. The methods and derivations included are workable but you will need to look elsewhere for scholarly discussions and rigorous derivations of the most efficient and fashionable methods (e.g., Andrich; Fischer; Masters & Wright; Smith & Smith; Wilson). I rely less on calculus and proof and more on analogy and metaphor than is generally deemed proper in scholarly circles. While I will try to make the presentation non-mathematical, I will not try to make it simple.

There is little new here; the majority of the entries in the reference list are between 1960, when Rasch’s Probabilistic Models for Some Intelligence and Attainment Tests (Rasch, 1960) was published, and 1980, when it was republished shortly after Rasch’s death. There is as much here about rocks, darts, football, and oral reading as about multiple-choice items. I attempted, not at all successfully, to avoid mathematics, but those seeking rigorous explanations of estimation methods or fit statistics will need to look elsewhere (e.g., Smith & Smith, 2004; Fischer & Molenaar, 1995). This is not the manual for any Rasch computer package; it will not explain what WinSteps, RUMM, ConQuest, LPCM-WIN, or especially eRm.R is actually doing. For more hand-holding, try Wright and Stone (1979) or Bond and Fox (2007.)

Finally, this is not a cookbook for applying a special case of IRT models, although we do embrace the notion that Rasch models are very special indeed. Over the last forty years, I have come to understand that:

  1. Rasch Measurement is such an extraordinarily special case of IRT that the general IRT literature says almost nothing that helps us understand or achieve measurement.
  2. The mathematical complexities are a distraction and often counter-productive; our resources are better spent elsewhere.
  3. Measurement doesn’t happen by graciously accepting whatever instrument the author proudly hands you; fitting a model that explains the data isn’t progress.
  4. You need to be extraordinarily lucky to have an instrument that meets Rasch’s requirements; the harder you work on the design, the luckier you will be.[4]

I very purposefully keep saying Rasch Measurement, rather than The Rasch Model. There are a number of mathematical expressions that qualify as Rasch models and the specific expression of the appropriate form for a given situation is incidental, probably self-evident, once we understand what we want to accomplish. Our goal is measurement, as the world[5] understands the word.

Rasch’s Theory of Relativity

Saying our goal is measurement blatantly begs the question, What is measurement? For the moment, our answer will be, Measurement is the process of quantifying an aspect of an object. This then begs the question, What is an aspect? An aspect is not a thing nor the essence of the thing but simply an interesting, at least to us at the moment, property of the thing.

Rasch’s principles define measurement; Rasch’s methods are the process.

Trying to avoid the Metaphysics, as well as the calculus, it doesn’t matter for our purpose if there is some ideal, pure form for the aspect out there in hyperspace or the mind of God independent of any actual objects or, alternatively, if the aspect exists only if objects having the aspect exist. Is there an abstract idea of, for example, “reading proficiency” or are there just students who vary in their capacities to decode written text? In either event, our problem is to imagine the consequences of that capacity for students and devise tactics to make the consequences manifest. We are trying to learn something about the status of a kid in a classroom, not to deduce the nature of a Socratic form.

There are, however, basic theoretical or philosophical issues that are much more interesting, much more challenging, and that come before the arithmetic I will eventually describe. Doing the arithmetic correctly may inform the discussion and analysis but it neither starts them nor ends them.

For much of the twentieth century, mental measurement and, by inheritance, the social sciences were hamstrung by superficial thinking and logical shortcuts, illustrated by assertions like “IQ is what IQ tests measure.” The items that make up such a test are an operational definition of some aspect and, I guess, you can call it IQ if you like but there is a valid validity question looming out there. It is far better to have a sound theoretical basis[6] of the aspect and a statement of observable consequences before we commit to a bunch of test items. If we don’t know where we are going, any items’ll get us there.[7]

The purpose of this apology is to nudge, by Rasch, mental measurement toward the same level of consensus and respect that physical measurement enjoys, or that it enjoyed before Einstein. This is Rasch’s theory of relativity: the path to understand what is real and invariant and to recognize what is convention and population-dependent. That doesn’t seem so unreasonable.

Next: II. Measurement in Science

[1] Or often when confronted with data sets deliberately simulated to not match Rasch’s requirements.

[2] Paraphrasing Thurstone and Rasch.

[3] In Bancroft’s view, the analysis of variance is just the computing algorithm to do the arithmetic for extracting mean squares and components of variance from appropriately designed experiments. Rasch analysis, as differentiated from Rasch measurement and as implemented by any number of pieces of software past and present, is just the computing algorithm to do the arithmetic for producing measures from appropriately generated observations and establishing the domains for which they are applicable and appropriate, i.e., valid.

[4] Paraphrasing Thomas Jefferson and Branch Rickey.

[5] With the possible exception of that portion of the world populated by mainstream psychometricians doing educational research in the US.

[6] IQ may have been a poor choice to illustrate the point of a sound theoretical basis.

[7] Paraphrasing “If you don’t care where you are going, any road’ll get you there.” George Harrison, paraphrasing Lewis Carroll, from a conversation between Alice and the Cheshire Cat.