# Probability Time Trials

It has come to my attention that I write the basic Rasch probability in half a dozen different forms; half of them are in logits (aka, log odds) and half are in the exponential metric (aka, odds.) My two favorites for exposition are, in logits, exp (b-d) / [1 + exp (b-d)] and, in exponentials, B / (B + D)., where B = eb and D = ed. The second of these I find the most intuitive: the probability in favor of the person is the person’s odds divided by the sum of the person and item odds. The first, the logit form, may be the most natural because logits are the units used for the measures and exhibit the interval scale properties and this form emphasizes the basic relationship between the person and item.

There are variations on each of these forms like, [B / D]/ [1 + B / D] and 1 / [1+ D / B], which are simple algebraic manipulations. The forms are all equivalent; the choice of which to use is simply convention, personal preference, or perhaps computing efficiency, but that has nothing to do with how we talk to each other, only how we talk to the computer. The goal of computing efficiency means to minimize the calls to the log and exponential functions, which causes me to work internally mostly in the exponentials and to do input and output in logits.

These deliberations led to a small time trial to provide a partial answer to the efficiency question in R. I first step up some basic parameters and defined a function to compute 100,000 probabilities. (When you consider a state-wide assessment, which can range from a few thousand to a few hundred thousand students per grade, that’s not a very big number. If I were more serious, I would use a timer with more precision than whole seconds.)

> b = 1.5; d = -1.5

> B = exp(b); > D = exp(d)

> timetrial = function (b, d, N=100000, Prob) { for (k in 1:N) p[k] = Prob(b,d) }

Then I ran timetrial 100,000 times for each of seven expressions for the probability; the first three and the seventh use logits; four, five, and six use exponentials.

> date ()

 “Tue Jan 06 11:49:00 ”

> timetrial(b,d,,(1 / (1+exp(d-b))))            # 26 seconds

> date ()

 “Tue Jan 06 11:49:26 ”

> timetrial(b,d,,(exp(b-d) / (1+exp(b-d)))) # 27 seconds

> date ()

 “Tue Jan 06 11:49:53 ”

> timetrial(b,d,,(exp(b)/(exp(b)+exp(d)))) # 27 seconds

> date ()

 “Tue Jan 06 11:50:20 ”

> timetrial(b,d,,(1 / (1+D/B)))                  # 26 seconds

> date ()

 “Tue Jan 06 11:50:46 ”

> timetrial(b,d,,((B/D) / (1+B/D)))            # 27 seconds

> date ()

 “Tue Jan 06 11:51:13 ”

> timetrial(b,d,,(B / (B+D)))                     # 26 seconds

> date ()

 “Tue Jan 06 11:51:39 ”

> timetrial(b,d,,(plogis(b-d)))                  # 27 seconds

> date ()

 “Tue Jan 06 11:52:06 ”

The winners were the usual suspects, the ones with the fewest calls and operations but the bottom line seems to be, at least in this limited case using an interpreted language, it makes very little difference. That I take as good news: there is little reason to bother using the exponential metric in the computing.

The seventh form of the probability, plogis, is the built-in R function for the logistic distribution. While it was no faster, it is an R function and so can handle a variety of arguments in a call like “plogis (b-d).” If b and d are both scalars, the value of the expression is a scalar. If either b or d is a vector or a matrix, the value is a vector or matrix of the same size. If both b and d are vectors then the argument (b-d) doesn’t work in general, but the argument outer(b,d,“-“) will create a matrix of probabilities with dimensions matching the lengths of b and d. This will allow computing all the probabilities for, say, a class or school on a fixed form with a single call.

The related R function, dlogis (b-d) has the value of p(1-p), which is useful in Newton’s method or when computing the standard errors. And may be useful for impressing your geek friends or further mystifying your non-geek clients.

# Ixb. R-code to make a simple model less simple and more useful

My life as a psychometrician, the ability algorithm, and some R procs to do the work

The number one job of the psychometrician, in the world of large-scale, state-wide assessments, is to produce the appropriate raw-to-scale tables on the day promised. When they are wrong or late, lawsuits ensue. When they are right and on time, everyone is happy. If we did nothing else, most wouldn’t notice; few would complain.

Once the setup is done, computers can produce the tables in a blink of an eye. It is so easy it is often better, especially in the universe beyond fixed forms, to provide the algorithm to produce scale scores on demand and not bother with lookup tables at all. Give it the item difficulties, feed in the raw score, and the scale score pops out. Novices must take care that management never finds out how easy this step really is.

With modern technology, the ability algorithm can be written in almost any computer language (there is probably an app for your phone) but some are easier than others. My native language is Fortran, so I am most at home with C, C++, R, or related dialects. I am currently using R most of the time. For me with dichotomous items, this does it:

Ability (d)          # where d is the vector of logit difficulties.

But first, I need to copy a few other things into the R window, like a procedure named Ability.

(A simple cut and paste from this post into R probably won’t work but the code did work when I copied it from the Editor. The website seems to use a font for which not all math symbols are recognized by R. In particular, slash (/), minus (-), single and double quotes (‘ “), and ellipses (…) needed to fixed. I’ve done it with a “replace all” in a text editor before moving it into R by copying the offending symbol from the text into the “replace” box of the “replace all” command and typing the same symbol into the “with” box.  Or leave a comment and I’ll email you a simple text version)

Ability <- function (d, M=rep(1,length(d)), first=1, last=(length(d)-1), A = 500, B = 91, …) { b <- NULL; s <- NULL
b[first] <- first / (length(d) – first)
D <- exp(d)
for (r in first:last) {
b[r] <- Ablest(r, D, M,  b[r], …)
s[r] <- SEM (exp(b[r]), D, M)
b[r+1] <- exp(b[r] + s[r]^2)
}
return (data.frame(raw=(first:last), logit=b[first:last], sem.logit=s[first:last],
GRit=round((A+B*b[first:last]),0),  sem.GRit=round(B*s[first:last],1)))
} ##############################################################

This procedure is just a front for functions named Ablest and SEM that actually do the work so you will need to copy them as well:

Ablest <- function (r, D, M=rep(1,length(D)), B=(r / (length (D)-r)), stop=0.01) {
# r is raw score; D is vector of exponential difficulties; M is vector of m[i]; stop is the convergence
repeat {
adjust <- (r – SumP(B,D,M)) / SumPQ (B,D,M)
if (abs(adjust) < stop) return (log(B))
}
} ##########################################################ok
SEM <- function (b, d, m=(rep(1,length(d))))  return (1 / sqrt(SumPQ(b,d,m)))
##############################################################

And Ablest needs some even more basic utilities copied into the window:

SumP <- function (b, d, m=NULL, P=function (B,D) (B / (B+D))) {
if (is.null(m)) return (sum (P(b,d))) # dichotomous case; sum() is a built-in function
k <- 1
Sp <- 0
for (i in 1:length(m)) {
Sp <- Sp + EV (b, d[k:(k+m[i]-1)])
k <- k + m[k]
}
return (Sp)
} ##################################################################ok
EV <- function (b, d) { #  %*% is the inner product, produces a scalar
return (seq(1:length(d)) %*% P.Rasch(b, d, m=length(d)))
} ##################################################################ok
SumPQ <- function (B, D, m=NULL, P=function (B,D) {B/(B+D)}, PQ=function (p) {p-p^2}) {
if (is.null(m)) return (sum(PQ(P(B,D))))  # dichotomous case;
k <- 1
Spq <- 0
for (i in 1:length(m)) {
Spq = Spq + VAR (B,D[k:(k+m[i]-1)])
k <- k + m[k]
}
return (Spq)
} ##################################################################ok
VAR <- function (b,d) {  # this is just the polytomous version of (p – p^2)
return (P.Rasch(b, d, m=length(d)) %*% ((1:length(d))^2) – EV(b,d)^2)
} ##################################################################ok
P.Rasch <- function (b, d, m=NULL, P=function (B,D) (B / (B+D)) ) {
if (is.null(m)) return (P(b,d)) # simple logistic
return (P.poly (P(b,d),m))     # polytomous
} ##################################################################ok
P.poly <- function (p, m) { # p is a simple vector of category probabilities
k <- 1
for (i in 1:length(m)) {
p[k:(k+m[i]-1)] = P.star (p[k:(k+m[i]-1)], m[i])
k <- k + m[i]
}
return (p)
} ##################################################################ok
P.star <- function (pstar, m=length(pstar)) {
#
#       Converts p* to p; assumes a vector of probabilities
#       computed naively as B/(B+D).  This routine takes account
#       of the Guttmann response patterns allowed with PRM.
#
q <- 1-pstar  # all wrong, 000…
p <- prod(q)
for (j in 1:m) {
q[j] <- pstar[j] # one more right, eg., 100…, or 110…, …
p[j+1] <- prod(q)
}
return (p[-1]/sum(p)) # Don’t return p for category 0
} ##################################################################ok
summary.ability <- function (score, dec=5) {
print (round(score,dec))
plot(score[,4],score[,1],xlab=”GRit”,ylab=”Raw Score”,type=’l’,col=’red’)
points(score[,4]+score[,5],score[,1],type=’l’,lty=3)
points(score[,4]-score[,5],score[,1],type=’l’,lty=3)
} ##################################################################

This is very similar to the earlier version of Ablest but has been generalized to handle polytomous items, which is where the vector M of maximum scores or number of thresholds comes in.

To use more bells and whistles, the call statement can be things like:

Ability (d, M, first, last, A, B, stop)         # All the parameters it has
Ability (d, M)                                         # first, last, A, B, & stop have defaults
Ability (d,,,,,, 0.0001)                             # stop is positional so the commas are needed
Ability (d, ,10, 20,,, 0.001)                       # default for M assumes dichotomous items
Ability (d, M,,, 100, 10)                          # defaults for A & B are 500 and 91

To really try it out, we can define a vector of item difficulties with, say, 25 uniformly spaced dichotomous items and two polytomous items, one with three thresholds and one with five. The vector m defines the matching vector of maximum scores.

dd=c(seq(-3,3,.25), c(-1,0,1), c(-2,-1,0,1,2))
m = c(rep(1,25),3,5)
score = Ability (d=dd, M=m)
summary.ability (score, 4)

Or give it your own vectors of logit difficulties and maximum scores.

For those who speak R, the code is fairly intuitive, perhaps not optimal, and could be translated almost line by line into Fortran, although some lines would become several. Most of the routines can be called directly if you’re so inclined and get the arguments right. Most importantly, Ability expects logit difficulties and returns logit abilities. Almost everything expects and uses exponentials. Almost all error messages are unintelligible and either because d and m don’t match or something is an exponential when it should be a logit or vice versa.

I haven’t mentioned what to do about zero and perfect scores today because, first, I’m annoyed that they are still necessary, second,  these routines don’t do them, and, third, I talked about the problem a few posts ago. But, if you must, you could use b = b – SEM^2 and b[M] = b[M-1] + SEM[M-1]^2, where M is the maximum possible score, not necessarily the number of items. Or you could appear even more scientific and use something like b = Ablest(0.3, D, m) and b[M] = Ablest(M-0.3, D, m). Here D is the vector of difficulties in the exponential form and m is the vector of maximum scores for the items (and M is the sum of the m‘s.) The length of D is the total number of thresholds (aka, M) and the length of m is the number of items (sometimes called L.) Ablest doesn’t care that the score isn’t an integer but Ability would care. The value 0.3 was a somewhat arbitrary choice; you may prefer 0.25 or 0.33 instead.

To call this the “setup” is a little misleading; we normally aren’t allowed to just make up the item difficulties this way. There are a few other preliminaries that the psychometrician might weigh in on or at least show up at meetings; for example, test design, item writing, field testing, field test analysis, item reviews, item calibration, linking, equating, standards setting, form development, item validation, and key verification. There is also the small matter of presenting the items to the students. Once those are out-of-the-way, the psychometrician’s job of producing the raw score to scale score lookup table is simple.

Once I deal with a few more preliminaries , I’ll go ahead and go back to the good stuff like diagnosing item and person anomalies.

# Ix. Doing the Arithmetic Redux with Guttman Patterns

For almost the same thing as a PDF with better formatting: Doing the Arithmetic Redux

Many posts ago, I asserted that doing the arithmetic to get estimates of item difficulties for dichotomous items is almost trivial. You don’t need to know anything about second derivatives, Newton’s method iterations, or convergence criterion. You do need to:

1. Create an L x L matrix N = [nij], where L is the number of items.
2. For each person, add a 1 to nij if item j is correct and i is incorrect; zero otherwise.
3. Create an L x L matrix R = [rij] of log odds; i.e., rij = log(nij / nji)
4. Calculate the row averages; di = ∑ rij / L.

Done; the row average for row i is the logit difficulty of item i.

That’s the idea but it’s a little too simplistic. Mathematically, step three won’t work if either nij or nji is zero; in one case, you can’t do the division and in the other, you can’t take the log. In the real world, this means everyone has to take the same set of items and every item has to be a winner and a loser in every pair. For reasonably large fixed form assessments, neither of these is an issue.

Expressing step 4 in matrix speak, Ad = S, where A is an LxL diagonal matrix with L on the diagonal, d is the Lx1 vector of logit difficulties that we are after, and S is the Lx1 vector of row sums. Or d = A-1S, which is nothing more than the d are the row averages.

R-code that probably works, assuming L, x, and data have been properly defined, and almost line for line what we just said:

Block 1: Estimating Difficulties from a Complete Matrix of Counts R

N = matrix (0, L, L)                                 # Define and zero an LxL matrix

for ( x in data)                                          # Loop through people

N = N + ((1x) %o% x)                   # Outer product of vectors creates square

R = log (t(N) / N)                                      # Log Odds (ji) over (ij)

d = rowMeans(R)                                     # Find the row averages

This probably requires some explanation. The object data contains the scored data with one row for each person. The vector x contains the zero-one scored response string for a person. The outer product, %o%, of x with its complement creates a square matrix with a rij = 1 when both xj and (1 xi) are one; zero otherwise. The log odds line we used here to define R will always generate some errors as written because the diagonal of N will always be zero. It should have an error trap in it like: R = ifelse ((t(N)*N), log (N / t(N) ), 0).

But if the N and R aren’t full, we will need the coefficient matrix A. We could start with a diagonal matrix with L on the diagonal. Wherever we find a zero off-diagonal entry in Y, subtract one from the diagonal and add one to the same off-diagonal entry of A. Block 2 accomplishes the same thing with slightly different logic because of what I know how to do in R; here we start with a matrix of all zeros except ones where the log odds are missing and then figure out what the diagonal should be.

Block 2: Taking Care of Cells Missing from the Matrix of Log Odds R
Build_A <- function (L, R) {
A = ifelse (R,0,1)                                              # Mark missing cells (includes diagonal)
diag(A) = L – (rowSums(A) – 1)                       # Fix the diagonal (now every row sums to L)
return (A)
}

We can tweak the first block of code a little to take care of empty cells. This is pretty much the heart of the pair-wise method for estimating logit difficulties. With this and an R-interpreter, you could do it. However any functional, self-respecting, self-contained package would surround this core with several hundred lines of code to take care of the housekeeping to find and interpret the data and to communicate with you.

Block 3: More General Code Taking Care of Missing Cells

N = matrix (0, L, L)                                   # Define and zero an LxL matrix

for (x in data)                                            # Loop through people

{N = N + ((1x) %o% x)}                   # Outer product of vectors creates square

R = ifelse ((t(N)*N), log (N / t(N) ), 0)         # Log Odds (ji) over (ij)

A = Build_A (L, R)                             # Create coefficient matrix with empty cells

d = solve (A, rowSums(R))                       # Solve equations simultaneously

There is one gaping hole hidden in the rather innocuous expression, for (x in data), which will probably keeping you from actually using this code. The vector x is the scored, zero-one item responses for one person. The object data presumably holds all the response vectors for everyone in the sample. The idea is to retrieve one response vector at a time, add it into the counts matrix N in the appropriate manner, until we’ve worked our way through everyone. I’m not going to tackle how to construct data today. What I will do is skip ahead to the fourth line and show you some actual data.

Table 1: Table of Count Matrix N for Five Multiple Choice Items

 Counts MC.1 MC.2 MC.3 MC.4 MC.5 MC.1 0 35 58 45 33 MC.2 280 0 240 196 170 MC.3 112 49 0 83 58 MC.4 171 77 155 0 99 MC.5 253 145 224 193 0

Table 1 is the actual counts for part of a real assessment. The entries in the table are the number of times the row item was missed and the column item was passed. The table is complete (i.e., all non-zeros except for the diagonal). Table 2 is the log odds computed from Table 1; e.g., log (280 / 35) = 2.079 indicating item 2 is about two logits harder than item 1. Because the table is complete, we don’t really need the A-matrix of coefficients to get difficulty estimates; just add across each row and divide by five.

Table 2: Table of Log Odds R for Five Multiple Choice Items

 Log Odds MC.1 MC.2 MC.3 MC.4 MC.5 Logit MC.1 0 -2.079 -0.658 -1.335 -2.037 -1.222 MC.2 2.079 0 1.589 0.934 0.159 0.952 MC.3 0.658 -1.589 0 -0.625 -1.351 -0.581 MC.4 1.335 -0.934 0.625 0 -0.668 0.072 MC.5 2.037 -0.159 1.351 0.668 0 0.779

This brings me to the true elegance of the algorithm in Block 3. When we build the response vector x correctly (a rather significant qualification,) we can use exactly the same algorithm that we have been using for dichotomous items to handle polytomous items as well. So far, with zero-one items, the response vector was a string of zeros and ones and the vector’s length was the maximum possible score, which is also the number of items. We can coerce constructed responses into the same format.

If, for example, we have a constructed response item with four categories, there are three thresholds and the maximum possible score is three. With four categories, we can parse the person’s response into three non-independent items. There are four allowable response patterns, which not coincidentally, happen to be the four Guttmann patterns: (000), (100), (110), and (111), which correspond to the four observable scores: 0, 1, 2, and 3. All we need to do to make our algorithm work is replace the observed zero-to-three polytomous score with the corresponding zero-one Guttmann pattern.

 Response CR.1-2 CR.1-2 CR.1-3 0 0 0 0 1 1 0 0 2 1 1 0 3 1 1 1

If for example, the person’s response vector for the five MC and one CR was (101102), the new vector will be (10110110). The person’s total score of five hasn’t changed but we know have a response vector of all ones and zeros of length equal to the maximum possible score, which is the number of thresholds, which is greater than the number of items. With all dichotomous items, the length was also the maximum possible score and the number of thresholds but that was also the number of items. With the reconstructed response vectors, we can now naively apply the same algorithm and receive in return the logit difficulty for each threshold.

Here are some more numbers to make it a little less obscure.

Table 3: Table of Counts for Five Multiple Choice Items and One Constructed Response

 Counts MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 MC.1 0 35 58 45 33 36 70 4 MC.2 280 0 240 196 170 91 234 21 MC.3 112 49 0 83 58 52 98 14 MC.4 171 77 155 0 99 59 162 12 MC.5 253 145 224 193 0 74 225 25 CR.1-1 14 5 14 11 8 0 0 0 CR.1-2 101 46 85 78 63 137 0 0 CR.1-3 432 268 404 340 277 639 502 0

The upper left corner is the same as we had earlier but I have now added one three-threshold item. Because we are restricted to the Guttman patterns, part of the lower right is missing: e.g., you cannot pass item CR.1-2 without passing CR.1-1, or put another way, we cannot observe non-Guttman response patterns like (0, 1, 0).

Table 4: Table of Log Odds R for Five Multiple Choice Items and One Constructed Response

 Log Odds MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Mean MC.1 0 -2.079 -0.658 -1.335 -2.037 0.944 -0.367 -4.682 -10.214 -1.277 MC.2 2.079 0 1.589 0.934 0.159 2.901 1.627 -2.546 6.743 0.843 MC.3 0.658 -1.589 0 -0.625 -1.351 1.312 0.142 -3.362 -4.814 -0.602 MC.4 1.335 -0.934 0.625 0 -0.668 1.680 0.731 -3.344 -0.576 -0.072 MC.5 2.037 -0.159 1.351 0.668 0 2.225 1.273 -2.405 4.989 0.624 CR.1-1 -0.944 -2.901 -1.312 -1.680 -2.225 0 0 0 -9.062 -1.133 CR.1-2 0.367 -1.627 -0.142 -0.731 -1.273 0 0 0 -3.406 -0.426 CR.1-3 4.682 2.546 3.362 3.344 2.405 0 0 0 16.340 2.043

Moving to the matrix of log odds, we have even more holes. The table includes the row sums, which we will need, and the row means, which are almost meaningless. The empty section of the logs odds does make it obvious that the constructed response thresholds are estimated from their relationship to the multiple choice items, not from anything internal to the constructed response itself.

The A-matrix of coefficients (Table 5) is now useful. The rows define the simultaneous equations to be solved. For the multiple choice, we can still just use the row means because those rows are complete. The logit difficulties in the final column are slightly different than the row means we got when working just with the five multiple choice for two reasons: the logits are now centered on the eight thresholds rather than the five difficulties, and we have added in some more data from the constructed response.

Table 5: Coefficient Matrix A for Five Multiple Choice Items and One Constructed Response

 A MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Logit MC.1 8 0 0 0 0 0 0 0 -10.214 -1.277 MC.2 0 8 0 0 0 0 0 0 6.743 0.843 MC.3 0 0 8 0 0 0 0 0 -4.814 -0.602 MC.4 0 0 0 8 0 0 0 0 -0.576 -0.072 MC.5 0 0 0 0 8 0 0 0 4.989 0.624 CR.1-1 0 0 0 0 0 6 1 1 -9.062 -1.909 CR.1-2 0 0 0 0 0 1 6 1 -3.406 -0.778 CR.1-3 0 0 0 0 0 1 1 6 16.340 3.171

This is not intended to be an R primer so much as an alternative way to show some algebra and do some arithmetic. I have found the R language to be a convenient tool for doing matrix operations, the R packages to be powerful tools for many perhaps most complex analyses, and the R documentation to be almost impenetrable. The language was clearly designed by and most packages written by very clever people; the examples in the documentation seemed intended to impress the other very clever people with how very clever the author is rather than illustrate something I might actually want to do.

My examples probably aren’t any better.

# Viib. Using R to do a little work

Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty d and the person’s ability b, the probability of person succeeding on the item is p = B / (B + D), where B = exp(b) and D = exp(d). If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the b that makes the observed score equal the expected score, i.e., solves the equation r = ∑p, where r is the person’s number correct score and p was just defined.

If you are more concrete than that, here is a little R-code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure r is in the range 1 to L-1 (where L = length of d,) the ds are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.

#Block 1: Routine to compute logit ability for number correct r given d
Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties
b <- log (r / (length (d)-r))    # Initialize
repeat {
adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))
if (abs(adjust) < stop) return (b)
}      }
P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability
PQ <- function (p) (p-p^2)                     # p(1-p) aka inverse of the 2nd derivative

If you would like to try it, copy the text between the lines above into an R-window and then define the ds somehow and type in, say, Able(r=1, d=ds) or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command Able(r,d) if r and d have been defined. If you don’t have R installed on your computer, following the link to LLTM in the menu on the right will take you to an R site that has a “Get R” option.

In the world of R, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and R had it first.

#Block 2: Test ability routines
Test.Able <- function (low, high, inc) {
#Create a vector of logit difficulties to play with,
d = seq(low, high, inc)

# The ability for a raw score of 1,
# overriding default the convergence criterion of 0.01 with 0.0001
print (“Ability r=1:”)
print (Able(r=1, d=d, stop=0.0001))
#To get all the abilities from 1 to L-1
# first create a spot to receive results
b = NA
#Then compute the abilities; default convergence = 0.01
for (r in 1:(length(d)-1) )
b[r] = Able (r, d)
#Show what we got
print (“Ability r=1 to L-1:”)
print(round(b,3))
}
Test.Able (-2,2,0.25)

I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (sem); we have everything we need for them. For a quick average, of sorts, sem, useful for planning and test design, we have the Wright-Douglas approximation: sem = 2.5/√L, where L is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on height, width, and length, where height is the percent correct, width is the  range of difficulties, and length is the number of items. Or to extricate the sem for almost any score from the logit ability table, semr = √[(br+1 – br-1)/2]. Or if you want to do it right, semr =1 / √[∑pr(1-pr)].

Of course, I have some R-code. Let me know if it doesn’t work.

#Block 3: Standard Errors and a few shortcuts
# Wright-Douglas ‘typical’ sem
wd.sem <- function (k) (2.5/sqrt(k))
#
SEMbyHWL <- function (H=0.5,W=4,L=1) {
C2 <- NA
W <- ifelse(W>0,W,.001)
for (k in 1:length(H))
C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))
return (sqrt( C2 / L))
}
# SEM from logit ability table
bToSem <- function (r1, r2, b) {
s  <- NA
for (r in r1:r2)
s[r] <- (sqrt((b[r+1]-b[r-1])/2))
return (s)
}
# Full blown SEM
sem <- function (b, d) {
s <-  NA
for (r in 1:length(b))
s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))
return (s)
}

To get the SEM’s from all four approaches, all you really need are the four lines below after “Now we’re ready” below. The rest is start up and reporting.

#Block 4: Try out Standard Error procedures
Test.SEM <- function (d) {
# First, a little setup (assuming Able is still loaded.)
L = length (d)
W = max(d) – min(d)
H = seq(L-1)/L
# Then compute the abilities; default convergence = 0.01
b = NA
for (r in 1:(L-1))
b[r] = Able (r, d)
s.wd = wd.sem (length(d))
s.HWL = SEMbyHWL (H,W,L)
s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment
s = sem(b,d)
# Show what we got
print (“Height”)
print(H)
print (“Width”)
print(W)
print (“Length”)
print(L)
print (“Wright-Douglas typical SEM:”)
print (round(s.wd,2))
print (“HWL SEM r=1 to L-1:”)
print (round(s.HWL,3))
print (“SEM r=2 to L-2 from Ability table:”)
print (round(c(s.from.b,NA),3))
print (“MLE SEM r=1 to L-1:”)
print (round(s,3))
plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)
points(b,s.HWL,col=”green”,type=”l”)
points(b[-(L-1)],s.from.b,col=”blue”,type=”l”)
abline(h=s.wd,lty=3)
}
Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with d=seq(-3,3,0.25) are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the best test design is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form SEM =  C / √L, where C is a rather messy function of H and W. The real innovation in HWL was the recognition that test length L could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either L = ∑p or 0 = ∑p true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the sem for 1 and L-1 are typically a little greater than one, in logits, we could adjust the ability estimates for 1 and L-1 by 1.2 or so; the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to 0 and L, say, 0.25 and L-0.25. Or you can get slightly less extreme values using 0.33 or 0.5, or more extreme using 0.1.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. Close is defined as 0.33 or 0.25 or 0.10 logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

 Method Shift Zero Perfect Constant 1.20 -5.58 5.58 SE shift One -5.51 5.51 Shift 0.33 -5.57 5.57 Shift 0.25 -5.86 5.86 Shift 0.10 -6.80 6.80

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.
Test.0100 <- function (shift) {
d = seq(-3,3,0.25)
b = NA
for (r in 1:(length(d)-1) ) b[r] = Able (r, d)
# Adjust by something a little greater than one squared
b0 = b-shift
bL = b[length(d)-1]+shift
print(c(“Constant shift”,shift,round(b0, 2),round(bL, 2)))
plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)
# Adjust by one standard error squared
s = sem(b,d)
b0 = b-s^2
bL = b[length(d)-1]+s^2
print(c(“SE shift”,round(b0, 2),round(bL, 2)))
points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)
#Estimate ability for something “close” to zero;
for (x in shift[-1]) {
b0 = Able(x,d)                         # if you try Able(0,d) you will get an inscrutable error.
bL = Able(length(d)-x,d)
print( c(“Shift”,x,round(b0, 2),round(bL, 2)))
points (c(b0,b,bL),c(0:length(d)+1),type=”b”)
}    }

Test.0100 (c(1.2,.33,.25,.1))

# Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models

# Vii: Significant Relationships in the Life of a Psychometrician

Rules of Thumb, Shortcuts, Loose Ends, and Other Off-Topic Topics:

Unless you can prove your approximation is as good as my exact solution, I am not interested in your approximation. R. Daryl Bock

Unless you can show me your exact solution is better than my approximation, I am not interested in your exact solution. Benjamin D. Wright

Rule of Thumb Estimates for Rasch Standard Errors

The asymptotic standard error for Marginal Maximum Likelihood estimates of the Rasch difficulty d or ability b parameters is:

 I first applied to the University of Chicago because Prof. Bock was there.

 There was a reason I ended up working with Prof. Wright.