Viiif: Apple Pie and Disordered Thresholds Redux

A second try at disordered thresholds

It has been suggested, with some justification, that I may be a little chauvinistic depending so heavily on a baseball analogy when pondering disordered thresholds. So for my friends in Australia, Cyprus, and the Czech Republic, I’ll try one based on apple pie.

Certified pie judges for the Minnesota State Fair are trained to evaluate each entry on the criteria in Table 1 and the results for pies, at least the ones entered into competitions, are unimodal, somewhat skewed to the left.

Table 1: Minnesota State Fair Pie Judging Rubric

Aspect

Points

Appearance

20

Color

10

Texture

20

Internal appearance

15

Aroma

10

Flavor

25

Total

100

We might suggest some tweaks to this process, but right now our assignment is to determine preferences of potential customers for our pie shop. All our pies would be 100s on the State Fair rubric so it won’t help. We could collect preference data from potential customers by giving away small taste samples at the fair and asking each taster to respond to a short five-category rating scale with categories suggested by our psychometric consultant.

My feeling about this pie is:

0

1 2 3 4
I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten

I could eat this right after a major feast!

The situation is hypothetical; the data are simulated from unimodal distributions with roughly equal means. On day one, thresholds 3 and 4 were reversed; on day two, thresholds 2 and 3 for some tasters were also reversed. None of that will stop me from interpreting the results. It is not shown in this summary of the data shown below, but the answer to our marketing question is pies made with apples were the clear winners. (To appropriate a comment that Rasch made about correlation coefficients, this result is population-dependent and therefore scientifically rather uninteresting.) Any problems that the data might have with the thresholds did not prevent us from reaching this conclusion rather comfortably. The most preferred pies received the highest scores in spite of our problematic category labels. Or at least that’s the story I will include with my invoice.

The numbers we observed for the categories are shown in Table 2. Right now we are only concerned with the categories, so this table is summed over the pies and the tasters.

Table 2: Results of Pie Preference Survey for Categories

Day

I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten I could eat this right after a major feast!

One

10 250 785 83

321

Two 120 751 95 22

482

In this scenario, we have created at least two problems; first, the wording of the category descriptions may be causing some confusion. I hope those distinctions survive the cultural and language differences between the US and the UK. Second, the day two group is making an even cruder distinction among the pies; almost I like it or I don’t like it.

The category 4 was intended to capture the idea that this pie is so good that I will eat it even if I have already eaten myself to the point of pain. For some people that may not be different than this pie is among the best I’ve ever eaten, which is why relatively few chose category 3. Anything involving mothers is always problematic on a rating scale. Depending on your mother, “Almost as good as my mother’s” may be the highest possible rating; for others, it may be slightly above boiled liver. That suggests there may be a problem with the category descriptors that our psychometrician gave us, but the fit statistics would not object. And it doesn’t explain the difference between days one and two.

Day Two happened to be the day that apples were being judged in a separate arena, completely independently of the pie judging. Consequently every serious apple grower in Minnesota was at the fair. Rather than spreading across the five categories, more or less, this group tended to see pies as a dichotomy: those that were made with apples and those that weren’t. While the general population spread out reasonably well across the continuum, the apple growers were definitely bimodal in their preferences.

The day two anomaly is in the data, not the model or thresholds. The disordered thresholds that exposed the anomaly by imposing a strong model, but not reflected in the standard fit statistics, are an indication that we should think a little more about what we are doing. Almost certainly, we could improve on the wording of the category descriptions. But we might also want to separate apple orchard owners from other respondents to our survey. The same might also be true for banana growers but they don’t figure heavily in Minnesota horticulture. Once again, Rasch has shown us what is population-independent, i.e., the thresholds (and therefore scientifically interesting) and what is population-dependent, i.e., frequencies and preferences, (and therefore only interesting to marketers.)

These insights don’t tell us much about marketing pies better but I wouldn’t try to sell banana cream to apple growers and I would want to know how much of my potential market are apple growers. I am still at a loss to explain why anyone, even beef growers, would pick liver over anything involving sugar and butter.

Advertisements

Viiie: Ordered Categories, Disordered Thresholds

When the experts all agree, it doesn’t necessarily follow that the converse is true. When the experts don’t agree, the average person has no business thinking about it. B. Russell

The experts don’t agree on the topic of reversed thresholds and I’ve been thinking about it anyway. But I may be even less lucid than usual.

The categories, whether rating scale or partial credit, are always ordered: 0 always implies less than 1; 1 always implies less than 2; 2 always implies less than 3 . . . The concentric circle for k on the archery target is always inside (smaller thus harder to hit) than the circle for k-1. In baseball, you can’t get to second without touching first first. The transition points, or thresholds, might or might not be ordered in the data. Perhaps the circle for k-1 is so close in diameter to k that it is almost impossible to be inside k-1 without being inside k. Category k-1 might be very rarely observed, unless you have very sharp arrows and very consistent archers. Perhaps four-base hits actually require less of the aspect than three-base.

Continue . . . Ordered categories, disordered thresholds

Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models