A few days ago a Twitter account representing tourism in Houston, posted the following Tweet. It shows analysis of BBQ restaurant reviews using a 5-star rating system. The results are… surprising… and some may even say laughable.

No Kansas City? No Memphis? No Austin? No Houston? The Boston Globe when as far as to characterize these results as “a joke.” So, what is going on?
The analysis seems simple enough. The website chefspencil.com took restaurant reviews on TripAdvisor and calculated averages within the largest cities in the US.
But this analysis is anything but simple. TripAdvisor scores are enumerated on a one through five star scale, where more stars equal better BBQ. However, the challenge is that the difference in BBQ quality between one star and two stars may be different than the difference between four stars and five stars.
This challenge manifests in two ways: (i) interpersonal incomparability and (ii) sensitivity to monotonic transformations.
First, let’s talk about interpersonal incomparability. Let’s say you go to New Orleans and (for some reason) you order BBQ. The food is good, and you think that any food that is good should receive five stars. So you go on TripAdvisor and give the restaurant a five star rating. Now, let’s say I am traveling with you and we eat at the same BBQ restaurant. The food is good, but not great, and I think that only great food should receive five stars. So I go on TripAdvisor and give the restaurant a four star rating.
The challenge is ordinal scales, like the five star scale on TripAdvisor, are not comparable between people. This is because different people have different conceptualizations of what “five stars” (or “four stars,” or “three stars,” and so on) actually means.
So, if different types of people travel to and order BBQ in different US cities, then the TripAdvisor ratings may be biased due to the invalid interpersonal comparisons implicit in the analysis summarized above. This is one way that we can get results that suggest that New Orleans has the best BBQ in the US.
Next, let’s talk about sensitivity to monotonic transformations. To do this, assume away any of the aforementioned challenges with interpersonal comparability. That is, a five star rating, a four star rating, a three star rating, and so on mean the same thing for everyone. Even if we can make this heroic assumption, the analysis is challenging because calculating simple summary statistics—like group averages—are sensitive to monotonic transformations of the ordinal scale.
To see this, here is an example from a working paper of mine (with Andrew Oswald).
A survey asks respondents to answer the following question: “All in all how satisfied are you with your life right now?” using the following ordered response categories: “very satisfied,” “satisfied,” “unsatisfied,” and “very unsatisfied.” To simplify this example even further, suppose there are two groups of people, each with only two members. Group A includes one person who is “Very Dissatisfied” and another who is “Very Satisfied.” Group B includes one person who is “Dissatisfied” and another who is “Satisfied.” A seemingly simple question is: Which group is more satisfied? The answer to this question, however, is not simple. That is because the answer depends on the interval between the response categories. Perhaps a natural way to empirically answer this question is to assume a linear set of values for the response categories. In this case, “Very Dissatisfied” has a value of zero, “Dissatisfied” has a value of one, “Satisfied” is two, and “Very Satisfied” is three. The average satisfaction of the two groups is equal, with an average score of 1.5, and the groups are equally satisfied.
Similar to utility functions, however, ordinal scales only offer information about the relative rank of response categories and thus provide no information about the interval between categories. Therefore, a potentially valid alternative way to answer this question is to assume a concave set of values for the response categories. In this case, “Very Dissatisfied” again has a value of zero, “Dissatisfied” has a value of 1.75, “Satisfied” has a value of 2.5, and “Very Satisfied” again has a value of three. With this set of values, group B is more satisfied than group A. Finally, another valid alternative way to answer this question is to assume a convex set of values for the response categories. In this case, “Very Dissatisfied” again has a value of zero, “Dissatisfied” has a value of 0.5, “Satisfied” has a value of 1.25, and “Very Satisfied” again has a value of three. With this set of values, group A is more satisfied than group B. Table 1 illustrates this example and shows that the answer to the seemingly elementary question above depends on the assumed intervals between the response categories.

So, now that we understand the challenges with ordinal variables, what should we do? Although some may say (or suggest) that any analysis that uses ordinal variables is untrustworthy, the majority of variables that many of us care about in our world cannot be quantitatively measured on a cardinal scale.
We must learn how to analyze and interpret ordinal variables with care. There are numerous approaches available, but I will highlight two. A more computationally intensive partial identification approach (see here) or a simple robustness check (see here). With these, and hopefully future methods, we can begin to conduct robust analysis of ordinal variables and, eat good BBQ too.
Leave a Reply