Chapter 17 Summarizing

We calculate summary values in order to help describe and compare groups. The summary might represent a “typical” value, a measure of center, such as a mean. It could be a measure of spread such as an interquartile range. It could be something complex like a Gini index or as pedestrian as the number of members of the group. It might even be a categorical label (e.g., rich or poor) that you calculated expressly to define a set of groups.

We calculate summary values to characterize groups for some reason, often, to compare them to one another. For example, if we want to show that seventeen-year-olds are taller than eleven-year-olds, we could compute and compare mean heights.

It’s easier to compare two numbers than thousands. Those means stand in for all the zillions of heights that might have gone into the calculation. If we just look at the means, we are tacitly agreeing to disregard the zillions (minus two) of individual values, in the service of making the comparison easier. And although this bargain can sometimes blow up in our faces, summary values such as the mean are extremely useful and powerful.

The challenge for us data analysts, then, is three-fold:

  • What measure, what summary value, is the right one for what we’re trying to figure out?
  • What groups should we apply that measure to? and
  • How do we construct those groups?

In terms of data moves in this book, you deal with the second and third bullets by performing a grouping move, possibly filtering as well.

This chapter is about the first bullet. We will focus mostly on how to apply the measure we have chosen to the groups, and to a lesser extent on the extremely important issue of what measure to apply.

17.1 Using the “ruler” palette in a graph

The easiest way to calculate a summary is to click in the ruler palette attached to a graph. Depending on what kind of attribute (numerical, categorical) is on an axis, different measures are available.

If the attribute is numerical (like Height) you can display things like the mean or median, and additional goodies such as box plots, which implicitly show the median and the quartiles. To see the numerical value of one of these measures, hover over the line that appears in the graph. If the attribute is categorical (like Marital_status), you can display counts and percentages. And if there are two numerical axes you can display least-squares regression lines, which give you slope, intercept, and r-squared.

17.2 Writing formulas

In the chapter on grouping, you saw how to use dragging to the left to define groups, as in the following figure. We made the degree attribute–the one we used for grouping— using the technique described here. Thus, we have two groups: those with college degrees and those without.

Let’s investigate the income difference between these two groups. We need a number, a summary value, to characterize the income of each group. We’ll pick the median of the income.

Table with grouping by whether the person has a college degree.

Figure 17.1: Table with grouping by whether the person has a college degree.

We have done this already in the second demo lesson about height, age, and gender, Children and Teens, Part 2. There, we had made our groups (by dragging Age and Gender to the left) and made a new column, with a formula, mean(Height).

Let’s apply that same pattern to this situation.

In the chapter on calculating, you learned about writing formulas to re-express existing data in a new column. This chapter is about formulas for “stuff like averages.” These formulas will be different in some important ways. The key difference is that a function like mean()— which you use to compute the average value—applies to an entire group of cases, whereas a function like abs()—for absolute value—applies to one case at a time.

To calculate the median income of each group, start by making a new column on the left side of the table. Let’s call it medInc. Then give it a formula: median(Income).

If you want to try this,  here is a link to the CODAP document.

Making the new attribute, `medInc`, and giving it a formula.Making the new attribute, `medInc`, and giving it a formula.

Figure 17.2: Making the new attribute, medInc, and giving it a formula.

When you click Apply, CODAP fills in the medInc column with the values.

Medians in place.

Figure 17.3: Medians in place.

That formula will keep computing the median income even if you change the grouping. The next illustration shows what you see if you drag Gender to the left as well, which makes four groups, one for each combination of males and females, with and without degrees.

You can, of course, make graphs using those new columns. They might have very few points, but they help tell a story.

Medians in place with `Gender` as well as `degree`---plus a graph.Medians in place with `Gender` as well as `degree`---plus a graph.

Figure 17.4: Medians in place with Gender as well as degree—plus a graph.

By the way, this shows a common statistical blunder: the red lines are medians, but not the medians of the incomes for the ‘yes’-degree group and the ‘no’-degree group. They show the medians of the dots we see, that is, the points halfway between the males and females.

Beyond mean() and median()

Okay, we’ve also talked about sum() and count(). What else have we got?

When you open up the formula editor, there is a button labeled –Insert Function–. Click it to produce a menu. It has seven categories of functions to choose from; each category has several options. The left illustration shows the statistical submenu. You can see a whole slew of functions.

Each function has an information button to the right of the name. The right-hand illustration shows what the info looks like for the (rather complicated) percentile() function.

Left: the **statistical** menu under **--Insert Function--** in the formula editor. Right: information on `percentile()`.Left: the **statistical** menu under **--Insert Function--** in the formula editor. Right: information on `percentile()`.

Figure 17.5: Left: the statistical menu under –Insert Function– in the formula editor. Right: information on percentile().

17.3 Summarizing Categorical Attributes

We often think of summaries as means or medians—or more elaborate statistical quantities such as percentiles or standard deviations.

Alas, those functions don’t make sense for categorical attributes. What should we do? Let’s think about a situation where we want to compare groups; then we’ll think about a categorical attribute to compare them with.

In fact, let’s use the  same dataset we used in the last section.

We divided up our sample of people into two groups: those with college degrees and those without. Back then, we compared their median incomes; and income is numeric. But what about employment? That’s categorical—let’s use it.

What does your stereotype say? Mine says that people with college degrees are more likely to have a job. That would mean that a greater proportion (or percentage if you multiply by 100) of the “degree” group would be listed as “civilian employed.”

We can do this just like before, but we’ll need a (slightly) more elaborate formula. The basic idea is:

  • Count how many people have a value of Empl equal to "Civ Employed".
  • Count how many people there are altogether.
  • Divide the first number by the second.

Here is what the formula looks like, and how it turned out.9 Our college grads are more likely to have a job:

The formula for the proportion employed, `propEmpl`, and the result.The formula for the proportion employed, `propEmpl`, and the result.

Figure 17.6: The formula for the proportion employed, propEmpl, and the result.

17.4 More secrets of CODAP functions

One take-away you should have is that count() is a very powerful function.

Another is more subtle. Let’s look at the expression count(Empl = "Civ Employed").

First, notice the quotes around the string, "Civ Employed". When you want CODAP to recognize or use a specific string (of characters), you have to enclose them in double quotes.

Second, when you’re looking at a specific string, it has to be exactly correct. It’s case-sensitive; count(Empl = "civ employed") will not work.10

But third, and the real point here, is that the expression Empl = "Civ Employed" is not an attribute—it’s a Boolean expression that’s either true or false. Notice how different that is from a formula like mean(Height). The thing in parentheses, Height, is just an attribute.

But CODAP functions are more flexible than that. You could write, for example,

mean(Weight / Height^2, Empl = "Civ Employed")

and you will get the average BMI for all of the people with jobs11.

That is, you can take the mean of an expression—not just a plain attribute— and CODAP will calculate that value for every case before taking the mean.

The second argument, after the comma, is a filter. Only those cases for which the expression is true will be in the calculation.

That is, you can perform many calculations and apply filters entirely in formulas, without ever hiding or setting aside or making new columns.

This is a perilous idea. Sure, you could study income inequality and use a formula like,

median(Income/FamilySize, Gender = "Female" AND Age > 24 AND Empl = "Civ Employed")

The problem is that the formula is invisible. If you accidentally use 34 instead of 24 for the males’ formula, you might never notice. And you can’t see whether your idea of “income per person” makes sense. It’s far better to make a column for incomePerPerson so you can graph it, and play with it, and see whether it really expresses what you want.

That said, those invisible filters are really useful sometimes.

17.5 Commentary: CODAP’s “atomic” case orientation

This is a good place to say that CODAP is designed to work well when the cases are “atomic” bits of data, and that the user builds summaries such as means from those atoms.

You will often use datasets that are pre-summarized, for example, COVID data where each row, each case, is a State or a country, and the attributes are date, number-of-cases, cases-per-capita, and so forth.

After date, those are all summaries—calculations that somebody else made for us using, presumably, a database where the rows were individual COVID cases.

This is fine, but you will often find that the data-analysis situation is more complicated; that you are more “awash.” You have to be more careful that your calculations make sense. For example, if you take the median of cases-per-capita, what does that really mean?

Graphs can be troubling as well; sometimes, where a dot seems perfect for an atomic case— a person in the Census, say— we often want a bar or something else to show groups. The Map tool is an example of CODAP’s branching out: when the case is a State or a country, you can use a map to show values as colors on the map.

  1. Technically, the denominator should have been count(Empl), which is the number of cases _for which Empl exists.

  2. Ok, it sill work, but it will always give you zero.

  3. assuming that we have weight in kilograms and height in meters