Chapter 16 Calculating and Recoding

Sometimes, the data you have is not in the form you want. This chapter addresses situations where you have the information in one or more attributes (columns), but it needs to be transformed somehow and placed in a new column that you can use to do what you want. This often (but not always) means that you will need a formula to compute the contents of that new column.

Sometimes you will want to compute something like the mean or median. That’s a similar but different situation which we will talk about in the next chapter, The difference is that a calculation like mean() takes input from an entire group of cases. It’s a summary, an aggregate measure, because the value, the mean, applies to the whole group rather than to each individual. It works a little differently.

Here, we will be computing values that apply to each case individually.

16.1 A simple example: unit conversion

You have a bunch of data about the heights of children and teens. The problem is that the Weight attribute is measured in kilograms. For some reason, your audience will not understand this new-fangled metric system, so you decide to make your graphs in pounds instead.

The basic plan is simple: create a new column, give it a name (e.g., WtLb), and then give the new column a formula (Weight * 2.2). Here’s the step-by-step; try it out in the live illustration below.

  1. Be sure the table is selected.
  2. Click the gray circle with the plus sign near its upper right. A new column appears for the new attribute. Its name is selected, ready for entering.
  3. Type the new name: WtLb. Then press enter to complete the edit.
  4. Click on the new name. A menu appears. Choose Edit Formula….
  5. In the formula box, enter Weight * 2.2 and press Apply. Notice that the formula you enter is only the part to the right of the equals sign.
BART data

Now you can change the graph to WtLb by dragging the new attribute name to replace the old one.

Small extra: Notice that WtLb has no units in parentheses. Let’s fix that.

  1. Click WtLb just like we did to get the formula editor.
  2. This time, choose Edit Attribute Properties… Notice what we have here. Cool stuff.
  3. Enter pounds in the unit box.
  4. Press Apply. Shazam! The units appear!

Fathom aficionados from the ancient years will justly lament the loss of units arithmetic. Good old Fathom actually knew the conversions.

16.2 Making Numeric Attributes Categorical

You can use formulas to convert numerical attributes to categorical. Suppose you’re investigating income, but graphs with income on them quickly become complicated: they stretch over a wide range and they’re badly skewed.

So for exploratory purposes, you’d like to divide the population into two groups:

  • Those making $40,000 or more, which you will call “rich,” and
  • everybody else, which you will—for the purposes of your data—call “poor.”

Here’s how to do it; use the example live document below.

  1. With the table selected, click the gray circle with the plus sign to make a new attribute.
  2. Give it a good name (e.g., richOrPoor) and press enter to complete the edit.
  3. Click the new name to get the menu, and choose Edit formula….
  4. In the formula editor, enter if(TotalIncome >= 40000, "rich", "poor"). Press Apply. The column should fill with values.

1000 Californians from 2013 Notice how that if() function works. It’s like the one in most spreadsheets, and takes three arguments. Consider this formula for an attribute called kindOfPet:

if(sound = "woof", "dog", "cat")

  • The first is a “Boolean” expression, either true or false.
  • The second is what you get if the expression is true (the pet is a dog).
  • The third is what you get if the expression is false (the pet is a cat).

Use richOrPoor as you would any attribute. You can use it to make graphs, and you can use it in other calculations.

Gotcha! (Spring 2020, this could change) However: If you drag it to the left to make groups—a perfectly reasonable thing to do— CODAP will not do what you want. Why not? CODAP tries to apply that formula to the entire group (as it would median(TotalIncome)) and it can’t because there are many results, many riches and poors.

Therefore, before you drag, click on the column header, and in the menu that appears, choose Delete Formula (Keeping Values). Now the column of rich and poor is just as if you had typed in every value separately.

Just be aware: now, if you change someone’s TotalIncome, their richOrPoor value will not change, because you deleted the formula. Similarly, if you import more data, there is no formula to fill in new values for richOrPoor.

16.3 Reducing the number of categories in a categorical attribute

Sometimes, a categorical attribute has many categories, and you want to group them together. With States, for example, you might want to group them by “region,” so you would put California, Oregon, and Washington into a region called “Pacific,” and so forth for the rest of the states.

Let’s practice this skill with something simpler, using the data we just used above. We are going to reduce the six categories in Education to two categories: collegiate, which means they went to college at all, and ’no college`, which means they never did.

We could make a formula, but it would be ugly and complicated. Instead, we will simply type the new values in.

Won’t that take a long time? We have 1000 cases! No, because we’ll type them only once. Do this using the live illustration below:

  1. Drag Education left to make (six) groups by education level. Select the table.
  2. Click the “left” gray circle (the “add attribute” circle), just above and to the right of Education. This makes a new column.
  3. Name it college.
  4. In that column, enter the appropriate values for each category. That is, double-click the cell and actually type collegiate or no college. You can use Copy and Paste if you like.
  5. You have now re-coded Education so that the “improved” values are in college. You no longer need Education to make your groups, so drag it back to the right. Now you have only two groups (the college groups) on the left.
1000 Californians from 2013

You can use this new attribute as you would any other:

  1. Make a graph with Gender on the horizontal axis.
  2. Drop college into the graph.
  3. In the “configuration” palette (below the ruler palette on the graph), choose Fuse Dots Into Bars.
  4. Hover over the bars; what do you learn from the percentages that appear?

Also,

  1. Make a graph of TotalIncome with the new attribute, college, on the other axis. A useful, telling graph! Stay in school, kids…

16.4 By the way: Always make a new column!

This is a practice you should ingrain into your data-analysis muscle memory: Avoid destroying data.

Seems obvious, but the techniques we mentioned here sometimes lead students into that trap.

So: When you’re recoding Education into anyCollege, you might be tempted to (do not do this!):

  1. Drag Education left to make six groups be education level.
  2. Edit graduate, bachelor's, and some college to be collegiate.
  3. Edit the rest to be no college.

Do you see the problem? You have overwritten the original data. Suppose, later, you decide you want your collegiate group to be only people who have finished a bachelor’s degree? You no longer have anyone with some college to move from one category to the other.

Therefore, leave the original data as it is, and put any new data in a new column. Then if you change your mind, you don’t have to start over.