Chapter 16 Calculating and Recoding
Sometimes, the data you have is not in the form you want. This chapter addresses situations where you have the information in one or more attributes (columns), but it needs to be transformed somehow and placed in a new column that you can use to do what you want. This often (but not always) means that you will need a formula to compute the contents of that new column.
Sometimes you will want to compute something like the mean or median. That’s a similar but different situation which we will talk about in the next chapter, The difference is that a calculation like
mean()takes input from an entire group of cases. It’s a summary, an aggregate measure, because the value, the mean, applies to the whole group rather than to each individual. It works a little differently.
Here, we will be computing values that apply to each case individually.
16.1 A simple example: unit conversion
You have a bunch of data about the heights of children and teens.
The problem is that the
Weight attribute is measured in kilograms.
For some reason, your audience will not understand this new-fangled metric system,
so you decide to make your graphs in pounds instead.
The basic plan is simple: create a new column, give it a name (e.g.,
and then give the new column a formula (
Weight * 2.2). Here’s the step-by-step;
try it out in the live illustration below.
- Be sure the table is selected.
- Click the gray circle with the plus sign near its upper right. A new column appears for the new attribute. Its name is selected, ready for entering.
- Type the new name:
WtLb. Then press enter to complete the edit.
- Click on the new name. A menu appears. Choose Edit Formula….
- In the formula box, enter
Weight * 2.2and press Apply. Notice that the formula you enter is only the part to the right of the equals sign.
Now you can change the graph to
WtLb by dragging the new attribute name
to replace the old one.
Small extra: Notice that
WtLb has no units in parentheses.
Let’s fix that.
WtLbjust like we did to get the formula editor.
- This time, choose Edit Attribute Properties… Notice what we have here. Cool stuff.
poundsin the unit box.
- Press Apply. Shazam! The units appear!
Fathom aficionados from the ancient years will justly lament the loss of units arithmetic. Good old Fathom actually knew the conversions.
16.2 Making Numeric Attributes Categorical
You can use formulas to convert numerical attributes to categorical. Suppose you’re investigating income, but graphs with income on them quickly become complicated: they stretch over a wide range and they’re badly skewed.
So for exploratory purposes, you’d like to divide the population into two groups:
- Those making $40,000 or more, which you will call “rich,” and
- everybody else, which you will—for the purposes of your data—call “poor.”
Here’s how to do it; use the example live document below.
- With the table selected, click the gray circle with the plus sign to make a new attribute.
- Give it a good name (e.g.,
richOrPoor) and press enter to complete the edit.
- Click the new name to get the menu, and choose Edit formula….
- In the formula editor, enter
if(TotalIncome >= 40000, "rich", "poor"). Press Apply. The column should fill with values.
if()function works. It’s like the one in most spreadsheets, and takes three arguments. Consider this formula for an attribute called
if(sound = "woof", "dog", "cat")
- The first is a “Boolean” expression, either true or false.
- The second is what you get if the expression is true (the pet is a dog).
- The third is what you get if the expression is false (the pet is a cat).
richOrPoor as you would any attribute.
You can use it to make graphs, and you can use it in other calculations.
Gotcha! (Spring 2020, this could change) However: If you drag it to the left to make groups—a perfectly reasonable thing to do— CODAP will not do what you want. Why not? CODAP tries to apply that formula to the entire group (as it would
median(TotalIncome)) and it can’t because there are many results, many
Therefore, before you drag, click on the column header, and in the menu that appears, choose Delete Formula (Keeping Values). Now the column of
pooris just as if you had typed in every value separately.
Just be aware: now, if you change someone’s
richOrPoorvalue will not change, because you deleted the formula. Similarly, if you import more data, there is no formula to fill in new values for
16.3 Reducing the number of categories in a categorical attribute
Sometimes, a categorical attribute has many categories, and you want to group them together. With States, for example, you might want to group them by “region,” so you would put California, Oregon, and Washington into a region called “Pacific,” and so forth for the rest of the states.
Let’s practice this skill with something simpler, using the data we just used above.
We are going to reduce the six categories in
to two categories:
collegiate, which means they went to college at all,
and ’no college`, which means they never did.
We could make a formula, but it would be ugly and complicated. Instead, we will simply type the new values in.
Won’t that take a long time? We have 1000 cases! No, because we’ll type them only once. Do this using the live illustration below:
Educationleft to make (six) groups by education level. Select the table.
- Click the “left” gray circle (the “add attribute” circle), just above and to the right of Education. This makes a new column.
- Name it
- In that column, enter the appropriate values for each category.
That is, double-click the cell and actually type
no college. You can use Copy and Paste if you like.
- You have now re-coded
Educationso that the “improved” values are in
college. You no longer need
Educationto make your groups, so drag it back to the right. Now you have only two groups (the
collegegroups) on the left.
You can use this new attribute as you would any other:
- Make a graph with
Genderon the horizontal axis.
collegeinto the graph.
- In the “configuration” palette (below the ruler palette on the graph), choose Fuse Dots Into Bars.
- Hover over the bars; what do you learn from the percentages that appear?
- Make a graph of
TotalIncomewith the new attribute,
college, on the other axis. A useful, telling graph! Stay in school, kids…
16.4 By the way: Always make a new column!
This is a practice you should ingrain into your data-analysis muscle memory: Avoid destroying data.
Seems obvious, but the techniques we mentioned here sometimes lead students into that trap.
So: When you’re recoding
anyCollege, you might be tempted to
(do not do this!):
Educationleft to make six groups be education level.
some collegeto be
- Edit the rest to be
Do you see the problem?
You have overwritten the original data.
Suppose, later, you decide you want your collegiate group to be only people who have finished
a bachelor’s degree?
You no longer have anyone with
some college to move from one category to the other.
Therefore, leave the original data as it is, and put any new data in a new column. Then if you change your mind, you don’t have to start over.