# 10 Calculating and Recoding

Sometimes, the data you have is not in the form you want. This chapter addresses situations where you have the information in one or more attributes (columns), but it needs to be transformed somehow and placed in a *new column* that you can use to do what you want. This often (but not always) means that you will need a *formula* to compute the contents of that new column.

Sometimes you will want to compute something like the mean or median. That’s a similar but different situation which we will talk about in the next chapter, The difference is that a calculation like

`mean()`

takes input from an entire group of cases. It’s asummary, anaggregate measure, because the value, the mean, applies to the whole group rather than to each individual. It works a little differently.

Here, we will be computing values that apply to each case individually.

## 10.1 A simple example: unit conversion

You have a bunch of data about the heights of children and teens. The problem is that the `Weight`

attribute is measured in kilograms. For some reason, your audience will not understand this new-fangled metric system, so you decide to make your graphs in pounds instead.

The basic plan is simple: create a new column, give it a name (e.g., `WtLb`

), and then give the new column a formula (`Weight * 2.2`

). Here’s the step-by-step; try it out in the live illustration below.

- Be sure the table is selected.
- Click the gray circle with the plus sign near its upper right. A new column appears for the new attribute. Its name is selected, ready for entering.
- Type the new name:
`WtLb`

. Then press**enter**to complete the edit. - Click on the new name. A menu appears. Choose
**Edit Formula…**. - In the formula box, enter
`Weight * 2.2`

and press**Apply**. Notice that the formula you enter is*only the part to the right of the equals sign*.

Now you can change the graph to `WtLb`

by dragging the new attribute name to replace the old one.

Small extra: Notice that `WtLb`

has no units in parentheses. Let’s fix that.

- Click
`WtLb`

just like we did to get the formula editor. - This time, choose
**Edit Attribute Properties…**Notice what we have here. Cool stuff. - Enter
`pounds`

in the**unit**box. - Press
**Apply**. Shazam! The units appear!

Fathom aficionados from the ancient years will justly lament the loss of units arithmetic. Good old Fathom actually knew the conversions.

## 10.2 Making Numeric Attributes Categorical

You can use formulas to convert numerical attributes to categorical. Suppose you’re investigating income, but graphs with income on them quickly become complicated: they stretch over a wide range and they’re badly skewed.

So for exploratory purposes, you’d like to divide the population into two groups:

- Those making $40,000 or more, which you will label
`"$$"`

, and - everybody else, which you will label
`"$"`

.

Here’s how to do it; use the example live document below.

- With the table selected, click the gray circle with the plus sign to make a new attribute.
- Give it a good name (e.g.,
`money`

) and press**enter**to complete the edit. - Click the new name to get the menu, and choose
**Edit formula…**. - In the formula editor, enter
`if(TotalIncome >= 40000, "$$", "$")`

. Press**Apply**. The column should fill with values. Don’t leave out the quote marks! - Make a graph that uses your attribute. For example, put
`EmplStatus`

on one axis, and then plop`money`

into the middle. Aha! Employed people are more likely to be in the`$$`

group!

Notice how that `if()`

function works. It’s like the one in most spreadsheets, and takes three arguments. Consider this formula for an attribute called `kindOfPet`

:

`if(sound = "woof", "dog", "cat")`

- The first is a “Boolean” expression, either true or false.
- The second is what you get if the expression is true (the pet is a dog).
- The third is what you get if the expression is false (the pet is a cat).

Use `money`

as you would any attribute. You can use it to make graphs, and you can use it in other calculations.

## 10.3 Reducing the number of categories in a categorical attribute

Sometimes, a categorical attribute has many categories, and you want to group them together. With States, for example, you might want to group them by “region,” so you would put California, Oregon, and Washington into a region called “Pacific,” and so forth for the rest of the states.

Let’s practice this skill with something simpler, using the data we just used above. We are going to reduce the six categories in `Education`

to two categories: `collegiate`

, which means they went to college at all, and `no college`

, which means they never did.

We *could* make a formula, but it would be ugly and complicated. Instead, we will simply type the new values in.

Won’t that take a long time? We have 1000 cases! No, because we’ll type them only once. Do this using the live illustration below:

- Drag
`Education`

left to make (six) groups by education level. Select the table. - Click the “left” gray circle (the “add attribute” circle), just above and to the right of
**Education**. This makes a new column. - Name it
`college`

. - In that column, enter the appropriate values for each category. That is, double-click the
*cell*and actually type`collegiate`

or`no college`

. You can use**Copy**and**Paste**if you like. - You have now
*re-coded*`Education`

so that the “improved” values are in`college`

. You no longer need`Education`

to make your groups, so drag it back to the right. Now you have only two groups (the`college`

groups) on the left.

You can use this new attribute as you would any other:

- Make a graph with
`Gender`

on the horizontal axis. - Plop
`college`

into the middle of the graph. - In the “configuration” palette (below the ruler palette on the graph), choose
**Fuse Dots Into Bars**. - Hover over the bars; what do you learn from the percentages that appear?

Also,

- Make a graph of
`TotalIncome`

with the new attribute,`college`

, on the other axis. A useful, telling graph! Stay in school, kids…

## 10.4 By the way: Always make a new column!

This is a practice you should ingrain into your data-analysis muscle memory: **Avoid destroying data**.

Seems obvious, but the techniques we mentioned here sometimes lead students into that trap.

So: When you’re recoding `Education`

into `anyCollege`

, you might be tempted to (**do not do this!**):

- Drag
`Education`

left to make six groups be education level. - Edit
`graduate`

,`bachelor's`

, and`some college`

to be`collegiate`

. - Edit the rest to be
`no college`

.

Do you see the problem? You have overwritten the original data. Suppose, later, you decide you want your collegiate group to be only people who have finished a bachelor’s degree? You no longer have anyone with `some college`

to move from one category to the other.

Therefore, *leave the original data as it is, and put any new data in a new column*. Then if you change your mind, you don’t have to start over.