Chapter 18 Joining
Joining is about adding new data to your datset. It can be as simple as adding more data that’s just like your existing data. For example, you might have data about shoe sizes for one classroom and you want to add the data from another classroom. Or you’ve been keeping weather data and you want to add today’s weather to the dataset. That’s straightforward and important, but there is a more complicated type of joining that requires more computational thinking to get right.
In this more complicated join, you want to combine data that are fundamentally different and live in more than one dataset. Here are a couple of scenarios:
- You have a dataset with the number of COVID cases for each State. You want to know how many cases there are per capita, or (as is traditional) per 100,000 population. You have a separate dataset that has States, capitals, and populations. Basically, you want a new calculated column in the first dataset that’s equal to the number of COVID cases (first dataset) divided by the population (second dataset)…and multiplied by 100000. What do you put in the formula to get the population of the correct State?
- You have data about passengers exiting the BART system,
including what station they entered and what station they exited.
The trouble is, these stations are coded as numbers.
A separate dataset simply holds the station numbers and their names,
for example, Downtown Berkeley is “14.”
How can you make columns for
exitthat use that second dataset to decode the numbers into names that you understand?
18.1 Just adding data
If you want to add a single case to an existing dataset, the easy solution is simply to type it into the empty row at the bottom of the table. If your datasets are medium-to-large, that’s not likely to happen very often—but it does happen, so it’s good to know.
If the table is already organized hierarchically, you will need to click on the number of a case that’s in the same category as your new data, and choose Insert Case from the menu that appears.
If you have two or more datasets that are structured identically and you want to combine them, CODAP has no easy command. One solution is to export them both as csv, combine them in a text editor, and then re-import them.
Of course, the two datsets need to have exactly the same attributes in the same order. You should also be sure to remove the attribute names that were ideally in the first line of the second file!
18.2 A true join and a simple solution: grouping and typing
This one came up in class. A student was using a Census data portal and was studying how income inequality had changed over time. They got income data from 1950, 1980, and 2010. The problem was, the incomes in 1950 were a lot smaller that in 2010, so in the graphs, it was hard to make the comparison. They wanted to know if they could somehow, you know, like, adjust it.
With a little prompting, they looked up the consumer price index in those three years. Then it was a matter of grouping by year (dragging left, which made three cases, one for each year), creating a new column for the CPI, and simply typing the CPI into those three empty cells. That is, they did not need to write a complicated formula, and they didn’t need to type thousands of entries.
Finally, they made a new column next to their income attribute and created a formula for the adjusted income. The very idea of adjusting the income like that is really powerful, and of course happens all the time in everyday statistics. See this chapter if you want to know more.
The astute reader will notice that the drag-left-to-make-a-new-column-and-just-type technique here is the same one we used for recoding some categorical variables, which we counted as a calculating data move (see this section). It’s interesting that we can use the same CODAP feature— dynamically reorganizing the hierarchical structure of a dataset— to do different data moves.
xxx do we need illustrations of this process? A live place to try out adjusting?
18.3 A true, hairy, joining example
This section’s example will show you how the sausage is made. If you just want to eat the sausage, skip to the next section!
What if it’s more than just three years, though? What if there are hundreds of things from one dataset to type into another? That’s impractical, you’re likely to make mistakes, and it would be really tedious.
For that situation, you need the computer to help. Sadly, to get it right can be hard. Computer commands to do this are really fussy, and the syntaxes are often Byzantine. Furthermore, there are different ways to join two datasets. You’ll see references to “inner joins” and “outer joins” and the like.
We will do only one kind here, using the BART scenario from the beginning of this chapter. Let’s describe the setup:
Our main dataset (
riders) contains records of individual people exiting the BART system.
It tells the time–to the second—that they tagged out of the exit gate,
plus the number of the station they entered and the
number of the station they exited.
Here is an example of what’s in it:
That’s all great, but if I’m doing the data analysis, I want the names of the stations, not their numbers.
Our second dataset (
stations) is a list of all the BART stations.
It includes the names of the stations and the numbers
BART uses to designate the stations in the person records.
Here is a snippet:
Reading the tables, we can see that all of our passengers got off at Downtown Berkeley— and all within a second of each other— but they got on at various stations: 12th Street, Fruitvale, and San Leandro.
So our plan is this: for each person record, remember the number of the station they exited from. Find that record in the station database, and remember the name of the station in that record. Finally, back in the person database, write that name in a new column.
Try this in the live illustration below ( or in this separate tab):
- Make a new column in
riders; call it
- Give it this weird formula; you can find the function
lookupByKeyin the Lookup Functions section of the formula editor:
lookupByKey("stations", "StationName", "StationCode", enter)
You should see the new column fill with station names.
Of course, you should see what this remarkable data looks like;
make a new graph and put
time on the horizontal axis.
If you like, color the points by station.
Now, about that crazy function. How in the world did we figure out what to put inside the parentheses? No normal mortal can remember, so if you want to construct such a formula yourself, do this:
- Open the formula editor for
enterNameand delete the formula.
- In the — Insert Function— button menu, choose Lookup Functions. You’ll see the left side of this next figure:
The info button—the little i in the circle—will give you detailed (albeit still confusing) help with the arguments of the function. The arguments are the four things that go inside the parentheses, separated by commas.
Then, when you actualy choose the lookupByKey item from that menu, it will insert the function into the function editor, with helpful text that you can edit… which is what you see on the right side in the figure.
Those four arguments are:
|argument||what it means|
||The name of the other dataset, in quotes because it’s a string:
||The name of the attribute in the other dataset that you want. That’s
||The name of the attibute in the other dataset you’re using as a “key”:
||The attribute in this dataset that matches that key. Note, not a string, so no quotes:
The concept of the key is, well, key. It’s what’s the same between the two datasets.
Think about it from the computer’s point of view:
for each case in the
riders table, you have to find the right case in the
How do you know which one?
You do exactly as we described above when we were thinking about it by hand:
you find the code for the station (which is in
enter) and match it with the code in the other dataset
(which is in
You also need to know what value to get from that
stations record; for us, that’s the name,
The other attribute is the name of the dataset you’re looking into, of course.
This whole dance is tortuous, I know. But once you have it all set up correctly, it can bring in thousands of data values quickly and reliably.
Here are two applications you might not immediately think of, to whet your appetite:
- Some datasets have geographic information, such as the latitude and longitude of BART stations.
If you have attributes named
longin your dataset, points will plot on a map!
- You can do a lookup into your own dataset. Suppose you have a dataset of people, including people from several generations of the same families, and you give each one an ID. Now suppose you have information about eye color, or the ability to roll your tongue. You can make attributes for mother and father, and use lookup to study the genetics of these traits.
18.4 An easier way to do a true join
Although the procedures in the previous section will help you understand the details of how to do a join in CODAP, there is an easier way: you can simply drag an attribute from one table to another. If you do this the right way, CODAP writes all those messy formulas for you. We’ll learn this one with an example: xxx