Left branch | Right branch |
---|---|
Variable | x ≤ 3 |
Change these at your peril.
This CODAP plug-in lets you make decision or prediction trees based on the data in the document. Here are the basics:
As an example, suppose we want to determine whether you have a disease called Fbola. We have a number of tests and indications to help us.
Having Fbola is the dependent variable. We have test and disease data for a number of people who came to the clinic. Our tree represents a series of questions: Are you coughing? Is your fever over 39 degrees Celsius? Have you ever had the measles? Your design of the tree represents a sequence of questions, each depending on the answers to the previous ones.
After any question, we can decide to stop and choose a prediction (or diagnosis). We will say that our diagnosis is positive (that we think you have Fbola) or negative. The positive and negative diagnoses have black and green blocks.
Of course, we will not be right every time. There are four possibilities:
We want to arrange the tree to get the best outcome we can.
The names of all the variables --- the column headings in your data table --- appear in a row below the tree, towards the bottom of the window.
To make the tree branch according to a variable (which represents asking a question in our example), drag the variable into the tree and drop it on the node you want to branch. At the moment, you may not see the drag! Don't worry, it's dragging nevertheless. And if it doesn't seem to "take," try again. It may be that dropping it on text doesn't work.
As you will see, every node is either a "branching" node (where you ask a question) or a "terminal" node (where you give a diagnosis or prediction). You can think of the terminal nodes as leaves on the tree. They are labeled + or –. (At the beginning, they start out labeled (?).)
Reverse the sense of a terminal node by clicking on it.
Your tree will always have one dependent (outcome) variable. It's the thing you're trying to predict using the other variables. For example, if you have data on people with their sex, income, race, and education, and you want to study what affects income, make income the dependent variable.
To make a variable dependent, double-click it in the list below the tree.
The dependent variable also appears at the top of the tree, in the first ("root") node. That node also tells you what counts as "positive" in the tree.
Whenever you add an attribute (a variable) to the tree, you make two branches. One will have a "leaf" labeled "plus," the other will be "minus." The "+"s represent a "positive" choice, even if it's a negative occurrence. For example, a positive test for a disease may not be what you want, but it leads you to suspect that you actually have the disease.
All branchings in the tree are binary. You are always choosing between two options. Your variables may have more than two values, however. The plugin makes default choices about how the values of your variable map onto the two choices. You will need to configure any variable you put in a node.
Click the button in a branching node to make its variable's configuration appear.
Each variable has a "left" and a "right" value. In general, try to put the more "positive" values (with respect to your dependent variable) on the left. You can decide what the two sides will be labeled using the text boxes. For example, you might decide that income > 50000 should be called rich, and everything else is poor.
In the settings tab, you will find a menu that lets you select between having this be a Classification tree and a Regression tree.
A Classification tree is the traditional decision tree, where the result is a positive or negative prediction or diagnosis. Each node shows the proportion of cases where the case you're trying to predict is true.
In a Regression tree, you are predicting the value of the dependent variable. So it makes sense that the dependent variable should be numeric for a regression tree. Nodes will show the mean of the variable for cases in that node, and the bottom "stat" shows the sum of the squares of the deviations from the mean, added up for all terminal nodes, and divided by the original sum of squares. That is, it shows the proportion of variance still unexplained by the tree.
Clicking on a node selects all of the corresponding cases in CODAP.
Pressing emit data sends information on the current tree to a separate dataset, called either Classification Tree Records or Regression Tree Records, depending on the type of tree you're making. Find those tables in the Tables menu in the toolbar.
Clicking on one of the records in one of those tables restores the table to what it was when you created that record.
Clicking on a single case in the CODAP data table highlights the "path" to its terminal node.