Testimate Guide
Logistic regression
Logistic regresssion is not part of the AP Stats syllabus, and is often missing from the introductory course. We include it in testimate
because it’s pretty cool, and it’s one way to explore how a numeric predictor can predict a categorical outcome.
We’ll use our “pulse” dataset again, this time trying to predict gender
from height
.
When you set that up in testimate
, and configure it to predict Male
(because the young men are taller than the young women in the sample), you see this:
What the heck does this all mean?
Logistic regression (like linear regression) fits a function to the data. But this will be a logistic function, a kind of S-shaped thing. In this case, the function (the Model in the display) is
\[P(\mathrm{gender = Male}) = \frac{1}{1 + e^{-4m(\mathrm{height}-b)}}\]
Still not clear? I don’t blame you. Let’s look at the results of this regression. First, we’ll explain the graph. Then we’ll explain how to make it:
The points are data: the heights of the people. You can see logisticGroup
on the vertical axis. It’s a translation of gender
, where Male
appears as 1.0 and Female
is 0.0. As you can see, males are taller than females on average; in logistic regresssion terms, you might say that the taller someone is, the more likely they are to be male.
The red curve is the logistic model. You can see, for example, that the value of the function at height
= 160 is about 0.1. That means that, if a person is 160 cm tall, the probability that they are male is 0.1. Wild, huh?
Another way to say this is, “if you just look near 160 cm, only about 10% of the people are male.”
Making that graph
Testimate
cannot ask CODAP to make the entire graph that we want, so you have to do a little of the work yourself:
- Press the
show graph
button. This will make a graph where the outcome attribute values are represented by 0 and 1. (Perfect for probability…) - Press
copy formula
. This puts a copy of the formula for the function—the Model probability function shown in the output, but with lots more decimal places—on the clipboard. - Click on the graph. In the graph’s “ruler” menu, check
Plotted Function
. - At the top of the graph, click in the white area to the right of “f()=”. The formula editor opens.
- Paste the formula into the formula editor. Then click
Apply
to see your function!
Note: if the model changes, it will not update the function automatically. You need to re-copy and re-paste the function definition.
Try making the graph in the live demo below.
The “find” box
Just above the blue “configuration stripe” is a tool to compute probabilities at a specific value for the predictor. Just enter the value you want and press enter
or tab
, and testimate
will compute the probability.
This one shows that, if you’re 175 cm tall, there’s an 86% chance that you’re male. You can try it out in the demo above. At what height is your chance of being male about a quarter?
What do the function parameters mean?
In the logistic function
\[y = \frac{1}{1 + e^{-4m(x-b)}}\]
there are two parameters, \(m\) and \(b\). In this way of writing the function, \(b\) is the value of the predictor where the outcome (the probability \(y\)) is 0.5.1 And \(m\) is the slope of the curve at that point.2 So when we look at a specific function, such as the one in the graph we showed earlier (now in the margin),
\[P(\mathrm{gender = Male}) = \frac{1}{1 + e^{-4 \times 0.06 (\mathrm{height}-168)}}\]
that means that at 168 cm, the chance is 50% that you’re male, and the slope at that point is 0.06, that is, you’re 6% more likely to be male for every additional centimeter in height.
Under the hood
Do you want to know more about how all this works? No? That’s fine. You’re done. But in case you’re curious:
That 1-0 attribute, logisticGroup
. We make a new attribute based entirely on the outcome attribute (gender
). In this new attribute, we put 1
if the value is the one in the button in the configuration strip. Why isn’t this attribute in the table? It is! It’s just hidden. use choosy
to reveal it.
What’s this about iteration? To do linear regression, you do a bunch of calculations using data values, and calculate the slope and intercept of the least-squares line. That doesn’t work with logistic regression. Instead, we make an initial guess and then iterate, tweaking its parameters and changing them in a direction to minimize “cost.” This “cost” is a value that (like the sum of squares of residuals in a LSRL) is larger if the curve is farther from the points. The current value of cost
actually appears in the display.
In general, when we tested our algorithm, 100 iterations resulted in a stable answer. But you can do 10 more iterations by clicking the 10 more
button. If you want to start earlier in the process, change the “100” in the configuration stripe to a smaller number (such as 1 or 10) and see what function you get. Then you can add iterations and see how the function improves. Sadly, we cannot make the function update automatically; you have to make a new function yourself.
What’s rate
? When we make an iteration step, we have to choose how big that step is. The rate
parameter in the configuration stripe—it defaults to 0.1—is proportional to the step size. A bigger value will reach the answer faster, but too big and it can overshoot or otherwise mess up.