Testimate Guide

Logistic regression

Logistic regresssion is not part of the AP Stats syllabus, and is often missing from the introductory course. We include it in testimate because it’s pretty cool, and it’s one way to explore how a numeric predictor can predict a categorical outcome.

We’ll use our “pulse” dataset again, this time trying to predict gender from height.

When you set that up in testimate, and configure it to predict Male (because the young men are taller than the young women in the sample), you see this:

What the heck does this all mean?

Logistic regression (like linear regression) fits a function to the data. But this will be a logistic function, a kind of S-shaped thing. In this case, the function (the Model in the display) is

\[P(\mathrm{gender = Male}) = \frac{1}{1 + e^{-4m(\mathrm{height}-b)}}\]

Still not clear? I don’t blame you. Let’s look at the results of this regression. First, we’ll explain the graph. Then we’ll explain how to make it:

The points are data: the heights of the people. You can see logisticGroup on the vertical axis. It’s a translation of gender, where Male appears as 1.0 and Female is 0.0. As you can see, males are taller than females on average; in logistic regresssion terms, you might say that the taller someone is, the more likely they are to be male.

The red curve is the logistic model. You can see, for example, that the value of the function at height = 160 is about 0.1. That means that, if a person is 160 cm tall, the probability that they are male is 0.1. Wild, huh?

Another way to say this is, “if you just look near 160 cm, only about 10% of the people are male.”

Making that graph

Testimate cannot ask CODAP to make the entire graph that we want, so you have to do a little of the work yourself:

  • Press the show graph button. This will make a graph where the outcome attribute values are represented by 0 and 1. (Perfect for probability…)
  • Press copy formula. This puts a copy of the formula for the function—the Model probability function shown in the output, but with lots more decimal places—on the clipboard.
  • Click on the graph. In the graph’s “ruler” menu, check Plotted Function.
  • At the top of the graph, click in the white area to the right of “f()=”. The formula editor opens.
  • Paste the formula into the formula editor. Then click Apply to see your function!

Formula editor with pasted function. See? Lots of decimal places.

Note: if the model changes, it will not update the function automatically. You need to re-copy and re-paste the function definition.

Try making the graph in the live demo below.

The “find” box

Just above the blue “configuration stripe” is a tool to compute probabilities at a specific value for the predictor. Just enter the value you want and press enter or tab, and testimate will compute the probability.

This one shows that, if you’re 175 cm tall, there’s an 86% chance that you’re male. You can try it out in the demo above. At what height is your chance of being male about a quarter?

What do the function parameters mean?

In the logistic function

\[y = \frac{1}{1 + e^{-4m(x-b)}}\]

there are two parameters, \(m\) and \(b\). In this way of writing the function, \(b\) is the value of the predictor where the outcome (the probability \(y\)) is 0.5.1 And \(m\) is the slope of the curve at that point.2 So when we look at a specific function, such as the one in the graph we showed earlier (now in the margin),

\[P(\mathrm{gender = Male}) = \frac{1}{1 + e^{-4 \times 0.06 (\mathrm{height}-168)}}\]

that means that at 168 cm, the chance is 50% that you’re male, and the slope at that point is 0.06, that is, you’re 6% more likely to be male for every additional centimeter in height.

Under the hood

Do you want to know more about how all this works? No? That’s fine. You’re done. But in case you’re curious:

That 1-0 attribute, logisticGroup. We make a new attribute based entirely on the outcome attribute (gender). In this new attribute, we put 1 if the value is the one in the button in the configuration strip. Why isn’t this attribute in the table? It is! It’s just hidden. use choosy to reveal it.

What’s this about iteration? To do linear regression, you do a bunch of calculations using data values, and calculate the slope and intercept of the least-squares line. That doesn’t work with logistic regression. Instead, we make an initial guess and then iterate, tweaking its parameters and changing them in a direction to minimize “cost.” This “cost” is a value that (like the sum of squares of residuals in a LSRL) is larger if the curve is farther from the points. The current value of cost actually appears in the display.

In general, when we tested our algorithm, 100 iterations resulted in a stable answer. But you can do 10 more iterations by clicking the 10 more button. If you want to start earlier in the process, change the “100” in the configuration stripe to a smaller number (such as 1 or 10) and see what function you get. Then you can add iterations and see how the function improves. Sadly, we cannot make the function update automatically; you have to make a new function yourself.

What’s rate? When we make an iteration step, we have to choose how big that step is. The rate parameter in the configuration stripe—it defaults to 0.1—is proportional to the step size. A bigger value will reach the answer faster, but too big and it can overshoot or otherwise mess up.

Footnotes

  1. Right? If \(x = b\), then \(y\) =…↩︎

  2. OMG, a use for calculus! Now you see where the 4 comes from.↩︎