Guest post by data scientist Tom Levine
Variable in an equation
If you ever took an algebra class, you'll think that *x* in the following equation is a variable.
12 = x + 7
Variable in a data table
As I've explained previously, when I say "variable", I'm usually talking about a column in a data table. That is, "variable" is a synonym for "column." This meaning of the word "variable" can be confusing if you were taught a different meaning of "variable" in school. You might find it strange that we use the word "variable" in two seemingly different ways. In fact, I usually think of these as different meanings of the word "variable."
It turns out that these two senses of "variable" are quite related. In the current blog post, I hope to explain why we use the same word to describe two concepts that seem sort of different.
Once we have created a data table, we might decide to make a fancy equation that explains how some values in the table are related to each other. We might call this a *fitted model*. Here's part of a data table about the wealth and health of nations. (You may have seen this before.)
What we're calling "income per person" is technically gross-domestic product per capita, adjusted for purchasing power parity and inflation. But don't worry about that if it sounds like Greek. Here's a fancy graph of the data.
And here's an equation that describes those data.
Predicted life expectancy= 32 + 4.239 × log (Income per person)
In this equation, life expectancy and income per person just numbers that we take from one row of the table. To make that look more like what you might be used to, let's rewrite it like this.
y = 32 + 4.239 × log(x)
By distilling this relationship to a simple equation, we can express complex information as a simple summary of it is small enough to fit in our brains.
There are lots of ways we can use this equation (the fitted model). Right now, we're just going to talk about using the equation to make predictions.
Typically we write the equation such that the left side is the thing we are predicting and the right side is the stuff we know. This is how I've been writing it so far, so we are predicting life expectancy (*y*) based on income per person (*x*). Let's say that we have measured the income per person of another country but don't know the life expectancy. We can use our equation to predict it!
Here's a new row
We know China's income per person *x*, so let's plug that into our equation to predict China's life expectency.
We thus predict China's life expectancy to be 71 years. The actual value is 75, so that's pretty close.
Let's get back to why these two senses of variable are the same thing. Our fitted model is an equation with two variables, income per person and life expectancy. Each of these variables corresponds to a column in our data table.
In the first data table I presented, we knew the values of all of the cells. That is, we knew the income per person and life expectancy for each of the three countries.
In the second table, we didn't only knew the value for one of the two variables (income per person). In order to predict the value for the other variable (life expectancy), we plugged in the variable that we did know into our equation, and we solved it for the variable that we didn't know.
The word "variable" sort of has two senses, and it can be confusing to listen to data scientists when you don't realize this.
If you've taken an algebra class, you probably think of a variable as a letter in an equation.
When data scientists talk about variables, that *might* be what they mean. But they might also be talking about columns in data tables. These two meanings of the word "variable" arise because of how we model data.
We come up with equations to describe the relationships within the dataset, so a variable in the column sense is sort of like a lot of variables in the letter-in-an-equation sense. Once we have our fitted model (equation), we can plug in variables for some of the variables in order to predict other variables.
We'll often solve the same equation as many times as we have rows in our table, so it's sort of like we're plugging a whole column of numbers into our equation rather than just a single number.
What else are you wondering about data science? Tell me what I should write about next.