This post was written by Thomas Levine, data scientist and DataKind Ambassador. Tom also performs in the band CSV Soundsystem which makes music from spreadsheets.
This fall, Tom has generously lent his time to a project we're calling "Ask A Data Scientist." Got a question about your data? Ask Tom in the comments below. You might be already learning R or python or D3, but today we're starting with the basics.
In order to understand how the world works, I collect data about it and then apply a bunch of fancy quantitative methods. A lot of these methods expect data to be represented in data tables. When people like me work with tables so much, we come up with lots of fancy words to describe all of the different parts of a table. Today, I hope to teach you what all of these different words mean.
Here's a data table about the wealth and health of nations.
It has a grid with a value in each of the rectangular boxes. Being this sort of grid, it has columns and rows. These columns and rows mean special things. *What we're calling "income per person" is technically gross-domestic product per capita, adjusted for purchasing power parity and inflation. But don't worry about that if it sounds like Greek.*
The very first line in that table says "Country", "Life Expectancy" and "Income Per Person". This line is called the "header", and we'll get to that later. The rest of the lines are all *rows*. A data table represents a collection of things, and each individual thing is represented as a row. For example, the table above is about a collection of countries, so each row is a country. I've copied it down here again, this time highlighting one of its three rows.
It is quite important to know that each row is a country, so we have a name for this relationship; we say that country is the the *statistical unit* in this data table.
**Synonyms**: Rows are also called *records*, *observations*, *trials* and probably a bunch of other things.
In this data table, we have recorded a bunch of information about each the wealth and health of nations. More precisely, we have recorded the same sorts of information about each country during the year 2012. The first box in each row contains the name of that country, and the second box contains the life expectancy for that country, and the third contains the income per person. Thus, all of the values in a given column are about the same sort of thing. For example, the second column contains all of the life expectancies.
**Synonyms**: Columns are also called *variables* and *features*. These words sound very fancy, but they're not; when a data scientist says "variable" or "feature", she's just using a fancy word for "column".
We usually indicate the names of the columns in the *header*, which is highlighted below. This is how we know what each column means.
I said above that "row" is a synonym for "record", "observation" and "trial and that "column" is a synonym for "variable" and "feature". This isn't *entirely* true. In my mind, these truly are synonyms, but I'm sometimes given a data table that doesn't look like this. You could make a data table where rows don't correspond to records and where columns don't correspond to variables. For example, you could make a table where and columns are records and rows are variables. Or you could make a table that includes a few rows that are statistics about the other rows. I get confused when they I have tables like this.
There's actually a name for this sort of data table; it's called *untidy* data, and the first thing that I do when analyzing such a data table is converting it into the *tidy* format where each row is an observation/trial/record and each column is a variable.
Data science has lots of words and concepts that often sound fancier than they really are. What should I tell you about next?