I have some data—what do I do with it?

When I am teaching people about how to go about data analysis the first few steps are always the same. It’s almost always the case that the data as you receive it from the customer is not in the right format for what you want to do with it. You might even want it in more than one format, such as one for plotting and a different format for a certain kind of data analysis technique.

There isn’t one format that meets all purposes, but Hadley Wickham’s tidy format is a great start for data analysis purposes.

So, your first task when you get the data is to get it into the right shape. This is often known as data wrangling. Computer scientists use the acronym ETL (Extraction, Transformation and Loading) for the various steps of getting the data into your preferred software. I won’t deal with data wrangling here, excep to say that my favourite data wrangling tool is dplyr and you can find out how to use it here.

After the data wrangling you need to become familiar with your data before you do any that might lead you to report wrong conclusions, such as fit an inappropriate model to the data. This process is called Exploratory Data Analysis (EDA) and it was pioneered by John W. Tukey in his (very weird) 1997 book of the same name. I shall illustrate some EDA using the R statistical environment.

Start by creating some simple summaries of each of the variables. Let’s take as an example the Palmer penguins dataset. This a dataset collected by US scientists at the Palmer research station on the Antarctic peninsula. Who doesn’t love penguins? (Fans of Apple+ TV’s Slow Horses will know the answer.)

You will need to first install the palmerpenguins package using the commands

install.packages(‘palmerpengions’) library(palmerpengions)

We can get a short summary of the dataset from the summary function in R.

      species          island    bill_length_mm  bill_depth_mm 
Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10 
Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60 
Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30 
                                Mean   :43.92   Mean   :17.15 
                                3rd Qu.:48.50   3rd Qu.:18.70 
                                Max.   :59.60   Max.   :21.50 
                                NA’s   :2       NA’s   :2     
flipper_length_mm  body_mass_g       sex           year      strings      
Min.   :172.0     Min.   :2700   female:165   Min.   :2007   Mode:logical 
1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007   TRUE:344     
Median :197.0     Median :4050   NA’s  : 11   Median :2008                
Mean   :200.9     Mean   :4202                Mean   :2008                
3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009                
Max.   :231.0     Max.   :6300                Max.   :2009                
NA’s   :2         NA’s   :2                                                

We can see that we have eight columns of data on 344 penguins. Three variables, the species, island and sex are categories and the other five are numeric attributes.

Missing values

When looking at a new dataset one of the first things we should be looking for is missing values. In Excel these will just be blank cells, but in R they are denoted by the special symbol NA (which stands for “not available”), which is distinct from the character string “NA”.

Missing values, along with outliers, are the sort of thing that can easily trip up analysis software, so it’s essential to know about them in advance. In spreadsheets missing values are silently ignored by calculations such as averages. In R, calculating an average of a set of numbers that contains NAs generates an error. You can then go back and tell it to calculate the average ignoring the NAs (or you can do something else, such as imputing the missing values), but they are always brought to your attention.

We note that:

  • we don’t have the sex for 11 of the penguins
  • we don’t have the bill length for 2 of the penguins
  • we don’t have the bill depth for 2 of the penguins
  • we don’t have the flipper length for 2 of the penguins
  • we don’t have the body mass for 2 of the penguins

Let’s take a look at all the cases that have at least one missing value.

    species    island bill_length_mm bill_depth_mm flipper_length_mm
4    Adelie Torgersen             NA            NA                NA
9    Adelie Torgersen           34.1          18.1               193
10   Adelie Torgersen           42.0          20.2               190
11   Adelie Torgersen           37.8          17.1               186
12   Adelie Torgersen           37.8          17.3               180
48   Adelie     Dream           37.5          18.9               179
179  Gentoo    Biscoe           44.5          14.3               216
219  Gentoo    Biscoe           46.2          14.4               214
257  Gentoo    Biscoe           47.3          13.8               216
269  Gentoo    Biscoe           44.5          15.7               217
272  Gentoo    Biscoe             NA            NA                NA
    body_mass_g  sex year strings
4            NA <NA> 2007    TRUE
9          3475 <NA> 2007    TRUE
10         4250 <NA> 2007    TRUE
11         3300 <NA> 2007    TRUE
12         3700 <NA> 2007    TRUE
48         2975 <NA> 2007    TRUE
179        4100 <NA> 2007    TRUE
219        4650 <NA> 2008    TRUE
257        4725 <NA> 2009    TRUE
269        4875 <NA> 2009    TRUE
272          NA <NA> 2009    TRUE

We can leave out the rows that have missing values by creating a new dataframe, penguins2, without them.

Distributions

Let’s now look at the distributions of the numeric variables. Don’t be tempted to just look at the mean—look at the whole distribution.

We are looking for:

  • outliers (extreme values),
  • skewed distributions and
  • multimodality (multiple humps in the distribution).

Outliers can be errors, or they can be valid values, but indicating that the data consists of subsets that we should look at separately, or they can just be extreme values that occur occasionally. It’s important to know about them because they can have strong effects on models or reported results. Averages, in particular, are very vulnerable to small numbers of very high or very low values.

Suppose you are doing a survey on wealth in a small town. By chance, Bill Gates and Elon Musk happen to be meeting for a coffee in that very town on the day of your survey and they both agree to take part.

I would say never compute the mean without also computing the maximum and the minimum at the same time.

Skewed distributions indicate that there is a high proportion of low values and a low (but perhaps very influential) proportion of high values. In business contexts people often note that a high proportion of their revenue comes from a small proportion of key customers (‘the Pareto effect’).

Multimodality might be another indication that there are important subsets of the data. Here we remind ourselves that there are three different species in our dataset.

We note signs of multimodality (i.e. multiple humps). The obvious cause is that there are differences between the species. Let’s look at some of these by species.

Right away we can see noticeable differences between the species.

Two variables at a time

For the numeric variables We need to know, not just what each variable is like on its own, but how it relates to the other numeric variables, i.e. correlations between them.

A scatterplot matrix (below) plots each numeric variable against every other one, simultaneously in thumbnail-sized plots. Usually most of them won’t be of interest but it does enable the reader to pick out the ones of interest, either because they fall into clusters or because they are correlated. If they are correlated on will tend to increase with another in a (broadly) straight line relationship. If we see such patterns of interest we can focus in on that pair by using a separate chart. Of course, We can do all this by species.

Notice the variable names are on the diagonal and that the cells above the diagonal are mirrored below the diagonal. To see the relationship between any two variables look at the row that contains the name of one and the column that contains the name of the other and find the cell that intersects both of them.

Example: to look at flipper length versus body mass we note that flipper length is mentioned in row 4 and body mass is mentioned in row 5 so we look at row 4, column 5. (Or row 5, column 4. It’s the same information.)

From the chart we can see that flipper length is correlated with body mass. As one goes up, so does the other. Let’s plot just those two variables, and colour code the points by species.

We can see that the Gentoos are heavier and have longer flippers than the Adelie and Chinstraps, which are similar to one another.

For categorical variables a useful tool is the contingency table. It is a table of counts of one variable against another. (In the market research world they are known as crosstabs.) Suppose we are interested in the numbers of each species of penguin on each island.

BiscoeDreamTorgersen
Adelie445547
Chinstrap0680
Gentoo11900

Often contingency tables have a relationship between the rows and the columns. It might occur to us to wonder whether the different penguin species are distributed equally amongst the different islands. Obviously, penguins move around, so the numbers found on each island on a particular day will vary from day to day. So there is random variation to take into account.

We can apply a statistical test, the 2 (pronounced ‘kye-squared’) test, to the hypothesis that the penguins are equally distributed across the islands. It delivers a ‘p-value’ that tells us the probability of seeing data as divergent as this, under the assumption that the penguins actually are equally distributed. If the ‘p-value’ is very small the equality assumption is unlikely, given this observed data.


    Pearson’s Chi-squared test

data:  species_island.tab
X-squared = 284.59, df = 4, p-value < 2.2e-16

The p-value is extremely small, so the hypothesis that penguins are equally distrinuted across the islands is extremely unlikely. In this case the result was obvious from the outset, just by looking at the table, but often it is not at all clear and the 2 test is an easy way to clarify the situation.

The test works by taking the total number of each species of penguin and working out how many we would expect to see on each island if they were equally likely to be found on any one of the islands. This is the expected frequency. (There is no need to round it to whole numbers.) We compare the expected frequencies with the observed frequencies to compute the p-value. Here the expected frequencies are:

BiscoeDreamTorgersen
Adelie71.4654753.9279320.606607
Chinstrap33.2852925.117129.597598
Gentoo58.2492543.9549516.795796

Contingency tables can be visualised in several ways. One of my favourites is the mosaic plot.

Here the table is represented by blocks, whose size represents the number of penguins. The colours denote departures from the hypothesis that penguins are equally likely to be found on each island. Blue represents an excess and red represents a deficit from the expected values. It’s based on the 2 test, so it does the test in a visually obvious way that doesn’t require knowledge of the underlying statistical theory.

More than two variables at a time

Analysing more than two variables at a time involves a branch of statistics called multivariate analysis. I won’t go into details here, except to say that it involves the use of more advanced mathematical techniques. The benefits are that they can tell you:

  • if any of the variables you are using are redundant, because you are getting essentially the same information from other variables. For exaample, do you need to measure flipper length and body mass? They are both measures of size.
  • If there are clusters of similar individuals.

Summary

Things to think about when you get a new dataset:

  • Look for simple summaries
  • Never compute an average without also computing the maximum and the minimum
  • Look for outliers and missing values
  • Plot the data in ways that are meaningful, bearing in mind what you want to do with it
  • Find a way to plot the entire distribution
  • Use contingency tables of categorical data and test for relationships between the rows and the columns