When I am teaching people about how to go about data analysis the first few steps are always the same. It’s almost always the case that the data as you receive it from the customer is not in the right format for what you want to do with it. You might even want it in more than one format, such as one for plotting and a different format for a certain kind of data analysis technique.
There isn’t one format that meets all purposes, but Hadley Wickham’s tidy format is a great start for data analysis purposes.
So, your first task when you get the data is to get it into the right shape. This is often known as data wrangling. Computer scientists use the acronym ETL (Extraction, Transformation and Loading) for the various steps of getting the data into your preferred software. I won’t deal with data wrangling here, excep to say that my favourite data wrangling tool is dplyr and you can find out how to use it here.
After the data wrangling you need to become familiar with your data before you do any that might lead you to report wrong conclusions, such as fit an inappropriate model to the data. This process is called Exploratory Data Analysis (EDA) and it was pioneered by John W. Tukey in his (very weird) 1997 book of the same name. I shall illustrate some EDA using the R statistical environment.
Start by creating some simple summaries of each of the variables. Let’s take as an example the Palmer penguins dataset. This a dataset collected by US scientists at the Palmer research station on the Antarctic peninsula. Who doesn’t love penguins? (Fans of Apple+ TV’s Slow Horses will know the answer.)
You will need to first install the palmerpenguins package using the commands
install.packages(‘palmerpengions’) library(palmerpengions)
We can get a short summary of the dataset from the summary function in R.
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA’s :2 NA’s :2
flipper_length_mm body_mass_g sex year strings
Min. :172.0 Min. :2700 female:165 Min. :2007 Mode:logical
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007 TRUE:344
Median :197.0 Median :4050 NA’s : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA’s :2 NA’s :2
We can see that we have eight columns of data on 344 penguins. Three variables, the species, island and sex are categories and the other five are numeric attributes.
Missing values
When looking at a new dataset one of the first things we should be looking for is missing values. In Excel these will just be blank cells, but in R they are denoted by the special symbol NA (which stands for “not available”), which is distinct from the character string “NA”.
Missing values, along with outliers, are the sort of thing that can easily trip up analysis software, so it’s essential to know about them in advance. In spreadsheets missing values are silently ignored by calculations such as averages. In R, calculating an average of a set of numbers that contains NAs generates an error. You can then go back and tell it to calculate the average ignoring the NAs (or you can do something else, such as imputing the missing values), but they are always brought to your attention.
We note that:
- we don’t have the sex for 11 of the penguins
- we don’t have the bill length for 2 of the penguins
- we don’t have the bill depth for 2 of the penguins
- we don’t have the flipper length for 2 of the penguins
- we don’t have the body mass for 2 of the penguins
Let’s take a look at all the cases that have at least one missing value.
species island bill_length_mm bill_depth_mm flipper_length_mm
4 Adelie Torgersen NA NA NA
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42.0 20.2 190
11 Adelie Torgersen 37.8 17.1 186
12 Adelie Torgersen 37.8 17.3 180
48 Adelie Dream 37.5 18.9 179
179 Gentoo Biscoe 44.5 14.3 216
219 Gentoo Biscoe 46.2 14.4 214
257 Gentoo Biscoe 47.3 13.8 216
269 Gentoo Biscoe 44.5 15.7 217
272 Gentoo Biscoe NA NA NA
body_mass_g sex year strings
4 NA <NA> 2007 TRUE
9 3475 <NA> 2007 TRUE
10 4250 <NA> 2007 TRUE
11 3300 <NA> 2007 TRUE
12 3700 <NA> 2007 TRUE
48 2975 <NA> 2007 TRUE
179 4100 <NA> 2007 TRUE
219 4650 <NA> 2008 TRUE
257 4725 <NA> 2009 TRUE
269 4875 <NA> 2009 TRUE
272 NA <NA> 2009 TRUE
We can leave out the rows that have missing values by creating a new dataframe, penguins2, without them.
Distributions
Let’s now look at the distributions of the numeric variables. Don’t be tempted to just look at the mean—look at the whole distribution.
We are looking for:
- outliers (extreme values),
- skewed distributions and
- multimodality (multiple humps in the distribution).
Outliers can be errors, or they can be valid values, but indicating that the data consists of subsets that we should look at separately, or they can just be extreme values that occur occasionally. It’s important to know about them because they can have strong effects on models or reported results. Averages, in particular, are very vulnerable to small numbers of very high or very low values.
Suppose you are doing a survey on wealth in a small town. By chance, Bill Gates and Elon Musk happen to be meeting for a coffee in that very town on the day of your survey and they both agree to take part.
I would say never compute the mean without also computing the maximum and the minimum at the same time.
Skewed distributions indicate that there is a high proportion of low values and a low (but perhaps very influential) proportion of high values. In business contexts people often note that a high proportion of their revenue comes from a small proportion of key customers (‘the Pareto effect’).
Multimodality might be another indication that there are important subsets of the data. Here we remind ourselves that there are three different species in our dataset.

We note signs of multimodality (i.e. multiple humps). The obvious cause is that there are differences between the species. Let’s look at some of these by species.




Right away we can see noticeable differences between the species.
Two variables at a time
For the numeric variables We need to know, not just what each variable is like on its own, but how it relates to the other numeric variables, i.e. correlations between them.
A scatterplot matrix (below) plots each numeric variable against every other one, simultaneously in thumbnail-sized plots. Usually most of them won’t be of interest but it does enable the reader to pick out the ones of interest, either because they fall into clusters or because they are correlated. If they are correlated on will tend to increase with another in a (broadly) straight line relationship. If we see such patterns of interest we can focus in on that pair by using a separate chart. Of course, We can do all this by species.

Notice the variable names are on the diagonal and that the cells above the diagonal are mirrored below the diagonal. To see the relationship between any two variables look at the row that contains the name of one and the column that contains the name of the other and find the cell that intersects both of them.
Example: to look at flipper length versus body mass we note that flipper length is mentioned in row 4 and body mass is mentioned in row 5 so we look at row 4, column 5. (Or row 5, column 4. It’s the same information.)
From the chart we can see that flipper length is correlated with body mass. As one goes up, so does the other. Let’s plot just those two variables, and colour code the points by species.

We can see that the Gentoos are heavier and have longer flippers than the Adelie and Chinstraps, which are similar to one another.
For categorical variables a useful tool is the contingency table. It is a table of counts of one variable against another. (In the market research world they are known as crosstabs.) Suppose we are interested in the numbers of each species of penguin on each island.
| Biscoe | Dream | Torgersen | |
|---|---|---|---|
| Adelie | 44 | 55 | 47 |
| Chinstrap | 0 | 68 | 0 |
| Gentoo | 119 | 0 | 0 |
Often contingency tables have a relationship between the rows and the columns. It might occur to us to wonder whether the different penguin species are distributed equally amongst the different islands. Obviously, penguins move around, so the numbers found on each island on a particular day will vary from day to day. So there is random variation to take into account.
We can apply a statistical test, the 2 (pronounced ‘kye-squared’) test, to the hypothesis that the penguins are equally distributed across the islands. It delivers a ‘p-value’ that tells us the probability of seeing data as divergent as this, under the assumption that the penguins actually are equally distributed. If the ‘p-value’ is very small the equality assumption is unlikely, given this observed data.
Pearson’s Chi-squared test
data: species_island.tab
X-squared = 284.59, df = 4, p-value < 2.2e-16
The p-value is extremely small, so the hypothesis that penguins are equally distrinuted across the islands is extremely unlikely. In this case the result was obvious from the outset, just by looking at the table, but often it is not at all clear and the 2 test is an easy way to clarify the situation.
The test works by taking the total number of each species of penguin and working out how many we would expect to see on each island if they were equally likely to be found on any one of the islands. This is the expected frequency. (There is no need to round it to whole numbers.) We compare the expected frequencies with the observed frequencies to compute the p-value. Here the expected frequencies are:
| Biscoe | Dream | Torgersen | |
|---|---|---|---|
| Adelie | 71.46547 | 53.92793 | 20.606607 |
| Chinstrap | 33.28529 | 25.11712 | 9.597598 |
| Gentoo | 58.24925 | 43.95495 | 16.795796 |
Contingency tables can be visualised in several ways. One of my favourites is the mosaic plot.

Here the table is represented by blocks, whose size represents the number of penguins. The colours denote departures from the hypothesis that penguins are equally likely to be found on each island. Blue represents an excess and red represents a deficit from the expected values. It’s based on the 2 test, so it does the test in a visually obvious way that doesn’t require knowledge of the underlying statistical theory.
More than two variables at a time
Analysing more than two variables at a time involves a branch of statistics called multivariate analysis. I won’t go into details here, except to say that it involves the use of more advanced mathematical techniques. The benefits are that they can tell you:
- if any of the variables you are using are redundant, because you are getting essentially the same information from other variables. For exaample, do you need to measure flipper length and body mass? They are both measures of size.
- If there are clusters of similar individuals.
Summary
Things to think about when you get a new dataset:
- Look for simple summaries
- Never compute an average without also computing the maximum and the minimum
- Look for outliers and missing values
- Plot the data in ways that are meaningful, bearing in mind what you want to do with it
- Find a way to plot the entire distribution
- Use contingency tables of categorical data and test for relationships between the rows and the columns