Exploratory data analysis
Exercise 1
First, work your way through this webpage, compeleting all the tasks. Then, use the code below to answer the questions in the exercise report file. You are encouraged to work in groups, but you should each submit an individual report. If you refer to a figure or table, copy it into your report.
You are not expected to understand the code at this stage!
Palmer penguins
Introduction
The palmerpenguins dataset contains morphological1 measurement data from three penguin species from three islands in Antarctica. This is an example of an observational2 dataset. From the data website:
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The data have already been loaded into the R environment3, which is running in this webpage. The data is stored in an object called penguins.
- The data is in a tabular “dataframe” format (like a spreadsheet).
- The top row shows the names of each column (variable).
NAmeans that data is missing for that particular value.- You can arrange the data in different ways by clicking on the name of a column.
- The data is also in a “tidy” format, where each row represents one observation (a single penguin).
Have a look through and make sure you understand what each column is showing you.
Frequencies of categories
Contingency tables
A contingency table shows frequency data (how often a given case occurs) for a dataset in a table format. A simple contingency table for this dataset might show the number of each penguin species in the dataset. The count() function will do that for us:
The n column shows the number of rows that belong to each species.
You can add more variables to a contingency table by writing the names of the variables you want to group by within the brackets () of the count() function, separated by a comma ,.
You can keep adding more categories in the same way.
Bar plots
Bar plots are a way to visualise data from contingency tables. The height of each bar represents the number of cases in that category (equivilent to the n column). The code below is an example of a ggplot “recipe” to make such a plot.
Just like contingency tables, you can show multiple categories on a bar plot. But since you’ve already used two dimensions of the plot (the x and y axes), you will have to use something like the colour of the bars to show more information.
Quantitative variables
Histograms
A histogram groups numeric data into contiguous bins and displays the count in each bin as a bar. So very similar to a bar chart, but you make the categories by splitting up a continuous variables into equal sections (called bins5). For example, let’s look at a histogram of body_mass_g of all penguins.
Just like how you used fill for the bar plot to show another category, you could do the same here. For example, you might want to fill by species.
Some people find this sort of plot hard to read (the instructor included). The next section has an alternative approach that is often easier to grasp.
Violin plots
A violin plot shows the same information as multiple histograms. It applies a smoothing function to the histogram, and then mirrors it to produce something that often looks like the body of a violin. They can be a good way to show how a quantitative variable varies among groups. For example, let’s look at a violin plot of body_mass_g for each species.
Let’s also show sex on this plot.
You probably notice an issue here. Since some of the penguins have missing data for sex (sex == NA) they are being plotted as their own category. We might want to first remove these before making the plot.
Now the plot is much cleaner, and reading it is easier.
Scatter plots
If you want to see how two continuous variables are related, then you can use a scatter plot. For example, let’s plot the relationship between body_mass_g and flipper_length_mm.
Each point represents one penguin. Each point’s location on the x axis represents the penguin’s body_mass_g and the point’s location on the y axis represents the penguins flipper_length_mm.
You can describe the relationship using words that imply strength, shape and direction. For example, body_mass_g and flipper_length_mm appear to have a relationship that is:
- strong (the trend is clear)
- linear (you could capture the relationship by drawing a straight line)
- positive (
body_mass_gincreases,flipper_length_mmalso increases)
You could also describe this plot in simpler terms, for example:
“Heavier penguins tend to have longer flippers.”
Currently you can’t tell what point belongs to which species of penguin. You could use colour to show this information.
An additional aesthetic you could use on a scatter plot is shape. Let’s also indicate which sex each penguin is by using shape.
Your task
Once you’ve worked through the webpage, download the assignment worksheet in Canvas assignments. You will need to edit the code above to answer the questions fully.
Footnotes
Morphological refers to the physical form and structure of organisms.↩︎
An observational study is a type of research design where researchers observe and collect data without manipulating or intervening.↩︎
R is a free software environment for statistical computing and graphics that you will use in this course↩︎
fillspecifies how each bar should be filled with colour.↩︎In the plot below, the argument
bins = 20affects how many groups you slice the variable into. There’s no right or wrong way to decide how many bins to use, so you can try changing it to find a value that looks good for you.↩︎