1. Southern resident killer whale diets

Summary statistics across grouped variables

Authors

Amy Van Cise

Sarah Tanja

Published

January 20, 2026

Modified

February 17, 2026

Background

Optional background reading:

Guiding research questions

In this lab we will be using simulated data on southern resident killer whale diet to explore two guiding research questions:

Q1: Does Chinook salmon make up a majority of the southern resident killer whale (SRKW) diet?

Q2: Does southern resident killer whale (SRKW) diet change throughout the year?

Note

We will ALL first work on answering Q1 together in class, and see how far we get! If you complete Q1, see if you can find someone in class that has not finished, and try to help them troubleshoot their code!

Setup your (coding) environment

  • Create a new folder for week4 in your R project directory
  • Open a new .Rmd or .qmd file in your week4 folder

Q1: Does Chinook salmon make up a majority of the southern resident killer whale (SRKW) diet?

Step 1. Hypotheses and variables

Write down your null and alternative hypotheses

  • Ho:
  • Ha:
Tip

Proper scientific notation for hypotheses uses subscript formatting. You can use markdown syntax to easily make any text subscript by wrapping that text in the ~ tilde symbol . For example, H~o~ when rendered to markdown or viewed in RStudio visual editor looks like Ho .

Identify your x and y variables

  • x is the independent, or predictor variable
  • y is the dependent, or response variable

Tip

Think about whether your variables are categorical (factor) or continuous (numeric). This will help you decide which statistical test to use later!

Load libraries

Code
library(tidyverse)
library(ggridges) # for making cool ridge plots with ggplot!

Load inputs

  • Use the read_csv() function from the readr package to read in the simulated killer whale diet data.
Tip

Remember to adjust the file path as necessary to point to the correct location of your data file. The file path within the parentheses should be wrapped in quotes.

your_data <- read_csv("path/to/your/datafile.csv")

  • Remember to name the data frame something meaningful and short!

  • Avoid spaces in your naming conventions! Instead use:

    • Tidyverse (Preferred): snake_case for everything.

    • Base R: Often uses period.separated names.

    • Other: camelCase is also used in some contexts. 

  • Use the assignment operator <- to assign the output of the read_csv() function to your data frame name.

  • The hotkey Alt + - (or Option + - on Mac) can be used to insert the assignment operator <-.

Step 2. Wrangle and visualize the data

Identify the variables needed to answer your research question

  • Use the glimpse() function from the dplyr package to view the structure of your data frame and identify your x and y variables you will need to answer your research question.
Code
glimpse(kw_diet)
Rows: 2,700
Columns: 6
$ Sample   <chr> "Sample1", "Sample1", "Sample1", "Sample1", "Sample1", "Sampl…
$ Ind      <chr> "Ind17", "Ind17", "Ind17", "Ind17", "Ind17", "Ind17", "Ind17"…
$ Month    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 9, 9…
$ species  <chr> "Chinook salmon", "chum salmon", "coho salmon", "steelhead sa…
$ propDiet <dbl> 0.93652027, 0.00000000, 0.05728241, 0.00000000, 0.00000000, 0…
$ season   <chr> "Spring", "Spring", "Spring", "Spring", "Spring", "Spring", "…

In this dataframe

  • species refers to fish species consumed

  • propDiet refers to the proportion of fish species DNA found in a poop sample

Tip
  • Ask yourself:
    • Are my x and y variables present in the data frame?
    • If not, what variables do I need to transform to get my x and y variables?

Wrangle your variables into columns

group_by & summarize

This week we are going to learn the functions group_by(), summarize() from the dplyr package to wrangle our data.

Use the R help function ? to learn more about these functions. For example, to learn about the group_by() function, run ?group_by in the console or a code chunk.

group_by is commonly followed by summarize(). These two functions work together hand in hand to first group your data frame by a variable and then compute the same summary statistics for each group.

Go here to gain more insight into what these important functions do.

Code
new_dataframe <- your_data %>%
  group_by(col_a, col_b) %>% 
  summarize(new_col = mean(col_a),
            new_col2 = sd(col_a))

Use summarize() to create a new column for each statistic you explicitly code. summarize() is designed to work with functions that take a vector of values and return a single value. Here are some useful candidates:

Function Returns
mean() mean value of a vector
sd() standard deviation of a vector

Visualize your data

Code
ggplot(your_data, aes(x = your_x_variable, y = your_y_variable)) + 
  geom_point() +
  theme_minimal()
Tip

Try layering geom_boxplot(alpha=0.7) on top of a geom_point() layer! This shows both the individual data points and the boxplot quartiles summary simultaneously.

Step 3. Model the relationship

Choose an appropriate model

  • Categorical data:

    • ANOVA aov()

    • Follow-up with a Tukey HSD TukeyHSD()

  • Continuous data:

    • linear regression lm()
  • Is the relationship significant?

  • Is the relationship positive or negative?

Q2: Does southern resident killer whale (SRKW) diet change throughout the year?

Step 1. Hypotheses and variables

Write down your null and alternative hypotheses

  • Ho:

  • Ha:

Identify your x and y variables

  • x is the independent, or predictor variable

  • y is the dependent, or response variable

Step 2. Wrangle and visualize the data

Code
ggplot(kw_diet, 
       aes(x = Month, y = propDiet, 
           color = species, fill = species)) + 
       geom_point()+
       labs(x = "Month", y = "Diet Proportion") +
       theme_minimal()

This data is pretty messy when viewed all together! Let’s pickout one species to focus on using filter() from the dplyr package.

filter() allows us to subset our data frame based on specific criteria. For example, we can filter the species column to only include rows where the species is ["pick your favorite fish"].

Code
new_dataframe <- your_data %>%
  filter(col_a == "some_value")
Note
  • Try layering geom_smooth() on top of a geom_point() layer! This shows both the individual data points and the trend simultaneously.

Step 3. Model the relationship

Does the proportion of [insert your fish here] in the diet change throughout the year?

Choose an appropriate model

  • Categorical data:

    • ANOVA aov()

    • Follow-up with a Tukey HSD TukeyHSD()

  • Continuous data:

    • linear regression lm()
  • Is the relationship significant?

  • Is the relationship positive or negative?

Bonus - facet_wrap()

Tip

After looking at one fish species, can you look at them all together? Try using facet_wrap() to create a panel of plots, one for each species!

An example dataset palmerpenguins is available in the palmerpenguins package. Here is an example of how to use facet_wrap() with this dataset to create separate plots for each species of penguin.

Code
library(palmerpenguins)
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, 
                     color = species)) +
  geom_point() +
  facet_wrap(~species) +
  labs(x = "Flipper Length (mm)", y = "Body Mass (g)")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(color = "grey60",
                                    fill = NA,
                                    linewidth = 0.4))

References

Van Cise, Amy M., M. Bradley Hanson, Candice Emmons, Dan Olsen, Craig O. Matkin, Abigail H. Wells, and Kim M. Parsons. 2024. “Spatial and Seasonal Foraging Patterns Drive Diet Differences Among North Pacific Resident Killer Whale Populations.” Royal Society Open Science 11 (9). https://doi.org/10.1098/rsos.240445.