1. Analyse calling behavior in Cook Inlet beluga whales

Data visualization with R and ggplot2

Authors

Amy Van Cise

Sarah Tanja

Published

January 15, 2026

Modified

February 17, 2026

Background

In this script, we will analyze calling behavior in Cook Inlet beluga whales using a dataset from Arial Brewer that includes visual data on group size, behavior, calf presence, and calling rates (combined calls, whistles, and pulsed calls). We will visualize the distributions of these variables and explore how calling behavior varies with group size, behavior, and calf presence. We will also explore how calling behavior changes as we approach behavioral transitions.

Arial Brewer SAFS, Whale and Dolphin Ecology Lab PhD Student

Arial earned a Bachelor of Science degree in Marine Biology from the University of California, Santa Cruz in 2010, and is a marine mammal biologist at NOAA focusing on the acoustic ecology and behavior of cetaceans. Arial has previously worked as a marine mammal trainer and field biologist and has participated in marine mammal surveys off the coasts of Mexico, California, Oregon, Washington, British Columbia, Alaska and Hawaii.At SAFS, Arial is investigating the vocal behavior, kinship, and microbiome variability of the endangered Cook Inlet beluga whale population in Alaska.

Voices of the Sea link

Cook Inlet beluga mother and calf in turbid waters. Credit: NOAA Fisheries/Paul Wade

Important

Explore the data in regards to your guiding research question (either Q1 or Q2!)

Remember the scientific method steps:

Step 1: Hypotheses and variables

Write down your null and alternative hypotheses. Identify x and y variables.

Step 2: Explore the data visually

Step 3: Choose an appropriate model

Categorical data: ANOVA aov()
Continuous data: linear regression lm()
Is the relationship significant?
Is the relationship positive or negative?

Step 1. Hypotheses and variables

Open an Rmd document in R.
Write down your null and alternative hypotheses. Identify x and y variables.
Set up your environment

Load libraries

Tip

In R, you only need to install packages into your library once…(until you update R, then packages may need to be re-installed). To install a new package that you have not yet installed, use the install.packages("package_name") function. You can run this line of code in your console (not in the script) to install any packages you don’t have yet.

For example, you may need to run install.packages("ggsci") to install the ggsci package.

library(tidyverse)

## There are many cool color palettes... pick a handful to explore
library(PNWColors) # color palettes
library(ggsci) # color palettes
library(colorspace) # color palettes
library(viridis) # color palettes
library(RColorBrewer) # color palettes

Load your data

Important

Pay attention to your file paths! You may need to change the file path below depending on where your data file is located relative to your working directory. Here, we assume you are working from a .Rmd script that is saved in your code folder, and the data file is located in a folder called data that is one level up from code. It’s important to explicitly load your data in the script, if you use the “import data” GUi option that is not the tidyverse way because it isn’t reproducible !

Tip

As a reminder, you can insert a new code chunk with the keyboard shortcut Ctrl + Alt + I (Windows) or Cmd + Option + I (Mac).

Look at your data! Can you find your x and y variables as columns of your data?

Initial data exploration

Q1: Do belugas use different types of calls in different behavioral or groups contexts?

Write down your null and alternative hypotheses. Identify x and y variables.

When visualizing distributions with geom_histogram(), ggplot expects you to pass a single value to the x aesthetic.

If we want to visualize counts of group sizes, we pass group_size to the x variable.

ggplot(data = data, aes(x=group_size)) +
  geom_histogram(binwidth=10) +
  xlab("group size")

Tip

What happens when you change the binwidth? Try changing it to 5 or 20 and see how the histogram changes.

Now it’s your turn: Create some histograms!

Create histograms like the one above for the following variables:

combined_call
whistle
pulsed_call

Important

What do you notice about the distributions? Are they normally distributed? Skewed?

Explore using other methods to visualize distribution data. Try:
- geom_boxplot()
- geom_violin()

Important

Create exploratory visualizations and statistical tests to answer your falsifiable hypothesis.

You will want to pivot the dataframe with pivot_longer() so that you can compare all call types at once!

Tip

Remember to use your help function ?pivot_longer to learn more about what that function does and what it expects as inputs and the syntax for those inputs

?pivot_longer()

ggplot(data = data.long, 
       aes(x = group_size, 
           y = num.call,
           fill = call)) +
        geom_bar(stat="identity", 
                 position = "dodge", 
                 alpha = 0.9) +
        theme_minimal()

Q2: Does beluga vocal behavior (number or type of calls) change as we approach a change in group behavior?

When visualizing counts of calls over time approaching a behavioral transition with geom_bar(), ggplot expects you to pass a single value to the x aesthetic (e.g., timediff_transition), and a single value to the y aesthetic (e.g., combined_call, whistle, or pulsed_call).

ggplot(data = data, aes(x = timediff_transition, y = combined_call)) +
  geom_bar(stat = "identity") +
  theme_minimal()

Write down your null and alternative hypotheses. Identify x and y variables.

Tip

What happens when you change the geom from geom_bar() to geom_point() or geom_jitter()? Try it out and see how the plot changes.

Now it’s your turn: Create some data exporation visualizations!

Create visualizations like the one above for the following variables:

whistle
pulsed_call

Important

What do you notice about the distributions? Are they normally distributed? Skewed?

You will want to pivot the dataframe with pivot_longer() so that you can compare all call types at once!

Tip

Remember to use your help function ?pivot_longer to learn more about what that function does and what it expects as inputs and the syntax for those inputs

Important

Create exploratory visualizations and statistical tests to answer your falsifiable hypothesis.

Step 2. Wrangle and visualize the data

Tip

If your x and y variables are not columns, wrangle your dataset so that they are. Plot!

geom_bar()
geom_point()
geom_smooth()

Step 3. Model the relationship

Choose an appropriate model

Categorical data:
- ANOVA aov()
Continuous data:
- linear regression lm()
Is the relationship significant?
Is the relationship positive or negative?

Lab reports

Important

Your lab report should include the following components:

Your falsifiable hypothesis
The results of your statistical analysis
A figure visualizing your results
A determination of whether you do or do not reject your hypothesis.

Knitting a lab report

When you’re ready to knit your .Rmd or .qmd file into a final report, click the “Knit” or “Render” button at the top of the script editor in RStudio. This will generate an HTML (or PDF/Word if you choose) document that includes both your code and the output (figures, tables, etc.) from running that code.

You can control what is printed by each chunk across the script by including a setup chunk at the top of your document. A setup chunk is a special code chunk that sets global options for all code chunks in the document. A setup chunk could look like this:

# This is the setup chunk
knitr::opts_chunk$set(
  echo = TRUE, # Display code chunks
  eval = TRUE, # Evaluate code chunks
  warning = FALSE, # Hide warnings
  message = FALSE, # Hide messages
  comment = "") # Prevents appending '##' to beginning of lines in code output))

In the setup chunk above, we set options to display code (echo = TRUE), evaluate code (eval = TRUE), and hide warnings and messages. Make sure you name the setup chunk by including setup after the {r, in the chunk header (e.g., {r, setup, include=FALSE}). By typing include=FALSE in the setup chunk header, we prevent the setup chunk itself from being displayed in the final document.

Coincidentally, this is how we set options for individual chunks by adding parameters to the chunk header (e.g., {r, echo=FALSE} to hide code for that specific chunk). You can also click the settings wheel icon in the top right corner of any code chunk in RStudio to set chunk options via a graphical user interface (GUI).

Experiment with including various chunk options and compare your knitted / rendered documents to see the differences.

Data visualization with `ggplot2` tips & resources

Checkout great examples of ggplot2 usage in the R graph gallery
Review ggplot2 structure with the ggplot2:cheatsheet
Try using facet_wrap() or facet_grid() to create plot multiples based on a categorical variable (pass the categrorical variable to the ~ argument inside facet_wrap() or facet_grid()). An example might be facet_wrap(~behavior) to create separate plots for each behavior type.
Make visually appealing plots by customizing colors, themes, and labels. Some useful functions include:
- scale_color_manual() and scale_fill_manual() to customize colors
- theme_minimal(), theme_classic(), theme_bw(), etc. to change overall plot theme
- xlab(), ylab(), and ggtitle() to add axis labels and titles
There are many gorgeous color palettes available in R packages like PNWColors, colorspace, RColorBrewer, and viridis. Explore these packages to find color schemes that enhance your visualizations.

--- title: "1. Analyse calling behavior in Cook Inlet beluga whales" subtitle: "Data visualization with `R` and `ggplot2`" page-layout: article author: - Amy Van Cise - Sarah Tanja date: "2026-01-15" date-modified: today order: 1 format: html --- ```{r, setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = TRUE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "") # Prevents appending '##' to beginning of lines in code output)) ``` # Background In this script, we will analyze calling behavior in Cook Inlet beluga whales using a dataset from Arial Brewer that includes visual data on group size, behavior, calf presence, and calling rates (combined calls, whistles, and pulsed calls). We will visualize the distributions of these variables and explore how calling behavior varies with group size, behavior, and calf presence. We will also explore how calling behavior changes as we approach behavioral transitions. ![](/modules/week3/refs/arialinthefield.png){fig-align="left" width="3in"} [Arial Brewer](https://scholar.google.com/citations?user=MTBjm9sAAAAJ&hl=en) [SAFS, Whale and Dolphin Ecology Lab PhD Student](https://www.amyvancise.com/our-lab) > Arial earned a Bachelor of Science degree in Marine Biology from the University of California, Santa Cruz in 2010, and is a marine mammal biologist at NOAA focusing on the acoustic ecology and behavior of cetaceans. Arial has previously worked as a marine mammal trainer and field biologist and has participated in marine mammal surveys off the coasts of Mexico, California, Oregon, Washington, British Columbia, Alaska and Hawaii.At SAFS, Arial is investigating the vocal behavior, kinship, and microbiome variability of the endangered Cook Inlet beluga whale population in Alaska. Voices of the Sea [link](https://voicesinthesea.ucsd.edu/species/belugasNarwhals/beluga.html#) [![Cook Inlet beluga mother and calf in turbid waters. Credit: NOAA Fisheries/Paul Wade](/modules/week3/refs/750x500-cook-inlet-aerial-calf-afsc.jpg){fig-alt="Cook Inlet beluga mother and calf in turbid waters. Credit: NOAA Fisheries/Paul Wade" fig-align="right"}](https://www.fisheries.noaa.gov/feature-story/vocal-repertoire-cook-inlet-beluga-whales-documented-first-time) ::: callout-important Explore the data in regards to your guiding research question (either Q1 or Q2!) Remember the scientific method steps: ### Step 1: Hypotheses and variables - Write down your null and alternative hypotheses. Identify x and y variables. ### Step 2: Explore the data visually ### Step 3: Choose an appropriate model - Categorical data: ANOVA `aov()` - Continuous data: linear regression `lm()` - Is the relationship significant? - Is the relationship positive or negative? ::: # Step 1. Hypotheses and variables - Open an Rmd document in R. - Write down your null and alternative hypotheses. Identify x and y variables. - Set up your environment ## Load libraries ::: callout-tip In `R`, you only need to **install** packages into your library once...(until you update `R`, then packages may need to be re-installed). To install a new package that you have not yet installed, use the `install.packages("package_name")` function. You can run this line of code in your console (not in the script) to install any packages you don't have yet. For example, you may need to run `install.packages("ggsci")` to install the `ggsci` package. ::: ```{r} library(tidyverse) ## There are many cool color palettes... pick a handful to explore library(PNWColors) # color palettes library(ggsci) # color palettes library(colorspace) # color palettes library(viridis) # color palettes library(RColorBrewer) # color palettes ``` ## Load your data ::: callout-important Pay attention to your file paths! You may need to change the file path below depending on where your data file is located relative to your working directory. Here, we assume you are working from a `.Rmd` script that is saved in your `code` folder, and the data file is located in a folder called `data` that is one level up from `code`. It's important to explicitly load your data in the script, if you use the "import data" GUi option that is *not* the tidyverse way because it isn't **reproducible** ! ::: ```{r} #| include: false data <- read_csv("../data/CIB_calling_behavior.csv") ``` ::: callout-tip As a reminder, you can insert a new code chunk with the keyboard shortcut `Ctrl + Alt + I` (Windows) or `Cmd + Option + I` (Mac). ::: - Look at your data! Can you find your x and y variables as columns of your data? ## Initial data exploration ### Q1: Do belugas use different types of calls in different behavioral or groups contexts? > Write down your null and alternative hypotheses. Identify x and y variables. When visualizing distributions with `geom_histogram()`, `ggplot` expects you to pass a single value to the `x` aesthetic. If we want to visualize counts of group sizes, we pass `group_size` to the `x` variable. ```{r} ggplot(data = data, aes(x=group_size)) + geom_histogram(binwidth=10) + xlab("group size") ``` ::: callout-tip What happens when you change the binwidth? Try changing it to 5 or 20 and see how the histogram changes. ::: ### Now it's your turn: Create some histograms! Create histograms like the one above for the following variables: - `combined_call` - `whistle` - `pulsed_call` ::: callout-important What do you notice about the distributions? Are they normally distributed? Skewed? ::: ```{r} #| include: false data %>% ggplot(aes(x=combined_call)) + geom_histogram(binwidth = 1) + xlab("# combined calls / minute") ``` ```{r} #| include: false ggplot(data = data, aes(x=whistle)) + geom_histogram(binwidth = 1) + xlab("# whistles / minute") ``` ```{r} #| include: false ggplot(data = data, aes(x=pulsed_call)) + geom_histogram(binwidth = 1) + xlab("# pulsed calls / minute") ``` - Explore using other methods to visualize distribution data. Try: - `geom_boxplot()` - `geom_violin()` ::: callout-important Create exploratory visualizations and statistical tests to answer your falsifiable hypothesis. ::: ```{r} #| include: false # Create exploratory visualizations and statistical tests to answer the following questions: # # - Does the number of calls in each call type change with group size? # - Is that difference statistically significant? # - Is there a difference between call types with group size? # - Is that difference statistically significant? # - Does the number of calls in each call type change with group behavior? # - Is that difference statistically significant? # - Does the number of calls in each call type change with calf presence? # - Is that difference statistically significant? ``` > You will want to pivot the dataframe with `pivot_longer()` so that you can compare all **call types** at once! ::: callout-tip Remember to use your help function `?pivot_longer` to learn more about what that function does and what it expects as inputs and the syntax for those inputs ::: ```{r} #| eval: false ?pivot_longer() ``` ```{r} #| include: false data.long <- data %>% pivot_longer(cols = c(combined_call,whistle,pulsed_call), names_to = "call", values_to = "num.call") ``` ```{r} ggplot(data = data.long, aes(x = group_size, y = num.call, fill = call)) + geom_bar(stat="identity", position = "dodge", alpha = 0.9) + theme_minimal() ``` ```{r} #| include: false #| eval: false lm.w <- lm(whistle ~ group_size, data = data) summary(lm.w) ``` ```{r} #| include: false #| echo: false data.long %>% filter(num.call > 0) %>% ggplot(aes(x = call, y = num.call, color = behavior, fill = behavior)) + geom_boxplot() + theme_minimal() ``` ```{r} #| include: false lm.w.behavior <- aov(whistle ~ behavior, data = data) summary(lm.w.behavior) ``` ```{r} #| include: false lm.pc.behavior <- lm(pulsed_call ~ behavior, data = data) summary(lm.pc.behavior) ``` ```{r} #| include: false ggplot(data = data.long, aes(x = call, y = num.call, color = calf, fill = calf)) + geom_violin() + theme_minimal() ``` ```{r} #| include: false lm.cc.calf <- lm(combined_call ~ calf, data = data) summary(lm.cc.calf) ``` ```{r} #| include: false data.long %>% filter(num.call > 0) %>% ggplot(aes(x = call, y = num.call, color = behavior, fill = behavior)) + geom_boxplot() + theme_minimal() + facet_wrap(~calf) + scale_fill_aaas() + scale_color_aaas() + xlab("Call type") + ylab("Number of calls") ``` ### Q2: Does beluga vocal behavior (number or type of calls) change as we approach a change in group behavior? When visualizing counts of calls over time approaching a behavioral transition with `geom_bar()`, `ggplot` expects you to pass a single value to the `x` aesthetic (e.g., `timediff_transition`), and a single value to the `y` aesthetic (e.g., `combined_call`, `whistle`, or `pulsed_call`). ```{r} ggplot(data = data, aes(x = timediff_transition, y = combined_call)) + geom_bar(stat = "identity") + theme_minimal() ``` > Write down your null and alternative hypotheses. Identify x and y variables. ::: callout-tip What happens when you change the geom from `geom_bar()` to `geom_point()` or `geom_jitter()`? Try it out and see how the plot changes. ::: ### Now it's your turn: Create some data exporation visualizations! Create visualizations like the one above for the following variables: - `whistle` - `pulsed_call` ```{r} #| include: false ggplot(data = data, aes(x = timediff_transition, y = whistle)) + geom_jitter() + theme_minimal() ``` ::: callout-important What do you notice about the distributions? Are they normally distributed? Skewed? ::: > You will want to pivot the dataframe with `pivot_longer()` so that you can compare all **call types** at once! ::: callout-tip Remember to use your help function `?pivot_longer` to learn more about what that function does and what it expects as inputs and the syntax for those inputs ::: ::: callout-important Create exploratory visualizations and statistical tests to answer your falsifiable hypothesis. ::: ```{r} #| include: false # - Compare number of calls approaching behavioral transition # - Is that difference statistically significant? # - Compare types of calls approaching behavioral transition # - Is that difference statistically significant? # - Compare across different encounter types (e.g., social vs. travel) # - Is that difference statistically significant? # - Compare across calf presence # - Is that difference statistically significant? ``` ```{r} #| include: false data_update_long <- data %>% pivot_longer(pulsed_call:combined_call, names_to = "call_type", values_to = "nCalls") ``` ```{r} #| include: false ggplot(data_update_long, aes(x = timediff_transition, y = nCalls, fill = call_type)) + geom_bar(stat = "identity") + facet_wrap(~encounter, ncol = 1) + theme_minimal() ``` ```{r} #| include: false #| eval: false lm_callsOverTime <- lm(nCalls ~ abs(timediff_transition), data = data_update_long) summary(lm_callsOverTime) ``` ```{r} #| echo: false #| include: false ggplot(data_update_long, aes(x = abs(timediff_transition), y = nCalls, color = call_type, fill = call_type)) + geom_jitter(size = 0.5) + theme_minimal() + geom_smooth() + scale_x_reverse() + scale_fill_manual(values = pnw_palette("Shuksan")) + scale_color_manual(values = pnw_palette("Shuksan")) + xlab("Minutes to/from transition") + ylab("Number of calls") ``` # Step 2. Wrangle and visualize the data ::: callout-tip If your x and y variables are not columns, wrangle your dataset so that they are. Plot! - `geom_bar()` - `geom_point()` - `geom_smooth()` ::: # Step 3. Model the relationship Choose an appropriate model - Categorical data: - ANOVA `aov()` - Continuous data: - linear regression `lm()` - Is the relationship significant? - Is the relationship positive or negative? # Lab reports ::: callout-important Your lab report should include the following components: 1. **Your falsifiable hypothesis** 2. **The results of your statistical analysis** 3. **A figure visualizing your results** 4. **A determination of whether you do or do not reject your hypothesis.** ::: # Knitting a lab report When you're ready to knit your `.Rmd` or `.qmd` file into a final report, click the "Knit" or "Render" button at the top of the script editor in RStudio. This will generate an HTML (or PDF/Word if you choose) document that includes both your code and the output (figures, tables, etc.) from running that code. You can control what is printed by each chunk across the script by including a setup chunk at the top of your document. A setup chunk is a special code chunk that sets global options for all code chunks in the document. A setup chunk could look like this: ``` # This is the setup chunk knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = TRUE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "") # Prevents appending '##' to beginning of lines in code output)) ``` In the setup chunk above, we set options to display code (`echo = TRUE`), evaluate code (`eval = TRUE`), and hide warnings and messages. Make sure you name the setup chunk by including `setup` after the `{r,` in the chunk header (e.g., `{r, setup, include=FALSE}`). By typing include=FALSE in the setup chunk header, we prevent the setup chunk itself from being displayed in the final document. Coincidentally, this is how we set options for individual chunks by adding parameters to the chunk header (e.g., `{r, echo=FALSE}` to hide code for that specific chunk). You can also click the settings wheel icon in the top right corner of any code chunk in RStudio to set chunk options via a graphical user interface (GUI). ![](/modules/week3/refs/chunk-header.png) Experiment with including various chunk options and compare your knitted / rendered documents to see the differences. # Data visualization with `ggplot2` tips & resources - Checkout great examples of `ggplot2` usage in the [R graph gallery](https://r-graph-gallery.com/) - Review `ggplot2` structure with the [`ggplot2`:cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf) - Try using `facet_wrap()` or `facet_grid()` to create plot multiples based on a categorical variable (pass the categrorical variable to the `~` argument inside `facet_wrap()` or `facet_grid()`). An example might be `facet_wrap(~behavior)` to create separate plots for each behavior type. - Make visually appealing plots by customizing colors, themes, and labels. Some useful functions include: - `scale_color_manual()` and `scale_fill_manual()` to customize colors - `theme_minimal()`, `theme_classic()`, `theme_bw()`, etc. to change overall plot theme - `xlab()`, `ylab()`, and `ggtitle()` to add axis labels and titles - There are many gorgeous color palettes available in `R` packages like `PNWColors`, `colorspace`, `RColorBrewer`, and `viridis`. Explore these packages to find color schemes that enhance your visualizations.

Background

Step 1: Hypotheses and variables

Step 2: Explore the data visually

Step 3: Choose an appropriate model

Step 1. Hypotheses and variables

Load libraries

Load your data

Initial data exploration

Q1: Do belugas use different types of calls in different behavioral or groups contexts?

Now it’s your turn: Create some histograms!

Q2: Does beluga vocal behavior (number or type of calls) change as we approach a change in group behavior?

Now it’s your turn: Create some data exporation visualizations!

Step 2. Wrangle and visualize the data

Step 3. Model the relationship

Lab reports

Knitting a lab report

Data visualization with ggplot2 tips & resources

Data visualization with `ggplot2` tips & resources