2. Population structure in Hawaiian pilot whales

Exploring genetic relatedness with R and the tidyverse

Authors

Amy Van Cise

Sarah Tanja

Published

January 13, 2026

Modified

February 17, 2026

Background

Spend some time looking at the figure below to familiarize yourselves with the concept of genetic relatedness, which is defined as the proportion of your DNA that you share with another individual. Closer relatives share more of their DNA, and more distant relatives share less and less of it.

When we average across a population, or across all living things, we find that all individuals share roughly half (50%) of their DNA with their immediate relatives - that is, their parents, offspring, and full siblings (see the darkest grey ellipse in the center of the figure above). All of these relationships are therefore considered to have a relatedness coefficient of 0.5.

If we expand out to the next grey ellipse in the figure, which represents our next closest family members, we see that all individuals share roughly 1/4 (25%) of their DNA with these extended family relatives - that is, half-siblings, grandparents and grandchildren, and aunts/uncles or nephews/nieces. All of these relationships are therefore considered to have a relatedness coefficient of 0.25.

We can expand out again to the third largest grey ellipse in the figure, and we see that all individuals share roughly 1/8 (12.5%) of their DNA with these distant family relatives - that is, great-grandparents and great-grandchildren, great aunts and uncles, and great nieces and nephews. All of these relationships are therefore considered to have a relatedness coefficient of 0.125.

Scientists can use this information to build pedigrees and determine family relationships in wild populations using a combination of genetic data, age data, and sex data. For example, if you know that two individual California sea lions - one is a 17yo female and the other is a 5yo female - have a relatedness coefficient of 0.54, meaning that just over half of their DNA is identical, you may be able to determine that these two individuals are either parent-offspring or full siblings. Given the age difference between the two, there’s a good chance that the older female is the mom and the younger female is the daughter.

What do you think is the relationship of the following individuals?

Individual 1 Individual 2 Relatedness Coefficient
20yo male 10yo female 0.23
13yo female 10yo female 0.56
38yo male 4yo male 0.26

In this exercise, we will use our newfound tidyverse skills to explore genetic relatedness in Hawaiian short-finned pilot whales (Globicephala macrorhynchus). Using genetic data collected from biopsy samples, we can estimate relatedness coefficients between pairs of individuals and explore how relatedness varies within and between population groups.

Note

This data comes from a study by Van Cise et al. (2017) that investigated population structure and social organization in Hawaiian pilot whales. You can read the full paper here.

Step 0: Set up environment

  • Create an R project (OR a new folder in an existing R project!)

  • Organize your project directory structure

  • Download data files and put them in a data folder!

Best practice is to never modify the raw data directly in excel!!! You don’t want to accidentally change a number or delete something. The data folder should just be input to your code.. any output or data wrangling done within the code should be directed to an output folder so that you don’t risk overwriting your raw data.

Here is a simple example to start off:

.
└── my_awesome_project
    ├── code
    │   └── my_code.Rmd
    ├── data
    │   ├── some_data.csv
    │   └── more_data.csv
    └── output
        └── figs
  • Open a new .rmd file in your code folder

  • Title & date your code file!

  • Insert a code chunk with Ctrl+Alt+i for Windows or Cmd+Option+i for Mac.

  • Load libraries

library("tidyverse")

Step 1: Import data

Now that you have your environment set up, the first thing you will want to do is import the data that you are working with. The most common way this is done is by reading in a comma-separated file (csv) or tab-delimited file (e.g. tsv or txt files). We are using a csv file for this exercise, so you can read in your data using read_csv()from the readr package:

related <- read_csv("../data/SFPW_relatedness_associations.csv")
haplotypes <- read_csv("../data/SFPW_mtDNA_haplotypes.csv")

Once you have imported your data, take a look at it to make sure it makes sense. You can do this in two ways: first, in the environment window (upper right corner), find the variable called “search_data” and click on the blue arrow. This will show a drop-down summary of your data. You’ll see each column listed as a vector, and it will show you the name of the column, the type or class of data, and the number of observations in each column.

You can also look at your data in the viewer. Click on the name of the data frame in the environment to open it up in the viewer window. Notice that it opens as a separate tab, and that each vector is shown as a column, with a row for each observation.

Finally, you can print a summary of your data frame in the console using the summary() function:

summary(related)
      ...1           clust1        ind1         island1              clust2  
 Min.   :    1   Min.   : 1   Min.   : 1.00   Length:90000       Min.   : 1  
 1st Qu.:22501   1st Qu.: 4   1st Qu.: 5.75   Class :character   1st Qu.: 4  
 Median :45001   Median : 8   Median :10.50   Mode  :character   Median : 8  
 Mean   :45001   Mean   : 8   Mean   :10.50                      Mean   : 8  
 3rd Qu.:67500   3rd Qu.:12   3rd Qu.:15.25                      3rd Qu.:12  
 Max.   :90000   Max.   :15   Max.   :20.00                      Max.   :15  
      ind2         island2               rel             assoc       
 Min.   : 1.00   Length:90000       Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 5.75   Class :character   1st Qu.:0.1010   1st Qu.:0.1010  
 Median :10.50   Mode  :character   Median :0.1589   Median :0.2082  
 Mean   :10.50                      Mean   :0.2004   Mean   :0.2447  
 3rd Qu.:15.25                      3rd Qu.:0.2684   3rd Qu.:0.3434  
 Max.   :20.00                      Max.   :0.9510   Max.   :1.0000  
summary(haplotypes)
      ...1           indID              cluster     individual   
 Min.   :  1.00   Length:300         Min.   : 1   Min.   : 1.00  
 1st Qu.: 75.75   Class :character   1st Qu.: 4   1st Qu.: 5.75  
 Median :150.50   Mode  :character   Median : 8   Median :10.50  
 Mean   :150.50                      Mean   : 8   Mean   :10.50  
 3rd Qu.:225.25                      3rd Qu.:12   3rd Qu.:15.25  
 Max.   :300.00                      Max.   :15   Max.   :20.00  
    island              haps          
 Length:300         Length:300        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
Tip

Also try the head() and str() functions… what do these do??

Step 2: Wrangle your data

Usually, the data that you import won’t be exactly what you want to plot or model. Sometimes it will require some tidying. In this case, the first column imported a column of row numbers that are not useful for us (they appear to be rownames of row numbers). So let’s use R to remove that first column from both dataframes.

To do this cleanly, we’re going to use one of the most powerful functions of the tidyverse: the pipe %>% function. The pipe functions tells R to take the data output from one step and send it to the next step. This allows us to write tidy code, in list formats, that is easy to read later.

Tip

You can insert a pipe operator %>% using the keyboard shortcut Ctrl+Alt+p (Windows) or Cmd+Option+p (Mac).

We can use a pipe operator %>%, followed by the command dplyr::select to select all columns except column 1 (-1) . This drops column 1 from our dataframes, and saves the output which overwrites the related and haplotypes dataframes respectively.

related <- related %>% 
  dplyr::select(-1)

haplotypes <- haplotypes %>% 
  dplyr::select(-1)

Take some time to familiarize yourself with the data. Explore the columns, what information is present, and what questions you can ask of the data.

Functions like head , summary, and str are helpful for dataframe familiarization.

head(related)
# A tibble: 6 × 8
  clust1  ind1 island1 clust2  ind2 island2   rel assoc
   <dbl> <dbl> <chr>    <dbl> <dbl> <chr>   <dbl> <dbl>
1      1     1 Hawaii       1     1 Hawaii  0.614 0.755
2      1     1 Hawaii       1    10 Hawaii  0.501 0.675
3      1     1 Hawaii       1    11 Hawaii  0.742 0.935
4      1     1 Hawaii       1    12 Hawaii  0.745 0.885
5      1     1 Hawaii       1    13 Hawaii  0.726 1    
6      1     1 Hawaii       1    14 Hawaii  0.674 0.836
summary(related)
     clust1        ind1         island1              clust2        ind2      
 Min.   : 1   Min.   : 1.00   Length:90000       Min.   : 1   Min.   : 1.00  
 1st Qu.: 4   1st Qu.: 5.75   Class :character   1st Qu.: 4   1st Qu.: 5.75  
 Median : 8   Median :10.50   Mode  :character   Median : 8   Median :10.50  
 Mean   : 8   Mean   :10.50                      Mean   : 8   Mean   :10.50  
 3rd Qu.:12   3rd Qu.:15.25                      3rd Qu.:12   3rd Qu.:15.25  
 Max.   :15   Max.   :20.00                      Max.   :15   Max.   :20.00  
   island2               rel             assoc       
 Length:90000       Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:0.1010   1st Qu.:0.1010  
 Mode  :character   Median :0.1589   Median :0.2082  
                    Mean   :0.2004   Mean   :0.2447  
                    3rd Qu.:0.2684   3rd Qu.:0.3434  
                    Max.   :0.9510   Max.   :1.0000  
str(related)
tibble [90,000 × 8] (S3: tbl_df/tbl/data.frame)
 $ clust1 : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
 $ ind1   : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
 $ island1: chr [1:90000] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
 $ clust2 : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
 $ ind2   : num [1:90000] 1 10 11 12 13 14 15 16 17 18 ...
 $ island2: chr [1:90000] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
 $ rel    : num [1:90000] 0.614 0.501 0.742 0.745 0.726 ...
 $ assoc  : num [1:90000] 0.755 0.675 0.935 0.885 1 ...
head(haplotypes)
# A tibble: 6 × 5
  indID       cluster individual island haps 
  <chr>         <dbl>      <dbl> <chr>  <chr>
1 1_1_Hawaii        1          1 Hawaii A    
2 5_1_Hawaii        5          1 Hawaii A    
3 6_1_Hawaii        6          1 Hawaii D    
4 8_1_Hawaii        8          1 Hawaii A    
5 12_1_Hawaii      12          1 Hawaii D    
6 1_2_Hawaii        1          2 Hawaii A    
summary(haplotypes)
    indID              cluster     individual       island         
 Length:300         Min.   : 1   Min.   : 1.00   Length:300        
 Class :character   1st Qu.: 4   1st Qu.: 5.75   Class :character  
 Mode  :character   Median : 8   Median :10.50   Mode  :character  
                    Mean   : 8   Mean   :10.50                     
                    3rd Qu.:12   3rd Qu.:15.25                     
                    Max.   :15   Max.   :20.00                     
     haps          
 Length:300        
 Class :character  
 Mode  :character  
                   
                   
                   
str(haplotypes)
tibble [300 × 5] (S3: tbl_df/tbl/data.frame)
 $ indID     : chr [1:300] "1_1_Hawaii" "5_1_Hawaii" "6_1_Hawaii" "8_1_Hawaii" ...
 $ cluster   : num [1:300] 1 5 6 8 12 1 5 6 8 12 ...
 $ individual: num [1:300] 1 1 1 1 1 2 2 2 2 2 ...
 $ island    : chr [1:300] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
 $ haps      : chr [1:300] "A" "A" "D" "A" ...

After exploring the data, we can see the individual pilot whales are described by cluster, island, and haplotype in the haplotypes dataframe. The related dataframe contains pairwise relatedness coefficients between individuals, along with their cluster and island designations.

Now we’re ready to ask questions about the data.. Like: “What is the average relatedness coefficients within clusters? Are some clusters more closely related than others?” These questions can be specifically rephrased as hypothesis.

  • Write your hypotheses
    • Null hypothesis (H0):
      • All pilot whale clusters are equally related.
      • There is no difference in relatedness among Hawaiian short-finned pilot whale clusters.
    • Alternative hypothesis (H1a):
      • Some pilot whale clusters are more incestuous than others.
      • There is significant difference in relatedness in Hawaiian short-finned pilot whale clusters.

Begin with the end in mind: now that we have an idea of the end, we can wrangle our data further!

Keep within cluster comparisons:

by_cluster <- related %>% 
  filter(clust1 == clust2)

Remove rows where the coefficient shows relatedness to self

by_cluster <- by_cluster %>% 
  filter(!ind1 == ind2)

Step 3: Visualize data

ggplot(by_cluster,
       aes(x = factor(clust1),
           y = rel)) +
  geom_boxplot(outlier.alpha = 0.3) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1) +
  labs(
    x = "Cluster",
    y = "Relatedness coefficient"
  ) +
  theme_minimal()

Do any of these clusters have on average, a greater relatedness coefficient compared to other clusters? What do you notice about the data? What questions does it make you ask?

Step 4: Model the data

Here we can run a One-Way ANOVA to test the null hypothesis that all clusters have the same average relatedness coefficient.

anova_results <- aov(rel ~ clust1, data= by_cluster)
summary(anova_results)
              Df Sum Sq  Mean Sq F value Pr(>F)
clust1         1   0.01 0.009786   1.206  0.272
Residuals   5698  46.23 0.008113               

What does this result tell us? What additional questions or visualizations can you make with this dataset?

Tip

Try exploring relatedness by island, or exploring between-cluster relatedness!

Step 5: Save your work

Finally, don’t forget to save your work! You can save your R script or R markdown file by clicking on the save icon in the toolbar, or by using the keyboard shortcut Ctrl+S (Windows) or Cmd+S (Mac). You can also save any plots you create using the ggsave() function from the ggplot2 package. For example, if you created a plot called my_plot, you could save it as a PNG file like this:

ggsave("my_plot.png", plot = my_plot, path = "../output/figs", width = 6, height = 4)

Make sure to specify the correct path to your output folder where you want to save the plot. You can adjust the width and height parameters to change the size of the saved plot.

References

Van Cise, Amy M., Karen. K. Martien, Sabre D. Mahaffy, Robin W. Baird, Daniel L. Webster, James H. Fowler, Erin M. Oleson, and Phillip A. Morin. 2017. “Familial Social Structure and Socially Driven Genetic Differentiation in Hawaiian Short-Finned Pilot Whales.” Molecular Ecology 26 (23): 6730–41. https://doi.org/10.1111/mec.14397.