library("tidyverse")2. Population structure in Hawaiian pilot whales
Exploring genetic relatedness with R and the tidyverse
Background
Spend some time looking at the figure below to familiarize yourselves with the concept of genetic relatedness, which is defined as the proportion of your DNA that you share with another individual. Closer relatives share more of their DNA, and more distant relatives share less and less of it.

When we average across a population, or across all living things, we find that all individuals share roughly half (50%) of their DNA with their immediate relatives - that is, their parents, offspring, and full siblings (see the darkest grey ellipse in the center of the figure above). All of these relationships are therefore considered to have a relatedness coefficient of 0.5.
If we expand out to the next grey ellipse in the figure, which represents our next closest family members, we see that all individuals share roughly 1/4 (25%) of their DNA with these extended family relatives - that is, half-siblings, grandparents and grandchildren, and aunts/uncles or nephews/nieces. All of these relationships are therefore considered to have a relatedness coefficient of 0.25.
We can expand out again to the third largest grey ellipse in the figure, and we see that all individuals share roughly 1/8 (12.5%) of their DNA with these distant family relatives - that is, great-grandparents and great-grandchildren, great aunts and uncles, and great nieces and nephews. All of these relationships are therefore considered to have a relatedness coefficient of 0.125.
Scientists can use this information to build pedigrees and determine family relationships in wild populations using a combination of genetic data, age data, and sex data. For example, if you know that two individual California sea lions - one is a 17yo female and the other is a 5yo female - have a relatedness coefficient of 0.54, meaning that just over half of their DNA is identical, you may be able to determine that these two individuals are either parent-offspring or full siblings. Given the age difference between the two, there’s a good chance that the older female is the mom and the younger female is the daughter.
What do you think is the relationship of the following individuals?
| Individual 1 | Individual 2 | Relatedness Coefficient |
| 20yo male | 10yo female | 0.23 |
| 13yo female | 10yo female | 0.56 |
| 38yo male | 4yo male | 0.26 |
In this exercise, we will use our newfound tidyverse skills to explore genetic relatedness in Hawaiian short-finned pilot whales (Globicephala macrorhynchus). Using genetic data collected from biopsy samples, we can estimate relatedness coefficients between pairs of individuals and explore how relatedness varies within and between population groups.
Step 0: Set up environment
Create an R project (OR a new folder in an existing R project!)
Organize your project directory structure
Download data files and put them in a data folder!
Best practice is to never modify the raw data directly in excel!!! You don’t want to accidentally change a number or delete something. The data folder should just be input to your code.. any output or data wrangling done within the code should be directed to an output folder so that you don’t risk overwriting your raw data.
Here is a simple example to start off:
.
└── my_awesome_project
├── code
│ └── my_code.Rmd
├── data
│ ├── some_data.csv
│ └── more_data.csv
└── output
└── figs
Open a new
.rmdfile in yourcodefolderTitle & date your code file!

Insert a code chunk with
Ctrl+Alt+ifor Windows orCmd+Option+ifor Mac.Load libraries
Step 1: Import data
Now that you have your environment set up, the first thing you will want to do is import the data that you are working with. The most common way this is done is by reading in a comma-separated file (csv) or tab-delimited file (e.g. tsv or txt files). We are using a csv file for this exercise, so you can read in your data using read_csv()from the readr package:
related <- read_csv("../data/SFPW_relatedness_associations.csv")
haplotypes <- read_csv("../data/SFPW_mtDNA_haplotypes.csv")Once you have imported your data, take a look at it to make sure it makes sense. You can do this in two ways: first, in the environment window (upper right corner), find the variable called “search_data” and click on the blue arrow. This will show a drop-down summary of your data. You’ll see each column listed as a vector, and it will show you the name of the column, the type or class of data, and the number of observations in each column.
You can also look at your data in the viewer. Click on the name of the data frame in the environment to open it up in the viewer window. Notice that it opens as a separate tab, and that each vector is shown as a column, with a row for each observation.
Finally, you can print a summary of your data frame in the console using the summary() function:
summary(related) ...1 clust1 ind1 island1 clust2
Min. : 1 Min. : 1 Min. : 1.00 Length:90000 Min. : 1
1st Qu.:22501 1st Qu.: 4 1st Qu.: 5.75 Class :character 1st Qu.: 4
Median :45001 Median : 8 Median :10.50 Mode :character Median : 8
Mean :45001 Mean : 8 Mean :10.50 Mean : 8
3rd Qu.:67500 3rd Qu.:12 3rd Qu.:15.25 3rd Qu.:12
Max. :90000 Max. :15 Max. :20.00 Max. :15
ind2 island2 rel assoc
Min. : 1.00 Length:90000 Min. :0.0000 Min. :0.0000
1st Qu.: 5.75 Class :character 1st Qu.:0.1010 1st Qu.:0.1010
Median :10.50 Mode :character Median :0.1589 Median :0.2082
Mean :10.50 Mean :0.2004 Mean :0.2447
3rd Qu.:15.25 3rd Qu.:0.2684 3rd Qu.:0.3434
Max. :20.00 Max. :0.9510 Max. :1.0000
summary(haplotypes) ...1 indID cluster individual
Min. : 1.00 Length:300 Min. : 1 Min. : 1.00
1st Qu.: 75.75 Class :character 1st Qu.: 4 1st Qu.: 5.75
Median :150.50 Mode :character Median : 8 Median :10.50
Mean :150.50 Mean : 8 Mean :10.50
3rd Qu.:225.25 3rd Qu.:12 3rd Qu.:15.25
Max. :300.00 Max. :15 Max. :20.00
island haps
Length:300 Length:300
Class :character Class :character
Mode :character Mode :character
Also try the head() and str() functions… what do these do??
Step 2: Wrangle your data
Usually, the data that you import won’t be exactly what you want to plot or model. Sometimes it will require some tidying. In this case, the first column imported a column of row numbers that are not useful for us (they appear to be rownames of row numbers). So let’s use R to remove that first column from both dataframes.
To do this cleanly, we’re going to use one of the most powerful functions of the tidyverse: the pipe %>% function. The pipe functions tells R to take the data output from one step and send it to the next step. This allows us to write tidy code, in list formats, that is easy to read later.
You can insert a pipe operator %>% using the keyboard shortcut Ctrl+Alt+p (Windows) or Cmd+Option+p (Mac).
We can use a pipe operator %>%, followed by the command dplyr::select to select all columns except column 1 (-1) . This drops column 1 from our dataframes, and saves the output which overwrites the related and haplotypes dataframes respectively.
related <- related %>%
dplyr::select(-1)
haplotypes <- haplotypes %>%
dplyr::select(-1)Take some time to familiarize yourself with the data. Explore the columns, what information is present, and what questions you can ask of the data.
Functions like head , summary, and str are helpful for dataframe familiarization.
head(related)# A tibble: 6 × 8
clust1 ind1 island1 clust2 ind2 island2 rel assoc
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 Hawaii 1 1 Hawaii 0.614 0.755
2 1 1 Hawaii 1 10 Hawaii 0.501 0.675
3 1 1 Hawaii 1 11 Hawaii 0.742 0.935
4 1 1 Hawaii 1 12 Hawaii 0.745 0.885
5 1 1 Hawaii 1 13 Hawaii 0.726 1
6 1 1 Hawaii 1 14 Hawaii 0.674 0.836
summary(related) clust1 ind1 island1 clust2 ind2
Min. : 1 Min. : 1.00 Length:90000 Min. : 1 Min. : 1.00
1st Qu.: 4 1st Qu.: 5.75 Class :character 1st Qu.: 4 1st Qu.: 5.75
Median : 8 Median :10.50 Mode :character Median : 8 Median :10.50
Mean : 8 Mean :10.50 Mean : 8 Mean :10.50
3rd Qu.:12 3rd Qu.:15.25 3rd Qu.:12 3rd Qu.:15.25
Max. :15 Max. :20.00 Max. :15 Max. :20.00
island2 rel assoc
Length:90000 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.1010 1st Qu.:0.1010
Mode :character Median :0.1589 Median :0.2082
Mean :0.2004 Mean :0.2447
3rd Qu.:0.2684 3rd Qu.:0.3434
Max. :0.9510 Max. :1.0000
str(related)tibble [90,000 × 8] (S3: tbl_df/tbl/data.frame)
$ clust1 : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
$ ind1 : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
$ island1: chr [1:90000] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
$ clust2 : num [1:90000] 1 1 1 1 1 1 1 1 1 1 ...
$ ind2 : num [1:90000] 1 10 11 12 13 14 15 16 17 18 ...
$ island2: chr [1:90000] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
$ rel : num [1:90000] 0.614 0.501 0.742 0.745 0.726 ...
$ assoc : num [1:90000] 0.755 0.675 0.935 0.885 1 ...
head(haplotypes)# A tibble: 6 × 5
indID cluster individual island haps
<chr> <dbl> <dbl> <chr> <chr>
1 1_1_Hawaii 1 1 Hawaii A
2 5_1_Hawaii 5 1 Hawaii A
3 6_1_Hawaii 6 1 Hawaii D
4 8_1_Hawaii 8 1 Hawaii A
5 12_1_Hawaii 12 1 Hawaii D
6 1_2_Hawaii 1 2 Hawaii A
summary(haplotypes) indID cluster individual island
Length:300 Min. : 1 Min. : 1.00 Length:300
Class :character 1st Qu.: 4 1st Qu.: 5.75 Class :character
Mode :character Median : 8 Median :10.50 Mode :character
Mean : 8 Mean :10.50
3rd Qu.:12 3rd Qu.:15.25
Max. :15 Max. :20.00
haps
Length:300
Class :character
Mode :character
str(haplotypes)tibble [300 × 5] (S3: tbl_df/tbl/data.frame)
$ indID : chr [1:300] "1_1_Hawaii" "5_1_Hawaii" "6_1_Hawaii" "8_1_Hawaii" ...
$ cluster : num [1:300] 1 5 6 8 12 1 5 6 8 12 ...
$ individual: num [1:300] 1 1 1 1 1 2 2 2 2 2 ...
$ island : chr [1:300] "Hawaii" "Hawaii" "Hawaii" "Hawaii" ...
$ haps : chr [1:300] "A" "A" "D" "A" ...
After exploring the data, we can see the individual pilot whales are described by cluster, island, and haplotype in the haplotypes dataframe. The related dataframe contains pairwise relatedness coefficients between individuals, along with their cluster and island designations.
Now we’re ready to ask questions about the data.. Like: “What is the average relatedness coefficients within clusters? Are some clusters more closely related than others?” These questions can be specifically rephrased as hypothesis.
- Write your hypotheses
- Null hypothesis (H0):
- All pilot whale clusters are equally related.
- There is no difference in relatedness among Hawaiian short-finned pilot whale clusters.
- Alternative hypothesis (H1a):
- Some pilot whale clusters are more incestuous than others.
- There is significant difference in relatedness in Hawaiian short-finned pilot whale clusters.
- Null hypothesis (H0):
Begin with the end in mind: now that we have an idea of the end, we can wrangle our data further!
Keep within cluster comparisons:
by_cluster <- related %>%
filter(clust1 == clust2)Remove rows where the coefficient shows relatedness to self
by_cluster <- by_cluster %>%
filter(!ind1 == ind2)Step 3: Visualize data
ggplot(by_cluster,
aes(x = factor(clust1),
y = rel)) +
geom_boxplot(outlier.alpha = 0.3) +
geom_jitter(width = 0.15, alpha = 0.4, size = 1) +
labs(
x = "Cluster",
y = "Relatedness coefficient"
) +
theme_minimal()
Do any of these clusters have on average, a greater relatedness coefficient compared to other clusters? What do you notice about the data? What questions does it make you ask?
Step 4: Model the data
Here we can run a One-Way ANOVA to test the null hypothesis that all clusters have the same average relatedness coefficient.
anova_results <- aov(rel ~ clust1, data= by_cluster)
summary(anova_results) Df Sum Sq Mean Sq F value Pr(>F)
clust1 1 0.01 0.009786 1.206 0.272
Residuals 5698 46.23 0.008113
What does this result tell us? What additional questions or visualizations can you make with this dataset?
Try exploring relatedness by island, or exploring between-cluster relatedness!
Step 5: Save your work
Finally, don’t forget to save your work! You can save your R script or R markdown file by clicking on the save icon in the toolbar, or by using the keyboard shortcut Ctrl+S (Windows) or Cmd+S (Mac). You can also save any plots you create using the ggsave() function from the ggplot2 package. For example, if you created a plot called my_plot, you could save it as a PNG file like this:
ggsave("my_plot.png", plot = my_plot, path = "../output/figs", width = 6, height = 4)Make sure to specify the correct path to your output folder where you want to save the plot. You can adjust the width and height parameters to change the size of the saved plot.