Using SNPRelate to Construct Ghanaian Kassena Genetic Genealogy

Residents of Paga, Ghana using genetic genealogy to identify diaspora relatives

One of the goals of The African Kinship Reunion is to construct the genetic genealogy of our participants from Africa. In this post, I discuss the genetic genealogy dendrogram (tree diagram) results produced using SNPRelate. The purpose of using SNPRelate was to obtain genetic relatedness connections among participants in the form of a tree diagram. I welcome your feedback in the comments section below.

The Set-up

I used the raw SNP DNA text files of 51 participants. Of these, 42 were from members of the Kassena ethnic group residing in the Nania community of Paga, Ghana, 7 were from my family of African Americans residing in the United States, and 2 were from the Congo but residing in the United States. To develop the genetic genealogy dendrogram, I used a set of tools called SNPRelate: Parallel Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. SNPRelate was developed by Xiuwen Zheng, Stephanie Gogarten, Cathy Laurie, and Bruce Weir for genome-wide association studies on genetic relatedness using the R statistical and programing language. I also used RStudio, which is an integrated development environment (IDE) for R that includes a console and supports plotting.

The Code

To run SNPRelate on our participant datafiles, I basically followed the R script provided by the SNPRelate authors on the Bioconductor website. I show below the exact script that I used.

In RStudio, I loaded the necessary libraries to use SNPRelate and to work with GDS formatted files.

library(gdsfmt)
library(SeqArray)
library(SNPRelate)

Then, I read the filename of the dataset which consisted of a single VCF formatted text file of all participants. Programs, such as Beagle, Refined IBD, and hap-ibd, use VCF formatted files.

vcf.fn <- ("/home/lakishatdavid/inputs/Results/MergedSamples.vcf.gz")

I converted the participants’ VCF formatted file to a Genomic Data Structure (GDS) formatted file for use with SNPRelate.

gdsfile <- tempfile()
seqVCF2GDS(vcf.fn, gdsfile)

I opened the GDS formatted file to use it.

gds <- seqOpen(gdsfile)

I used the SNPRelate snpgdsDiss command to calculate the dissimilarity between each pair of individuals.

diss <- snpgdsDiss(gds)

I used the SNPRelate snpgdsHCluster command to conduct a hierarchical cluster analysis.

hc <- snpgdsHCluster(diss)

I used the SNPRelate snpgdsCutTree command which determines clusters of individuals from the dendrogram (tree diagram).

rv <- snpgdsCutTree(hc)

I used the plot command to plot the dendrogram.

plot(rv$dendrogram)

The plot command produced the following dendrogram:

dendogram AncestryDNA

Take a look and you will notice familiar names like Jennifer Kadi Welaga and Gabriel Kugoriamo who have shown up as relatives in several one-to-many autosomal DNA comparison results on GEDmatch. The horizontal lines connecting participants are relative such that those who are connected together by a horizontal line are more closely related than those who are not. For example, Jennifer Kadi Welaga is connected by horizontal lines to her children, siblings Abapuuri Welaga and Kobanyere Welaga. Another example is that Gabriel Kugoriamo is connected to George Kugoriamo, Gabriel’s father. Additionally, these participants are more related to each other than they are to, for example, Kwowora Apeakwo and her son Derrick Apeakwo.

It may be useful to look at the dendrogram without the names to see the genetic relatedness a little clearer. For that, I used the SNPRelate snpgdsDrawTree command.

snpgdsDrawTree(rv, edgePar=list(col=rgb(0.5,0.5,0.5, 0.75), t.col="black"))

The snpgdsDrawTree command produced the following dendrogram:

dendogram AncestryDNA 2

Finally, I closed the GDS formatted file.

seqClose(gds)

The Usefulness of SNPRelate Dendrograms

Dendrograms could be useful to see possible genetic relatedness that supplements the results of GEDmatch’s clustering and triangulation with cross-matching tools or Blaine T. Bettinger’s The Shared cM Project 4.0 tool v4, all of which are extremely useful in their own right. Among genetic matches, the dendrograms signal which participants’ shared ancestors are more recent than others.

As I examine this SNPRelate dendrogram, I would explore the relatedness between the Awudu family and the Awewoyeim family because they are so closely connected on the dendrogram. Following that, I would explore their relatedness with the Welaga family because it appears that the shared ancestor of Awudu and Awewoyeim has a shared ancestor with the Welaga family. I would also explore the relatedness and shared ancestor of Linda Adoa and Nabarese Pagawojem, and then the shared ancestor between that person and the Bendakeim family.

Although the SNPRelate’s dendrogram could prove useful, there is one major function that I wish I could do using SNPRelate. In the dendrogram, parents and their children are placed in such a way that it appears that the progeny shares an ancestor with one parent, and then that ancestor shares an ancestor with the other parent. I do understand that sometimes the parents of a child are related to each other. However, I do not agree that this dendrogram shows the overall pattern of relatedness among parents and children among the group such that in all or most cases, one parent and child has a common ancestor who then has a common ancestor with the other parent relative to other members of the community.

Given the diploid nature of humans in which they receive chromosomes from both parents, I would find it useful if part of the formatting of the dataset could include creating two data entries for participants who have at least one parent participant in the dataset. The concept of this formatting is similar to the concept behind GEDmatch’s tool that creates phased profiles from a progeny and one or both parents. The two data entries for progeny participants would be phased entries, each representing one parent. In this way, each parent-progeny entry would more freely cluster with those within the community (or participant group) who they are more closely related to. In the meantime, I would need to remove the progeny participants from the dataset to get a picture closer to the relatedness among the participants.

I look forward to using SNPRelate more as I construct the genetic genealogy of the participants from Africa. This tool is the closest that I have used to automatically generating a genetic genealogy tree from SNP information of individuals. With that in mind, it also would be quite interesting to run SNPRelate with our participants from Africa and their diaspora relatives’ profiles.