Population discontinuities in a
data-set produce strange results in Model based algorithms like ADMIXTURE, for
example in a global data-set, sampling only discontinuous populations
for the most part produces fickle observations of global human
variation, this all ties in with Witherspoon 2007's observation:
Thus the answer to the question “How often is a pair of
individuals from one population genetically more dissimilar than two
individuals chosen from two different populations?” depends on the
number of polymorphisms used to define that dissimilarity and the
populations being compared. The answer, can be read from Figure 2. Given 10 loci, three distinct populations, and the full spectrum of polymorphisms (Figure 2E), the answer is 0.3, or nearly one-third of the time. With 100 loci, the answer is ∼20% of the time and even using 1000 loci,
10%. However, if genetic similarity is measured over many thousands of
loci, the answer becomes “never” when individuals are sampled from
geographically separated populations.
On the other
hand, if the entire world population were analyzed, the inclusion of
many closely related and admixed populations would increase This is illustrated by the fact that and the classification error rates, CC and CT, all remain greater than zero when such populations are analyzed, despite the use of >10,000 polymorphisms (Table 1, microarray data set; Figure 2D). In a similar vein, Romualdi et al. (2002) and Serre and Pääbo (2004)
have suggested that highly accurate classification of individuals from
continuously sampled (and therefore closely related) populations may be
impossible. However, those studies lacked the statistical power required
to answer that question (see Rosenberg et al. 2005).
Below is an attempt to outline a simple but effective method to assure that a global data-set
contains continuously sampled populations.
The first step is to perform a K2
ADMIXTURE run on the data-set. This will produce 2 cluster proportions
for each individual in the data-set. Next, combine the individuals
into the respective population groups they come from and find the
Mean or Median for each of the 2 cluster proportions per population.
Next, sort the mean or median cluster frequencies per population,
from highest to lowest or lowest to highest. Next, find the
difference in the cluster frequencies between each consecutively
sorted population entered for just one cluster (it can be anyone of
the two), call this the clinal differential. Finally, find the mean
and Standard deviation across the data-set for the computed clinal
differential values.
Repeat this process for variably
populated data-sets and compare the mean and Standard Deviations of
the clinal differential, the data-set with the least mean (absolute value) and SD will
tend to have the most continuously sampled data-set.
As a practical example I performed the
above steps on 6 different global data-sets;
- Global_V2 : This is the same data-set I used in this post.
- Global_V2b: This is the same as Global_V2, with the exception that all populations from the Americas and Oceania were removed.
- Global_V2c: This is the same as Global_V2, with the exception that all populations from the Americas, Oceania and South Asia were removed.
- Global_V2d: This is a subset of Global_V2 that only includes West Africans, Europeans and East Asians.
- Global_V2e: This is a subset of Global_V2 that only includes samples from Behar 2010
- Global_V2f: This is a subset of Global_V2 that not only includes samples from Behar 2010 but all the remaining African samples of Global_V2 as well.
The Median K2 ADMIXTURE proportions for
each data-set can be found here.
Here below are the self-explanatory
results of the mean and Standard deviations of the clinal
differentials for each data-set:
I do not understand what you mean to demonstrate here. I seems obvious that clinality is broken if you remove all populations between West and East Eurasians (V2d). Otherwise I'm amiss.
ReplyDelete"seems obvious that clinality is broken if you remove all populations between West and East Eurasians (V2d)"
DeleteRight, this is just a simple way to measure how broken or unbroken it is, technically you can apply it to any data-set.....