Population discontinuities in a data-set produce strange results in Model based algorithms like ADMIXTURE, for example in a global data-set, sampling only discontinuous populations for the most part produces fickle observations of global human variation, this all ties in with Witherspoon 2007's observation:
Thus the answer to the question “How often is a pair of individuals from one population genetically more dissimilar than two individuals chosen from two different populations?” depends on the number of polymorphisms used to define that dissimilarity and the populations being compared. The answer, can be read from Figure 2. Given 10 loci, three distinct populations, and the full spectrum of polymorphisms (Figure 2E), the answer is 0.3, or nearly one-third of the time. With 100 loci, the answer is ∼20% of the time and even using 1000 loci, 10%. However, if genetic similarity is measured over many thousands of loci, the answer becomes “never” when individuals are sampled from geographically separated populations.
On the other hand, if the entire world population were analyzed, the inclusion of many closely related and admixed populations would increase This is illustrated by the fact that and the classification error rates, CC and CT, all remain greater than zero when such populations are analyzed, despite the use of >10,000 polymorphisms (Table 1, microarray data set; Figure 2D). In a similar vein, Romualdi et al. (2002) and Serre and Pääbo (2004) have suggested that highly accurate classification of individuals from continuously sampled (and therefore closely related) populations may be impossible. However, those studies lacked the statistical power required to answer that question (see Rosenberg et al. 2005).
Below is an attempt to outline a simple but effective method to assure that a global data-set contains continuously sampled populations.
The first step is to perform a K2 ADMIXTURE run on the data-set. This will produce 2 cluster proportions for each individual in the data-set. Next, combine the individuals into the respective population groups they come from and find the Mean or Median for each of the 2 cluster proportions per population. Next, sort the mean or median cluster frequencies per population, from highest to lowest or lowest to highest. Next, find the difference in the cluster frequencies between each consecutively sorted population entered for just one cluster (it can be anyone of the two), call this the clinal differential. Finally, find the mean and Standard deviation across the data-set for the computed clinal differential values.
Repeat this process for variably populated data-sets and compare the mean and Standard Deviations of the clinal differential, the data-set with the least mean (absolute value) and SD will tend to have the most continuously sampled data-set.
As a practical example I performed the above steps on 6 different global data-sets;
- Global_V2 : This is the same data-set I used in this post.
- Global_V2b: This is the same as Global_V2, with the exception that all populations from the Americas and Oceania were removed.
- Global_V2c: This is the same as Global_V2, with the exception that all populations from the Americas, Oceania and South Asia were removed.
- Global_V2d: This is a subset of Global_V2 that only includes West Africans, Europeans and East Asians.
- Global_V2e: This is a subset of Global_V2 that only includes samples from Behar 2010
- Global_V2f: This is a subset of Global_V2 that not only includes samples from Behar 2010 but all the remaining African samples of Global_V2 as well.
The Median K2 ADMIXTURE proportions for each data-set can be found here.
Here below are the self-explanatory results of the mean and Standard deviations of the clinal differentials for each data-set: