Tuesday, February 28, 2012

Intra African Genome-Wide Analysis

The primary purpose of studying Haplogroups (NRY and mtDNA) is to describe population movements, AKA Phylogeography . Autosomal DNA on the other hand, gives a rather ambiguous indication of a certain populations Paternal and Maternal history, since the chromosomes used undergo genetic recombination and can not be traced back to a single common ancestor. But still, there are drawbacks in just using NRY or mtDNA to study the history of a given population, and that is that they constitute only of a single Loci, which thereby reduce the effective population size relative to the Autosomes.

To this end, I have utilised publicly available Genome-Wide SNP data to get further insight into the population structure of Africa which may not be fully understood only from the data of uni-parental markers that we have. Perhaps the best published work out there with respect to African Autosomal Genome-wide data is that from Tishkoff (2009), this important paper found 14 ancestral Clusters in the African continent using the most diverse African dataset to date, however, the paper used Autosomal Microsatellites and a handful of SNPs.

On a publicly available dataset, I carried out two of the most popular approaches to help investigate population structure in Africa using Autosomal genome-wide data; (1) The non-parametric approach known as Principal Components or Multi Dimensional Scaling, which uses a Matrix whose elements are the quantification of the genetic similarity between pairs of individuals, and on which such a Matrix is used in order to perform a Principal Component Analysis upon, and (2) An explicit model based population structure analysis using the software ADMIXTURE, where individuals are assumed to come from one of K discrete populations and where population membership and allele frequencies are estimated using a Bayesian modeling strategy.

A super set of the Data I used can be downloaded from here :http://dl.dropbox.com/u/23271596/ref.zip
The global Data Set, compiled by this blog author, contains publicly available data from 3970 individuals from around the world typed for 27,022 Autosomal SNPs, which can be found all over the 22 pairs of chromosomes (but not uniformly). I then utilized PLINK to perform the following on the above Data Set:
  1. Removed all Non-Continental African populations.
  2. Removed 18 Tunisians from Henn (2011) as previous analysis had shown independent cluster formation by this group, perhaps a sign of inbreeding.
  3. Removed 15 Morrocan Jews that came from Behar (2010) for the same reason as above.
  4. Kept SNPs above 99.46% genotyping success rate.
  5. Excluded SNPs in linkage disequilibrium (r2>0.5) with nearby markers in a window of 50 SNPs (advanced by 5 SNP).
  6. Added a handful of private African samples that took their genetic test with the Personal Genomics Company, 23andME. (The results of which I can not unfortunately publish in this post)

The above procedures left me with a core (public) Dataset of 1,065 Individuals from Africa and 26,129 SNPs for analysis. The complete SNPs typed for these individuals can be retrieved from: Behar (2010), Hapmap III, Henn (2011), HGDP and Xing (2010)
Furthermore, geographically, 362 were from East Africa, 304 from West Africa, 158 from North Africa, 142 from Central Africa and 99 from South Africa. Linguistically, the dataset contained 536 Niger Kordofanian speakers, 212 Nilo-Saharans , 211 AfroAsiatic speakers, 89 Khoisans and 17 Hadza.

Update: Reference Populations and Key:

MDS Analysis
The data for the MDS analysis was generated using PLINK, while the plots were generated using GNU OCTAVE. A 3 dimensional MDS plot for the dataset can be seen below, all populations are labelled according to their Median Co-ordinates.
Here, we can see that the first component, C1, separates East and North Africans from West/Central/South Africans, while the Second Component separates the divergent hunter gatherers (San,!kung, pygmies and Hadza from the rest), this may be more clearer on the two dimensional C1 vs C2 plot below,
The third Component C3, separates East Africans from all the rest, as more clearly seen on a C1 vs C3 plot below,

Model Based Analysis
The model based analysis was carried out for K=10 using ADMIXTURE, thus 10 clusters were generated from the Dataset, I took the liberty to name these clusters, some on a geographic basis, others on a linguistic basis and still others on a subsistence basis, there is obviously a lot of fluidity associated in naming a cluster, so it shouldn't be taken as something written in stone.

A PCA plot for the FST distances generated by ADMIXTURE for the 10 clusters can be seen below,

The extreme positioning of the 'Hadza' cluster is indeed striking, followed by the 'KhoiSan' and 'Pygmy' clusters. The 'West African', 'West-Central African' and 'Eastern Bantu' clusters are quite close to each other as can be expected. The divergence of the North African cluster from East Africa can be explained by the significant extra African Admixture North Africans have as evidenced by the amount of their direct maternal ancestries coming from Europe and the Near East, while a majority of their paternal Ancestry comes from East Africa (Namely, E1b1b).

Below are the Median proportions for the 10 clusters generated by ADMIXTURE for the 45 uniquely entered African populations categorised according to their 5 respective regions.

The unclear abbreviations above for the samples of EtA, EtO and EtT are respectively Ethiopian Amharas, Ethiopian Oromos and Ethiopian Tigrayans, these samples (as well as the Ethiopian Jews, AKA Beta Israel) come from Behar (2010), in addition, the EtO samples purportedly come from the southern most tip of Ethiopia close to the Kenyan border. The dominance of the North African cluster in Ethiopians is not much of a surprise, as it is well known that Ethiopia is a genetic conduit between East and North Africa.

Here, both the mbuti and biaka pygmies form completely independent clusters, which is not unexpected as they are some of the most divergent populations even on a global basis. Also to note, is the slight 'North African' Affinity of the Hema and the 'West African' affinity of the Bulala and Mada.

Many of the non-Khoisan South African populations in the Dataset show affinities to both the 'Eastern Bantu' and 'Central-West African' clusters in almost equal proportions, which is interesting.

As seen in the PCA plots of the FST distances, the 'Central-West African', the 'West African', as well as the 'Eastern Bantu' clusters are close. The Dogon population however shows the least amount of the 'Central-West African' cluster and is almost completely dominated by the 'West African' cluster, which the reverse is true for the Igbo and Yoruba. Similarly, the Fulani show almost none of the 'Central-West African' cluster but rather, are mostly dominated by the 'West African' cluster, with the difference from the Dogon being that the Fulani have a significant affinity with the 'North African' cluster rather than the 'Central-West African' one.

In the last graphic above, we can see a geographic affinity of North West Africans with West African based clusters and North East Africans with East African dominant clusters, as to be expected. As stated before however, the 'North African' cluster itself is likely a both ancient and recent synthesis of East African, European and Near Eastern Affinities.

I learned quite a bit on the population structure of Africa from this exercise but there is a lot more room left for improvement:
  1. The SNPs that are typed using almost all genotyping arrays are Eurasian biased, as they were first found in Europeans, as time goes on, more African specific SNPs will be discovered and their use in genome-wide analysis will change these results.
  2. More samples are needed, especially from both South and North Sudan, all along the Sahel belt, Tuaregs, different Omotic speakers from Ethiopia, populations from Mozambique and the South Eastern coast of Africa, as well as the South Western coast (Angola) and many many more. The inclusion of these samples will have an impact on these results.
  3. More dense SNPs (~200k) may also give slightly different results, although Sikora (2010) notes the following: “We can conclude that the common set of 2841 SNPs genotyped is an appropriate tool to study population structure in African populations; in general, world-wide patterns are evident and robust when using a minimum of 1000 SNPs.”
  4. Newer and more computer intensive methods for bridging the gap between model based and distance based Autosomal analysis have recently been published, it would be interesting to carry out an analysis of this dataset with these newer methods.


  1. This is a very interesting preliminary exercise, thanks. I say "preliminary" because it's impossible to capture the full detail in an all-Africa comparison. I hope that regional analysis follow up.

    Specially I think that specific regional analysis for East and Southern Africa should be interesting on their own right (West Africans appear to cluster tightly in the PCA plots, but they surely hide some structure as well).

    "More samples are needed, especially from both South and North Sudan, all along the Sahel belt, Tuaregs, different Omotic speakers from Ethiopia, populations from Mozambique and the South Eastern coast of Africa, as well as the South Western coast (Angola) and many many more".

    Sure. I'm very intrigued about Mozambicans and their apparent distinct relation with the Twa, accidentally found in Patin 2009. It may just be an "Eastern Bantu" thing but the Chagga of Tanzania and all the other Bantu did not show that component (but at residual levels). I don't see anything like that in this analysis but maybe a regional analysis could find some such specificities.

    1. Hi Maju, yes thanks for the idea of a regional African Analysis, I plan on doing that not only on a regional basis, but also on a linguistic basis as time permits.
      I would like to really add a Southern Sudan Nilotic dataset of at least 20-30 people, as that is very important for East African analysis. As you know Nilotes and Afrasans have been in East Africa since time immemorial, there is also an important Niger Kordofonian component (Bantu) in East Africa, but they are already relatively well represented by the Luhya and Kenyan Bantu samples. I am relatively confident that the addition of a Southern Sudanese Dataset would create its own distinct cluster, at which cluster's expense I am however not sure .
      With respect to Mozambique, Sikora (2010), which I cited in the post, also showed a unique cluster among their Mozambique samples, the text is free and you can check it out when you get a chance.

    2. I was oblivious to the Sikora paper, thanks for the mention. In a sense, with such a huge Mozambican sample, I'm not surprised that the clustered apart but still it is consistent with the Patin paper and is something to be explored.

      "I would like to really add a Southern Sudan Nilotic dataset of at least 20-30 people, as that is very important for East African analysis".

      Indeed that would be nice but I can't help you with that. :/

    3. Numbers, i.e. Quantity of samples, may indeed have an influence on cluster formation, however the degree of influence I'm not so sure, for instance, take a look at the number of Maasai , Yoruba and Luhya samples in the above run, so it may not just be the total number of Mozambiqans that is influencing the creation of an independent cluster, there could be something in their population history that we may not be clear about. I have updated this post with the locations of the reference samples for this dataset by the way.

  2. Nice work. I know this is somewhat of a tall order but is there any "Easy" to make distribution maps of these clusters? With many of the sample that exist it would be so interesting to see something like a heat map overlayed onto the sample plot points in a map of the content.

    I have been wanting to do this with that Tishkoff data for a long time as well as make some updated Uni parental maps but I just dont have the time.

    1. Greetings Beyoku, I know of a software that will do that, it is called the golden software: http://www.goldensoftware.com/products/mapviewer/mapviewer.shtml, they have many different types but the MapViewer7 seems the best for what you are looking for, "Efficient Solution for Visually Displaying Spatial Data", the only problem is that its not free, 250 bucks, There may however be some free stuff in Ubuntu's software packages, I'll look for some and make a post about it if I find anything.

    2. I think someone recommended me GNU Octave. I downloaded it (for free) but I'm unable to use it meaningfully. (Note it may be something else: all these software issues are a bit of a headache to me).

    3. Yea, it was me that recommended GNU octave, it is a powerful computational software similar to MATLAB but with the difference that it is free and open source, it probably can do some numerical interpolation based on geo spatial positions/coordinates, but I would have to program something, I am sure there is something easier and already customized for this particular task out there.

  3. Interesting analysis. I would just want to add that it is generally advised to remove samples with PI_HAT > 0.15 (essentially first and second cousins - the Maasai and Luhya contain many of them). If this isn't done it may lead to clusters which are just a product of recent inbreeding and not ancestral divergence.

    1. I did use the pi-hat, but it took a while for my computer to process it so I abandoned it, I will try it again for the next run.

    2. If you do start filtering the samples I would say use a 0.2 pi-hat removal threshold for agriculturalist groups and a 0.4 pi-hat threshold for khoisan & pygmy groups, since there are fewer hunter-gatherer samples or otherwise you'll end up with too few of them. For example, the majority of the Hadza group has very high pi-hat scores with each other, there are even siblings in that group.

    3. Well, as far as the Divergence of the Hadza, this run only just confirmed what Tishkoff (2009) found amongst the Hadza, and her find was even more extreme as they were divergent even on a GLOBAL level.

      “PC 3 (3.5%) distinguishes the Hadza hunter-gatherers from others”.

      “Finally, the Hadza are the sole constituents of a sixth cluster (yellow) consistent with their distinctive genetic structure identified with PCA and STRUCTURE.”

      “These results suggest the possibility that the SAK, Hadza, Sandawe, and Pygmy populations are remnants of an historically more widespread proto-Khoesan- Pygmy population of hunter-gatherers.”

      So I don't think the behavior of the Hadza in this run or any other similar runs can be easily just explained away by just 'relatedness'.

    4. I don't doubt their distinctiveness. However, if two siblings or a parent-child pair are within a particular group it can easily 'spark' a high Fst cluster. Without this the Fst might be slightly lower.

  4. This source takes the position that the Mada people speak a Niger-Congo language rather than an Afro-Asiatic language, which would resolve one of the two outliers associated with the fact that the Hasua and Mada are the only two populations lacking a significant North Africa component that speak Afro-Asiatic languages.

    1. Ironically the Mada have extremely high R1b frequencies.

    2. Actually, that is incorrect, the source you provided says the Language of the Mada is part of the Chadic branch of Afroasiatic : http://www.ethnologue.com/show_lang_family.asp?code=mxu

      With respect to the Mada lacking in the 'North African' cluster that is true, it is however to be noted that they carry a significant amount of the 'East-African2' cluster, in fact, they are the Westernmost population to carry that particular cluster in appreciable amounts. Thus the 'East-Africa2' cluster can be used as a proxy for Afroasiatic just as much as the 'North African' cluster, a case can actually be made that the North East African axis seen in C1 Vs. C3, that is the diagonal Axis that the EtO, EtA, EtT samples lie on and leading all the way up-to Egypt, is the 'nucleus' for Afroasiatic.

  5. What I find the most interesting of this run is that the Hema carry the 'North African' component while their close neighbors the Alur and Mbuti completely lack it. The Alur and Hema apparently even speak the same language (Lendu). This probably means that the Hema are not ethnically Nilotic, but mainly Bantu with Cushitic (likely similar to the Tutsi). I'm kinda surprised that Cushitic admixture traveled that far.

    1. Why surprised? If you believe that 'Cushitic' influence reached to as far as the tutsi who live in Rwanda(which I'm not even sure if it is really Cushitic or some very Ancient North East-African), the Hema are found even further North of Rwanda and Burundi. The Massai in Tanzania are even further south, we also know that E-M293 traveled to as far south as Southern Africa.

    2. I just find it fascinating, that's all. I wonder how old it is in most of these populations, this should be an interesting subject for research.

  6. Ethnologue, which is usually a very trustworthy source and regularly updated, classifies the Mada as Afroasiatic speaking and also they show up in the midst of Afroasiatic speaking peoples, what, in principle would be consistent with their R1b and East African relatedness (which I imagine expanding from Sudan with Chadic speakers).

  7. Weird on the Mada point. The first page said "Their native language, also called Mada, belongs to the Niger-Congo language family.", but I agree with Maju that Ethnologue is more trustworthy and a high R1b frequency is a very strong predictor of Chadic linguistic affiliation in Africa. The fact that Hasua and Mada show a similar pattern makes lots of sense in this context.

  8. A more general comment that arose as I considered your post here:

    It isn't really clear to me that it makes sense to do this analysis at K=10 rather than K=8 in admixture. Presumably, at K=8, you would end up collapsing East Bantu, Central West Africa and West Africa into a single ancestral population since the Fst distances between those populations are negligable relative to the other populations. This would make the other patterns in the data stand out better, and my intuition is that the extra two ancestral populations may be carrying a lot of noise relative to the signal it is producing.

    Maju and I have also discussed the possibility that the extreme outlier status of the Hadza is a product of extreme inbreeding rather tha genuine genetic distance. Where would a typical individual Hadza individual show up on the PC charts if the rest of them were removed from the sample?

    Another analysis that would be interesting would be to compare admixture rates from the Admixture frappe charts and compare them to the inferred admixtures you would see from uniparentals. Indeed, I'd find it quite interesting to see side by side uniparental and autosomal data.

    Most of the Pygmy component in non-Pygmy populations looks like just minor components of an expanding Bantu mix. But, it seems like there is more at work than that in the Alur, which makes that population's history an interesting puzzle. Could there have been a third pygmy population in pre-historic Africa which was absorbed into the Alur? Is this just an exaggerated founder effect legacy of Bantu expansion? Or what?

  9. Andrew, you wrote a lot interesting things both here and at your blog that I can not address all at the moment, but I will try to as time permits. One thing that caught my immediate attention however was this statement that you made : "The lower level of the North African component in Oromo speakers in Ethiopia relative to other Ethiopians is consistent with the notion that there was a meaningful demic component to the transition from a prior language to Ethiosemitic languages in Ethiopia. "

    The relevant Cushitic speakers that you want to compare Ethiosemitic speakers to before making such an assessment are those categorized as Central Cushitic speakers rather than Lowland East Cushitic Speakers. Since we do not have any genome-wide data from current Central-Cushitic speaking populations from the highlands of Ethiopia, we have to use the Ethiopian Jews or Beta Israel as a proxy since they were known to Historically speak Central Cushitic, or otherwise known as Agew languages, and as you can see, there is very little difference in cluster proportions between the Ethiopian Jews and the other Highland Ethiopians (EtT and EtA).

    Another point Re: Hadza and Haplogroups is that Henn(2011), from which the samples for this analysis came from, did report the haplogroups of the Hadza samples as follows:
    L0a2*: 6% (1)
    L3h: 11% (2)
    L4g: 56% (10)
    L2a: 22% (4)
    L3b: 6% (1)

    E1b1b1: 10% (1)
    B2b: 10% (1)b
    B2b4*: 50% (5)
    E1b1a7a3a: 30% (3)

    For The Sandawe:
    L0a2*: 20% (6)
    L3x1: 13% (4)
    L4g: 37% (11)
    L2a: 10% (3)
    L3e3: 17% (5)

    A3b2*: 12% (2)
    B2b4*: 29% (5)b
    E1b1b1: 18% (3)
    E2b1: 6% (1)
    E1b1a7a3a: 24% (4)
    E1b1a8a: 12% (2)

    And for the SAN
    L0d1a: 43% (14)
    L0d1b: 50% (16)
    L0a’b’f*: 7% (2)

    A3b: 26% (5)
    A3b1: 32% (6)
    B2b4*: 5% (1)b
    E2b1: 5% (1)
    E1b1a7a3a: 10% (2)
    E1b1a8a: 10% (2)
    R1b1b2a1a: 10% (2)

    In addition, since the Sandawe don't show any exogenous (relative to Africa that is) Paternal or Maternal lineages, but at the same time show ~12.3% of the 'North African' cluster, that is part of the reason why I think that the North African cluster is a composite of European, Near Eastern and East African elements, with the composition proportions changing with geography, i.e. more of the indigenous elements in East Africa, and less so in North Africa.

    1. 'that is part of the reason why I think that the North African cluster is a composite of European, Near Eastern and East African elements, with the composition proportions changing with geography, i.e. more of the indigenous elements in East Africa, and less so in North Africa.'

      I would agree with this last bit. IMO, it looks like the E1b1b1 found in these groups was spread by proto-Cushites (or possibly Omotic people), which explains their Northern affinities. E-M293 is not that old and fits the scenario of South Cushitic expansions. Also, L3x1 in the Sandawe looks like a Horn African lineage. The Sandawe to me seem mostly like a mixture between Bantus, South Cushites, and Paleo-Africans. I would say the Hadza are similar but with a higher Paleo-African component.

    2. The North African component is the West Eurasian component: it shows up that way also in my and Henn's North African comparisons at low K levels. At deeper Ks other more local components shade this but, with the partial exceptions of the Ethiopian and Fulani components (which look old Eurasian-African mixtures, being similarly Fst-distant from Tropical Africans and West Eurasinans) and some small exotic "OoA-remnant" components (very Fst-distant from all), the North African local components are much closer (by Fst distances) to West Eurasians than to Tropical Africans.

      So the North African component should be considered the West Eurasian component in Africa without hesitation (and any European or West Asian control will show up as nearly 100% within that component in an analysis like this one).

    3. "So the North African component should be considered the West Eurasian component in Africa without hesitation"

      Actually Maju, this is what Henn (2011) said regarding the cluster that was found in the Tuscans, proxy for West Eurasians, in relation to some East African populations:

      “At k = 4, we see a western African/Bantu-speaking cluster, an eastern African cluster, a cluster representing Europeans that likely also signifies ancestral variation maintained in eastern Africa (e.g., Maasai and Sandawe populations), and finally, a cluster that links all our HG populations.”

      So there is no need to consider the 'North African' cluster as a “West Eurasian” component 'without any hesitation', as one should hesitate to contemplate on the fact that West Eurasian genetics is in essence a subset of that of Africa's, as well as the uniparental evidence of migrations out of Africa post-OOA.

      The 'North-Africa' cluster likely has a gradient of indigenousness to Africa, decreasing in the degree of its indigenousness, both temporally and spatially, in a direction going from East to North-Africa.

    4. The reason for my claim is that all those components are within normal Fst distances of West Eurasian clusters with each other (very loosely around 0.1), while their Fst distance to Tropical African is normally near 0.2.

      In the table I published in my exercise, the distinctly local Ethiopian and Fulani components already show up and have intermediate distances but, for example, before K=9 Ethiopians appeared as a mix of Arabian and Mandenka components (i.e. they appear as like 60% Arab-like, 30% Mandenka-like and 10% other). This is in agreement with the K=11 Ethiopian-specific component, once it shows up, showing Fst=0.101 for Ethiopian/Arab components but Fst=0.115 with the Mandenka one.

      Even if you don't wish to interpret it as demic/genetic backflow (what I think is very hard to question, specially for North Africa), it clearly indicates a most intense (West) Eurasian affinity.

      Do you know how can this disagreement be tested? Introducing an East Eurasian control population (like CHB or Papuans or whatever). IF the affinity is only or mostly because of East African ultimate ancestry of Eurasians in general (which I understand is what you are proposing), then West Eurasians and East Asians should be more or less similarly close to North Africans or, specially, these East African groups like Ethiopians or Maasai.

      But I have some good feeling by now of how the various components and Fst distances behave in North African (and to lesser extent Sahelian) populations, so my prediction is what I stated above: that the component indicates genetic back-flow from Eurasia, which can be old in most cases but not older than the Aurignacoid cultures of c. 50 Ka ago (which should be at least 30 or 40 Ka older than the Out of Africa migration giving origin to the large but less diverse "Eurasian" macropopulation).

  10. Yea Eze, cosign with what you said for the most part, i'm just not sure on the possible dates of the split between omotic and Cushitic, it could have happened >10KYA if I recall, do you have any info on that?

    Regarding K=8 , I just run it for the exact same Dataset out of curiosity of Andrew's inquiry. In essence, what happened was that the 'East-Africa 1' and the 'West-Central Africa' clusters disappeared, the PCA FST plot looks pretty much like the K=10 run except those 2 clusters were gone.

    PCA Plot for the FST Distances @ K=8

    Here is a PDF file with the Median proportions for all the pops

    1. According to this source Omotic and Cushitic roughly split around 8 kya.


      So there probably existed a Cush-Omo meta-population around 10-8 kya, a branch of this group could have spread to the Southern Rift Valley, or perhaps later around 6 kya with the South Cushites (Iraqwoids).

      By the way what do you think of that Beja and Agew are the earliest branches of Cushitic, would this indicate that Cushitic spread from the Northwest (e.g. Northeast Sudan/Red Sea hills)? Some of the uniparental markers possibly indicate this (E-M78, E-M123, T-M70, and also many maternal lineages).

  11. I applied the Studantize >2 method to remove outliers in the All Africa Dataset, this filtered out 295 individuals when I rerun the dataset, the West-Central Africa cluster disappeared and the Sandawe formed their own cluster, of which a significant amount was found in East/Horn Africans. This procedure also reduced the overall stDEV of each population-to-cluster significantly. I don't have time to detail/plot out the results but any body interested can take a look at the Median cluster Matrix and the stDEV cluster Matrix, below