Ethio Helix ኢትዮ:ሒሊክስ: Intra African Genome-Wide Analysis

Tuesday, February 28, 2012

Intra African Genome-Wide Analysis

The primary purpose of studying Haplogroups (NRY and mtDNA) is to describe population movements, AKA Phylogeography . Autosomal DNA on the other hand, gives a rather ambiguous indication of a certain populations Paternal and Maternal history, since the chromosomes used undergo genetic recombination and can not be traced back to a single common ancestor. But still, there are drawbacks in just using NRY or mtDNA to study the history of a given population, and that is that they constitute only of a single Loci, which thereby reduce the effective population size relative to the Autosomes.

To this end, I have utilised publicly available Genome-Wide SNP data to get further insight into the population structure of Africa which may not be fully understood only from the data of uni-parental markers that we have. Perhaps the best published work out there with respect to African Autosomal Genome-wide data is that from Tishkoff (2009), this important paper found 14 ancestral Clusters in the African continent using the most diverse African dataset to date, however, the paper used Autosomal Microsatellites and a handful of SNPs.

On a publicly available dataset, I carried out two of the most popular approaches to help investigate population structure in Africa using Autosomal genome-wide data; (1) The non-parametric approach known as Principal Components or Multi Dimensional Scaling, which uses a Matrix whose elements are the quantification of the genetic similarity between pairs of individuals, and on which such a Matrix is used in order to perform a Principal Component Analysis upon, and (2) An explicit model based population structure analysis using the software ADMIXTURE, where individuals are assumed to come from one of K discrete populations and where population membership and allele frequencies are estimated using a Bayesian modeling strategy.

DATASET

A super set of the Data I used can be downloaded from here :http://dl.dropbox.com/u/23271596/ref.zip

The global Data Set, compiled by this blog author, contains publicly available data from 3970 individuals from around the world typed for 27,022 Autosomal SNPs, which can be found all over the 22 pairs of chromosomes (but not uniformly). I then utilized PLINK to perform the following on the above Data Set:

Removed all Non-Continental African populations.
Removed 18 Tunisians from Henn (2011) as previous analysis had shown independent cluster formation by this group, perhaps a sign of inbreeding.
Removed 15 Morrocan Jews that came from Behar (2010) for the same reason as above.
Kept SNPs above 99.46% genotyping success rate.
Excluded SNPs in linkage disequilibrium (r2>0.5) with nearby markers in a window of 50 SNPs (advanced by 5 SNP).
Added a handful of private African samples that took their genetic test with the Personal Genomics Company, 23andME. (The results of which I can not unfortunately publish in this post)

The above procedures left me with a core (public) Dataset of 1,065 Individuals from Africa and 26,129 SNPs for analysis. The complete SNPs typed for these individuals can be retrieved from: Behar (2010), Hapmap III, Henn (2011), HGDP and Xing (2010).
Furthermore, geographically, 362 were from East Africa, 304 from West Africa, 158 from North Africa, 142 from Central Africa and 99 from South Africa. Linguistically, the dataset contained 536 Niger Kordofanian speakers, 212 Nilo-Saharans , 211 AfroAsiatic speakers, 89 Khoisans and 17 Hadza.

Update: Reference Populations and Key:

MDS Analysis

The data for the MDS analysis was generated using PLINK, while the plots were generated using GNU OCTAVE. A 3 dimensional MDS plot for the dataset can be seen below, all populations are labelled according to their Median Co-ordinates.

Here, we can see that the first component, C1, separates East and North Africans from West/Central/South Africans, while the Second Component separates the divergent hunter gatherers (San,!kung, pygmies and Hadza from the rest), this may be more clearer on the two dimensional C1 vs C2 plot below,

The third Component C3, separates East Africans from all the rest, as more clearly seen on a C1 vs C3 plot below,

Model Based Analysis

The model based analysis was carried out for K=10 using ADMIXTURE, thus 10 clusters were generated from the Dataset, I took the liberty to name these clusters, some on a geographic basis, others on a linguistic basis and still others on a subsistence basis, there is obviously a lot of fluidity associated in naming a cluster, so it shouldn't be taken as something written in stone.

A PCA plot for the FST distances generated by ADMIXTURE for the 10 clusters can be seen below,

The extreme positioning of the 'Hadza' cluster is indeed striking, followed by the 'KhoiSan' and 'Pygmy' clusters. The 'West African', 'West-Central African' and 'Eastern Bantu' clusters are quite close to each other as can be expected. The divergence of the North African cluster from East Africa can be explained by the significant extra African Admixture North Africans have as evidenced by the amount of their direct maternal ancestries coming from Europe and the Near East, while a majority of their paternal Ancestry comes from East Africa (Namely, E1b1b).

Below are the Median proportions for the 10 clusters generated by ADMIXTURE for the 45 uniquely entered African populations categorised according to their 5 respective regions.

The unclear abbreviations above for the samples of EtA, EtO and EtT are respectively Ethiopian Amharas, Ethiopian Oromos and Ethiopian Tigrayans, these samples (as well as the Ethiopian Jews, AKA Beta Israel) come from Behar (2010), in addition, the EtO samples purportedly come from the southern most tip of Ethiopia close to the Kenyan border. The dominance of the North African cluster in Ethiopians is not much of a surprise, as it is well known that Ethiopia is a genetic conduit between East and North Africa.

Here, both the mbuti and biaka pygmies form completely independent clusters, which is not unexpected as they are some of the most divergent populations even on a global basis. Also to note, is the slight 'North African' Affinity of the Hema and the 'West African' affinity of the Bulala and Mada.

Many of the non-Khoisan South African populations in the Dataset show affinities to both the 'Eastern Bantu' and 'Central-West African' clusters in almost equal proportions, which is interesting.

As seen in the PCA plots of the FST distances, the 'Central-West African', the 'West African', as well as the 'Eastern Bantu' clusters are close. The Dogon population however shows the least amount of the 'Central-West African' cluster and is almost completely dominated by the 'West African' cluster, which the reverse is true for the Igbo and Yoruba. Similarly, the Fulani show almost none of the 'Central-West African' cluster but rather, are mostly dominated by the 'West African' cluster, with the difference from the Dogon being that the Fulani have a significant affinity with the 'North African' cluster rather than the 'Central-West African' one.

In the last graphic above, we can see a geographic affinity of North West Africans with West African based clusters and North East Africans with East African dominant clusters, as to be expected. As stated before however, the 'North African' cluster itself is likely a both ancient and recent synthesis of East African, European and Near Eastern Affinities.

Conclusion

I learned quite a bit on the population structure of Africa from this exercise but there is a lot more room left for improvement:

The SNPs that are typed using almost all genotyping arrays are Eurasian biased, as they were first found in Europeans, as time goes on, more African specific SNPs will be discovered and their use in genome-wide analysis will change these results.
More samples are needed, especially from both South and North Sudan, all along the Sahel belt, Tuaregs, different Omotic speakers from Ethiopia, populations from Mozambique and the South Eastern coast of Africa, as well as the South Western coast (Angola) and many many more. The inclusion of these samples will have an impact on these results.
More dense SNPs (~200k) may also give slightly different results, although Sikora (2010) notes the following: “We can conclude that the common set of 2841 SNPs genotyped is an appropriate tool to study population structure in African populations; in general, world-wide patterns are evident and robust when using a minimum of 1000 SNPs.”
Newer and more computer intensive methods for bridging the gap between model based and distance based Autosomal analysis have recently been published, it would be interesting to carry out an analysis of this dataset with these newer methods.

30 comments:

MajuFebruary 29, 2012 at 3:18 AM
This is a very interesting preliminary exercise, thanks. I say "preliminary" because it's impossible to capture the full detail in an all-Africa comparison. I hope that regional analysis follow up.

Specially I think that specific regional analysis for East and Southern Africa should be interesting on their own right (West Africans appear to cluster tightly in the PCA plots, but they surely hide some structure as well).

"More samples are needed, especially from both South and North Sudan, all along the Sahel belt, Tuaregs, different Omotic speakers from Ethiopia, populations from Mozambique and the South Eastern coast of Africa, as well as the South Western coast (Angola) and many many more".

Sure. I'm very intrigued about Mozambicans and their apparent distinct relation with the Twa, accidentally found in Patin 2009. It may just be an "Eastern Bantu" thing but the Chagga of Tanzania and all the other Bantu did not show that component (but at residual levels). I don't see anything like that in this analysis but maybe a regional analysis could find some such specificities.
ReplyDelete
Replies
BeyokuFebruary 29, 2012 at 12:18 PM
Nice work. I know this is somewhat of a tall order but is there any "Easy" to make distribution maps of these clusters? With many of the sample that exist it would be so interesting to see something like a heat map overlayed onto the sample plot points in a map of the content.

I have been wanting to do this with that Tishkoff data for a long time as well as make some updated Uni parental maps but I just dont have the time.
ReplyDelete
Replies
jes-rMarch 1, 2012 at 5:32 AM
Interesting analysis. I would just want to add that it is generally advised to remove samples with PI_HAT > 0.15 (essentially first and second cousins - the Maasai and Luhya contain many of them). If this isn't done it may lead to clusters which are just a product of recent inbreeding and not ancestral divergence.
ReplyDelete
Replies
andrewMarch 1, 2012 at 5:35 PM
This source takes the position that the Mada people speak a Niger-Congo language rather than an Afro-Asiatic language, which would resolve one of the two outliers associated with the fact that the Hasua and Mada are the only two populations lacking a significant North Africa component that speak Afro-Asiatic languages.
ReplyDelete
Replies
jes-rMarch 1, 2012 at 6:10 PM
What I find the most interesting of this run is that the Hema carry the 'North African' component while their close neighbors the Alur and Mbuti completely lack it. The Alur and Hema apparently even speak the same language (Lendu). This probably means that the Hema are not ethnically Nilotic, but mainly Bantu with Cushitic (likely similar to the Tutsi). I'm kinda surprised that Cushitic admixture traveled that far.
ReplyDelete
Replies
MajuMarch 2, 2012 at 1:20 AM
Ethnologue, which is usually a very trustworthy source and regularly updated, classifies the Mada as Afroasiatic speaking and also they show up in the midst of Afroasiatic speaking peoples, what, in principle would be consistent with their R1b and East African relatedness (which I imagine expanding from Sudan with Chadic speakers).
ReplyDelete
Replies
andrewMarch 2, 2012 at 3:14 PM
Weird on the Mada point. The first page said "Their native language, also called Mada, belongs to the Niger-Congo language family.", but I agree with Maju that Ethnologue is more trustworthy and a high R1b frequency is a very strong predictor of Chadic linguistic affiliation in Africa. The fact that Hasua and Mada show a similar pattern makes lots of sense in this context.
ReplyDelete
Replies
andrewMarch 2, 2012 at 3:27 PM
A more general comment that arose as I considered your post here:

It isn't really clear to me that it makes sense to do this analysis at K=10 rather than K=8 in admixture. Presumably, at K=8, you would end up collapsing East Bantu, Central West Africa and West Africa into a single ancestral population since the Fst distances between those populations are negligable relative to the other populations. This would make the other patterns in the data stand out better, and my intuition is that the extra two ancestral populations may be carrying a lot of noise relative to the signal it is producing.

Maju and I have also discussed the possibility that the extreme outlier status of the Hadza is a product of extreme inbreeding rather tha genuine genetic distance. Where would a typical individual Hadza individual show up on the PC charts if the rest of them were removed from the sample?

Another analysis that would be interesting would be to compare admixture rates from the Admixture frappe charts and compare them to the inferred admixtures you would see from uniparentals. Indeed, I'd find it quite interesting to see side by side uniparental and autosomal data.

Most of the Pygmy component in non-Pygmy populations looks like just minor components of an expanding Bantu mix. But, it seems like there is more at work than that in the Alur, which makes that population's history an interesting puzzle. Could there have been a third pygmy population in pre-historic Africa which was absorbed into the Alur? Is this just an exaggerated founder effect legacy of Bantu expansion? Or what?
ReplyDelete
Replies
EtyopisMarch 2, 2012 at 9:14 PM
Andrew, you wrote a lot interesting things both here and at your blog that I can not address all at the moment, but I will try to as time permits. One thing that caught my immediate attention however was this statement that you made : "The lower level of the North African component in Oromo speakers in Ethiopia relative to other Ethiopians is consistent with the notion that there was a meaningful demic component to the transition from a prior language to Ethiosemitic languages in Ethiopia. "

The relevant Cushitic speakers that you want to compare Ethiosemitic speakers to before making such an assessment are those categorized as Central Cushitic speakers rather than Lowland East Cushitic Speakers. Since we do not have any genome-wide data from current Central-Cushitic speaking populations from the highlands of Ethiopia, we have to use the Ethiopian Jews or Beta Israel as a proxy since they were known to Historically speak Central Cushitic, or otherwise known as Agew languages, and as you can see, there is very little difference in cluster proportions between the Ethiopian Jews and the other Highland Ethiopians (EtT and EtA).

Another point Re: Hadza and Haplogroups is that Henn(2011), from which the samples for this analysis came from, did report the haplogroups of the Hadza samples as follows:
mtDNA
L0a2*: 6% (1)
L3h: 11% (2)
L4g: 56% (10)
L2a: 22% (4)
L3b: 6% (1)

YDNA
E1b1b1: 10% (1)
B2b: 10% (1)b
B2b4*: 50% (5)
E1b1a7a3a: 30% (3)

For The Sandawe:
mtDNA
L0a2*: 20% (6)
L3x1: 13% (4)
L4g: 37% (11)
L2a: 10% (3)
L3e3: 17% (5)

YDNA
A3b2*: 12% (2)
B2b4*: 29% (5)b
E1b1b1: 18% (3)
E2b1: 6% (1)
E1b1a7a3a: 24% (4)
E1b1a8a: 12% (2)

And for the SAN
mtDNA:
L0d1a: 43% (14)
L0d1b: 50% (16)
L0a’b’f*: 7% (2)

YDNA:
A3b: 26% (5)
A3b1: 32% (6)
B2b4*: 5% (1)b
E2b1: 5% (1)
E1b1a7a3a: 10% (2)
E1b1a8a: 10% (2)
R1b1b2a1a: 10% (2)

In addition, since the Sandawe don't show any exogenous (relative to Africa that is) Paternal or Maternal lineages, but at the same time show ~12.3% of the 'North African' cluster, that is part of the reason why I think that the North African cluster is a composite of European, Near Eastern and East African elements, with the composition proportions changing with geography, i.e. more of the indigenous elements in East Africa, and less so in North Africa.
ReplyDelete
Replies
EtyopisMarch 3, 2012 at 1:27 AM
Yea Eze, cosign with what you said for the most part, i'm just not sure on the possible dates of the split between omotic and Cushitic, it could have happened >10KYA if I recall, do you have any info on that?

Regarding K=8 , I just run it for the exact same Dataset out of curiosity of Andrew's inquiry. In essence, what happened was that the 'East-Africa 1' and the 'West-Central Africa' clusters disappeared, the PCA FST plot looks pretty much like the K=10 run except those 2 clusters were gone.

PCA Plot for the FST Distances @ K=8

Here is a PDF file with the Median proportions for all the pops
ReplyDelete
Replies
EtyopisMarch 19, 2012 at 3:16 AM
I applied the Studantize >2 method to remove outliers in the All Africa Dataset, this filtered out 295 individuals when I rerun the dataset, the West-Central Africa cluster disappeared and the Sandawe formed their own cluster, of which a significant amount was found in East/Horn Africans. This procedure also reduced the overall stDEV of each population-to-cluster significantly. I don't have time to detail/plot out the results but any body interested can take a look at the Median cluster Matrix and the stDEV cluster Matrix, below
http://dl.dropbox.com/u/42082352/Africa_Rev5.pdf
ReplyDelete
Replies

Add comment

Ethio Helix ኢትዮ:ሒሊክስ

Pages

Tuesday, February 28, 2012

Intra African Genome-Wide Analysis

30 comments:

Blog Archive

Search This Blog

Contact Form

Ethio Helix ኢትዮ:ሒሊክስ

Pages

Tuesday, February 28, 2012

Intra African Genome-Wide Analysis

30 comments:

Blog Archive

Search This Blog

Subscribe To

Contact Form