Showing posts with label Autosomal. Show all posts
Showing posts with label Autosomal. Show all posts

Wednesday, July 31, 2013

A summary of interesting recent genetics papers.

I'm taking a break from my Summer break to post a few interesting papers that have come out within the past couple of months.


This paper supports such a notion of continuous gene-flow between Africans and non-Africans since the major Out of Africa event that was precursor to the populating of all continents outside of Africa.
To be sure, such a notion is not new but has been highlighted before by methods used by authors such as Li and Durbin (2011) for instance. Such a notion, is also sufficient to explain the intermediate genetic nature of West Eurasians, I.e between Africans and East Asian/Native Americans, that I have blogged about and demonstrated using ADMIXTURE in the past.


A few quotes from the paper:

"In this paper, we study the length distribution of tracts of identity by state (IBS), which are the gaps between pairwise differences in an alignment of two DNA sequences. These tract lengths contain information about the amount of genetic diversity that existed at various times in the history of a species and can therefore be used to estimate past population sizes. IBS tracts shared between DNA sequences from different populations also contain information about population divergence and past gene flow. By looking at IBS tracts shared within Africans and Europeans, as well as between the two groups, we infer that the two groups diverged in a complex way over more than 40,000 years, exchanging DNA as recently as 12,000 years ago." 

"To illustrate the power of our method, we use it to infer a joint history of Europeans and Africans from the high coverage 1000 Genomes trio parents. Previous analyses agree that Europeans experienced an out-of-Africa bottleneck and recent population growth, but other aspects of the divergence are contested [47]. In one analysis, Li and Durbin separately estimate population histories of Europeans, Asians, and Africans and observe that the African and non-African histories begin to look different from each other about 100,000–120,000 years ago; at the same time, they argue that substantial migration between Africa and Eurasia occurred as recently as 20,000 years ago and that the out-of-Africa bottleneck occurred near the end of the migration period, about 20,000–40,000 years ago. In contrast, Gronau, et al. use a likelihood analysis of many short loci to infer a Eurasian-African split that is recent enough (50 kya) to coincide with the start of the out of Africa bottleneck, detecting no evidence of recent gene flow between Africans and non-Africans [14]. The older Schaffner, et al. demographic model contains no recent European-African gene flow either [48], but Gutenkunst,et al. and Gravel, et al. use SFS data to infer divergence times and gene flow levels that are intermediate between these two extremes [22][49]. We aim to contribute to this discourse by using IBS tract lengths to study the same class of complex demographic models employed by Gutenkunst, et al. and Gronau, et al., models that have only been previously used to study allele frequencies and short haplotypes that are assumed not to recombine. Our method is the first to use these models in conjunction with haplotype-sharing information similar to what is used by the PSMC and other coalescent HMMs, fitting complex, high-resolution demographic models to an equally high-resolution summary of genetic data."

"We estimate that the European-African divergence occurred 55 kya and that gene flow continued until 13 kya. About 5.8% of European genetic material is derived from a ghost population that diverged 420 kya from the ancestors of modern humans. The out-of-Africa bottleneck period, where the European effective population size is only 1,530, lasts until 5.9 kya."

"Our inferred human history mirrors several controversial features of the history inferred by Li and Durbin from whole genome sequence data: a post-divergence African population size reduction, a sustained period of gene flow between Europeans and Yorubans, and a “bump” period when the ancestral human population size increased and then decreased again. Unlike Li and Durbin, we do not infer that either population increased in size between 30 and 100 kya. Li and Durbin postulate that this size increase might reflect admixture between the two populations rather than a true increase in effective population size; since our method is able to model this gene flow directly, it makes sense that no size increase is necessary to fit the data. In contrast, it is possible that the size increase we infer between 240 kya and 480 kya is a signature of gene flow among ancestral hominids."

"Our estimated divergence time of 55 kya is very close to estimates published by Gravel, et al.and Gronau, et al., who use very different methods but similar estimated mutation rates to the  per site per generation that we use in this paper. However, recent studies of de novo mutation in trios have shown that the mutation rate may be closer to  per site per generation [51][55][56]. We would estimate older divergence and gene flow times (perhaps  times older) if we used the lower, more recently estimated mutation rate. This is because the lengths of the longest IBS tracts shared between populations should be approximately exponentially distributed with decay rate ."




This paper discusses some points, rather the lack of evidence, that makes a pre-toba migration of modern humans outside of Africa almost impossible to reconcile with currently available evidence.

A few quotes from the paper:

"There are currently two sharply conflicting models for the earliest modern human colonization of South Asia, with radically different implications for the interpretation of the associated genetic and archaeological evidence (Fig. 1). The first is that modern humans arrived ∼50–60 ka, as part of a generalized Eurasian dispersal of anatomically modern humans, which spread (initially as a very small group) from a region of eastern Africa across the mouth of the Red Sea and expanded rapidly around the coastlines of southern and Southeast Asia, to reach Australia by ∼45–50 ka (7–10, 14–18) (Fig. 2). The second, more recently proposed view, is that there was a much earlier dispersal of modern humans from Africa sometime before 74 ka (and conceivably as early as 120–130ka), reaching southern Asia before the time of the volcanic “supereruption” of Mount Toba in Sumatra (the largest volcanic eruption of the past 2 million y) at ∼74 ka (1–6)."
"We find no evidence, either genetic or archaeological, for a very early modern human colonization of South Asia, before the Toba eruption. All of the available evidence supports a much later colonization beginning ∼50–55 ka, carrying mitochondrial L3 and Y chromosome C, D, and F lineages from eastern Africa, along with the Howiesons Poort-like microlithic technologies (see above and Genetics and Archaeology). We see no reason to believe that the initial modern human colonization of South and Southeast Asia was distinct from the process that is now well documented for effectively all of the other regions of Eurasia from ∼60 ka onward, even if the technological associations of these expanding populations differed (most probably for environmental reasons) between the eastern and northwestern ranges of the geographical dispersal routes."

"The archaeological evidence initially advanced to support an earlier (pre-Toba) dispersal of African-derived populations to southern Asia has since been withdrawn by the author responsible for the original lithic analyses, who now suggests that they are most likely “the work of an unidentified population of archaic people” (ref. 11, p. 26). Meanwhile, the genetic evidence outlined earlier indicates that any populations dispersing from Africa before 74 ka would predate the emergence of the mtDNA L3 haplogroup, the source for all known, extant maternal lineages in Eurasia (8, 28) (Fig. 5). The size of the mtDNA database is very substantial: currently there are almost 13,000 complete non-African mtDNA genomes available, not one of which is pre-L3."




This paper, written by a geneaolgoical community member, has made an impressive effort at creating and automating a comprehensive method to pylogenetically classify Geno 2.0 YDNA SNPs. Details of the algorithm are not available:

"To illustrate this, the author has used this Y-tree clade predictor (using the latest ISOGG tree as a basis for comparison) to classify over 1650 sets of publicly accessible Geno 2.0 Y-SNP calls. This information was then used as an input into another algorithm designed by the author – an algorithm developed to automate the construction of a phylogenetic Y-tree, while overcoming the challenges identified above. The technical details of this process will remain proprietary for the time being."



Thursday, March 28, 2013

Global Contour Map for the Dual ADMIXTURE Components.

Below is a contour map representing the African ADMIXTURE component at K=2 for the Global data set (V2) which  can be downloaded here, and population specific percentages that can be seen here

Contour map generated using Mapviewer7, Kriging method was used for gridding. ADMIXTURE outputs for all New World, Jewish, Singapore-Chinese and Singapore-Indian populations were removed before the generation of the map.

African cline from ADMIXTURE, K=2 . Black dots represent locations of sampled populations


 Some things to note,

  • Since this is a K2 run, the OOA or the 'other' component has a complete mirror distribution relative to the distribution of the African component seen in the above.
  • The regions where the brown color dominates (20-35% African ) are the same regions that are later on absorbed by the new component that arises @ K=3, which finds its peaks in West Eurasians and has an FST that is intermediate between those of the African and East Asian/Amerindian components.
  • It is notable to observe the congruence of the above with the distribution of global genetic as well as phenotypic diversity (below)1


Global phenotypic and genetic Diversity 
1.The effect of ancient population bottlenecks on human phenotypic variation

Friday, February 15, 2013

Gradient Maps for African ADMIXTURE components

Here below are gradient maps for my last African ADMIXTURE run, Africa_V2b, courtesy of a demo download of Mapviewer7 . The Kriging method was used for Gridding and 'Grid Z limits' mode was used for color mapping.

Sampled Population's Index

Sampled Population's Location

PCA for the FST distances
generated by ADMIXTURE  

West-Africa Cluster Freq.

Nilo-Saharan Cluster Freq.

East-Africa-2 Cluster Freq.

North-Africa Cluster Freq.

Khoi-San Cluster Freq.

Omotic Cluster Freq.

Mbuti-Pygmy Cluster Freq.

Biaka-Pygmy Cluster Freq.

Hadza Cluster Freq.

East-Africa-1 Cluster Freq.
Isometric view of the MDS plot
 for all Populations sampled


UPDATE (02/18/2013) : Below are gradient maps for the first African ADMIXTURE run, Africa_V1, courtesy of a demo download of Mapviewer7 . The same options as above were used both for gridding and color mapping.

Wednesday, November 14, 2012

STRUCTURE run on High/Low Altitude Ethiopians


The pdf can be downloaded here

Regarding the populations sampled, the paper notes the following:

The high altitude (HA) Amhara are agropastoralists living in a temperate Afro-alpine ecosystem in the Simien Mountains National Park at altitudes ranging from 3500-4100 meters (m). Altitudes above 2500m on the East African Plateau have been inhabited for at least 5 thousand years (ky) and altitudes around 2300-2400m for more than 70ky [24,25].”

Plus:

DNA was extracted from blood samples provided by 192 Amhara individuals living at 3700 m in the Simien Mountains National Park or at 1200 m in the town of Zarima.”

For the Oromo:

The HA Oromo are pastoralists herding cattle, sheep and goats and living in a temperate Afro-alpine ecosystem in the Bale Mountains National park and reside on the Sanetti Plateau at 4000-4100m. The HA areas of the Bale Plateau have been inhabited by Oromo since the early 1500s according to historical records [22,23].”

Plus:

79 individuals lived at 4000 m in the Bale Mountains National Park while 39 individuals lived at 1560 m in the town of Melkibuta.”

Melkibuta is probably a typo for Melkabuta, Bale, close to Goro, Bale which I have used as a proxy town in the map below for the location of the LA Oromo samples. 
Green= Low Altitude Amhara, Orange = High Altitude Amhara , Yellow = Low Altitude Oromo, Purple = High Altitude Oromo


Regarding the STRUCTURE run it says:

This position is further supported by the Bayesian clustering analysis performed using the program STRUCTURE [85]. In this analysis, 3 different sets of 57652 SNPs were used to infer the ancestral composition of each population assuming 7 ancestral groups. The STRUCTURE plots clearly show that Ethiopian populations share ancestral components with sub-Saharan African and Middle Eastern populations falling in the middle of the ancestry gradient between these two groups of populations (Figure S2.”

and Interestingly:

We also calculated the haplotype diversity and compared it to that observed in the worldwide populations. Interestingly, the Oromo (0.822) and Amhara (0.810) haplotype diversity values are as high as or higher than the highest values [80] observed in the HGDP, i.e. Bantu (0.818), Biaka Pygmies (0.815), Yoruba (0.815) and Mandenka (0.807); this is true regardless of altitude (0.798 for HA Amhara; 0.803 for LA Amhara, 0.813 for HA Oromo, and 0.813 for LA Oromo).”


There is also an FsT based Global neighbor joining tree in the PDF with a familiar outcome.







Saturday, July 7, 2012

The World At K=2


The most basic Autosomal genetic division of the world is between Africans and Out of Africans (OOA), this is not only seen on global PCA or MDS maps , where the first PC separates Africans from non Africans, but can also be observed with model based statistical (Bayesian) Analysis as well, where the first model iteration, i.e. K=2 distinguishes Africans from non-Africans.
Here, I present (for reference) the full ADMIXTURE, K=2 results for a global dataset of 2,967 individuals from around the world, sampled for 16,595 SNPs with a total genotyping rate of 99.6%.

The results are arranged from the highest median African % to the lowest.

Friday, June 22, 2012

Intra African Genome-Wide Analysis, V2

See Also : Intra African Genome-Wide Analysis, V1


Population References and First Pass K10 Analysis



K2 - K10 Analysis

Saturday, March 31, 2012

Cross Validating and K Selection


There are two ways of choosing a K value for any given dataset that one wishes to perform an ADMIXTURE run on, one is to throw a dart at a random set of numbers and hope it works out for the very best, the other is to run ADMIXTURE at different K's while computing a cross validation error for each of the K values using the --cv flag, I did this with the studentized global dataset that I discussed earlier in this post. The Cross Validation error values for K 1-14 for that particular dataset can be seen in the graphs below,

close up :
While the CV-Error values do not start flattening out until about K=10, the CV error values do not start inflecting until K=13, meaning K=13 is the appropriate choice for this dataset.

Cross Validation can take a considerably long time to run, as each consecutive K has to be evaluated along with its error separately, unless one has access to a very fast machine off-course.

As a reference, the Bash shell code to run Cross Validation in ADMIXTURE for up-to K=14 is:

for K in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; \
do ./admixture32 -j2 --cv=14 “filename.bed” $K | tee log${K}.out; done

where CV error values will be recorded in the .out files for each K.

Peaking populations for each cluster for K =2-13

K=2
Cluster1: pygmy,mbutipygmy,sotho/tswana,biakapygmy,fang

Cluster2: chinese-americans,tujia,miao,hezhen,han

East Asians and Africans split, with West Asians and Europeans belonging to 1/3 African and 2/3 East Asian, the reverse is seen with Ethiopians, 2/3 African and 1/3 East Asian.


Wednesday, March 21, 2012

A Supervised Global ADMIXTURE Run


A supervised ADMIXTURE run, assumes that certain populations within a given dataset are 100% of a certain ancestry, so for instance, given one wants to run ADMIXTURE at K=10 in supervised mode, then 10 different populations that are assumed to come from the 10 putative ancestral clusters that the software will infer, or rather will be forced to infer, must be manually selected.

I wanted to explore this type of a run on a global basis and purposefully select populations that not only may form their own clusters in an unsupervised run, but are also thought to be within the 'trunk', bifurcation 'nodes' and end 'branches' of the ancestral 'tree' of all people.
  
The basis of this run is the global dataset than can be downloaded in PLINK format from here. The dataset, a superset of the African dataset that I have been thus far utilizing, contains 3,970 individuals from around the world typed at 27,022 genome-wide SNPs.
A 3 dimensional, as well as a dim1 vs dim2, MDS plot labelled according to the median coordinates of the population groups for this dataset can be seen below:



The general structure of a globally spread PCA/MDS plot is well known and understood, the first principal component, describing the highest variation of all the components, separates Africans from non-Africans, while the second principal component separates West Asians/Europeans from East Asians, Oceanians and Native Americans. The 3rd principal component can be however shaky, in the plot above it separates Native Americans from the rest, however other sources have shown that the 3rd principal component in a global PCA separates divergent hunter gatherers (like the Hadza, Sandawe, San and Pygmies) from every body else, perhaps a 3-D PCA generated from full genome scans will put this to rest once and for all.

Monday, March 12, 2012

TreeMix analysis on the African Dataset


Thanks to a commenter going by the moniker 'Eze', who notified me the other day of a new program called Treemix, in which it infers “patterns of population splitting and mixing from genome-wide allele frequency data”, I had a chance to give it a try on the Intra-African Dataset that I have described previously.

After converting the input file into the desired format, I decided to play with several of its functionalities to become familiar with it,
 
1) Default Maximum Likelihood (ML) Tree,

  

2) Default ML graph with 4 assumed migrations,


 3) ML graph rooted with the San-nb,

  
4) ML graph with 4 migrations and rooted with the San-nb.

A remaining option of the software that I have not as yet tried is that which groups SNPs together to account for linkage disequilibrium. 

Other than that, the results are quite as expected, the North Africans are shown in both the default and rooted trees, but especially with the San-n rooted tree, as a branch of East Africans, and where East Africans in turn are seen as a branch of other Africans, consistent with evidence from uni-parental markers, as well as published papers, for an East African genesis of Eurasians, of which North-Africans can be used as a proxy for this particular Dataset.

The 4 inferred migrations in order of decreasing edges were;

-(Biaka Pygmy, Ancestral Sotho/tswana) → Sandawe, Migration edge:0.457032; likely an old hunter gatherers link. This was noted by Tishkoff (2009) : “These results suggest the possibility that the SAK, Hadza, Sandawe, and Pygmy populations are remnants of an historically more widespread proto-Khoesan- Pygmy population of hunter-gatherers.”

-(!kung,Ancestral to Biaka and Mbuti Pygmies) → Hadza,
Migration edge:0.44087; potentially another early hunter gatherers link.

-Ethiopian Jews → San,
Migration edge:0.188914; this could be a relic of early hunter-gatherer connections with Ethiopia (See: Ethiopians and Khoisan share the deepest clades of the human Y-chromosome phylogeny.) Another possible connection for this could be the migration of YDNA E1b1b1b2b (E-M293) carriers from Eastern Africa to Southern Africa within the past few millennia.

-Mbuti Pygmy → Alur,
Migration edge:0.140627; this was also picked up by the ADMIXTURE analysis, where the Alur had significant amounts of Mbuti and Biaka pygmy components.

Further reading on the details behind the software featured in this post, TreeMix, can be found here: http://hdl.handle.net/10101/npre.2012.6956.1.


UPDATE: Run another one again rooted with the SAN from Namibia and 10 migrations assumed and got the following results, left column is Migration edge weight

0.586693 luhya →hema,hadza
0.508001 egyptans → EtA
0.504407 egyptans → EtT
0.442291 egyptans → Ethiopian-jews
0.432858 moroccans → fulani
0.27746 mbutipygmy,pygmy → sandawe
0.203223 mbutipygmy,pygmy → hadza
0.156929 egyptans → maasai
0.154406 moroccans → san
0.129901 pygmy → alur


Some of the results from the previous 4 assumed migrations run disappeared, it is not clear if migrations inferred from a lower m assumption are more statistically significant than those inferred from higher m assumptions. In general, this newer run resembles more of the K10 ADMIXTURE run, however there are some obscure differences, for instance, while it picked up a North to East African migration in the EtA, EtT and EtJ samples, it skipped the EtO samples and then picked up the same migration pattern in the maasai samples, whom had a lower 'North-African' component in the K10 ADMIXTURE run than the EtO samples. My take on this is that the program is not yet sophisticated enough to accommodate for bidirectional migrations that have happened for thousands of years, like the ones that have taken place between East and North Africa for instance. Indeed the authors of the software do list the following pertinent point as one of their assumptions:

"We also have modeled migration between populations as occurring at single, instantaneous time points."

and

"This model will work best when gene flow between populations is restricted to a relatively short time period. The relevance of this assumption will depend on the species and the populations considered."

UPDATE2: Residual plot for 10 migrations rooted with the San-nb.

Thursday, March 8, 2012

Afrasans in a Genome-Wide context.


A subset of the Intra-African dataset I have includes Afrasans, or Afroasiatic speakers. Afroasiatic is typically divided into 6 major categories or groups; Semitic, Berber, Egyptian, Chadic, Cushitic and Omotic. A 7th, but nearly extinct group, known as Ongota is contentious, but is by some included as its own branch within the Afroasiatic phylum. All of these Language groups, with the exception of Semitic, are exclusively found in Africa. The 211 Afrasan samples in the dataset belong to 4 or 5 of those groups mentioned, depending on how one accounts for any language shifts (that is shifts within the wider Afrasan phylum) that might have occurred. A rough table is shown below associating the 211 samples with current, and in some cases previously spoken language or language groups of Afroasiatic.

 
In general, Afroasiatic is thought to have emerged somewhere in the North Eastern section of Africa, anywhere from Ethiopia to Southern Egypt, in the genetic (Autosomal) sense, this area can perhaps be viewed as where such populations inhabiting that area in Africa, lie along a diagonal axis of the C1 vs C3 Intra- African MDSplot (at ~ 34°
from the horizontal), as highlighted below:
MDS plots
After extracting the 211 AA speaking samples from the 1065 sample African Dataset, I performed an MDS Analysis on it as seen below.
Component 1 separates Berber/Semitic/Egyptian speakers from Chadic speakers, with Ethiopian Semitic/Cushitic speakers plotting somewhere in between, but closer to the former in this separation. Component 2, separates Ethiopians+Egyptians from the rest.
 
Component 3 Separates the Mozabites from the Rest, with Ethiopians again retaining an intermediate position.

Model Based Analysis
The Logical value for a K selection would be 6, i.e. equivalent to the number of known Afroasiatic subgroups, however, since Omotic speakers are not present in the Dataset, I went ahead and run a K=5 unsupervised ADMIXTURE Analysis for the Afrasan Dataset.

The K=5 ADMIXTURE run produced the following FST distances,
 
The biggest separation for both Axis is for the cluster I nicknamed Cushitic, while the Berber, Semitic and Mozabite clusters appear pretty close, with the Mozabites looking a bit isolated.

The Median proportions for the clusters can be seen below.
 
The fact that the mozbites formed their own cluster, is intriguing, although one would suspect that inbreeding may play a role, since it can also be seen how the Mozabites cluster away from other North Africans in the 3D MDS plot, almost forming their own group. 

Therefore, to see what this analysis would look like without the Mozabites, I took all 27 of them out, leaving me with 184 AA speaking samples.

I repeated the same analysis as above on the newer Dataset.

MDS Plots
Components 1 and 2 behaved the same way as when the Mozabites were included, Component 3 however, without the Mozabites, separates Berber and Cushitic speakers from the rest to almost the same degree, unlike when the Mozabites were included.

Model Based Analysis
This second iteration of the Afrasan dataset that did not include the Mozabites created a Cushitic, Chadic, Berber and Egyptian clusters, with a 5th cluster which looked like a relic that is present in trace amounts in all the Afrasan samples except the Mada and Hausa. The Egyptian cluster is also found in highland Ethiopians, it also shows a more frequent occurrence of high Standard Deviation relative to all the other clusters;
 
So the Egyptian cluster looks like it gives less of a linguistic signal than the other clusters, it could potentially be inclusive of a Semitic signal as well as maybe other types of non-Afroasiatic Eurasian affinities.

It would be of great interest to see where Omotic speakears would fit into this analysis.