Tuesday, June 19, 2012

Finding the TMRCA of Ethiopian YDNA lineages using an ASD method.


I have been lately working on computing TMRCAs using an ASD or average square difference method on publicly available Y-STR haplotypes. The premise for finding the TMRCA using the ASD method is quite straight forward and easy to understand, a putative ancestral haplotype is calculated for a given dataset and the repeat of each sample at each marker in the dataset is subtracted from this ancestral haplotype, this result is then cumulated and divided by the number of samples and the marker specific mutation rate, the process is repeated for every single marker in the dataset and the mean is then multiplied by an assumed years per generation length, the formula below articulates this method:
TMRCA formula (ASD method)
 
Where;
N= Total number of Samples
Z= Total number of Markers
L0= Putative Ancestral Haplotype (Median or Modal repeats)
L= Individual sample haplotype repeats
m= Marker Specific Mutation Rate
G= Years / Generation

The biggest variable here, other than the sampling strategy of a given dataset, are the several marker specific mutation rates that are available. The process of selection of a correct mutation rate is an unsettled issue, I have therefore utilized 4 sets of mutation rates that were compiled by Paul Newlin, a collaborator at the E3b Project, these rates come from several different publications and you can read about them here for more detail:
 
  1. The Chandler Mutation Rates:
  2. Stafford Bayesian Mutation Rates:
    Essentially a compilation of other mutation rates
  3. Burgarella & Navascués Mutation Rates:
  4. Ballantyne Mutation Rates:
In order to have an analogously accurate comparison of the TMRCAs between the different publications, I had to weed out and intersect the available markers from above with markers that are found in the public domain. This essentially left me with the following 46 markers that intersected with all 4 of the above sets of rates as well as the 66 markers that are widely used:
406s1 , 19 , 388 , 389-1 , 389-2 , 390 , 391 , 392 , 393 , 426 , 436 , 437 , 438 , 439 , 442 , 444 , 446 , 447 , 448 , 450 , 454 , 455 , 456 , 458 , 460 , 472 , 481 , 487 , 490 , 492 , 511 , 520 , 531 , 534 , 537 , 557 , 565 , 568 , 572 , 578 , 590 , 594 , 617 , 640 , 641 and gatah4.

In addition, since the Chandler mutation rates had a complete intersection with the 66 widely used markers, an additional 66 marker Chandler set was independently used that included the following markers in addition to the 46 listed above:
385a , 385b , 459a , 459b , 449 , 464a , 464b , 464c , 464d , ycaiia , ycaiib , 607 , 576 , 570 , cdya , cdyb , 395s1a , 395s1b , 413a and 413b.
  
Haplogroups A, E and J, cover well over 90% of the YDNA lineages found in Ethiopia. More specifically within these haplogroups, I was more interested in finding the TMRCA for A-M13, E-M35 and J1-M267, as these lineages cover over 70% but under 80% of said lineages, whereas the remaining 20-30% of lineages found in Ethiopia belong to E1b1*(x E1b1b,E1b1a1), other types of E lineages like E2 and E*, and some specific clades that belong to haplogroups B,T and J2.


E-M35
Here below are the ASD based TMRCA estimates in years before present (YBP) for the 4 sets of 46 marker mutation rates and the additional 66 marker Chandler mutation rates set. The graph indicates an average TMRCA for G=28 and G=33 when L0 = Median and Modal.

TMRCA estimates for E-M35 and subclades
Mean TMRCA estimate chart for E-M35 and subclades.
 
The raw data for the particular haplotypes used can be retrieved from here: http://www.haplozone.net/e3b/project/loadview/19.

The 180 E-M35 haplotypes are a composite of 60 E-M78 (20 V-13, 20 V-22 and 20 V-12), 60 E-M123 and 60 E-M81 haplotypes. The 128 E-M78 haplotypes are a composite of 49 V-13, 52 V-22 and 27 V-12 haplotypes. The reason for this phylogenetic balancing is because a disproportionate number of the E-M35 and M78 haplotypes in the primary data source, i.e haplozone, belong to haplogroup V13.

The high end TMRCA estimate for E-M35 using Chandler's mutation rates, 16,587 YBP, is the closest TMRCA estimate to that of Cruciani et. al of 20,900-23,900 YBP. There are also missing lineages of E-M35 in this dataset that are more prevalent in Eastern Africa (the putative homeland of the E-M35 mutation), namely, E-M293, E-V42, E-V92 and E-V6, the inclusion of these lineages would certainly bring up the TMRCA. On the low end, the Stafford rates estimate the TMRCA at a mere 8,819 YBP, which is not very realistic.

Here below are the current phylogenetic understandings of haplogroups E1b1b (EM215) and E1b1b1a (E-V68). Referring to the results above and tying them to the diagrams below, E-M78 is a subclade of E-V68 and E-V13,V22,V12 and V32 are in turn subclades of E-M78. E-M123 is a subclade of E-Z830 and in turn, E-M84 is a subclade of E-M123. E-M81 is a subclade of E-V257, where E-V257 and E-Z830 are united by the Single Nucleotide Polymorphism known as E-Z827.
Current understanding of E1b1b (M215) basic phylogeny

Current understanding of E1b1b1a (E-V68) basic phylogeny:


The most prevalent sub-lineage in Ethiopia from the unique lineages used in this experiment is E-V32, this lineage's predecessor, E-V12, is thought to have been born around Southern Egypt and back migrated to its original homeland as well as into western Sudan into places like Darfur, the haplotypes of the lineages represented in this exercise were mostly from the Arabian peninsula, an area not known to harbor diverse E-V32 lineages, and hence why the overall TMRCA across all independent mutation rate estimates seem a bit low (Range:2.98 -5.8 KYA, Mean:4.2 KYA, SD 0.831) relative to the proposed TMRCA by Cruciani '07 of 5.9-11.3 KYA. If more E-V32 lineages from Sudan and Ethiopia were added, I speculate that the TMRCA would considerably increase.


J1-M267
The primary data source for the haplotypes I used for this lineage can be found here : http://y-dna-j-m304.gen.or.at/archive/FTDNA_full-results-table/2012-06-14_Classic-Chart_67-markers_1171-members.htm

About half of the lineages in the database that belong to J1-M267 purportedly come from the Middle East and North Africa, the putative homeland of the mutation (the former more so than the latter). In addition, only one haplotype from Ethiopia was present in this dataset.

Here below are the TMRCA estimates for J1 and one of its major subclades, J1c3 (formerly known as J1e and defined by the SNP P58)

TMRCA estimates for J-M267 and subclades
Mean TMRCA estimate chart for J-M267 and subclades.
 
At the high end, the 46 marker Chandler mutation rates estimate the TMRCA for J1-M267 at 15,437 YBP, at the opposite end, Ballantyne's rates estimate it at 6,600 YBP. Again, Chandler's estimates seem more reasonable and in-line with most publications.

The Phylogenetic structure for haplogroup J1-M267 can be found here: http://www.isogg.org/tree/ISOGG_HapgrpJ.html

It is hard to come to any conclusions with respect to Ethiopian J1 lineages from this analysis, as so little are represented in this dataset, however, since we have quite a decent amount of haplotypes from the putative homeland of the lineage (unlike the previous case for E1b1b) the TMRCA, more especially using Chandler's rates, could be reliable.

Haplogroup A

A, is the closest lineage to the root of all living male lineages, it is fairly prevalent in Ethiopia at about 18%, in some cases in northern Ethiopia some populations can carry a particular sub lineage of A, currently known as A1b1b2b and formerly known as A3b2 and defined by the mutation M13, in excess of 40%. Unfortunately, due to the lack of private DNA testing participants from Africa, this haplogroup is not very well represented, nevertheless, I found 42 haplogroup A, 66 marker haplotypes from this database: http://www.familytreedna.com/public/Haplogroup_A/default.aspx?section=yresults

The TMRCA results for this dataset can be seen below,

TMRCA estimates for Haplogroup A and A1b1b2b
Mean TMRCA estimate chart for Haplogroup A and A1b1b2b

The 66 marker Chandler mutation rates set shows a high end TMRCA of 62,164 YBP, while Burgarella & Navascués rates estimate the low end at 21,212 YBP, a greater than 40 KY difference that can not be reasonably reconciled.

The 20 A-M13+ haplotypes show a TMRCA of 2889-6544 YBP, which at first glance seems ridiculous, but a closer look at the origin of these haplotypes reveals that all of them have origins outside of Africa, and may indeed have a more recent common ancestor.

Caveats and Projections
  • Chandler's rates seem to yield the most reasonable ASD based TMRCAs, this seems more especially true for older lineages.
  • A major portion of TMRCA analysis is finding confidence intervals, this analysis has not included this important segment of TMRCA calculations, but will look to do so in the future.
  • The analysis is limited due to sampling bias, and thus the results are skewed, but there is not much that can be done about that until more integral samples and haplotypes start appearing in the public databases.
  • SNP based TMRCA calculation methods have been recently used to date old lineages, a side by side comparison of STR based TMRCA calculations with these newer SNP based calculations for the exact same lineages would be useful.
  • Finally, going forward, TMRCA calculations should have some congruency with the more practical ancient DNA results, however, I have a feeling that we are far from getting any ancient DNA results from Ethiopia, or Africa, so a potential way out for now could be to peg TMRCAs (as a lower bound) with some of the ancient DNA lineages that are found outside of Africa and use the mutation rates that corresponds best with those results.
 
Special thanks to Steven Bird from the E3b forum for providing in depth knowledge in the area of YDNA STR analysis and application.

1 comment:

  1. Alas, it seems that all the distinctions that matter for providing solid evidence to inform the process of interpreting pre-history are swallowed by the disrepencies in mutation rates.

    For that purpose, it seems as if the key is finding some quality non-genetic means of calibrating the data.

    ReplyDelete