Saturday, January 5, 2013

TMRCA calculations from Plaster NRY data : Correcting an Error

Previously, I had computed TMRCAs for the YDNA STR data from the additional material that was provided along with Dr.Chris Plaster's thesis. However, after a brief communication with the author, I found out that the marker order of the STRs in the excel file was reported wrongly, the correct order for the markers are thus as follows:

DYS19 DYS388 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS437 DYS438 DYS439 DYS448 DYS456 DYS635 Y GATA H4

This changes my TMRCA calculations because I am not computing the coalescent using a generic mutation rate that is equivalent for all the markers, but rather each marker has its own mutation rate attributed to it.

When I rerun my program using the newly corrected order above I get the following:

As can be seen, using the new order of markers generally reduces the number of generations to coalescent for the Plaster data-set. The previous observation of a relatively lower TMRCA for the haplozone data of E-M123 versus that of the E-M34 Plaster data-set largely disappears. 

To check if the fact that the high number of samples (129) present in the E-M123 haplozone data-set was skewing the results, I took 23 random samples (which equals the same number of samples available in the Plaster E-M34 data-set) from the larger E-M123 Haplozone dataset and re-run the TMRCA calculations on just those samples, I repeated this process 300 times, only 28% of the runs yielded a mean TMRCA less than the E-M34 Plaster data-set, if sample size was skewing the results I would expect >50% of the runs to have a mean TMRCA less than that of the E-M34 plaster dataset.

That said, the E-M34 Plaster data-set still had a relatively higher generations to coalescent than the E-M84 Haplozone dataset, E-M84 is a subclade of E-M34 and a high majority of haplotypes that belong to E-M34 also test positive for the E-M84 SNP (at least for the non-African E-M34 haplotypes that we know of).

Other than that, the new, and corrected, ordering of the markers did not have much impact in relative TMRCA terms between the Plaster and Haplozone/FTDNA data for the other lineages I had tested.


  1. E1b1b-M35 is just ~5000 years old? I think that is totally off the mark by A LOT. We are talking of a widespread African and Mediterranean haplogroup with many variants that simply cannot have spread since such a recent date as the Chalcolithic (a convenient Iberian chronology reference) or Bronze Age (an also convenient Aegean chronology reference).

    It'd mean that a descendant of it, E1b1b1b1a-M81, had a massive founder effect in NW Africa and then expanded to all Western Iberia and even localities as far as Britain long after that date (Bronze Age? Iron? Roman era?). It'd mean that another descendant, E1b1b1a1b-V13, reached Greece and Albania, where it became very common, almost dominant, and spread through most of Europe (at low frequencies) also already in the Iron Age or so. It just doesn't seem plausible for what I know of prehistory, much less with the extremely low levels of African (even if just North African) autosomal genetics associated to them in Europe, which seem to demand particular "filters" (repeated admixture) before their European scatter.

    Being constructive, I'd say that the most recent possible calibration point would be, for arrival to Europe, in the Neolithic, where there was still room to make clear founder effects. In this sense, it must be reminded that E1b1b1a1b-V13 has been detected in Neolithic Catalans of c. 7000 years ago, along with G2a, and this must have been older in Greece and Albania (c. 9-10,000 years ago).

    So if your would be, estimated, '3000 years ago' is actually 10,000, then you should multiply all figures times three or four, if you want to get real. By "you" I mean Etyopis and Plaster alike. Producing a rough corrected estimate, everything else equal, for E1b1b of at least 15-20,000 years ago.

    Similarly for J1-M267, if you get something like 12,000 years, the real thing would be rather 48,000 years, which is coherent with the way I imagine the J1 scatter in Africa and West Asia (based on correlations with material prehistory, i.e. structured archaeology).

    Sorry but a lot of reality check is needed in these matters and with E1b-V13 we have aDNA references that simply do not allow for these short chronologies you are toying with.

    1. Using 30 years/Generation, E-M35 ~ 6-8KYA, according to the above, I'm not sure where you got 5 KYA from.

      But in any event, I have explained earlier that the purpose of the exercise is to find out the relative ages of the different data-sets and not the absolute dates. If I used more markers, the generations to coalescent would change, but I am limited with the number of markers I can use because I need overlap with the different data-sets.

      In addition the mutation rate sets are all pedigree and/or father-son, effective rates (as championed by Zhivotovsky for instance) are not utilized in my program, although I plan to do so in the future.

    2. You're right 6-7 Ka is what I get now from the data above. I probably used a 25 years per generation figure the other day, sorry.

      "But in any event, I have explained earlier that the purpose of the exercise is to find out the relative ages of the different data-sets and not the absolute dates".

      I was thinking something like that but, in any case, if a dataset is sufficiently representative of the overall haplogroup diversity, i.e. it has people from two or more early divergent branches within that haplogroup, we should get something very similar if not identical to the age of the whole haplogroup, right?

      "In addition the mutation rate sets are all pedigree and/or father-son"...

      That explains a lot.

    3. “i.e. it has people from two or more early divergent branches within that haplogroup, we should get something very similar if not identical to the age of the whole haplogroup, right?”

      Remember that I am using the diversity of the repeats (STR) for a given dataset in which all samples have already been SNP tested, and then attempt at using this diversity to come up with a TMRCA by applying custom mutation rates, since each marker has its own behavior: some erratic, some stable, some slow and others fast.

      So say for instance you have STR from a dataset that has all been SNP tested for E-M35, we can assume it is from several different subclades of E-M35 but we don't really know since no further downstream testing has been done, then for a second case, say you have another Y-STR dataset that has all been SNP tested for E-M78 and again no further downstream SNPs tested, since we know phylogenetically that M78 is downstream of M35, we would expect the M78 dataset to have less STR diversity, and hence a lower TMRCA, than the M35 dataset simply because the M78 dataset would have had less time to accumulate STR diversity than the M35 dataset, so this is one case of relative comparison of TMRCA's, the phylogenetic case, but then you have another case, the spatial case, in this case, say you have 2 datasets that have been tested for the SAME SNP but the samples come from different locations (I use locations here, but it could be other things too, like language, subsistence, or whatever) then you check the diversity and TMRCA of these 2 datasets, logically, in the simplest and most parsimonious terms, which ever dataset has the more diverse, and hence older TMRCA, would have had to harbored that SNP longer than the other dataset, and hence could potentially be the origin. In reality though things can be a lot more complex, as some of the STRs and how you combine them with other STRs can have tremendous effect on the age and observed diversity, so it could be hard to gauge WHY they are really appearing so diverse (or not so diverse).

      P.S. to come up with Zhivotvsky based TMRCA's just multiply the Pedigree TMRCA's by 2.5-3.5, roughly, I have not seen marker specific mutation rates for Zhivotovsky, but I have tried incorporating the 0.00069 mutation rates for all the markers uniformly in my program and that is what I roughly get.

  2. Zhivotovski uses a correcting parameter (a constant) to account for demographic alterations (the pedigree rate, assuming it's correctly measured) would only apply if the novel clades (as we measure them with SRY markers - not the best possible measure) would expand infinitely fast. In reality some do and some go extinct and some remain stable and sometimes they do contract, etc. In other words: not all of Chandler's observed father-son rates survive, much less in the long term. So Dr. Z. figured out a constant correction (an estimate in itself) to compensate for all that, which produces, as you say figures roughly x3 longer (which is reasonably correct in many cases, maybe still a bit too short, when compared with the archaeological calibration references).

    The problem with the pedigree rate is that it is self-referential and then also now X measures this and then Y measures that and claims that the actual mutation rate is doubly slow than Chandler's estimates... even Dienekes admitted in the recent past to that and kicked that Russian crazy guy he used to follow, which was his name?, from his blog with angry manners.

    However he recently returned to those rates in an entry about Native Americans, who according to the author of the paper would follow the obsolete model of "Clovis first", based on their estimates.

    There are deep problems with STR-based age estimates (probably caused by the "cannibal mum" issue where large pops. become "conservative" that I have outlined in other cases re. mtDNA - notice that this would not happen if the whole Y-chromosome would be sequenced instead) and the Z. correction is just a decent but week attempt (made in 2004) to establish a better approximation and that way save the possibilities of the molecular clock.

    Probably the most realistic correction rate to be applied is not linear as Z's constant but exponential or semi-exponential, so the differences with the pedigree rate increase with time - but still there would be many exceptions as populations are not homogenenous and a small population of a few dozen in Siberia will not behave the same as a large one of tens of thousands, maybe more, in the Tropics. The latter will be much more diverse but also more "conservative" overall, making almost impossible for any new clade (measured in mtDNA mutations or Y-DNA STR markers - but not in Y-DNA actual SNP mutations because the Y-crhomosome is long enough for at least one mutation to happen in be added each generation no matter what, defeating the "cannibal dad" hypothesis) to consolidate. Instead in small pops. it all depends on founder effect randomness.

    1. " Russian crazy guy he used to follow, which was his name?"
      I believe you are reffereing to Anatole Klyosov :

      I don't think he is crazy as in the bat-shit type, you can see his credentials, he just has crazy ideas, you can have crazy ideas without being crazy.

    2. Yea, Klyosov. I strongly disagree with him and his IMO misplaced enthusiasm, however when I say "crazy Russian" I did not mean as blatant disqualification but just a way of talking. "Crazy Russian" is an stereotype almost and in some cases it applies better than others. It's meant rather in the artistic "overly creative", "overwhelming" sense than the disqualification of his mind as intrinsically delusory (not more than normal)... but mostly just is another way of saying "whatshisname" - nothing else.

  3. How much work would you have to go through to create a regional analysis of E1b1a lineages?

    1. You mean compare STRs of E1b1a from different regions in Africa ? Not that much work, if the STRs are arranged in a file that looks like what is in one of these sheets,, then all I have to do is feed it into my program. My program is limited by the following 49 markers however:
      406s1 , 19 , 388 , 389-1 , 389-2 , 390 , 391 , 392 , 393 , 426 , 436 , 437 , 438 , 439 , 442 , 444 , 446 , 447 , 448 , 450 , 454 , 455 , 456 , 458 , 460 , 472 , 481 , 487 , 490 , 492 , 511 , 520 , 531 , 534 , 537 , 557 , 565 , 568 , 572 , 578 , 590 , 594 , 617 , 640 , 641, gatah4, 576, 570, 449.

      Any markers outside of these I would not be able to process as I don't have overlapping mutation rates for them from my 4 sources. There is a similar but more sophisticated program called YTIME that does what mine does and beyond, it is written in MATLAB : , I haven't played with it much but it can calculate confidence limits, i.e. upper-lower bounds for TMRCAs, which mine can't, it however calculates the TMRCA using the same method (ASD) as mine.