Thursday, April 5, 2012

A ‘‘Copernican’’ Reassessment of the Human Mitochondrial DNA Tree from its Root

Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies, misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neanderthalensis. This ‘‘Copernican’’ reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the core principles of common ancestry for extant descendants.

Source (Open Access)

Some quotes and figures from the paper: 

"Supported by a consensus of many colleagues and after a few years of hesitation, we have reached the conclusion that on the verge of the deep-sequencing revolution (47,55) when perhaps tens of thousands of additional complete mtDNA sequences are expected to be generated over the next few years, the principal change we suggest cannot be postponed any longer: an ancestral rather than a ‘‘phylogenetically peripheral’’ and modern mitogenome from Europe should serve as the epicenter of the human mtDNA reference system."

"Interestingly, the ranges of substitution counts within haplogroups M and N, which are hallmarks of the relatively recent out-of-Africa exodus of humans, are also very large. For example, within M there are two mitogenomes with 43 substitutions (in M30a and M44) and two mitogenomes with as many as 71 substitutions (in M2b1b and M7b3a). This is especially striking because the path from the RSRS to the root of M already contains 39 substitutions. Hence, the difference between the M root and its M44 descendant is only four substitutions (two in the coding region and two in the control region) as compared to 32 substitutions in the M2b1b and M7b3a mitogenomes. These observations raise the possibility that the tree in general, and haplogroup M in particular, might not adhere uniformly to the assumed molecular clock, under which substitutions occur at a fixed rate on all branches of the tree over time."

Some inferred dates of interest (from the supplemental file):
L3 : 67,262 (SD 4,434)
        M : 49,590 (SD 1,824)
              M1'20'51 : 47,641 (SD 2,851)
                                 M1 : 23,680 (SD 4,378)
                                          M1a : 19,183 (SD 3,226)
                                                    M1a1 : 12,910 (SD 3,341)
        N : 58,860 (SD 2,352)
              N1'5 : 56,547 (SD 4,705)
                         N1 : 51,643 (SD 5,640)
                                 N1a : 18,118 (SD 5,247)

R0a1 :   20,766 (SD 5,754)
U6a1 : 20,133 (SD 4,941)
J1 : 26,935 (SD 5,273)
T : 25,149 (SD 4,668)
K : 26,682 (SD 4,339)



  1. One of the issues with molecular-clock-o-logy is that, while new mutations are invariably present each generation in the nuclear DNA (including Y-DNA), in the critically much shorter mtDNA, mutations only appear every many generations (I estimated 1-10 Ka, averaging maybe 2.5 Ka, for each coding region mutation as rule of thumb, while the control region ones are very messy and hence best ignored).

    This fact creates a peculiar reality for whatever mtDNA molecular-clock-o-logy. For example in (case A) a tiny population with effective size of two (Ne=2), each time a mutation happens (very rarely but now and then no doubt) the chances of fixation and hence long term survival are high (50%).

    However (case B) as the population grows the chances of any mutation becoming fixated decrease very fast with Ne=10, they are only 10%, with Ne=100, only 1%, etc. And while mutations happen faster, the chance of ever being more than one novel mutation per generation are not really larger: innovation will happen more often but, unless the population grows A LOT (well above Paleolithic levels) the chances of two mutations per generation remain effectively zero. That means that in large (for Paleolithic standards) populations the process of evolution of novel mutations will become effectively frozen: they will happen, yes, but will also be "drifted out" by the fixated dominant clade(s).

    I call this the "cannibal mum" hypothesis because, metaphorically, mum eats the daughters systematically, keeping the mtDNA evolution effectively frozen in that branch.

    Now, you are surely better than I am with maths, do you think that this non-numerical explanation makes sense? The difference of length between mtDNA lines can be very extreme and has been so far unexplained. I think this can be an explanation but lack the mathematical "demonstration" so far.

    1. Well, I do think that the paragraph I quoted from the paper in this post with regard to haplogroup M does support some of your 'skeptic' arguments from the past, to an extent anyway. IMO there is really one way to put this to rest and that is by anchoring mtDNA found from as many old bones as lower temporal limits into as many of the nodes used in the theoretical model of the molecular clock, but we need more bones from Africa as well as Eurasia for this I suppose.

      As for a mathematical explanation for your 'cannibalism' hypothesis let's first understand in a simple way how these mutations or substitutions are converted, or rather correlated with time, first a method of calculating the substitution counts is outlined:
      “To calculate the substitution counts from the RSRS to every extant mitogenome (which is a tip in the mtDNA phylogeny), we summed up the number of mutations on the path leading to each noted haplogroup in the phylogeny and added to this the number of positions that differed between the tip and the root of the haplogroup.”
      ^ I think this is not too controversial, and I think you will not disagree.

      Then, you validate the molecular clock assumption with these 'counts', to do that they used the software PAML, which uses an ML method to infer branches by ;
      “The generalized likelihood ratio (GLR) test for validity of the clock assumption then uses the test statistic 2 X (log-likelihood of non-clock model minus log-likelihood of clock model),”

      Where the non-clock model is the substitution counts from above, if the non-clock model has no relation with the clock-model then the above formula gives a straight forward chi-squared distribution, while, presumably, deviations thereof give you degrees of validation.

      So to answer your question, in order to 'validate' your theory with the non-clock model, you have to model it (i.e your theory) someway or another in to an ML method employing software like what they used
      PAML for instance, in conjuction with Soares (2009) age estimating method;
      "who suggest to estimate ages in substitutions and then transform them to years in a nonlinear manner accounting for the selection effect on nonsynonymous mutations"

      maybe here, after validation, you can customize your 'selection effect' on the mutations to come up with some time, but I am not familiar at all with such a software, unfortunately, it can be downloaded here however:

      Maybe you could look into it.

  2. I can agree with Soares 2009 in principle but I would count from the root (and not from present) because all extant haplotypes are not 'tips' (the H(n) cases for instance) nor have the same distance and may have been stopped in their evolution by the "cannibal mum" process.

    (I'm not sure why an ML tree would be needed anyhow: the true tree is provided by the phylogeny itself, right?)

    But my real "problem" is to "prove" the "cannibal mum" hypothesis statistically correct with maths. Even if it is correct, it'd be interesting to know which ranges of Ne favor that behavior and which allow for a faster effective mutation rate, both in the low range (new mutations get fixated) and the high range (new mutations survive drift for long although they remain minor).

    Software is not that important here: what I need is a maths-oriented mind. But never mind, I'll ask a friend who just loves maths and physics (although his interest in genetics and prehistory is very low). He'll be glad to help me solve these doubts, I'm quite sure.

  3. Hurray for a nomenculture that makes the observation that "in general, and haplogroup M in particular, might not adhere uniformly to the assumed molecular clock, under which substitutions occur at a fixed rate on all branches of the tree over time," natural and obvious instead of obscure.

  4. A "cannibal mum" hypothesis, of the kind that Maju suggests, would explain a discrepency between mutation rates observed in geneologically related groups of people (which we don't really have because mutations are too rare), and mutation rates estimated from longer term evolution compared to known benchmarks, but unless some lineages are more cannibalistic than others (possibly true in the Holocene, but implausible before then), this mechanism should have a fairly uniform effect only only impact the relationship between the true mutation rate and the adjusted one that we calibrate the mutation rates against in a fairly constant amount.

    1. The existence and rate of "lineage cannibalism" (or we could also call it "tyranny of the vast majority") depends on population size, so it does not have "a fairly uniform effect". If my understanding is correct: very small populations would have relatively fast mutation rates while larger ones would have near zero mutation rate (because of the 'cannibal mum' effect).

      The basic notion can be understood with dissident opinions (equivalent to novel mutations): if one out of three is a dissident, the status quo is being actually challenged but, if one out of 1000 is a dissident, the status quo is stable. As the populations are never large enough to host more than one mutant (mtDNA) the duality of cases is quite notable.

  5. "I'm not sure why an ML tree would be needed anyhow"
    You need that to validate/invalidate the molecular clock, by comparing ML inferences of branch length with and without the clock's assumption?
    Aren't you saying your theory invalidates the molecular clock at the end of the day?

    "Software is not that important here:"
    why reinvent the wheel? Your theory, if I understood correctly, is basically saying that mutation rates are not linear and that they maybe a function of effective population sizes, this maybe reasonable, but if there is software already out there that is able to tweak some of these existing variables why not use it? The other problem is that you have to come up with an assumption for the Ne in the first place, as you don't really know what the Ne was when the mutation occurred, this could be come circular if you are using the mutation information to infer Ne, unless you have an independent source for Ne inference, I'm guessing....

    1. What I'm saying is that my hypothesis (not yet a theory: it needs to be proven, at least to some extent, first) could explain why the molecular clock does not work ONLY in the mtDNA case (not re. Y-DNA or others, careful).

      It does not demonstrate the molecular clock wrong but could explain why it is actually not working properly, something mentioned in this paper and also at Pierron 2011 and other papers mentioned at the bottom of this entry. Another entry of relevance may be this one from 2009, where I mention Soares'09 but also other papers that discuss the problems of molecular-clock-o-logy in the specific case of mtDNA.

      I don't need to demonstrate that MCH is problematic in human mtDNA, what I'm trying to figure out is why, why some lines are longer and others shorter, as Howell already mentioned in 2002, not just within statistical probability but well beyond it? Of course an explanation could be adaptiveness: some haplogroups are so extremely well adapted that they can hardly change but another could be the 'cannibal mum' hypothesis. Both would actually lead to the same practical effect: that at some point the molecular tic-tac stops or nearly stops for a particular lineage (but not others), what means that mutations must be counted from the root, not the tip.

      Always only for mtDNA, because nuclear DNA (incl. Y-DNA) is much much larger in size and mutations happen actually in every single generation.

      "... why reinvent the wheel?"

      Because this is more like a complex car than a wheel and I do not know how the entrails work, not even sure if it makes any sense to use it for my purposes, the same that we would not use a tractor for a car race.

      "The other problem is that you have to come up with an assumption for the Ne in the first place, as you don't really know what the Ne was when the mutation occurred"...

      First of all I want to work out the assumptions: for each Ne value (or range of them) how does this work? How long would a novel mutation take to happen for each Ne value range and which would be the chances of the new lineage to survive drift for those Ne value ranges? That's what I need to calculate first of all.

      So maybe your car is useless for what I want, what I need at this stage.

    2. Ok!
      So on another note, the chart in this post shows that about 30/50 of the substitutions likely happened in Africa (i.e. minimum upto L3), therefore, if the substitutions were converted into time in a linear manner then the L3 node should be at 0.6 * 177 = ~106 KYA, however their calculations report L3 ~ 67KYA, thus some sort of non-linearty is obviously built in their age determining system (Soares (2009)?), what exactly is the difference in their non-linear model versus the non-linearity your hypothesis proposes?

      this is Etyopis btw, I don't have access to my login right now...

    3. That graph seems to be wrong: you can count the mutations yourself (with patience) at PhyloTree (excepting the private "tip" lineages, but good enough). In the L3 line for example (CR stands for "coding region"):
      (1) L1-6: 5 CR (+3 HVS)
      (2) L2-6: 4 CR (+1 HVS)
      (3) L2'3'4'6: 7 CR (+5 HVS)
      (4) L3'4'6: 2 CR
      (5) L3'4: 3 CR (+2 HVS)
      (6) L3: 2 CR (+1 HVS)

      Total RSRS-L3: 23 CR (+12 HVS) = 35 (all)

      Not 30 or 31 in any case. Also the graph should allow for mutations up to 74 (?) (in some lines) and only allows for 52 (the modal). So it's kinda confusing.

      The lines in M and N have counts between c. 42 and c. 74 in fig. S2, while even the shortest sets (L0 and L1) have lines reaching up to >60. This graph seems to have cut in averages (or rather modals) but you can't average the distance from RSRS to L3, can you?

      "what exactly is the difference in their non-linear model versus the non-linearity your hypothesis proposes?"

      Not sure. I have been using linearity (for simplicity) and counting from the root but aware that it's an imperfect and just tentative method. But I do not think it's worse than equalizing all branches and ignoring their differences.

      A clear example is apparent here (from Pierron 2011). If you average everything, you force H to be recent and U to be old, just because U has accumulated more mutations in all or most downstream lines, while HV (but not the smaller "sister" R0a) remained almost frozen since coalescence (possibly because it was large and stable since old, at least according to my hypothesis).

      With normal MC methods U "is" older than HV but, if we consider my hypothesis and count from the root (linearly by the moment, as I lack of a satisfactory adjustment method) HV could be older than U. After all, U shows three CR mutations at the root, sign of a time when pre-U was small and heavily pruned (bottleneck), while R0 shows branches at all mutational levels (no bottleneck, continuous expansion) and both must have started their overall existence roughly at the same time (R node).

    4. Yes, you are right here, I also counted 35 total substitutions for L3, I think the reason the figure looks off is because of the scaling with respect to the accumulated mutations, I think the further you move to the right on the X axis the less it is accurate, for instance, for L5 the plot shows 17, where in actuality there are 20 substitutions, and for L1 it shows 12, where in actuality there are 14 substitutions, so I think it is meant as a rough graphical depiction rather than something fully accurate, thanks for bringing to my attention.
      Still though, even at 35 substitutions for L3, and let's say the maximum is 74, that is still 0.47 X 177=~84 KYA, so they still seem to be doing some non-linear adjustments, even though it is not a branch focused, or branch specific adjustments.

    5. Actually, I'm doing it wrong, apologies,

      Let's say the total time, or time since inception =Ti
      Let's say, total accumulated substitutions = Nt
      Node Accumulated substitutions = Ns

      then linear time to node, Tn= [1-(Ns/Nt)] * Ti, when Ns is NOT equal to Nt.

      so for the above example to find Tn of L3 and assuming,
      Ti = 177

      then, Tn= ~93Kya

    6. 93 Ka could be a reasonable age estimate for L3 IMO, however all depends on the value of Ti, which is the point of calibration. Also on whether there are non-linearities, which can only be (as far as I can tell) inferred by using various calibration points, which can't but be somewhat subjective choices.

    7. With the Linear model, if we are to assume that Nt and Ns are to remain constant for L3,i.e. 74 and 35 respectively, then to arrive at a Tn value for the L3 node of 60 KYA, Ti has to be 113.85, similarly to arrive at Tn=70 KYA, Ti has to be 132.82, so basically;

      Ti = Tn/[1-(Ns/Nt)] for Ns<Nt

      So I made a graph of this linear relationship for Tn of L3 (0-210 KYA) and Ti, using those Ns and Nt constants,

    8. But "a Tn value for the L3 node of 60 KYA" is circular logic: this date has no basis other than molecular-clock-o-logy: it cannot be a calibration point.

      I understand that Tn(M)=80Ka (or 74 Ka if you think Toba superexplosion may have factored in) is a clear calibration point. Others are less precise or more arguable.

      Star-like structures mark moments of rapid expansions and may coincide with arrival to new niches or improved climatic conditions. Before M, in the African part of the mtDNA tree, there are (or at least were when I wrote this note) just a few star-like structures: L3, L3e1, L1b1a and L2a1c'd'e'h'i'j'k [maybe there are more now or have been renamed, I have not rechecked yet].

      The largest one is L3 (the others are actually contemporary with the Eurasian expansion), which has seven branches, five of them in Africa and very close to the main node (short stems). So I'd calibrate L3 early or at the best of the Abbassia Pluvial, i.e. 125-100 Ka. But this is not as precise.

      I'd say that the oldest West Eurasian nodes, which may be R0 or U, could be calibrated to c. 50 Ka ago (55 Ka maybe).

      These are the calibration points I can figure out and we need such calibration points to figure out a refined and somewhat valid mtDNA chrono-estimate. All the rest is futile mathematical exercises without anchor points.

    9. Did you even see the graph? I just gave those dates as example points in the linear relationship model, the graph includes Tn values for L3 from 0-210 KYA. And also, if you really want to calibrate based on bones, then Ti can not be any older than 195KYA (omo), which means based on the linear model Tn for L3 can not be older than 103 KYA, but we also know that Soares (2009), the most upto date mtDNA clock that geneticists are using nowadays, would clearly disagree with this date.

    10. correction, I meant Ti can not be any younger than 195KYA, and L3 can not be younger than 103 KYA

    11. I did see the graph but did not understood it probably: my impression was that the only "valid" results would be where Ns<Ti<Nt but that was probably not what you meant, right?

      "And also, if you really want to calibrate based on bones, then Ti can not be any older than 195KYA (omo)"...

      I was about going to say something similar to that in my previous post but actually we do not know for sure because the MRCA could be older or younger than Omo in fact. Technically mtDNA Eve does not even need to have been a fully evolved Homo sapiens in fact, she could have been an Homo rhodesiensis/ergaster/whatever (even if I can agree it's not very likely, it's not impossible either). She could have been more recent as well. So it's a very loose calibration point at the best, a vague reference.

      Gotta go. A man was killed in my town and there is a homage in 10 minutes.

    12. " I did see the graph but did not understood it probably: my impression was that the only "valid" results would be where Ns < Ti < Nt but that was probably not what you meant, right? "

      I think I understand your source of confusion, when i wrote for Ns < Nt, what i mean is that the equation:

      Ti = Tn/[1-(Ns/Nt)];

      is only valid for values of Ns < Nt, because mathematically if Ns > Nt, then that means Ti will be a negative number, which would be impossible and if Ns = Nt, then Ti will have a division by zero, which again is impossible, so Ns can only be less than Nt, which also logically makes sense since you can not have accumulated substitutions up-to any given node greater than the accumulated substitutions to any given tip.

      "Technically mtDNA Eve does not even need to have been a fully evolved Homo sapiens"

      this is true, but she most likely was.

      "A man was killed in my town and there is a homage in 10 minutes."

      Sorry to hear that.

    13. Blame of my confusion on your graph is only mine: I did not look at the matter with enough interest and the horizontal Ns and Nt lines confused my perception.

      I can see now that, according to your equation, a Tn(L3)=125-100 Ka (as I suggested) requires a Ti of c. 190-240 Ka.

      In rounded up simplified terms, if Ti=200, and Ns/Nt=0.5, then Tn(L3)=100 Ka.

      The actual values may vary quite a bit but should oscillate not really too far from these.

      But this requires two corrections:
      1. We do not know the actual value of Ti
      2. Non-linearity, not just a generic arbitrary adjustment like certain good-looking logarithmicity but a well refined curve with calibration points.

      I have proposed three calibration points:

      1. L3 in the Abbassia Pluvial (125-90 Ka.), rather not too close to the end.
      2. M c. 80 Ka (or 74 Ka.)
      3. Oldest West Eurasian nodes (N1, U, R0 - I think) c. 55-50 Ka.

      For greater certainty as well, I'd only use CR mutations (or at least separately count CR and HVS transitions, as they are quite obviously not equivalent).

      · RSRS-L3: 23 CRM (coding region mutations)
      · L3-M: 3 CRM (total: 26)
      · L3-(R0/N1/U): 7-9 (total: 30-32)

      The real problem actuallt comes when counting towards present, for example N1-present oscillates (per PhyloTree) between 6 and 18 mutations (x3 variation), U-present between 6 and 17 (x3 again) and R0-present between 1 and 12 (x12!!!)

      Why would any line became "stuck" at the configuration just 7 CR mutations downstream of L3, when others are double or even more than triple that figure? That's the real problem IMO and that's why I imagined the "cannibal mum" hypothesis, according to which some lines become stuck by mere drift (given certain demographic parameters). However I have not yet qualified this hypothesis in mathematical/statistical terms (but I have asked a math-inclined friend to help me with it and hopefully we'll have something later today), so I may well be wrong.

      If so, I'd have to think that mtDNA's evolution is not neutral but conditioned by fitness values (R0 or at least HV variants would be better fit than other West Eurasian lineages and hence beat the random purity of statistics with their intrinsic biological fitness). After all mitochondria are very important in cell metabolism: pure neutrality may be the baseline null hypothesis but it may also be incorrect anyhow.

  6. hello, i dont't interstand how the mtdna community the logiciel that mr behar has put in their site in order to definite the haplogroups ... hoa has use this logiciel.
    And what di yoou think about how to definite haplogroups juste from hvs1
    Thnak you

    1. I don't understand your first question. As to your second question, coding region mutations are more reliable if they are available, I believe Maju once made an analogy that Coding region mutations are analogous to YDNA snps and those on HVS are analogous to YDNA strs.

      As far as this paper goes, they performed their Likelihood analyses on two sets of sequences, one on only the coding region and another on the entire molecule.