Monday, December 8, 2014

SNP vs. STR YDNA TMRCA Estimation

An interesting comparison of YDNA TMRCA estimates using the SNP counting method and STRs (with both pedigree and Zhivotovsky rates as well as rho and ASD methods) can be found in a recently published study.


The Y-chromosome tree bursts into leaf: 13,000 high-confidence SNPs covering the majority of known clades

Many studies of human populations have used the male-specific region of the Y chromosome (MSY) as a marker, but MSY sequence variants have traditionally been subject to ascertainment bias. Also, dating of haplogroups has relied on Y-specific short tandem repeats (STRs), involving problems of mutation rate choice, and possible long-term mutation saturation. Next-generation sequencing can ascertain single nucleotide polymorphisms (SNPs) in an unbiased way, leading to phylogenies in which branch-lengths are proportional to time, and allowing the times-to-most-recent-common-ancestor (TMRCAs) of nodes to be estimated directly. Here we describe the sequencing of 3.7 Mb of MSY in each of 448 human males at a mean coverage of 51 ×, yielding 13,261 high-confidence SNPs, 65.9% of which are previously unreported. The resulting phylogeny covers the majority of the known clades, provides date estimates of nodes, and constitutes a robust evolutionary framework for analysing the history of other classes of mutation. Different clades within the tree show subtle but significant differences in branch lengths to the root. We also apply a set of 23 Y-STRs to the same samples, allowing SNP- and STR-based diversity and TMRCA estimates to be systematically compared. Ongoing purifying selection is suggested by our analysis of the phylogenetic distribution of non-synonymous variants in 15 MSY single-copy genes. 

Link (Open Access)

(iii) The evolutionary STR mutation rate consistently overestimates, and the pedigree rate underestimates, the TMRCAs of nodes (Figure 4a).As expected, the pedigree mutation rate performs better for young nodes (<10 KYA; Table S6 ), while the evolutionary rate performs better for older nodes.

Off course "overestimation" and "underestimation"in this case are both relative to the particular mutation rate used by the authors for the SNP counting method in the first place, the authors used the Xue (2009) mutation rate estimate of 1 X 10^-9/bp/year , therefore, a slower mutation rate choice (like from Poznick (2013) or Francalacci (2013) for instance ) would obviously reduce the "overestimation" of the evolutionary STR mutation rate performance and conversely, a faster mutation rate choice would reduce the "underestimation" of the pedigree mutation rate performance, also important to note is that there is quite a bit of variance within the pedigree rates themselves, the authors chose to use a mean pedigree rate from YHRD (see the YTMRCA Calculator to see how pedigree rates from different sources impact TMRCA estimation). All in all however this was an interesting exercise, I hope we can get to see more of these types of comparisons, especially with fossil calibrated mutation rate estimates used for the SNP counting method.

Figure4: Relationship between SNP-and STR-based TMRCA estimates.SNP-based node estimates are plotted against   STR-based estimates for (a) 21 STRs (b) 17 STRs and (c) 13 STRs, here using ASD with the ‘ancestral haplotype’ root specification. The black dashed linein each case indicates x=y.U nderlying data and correlation coefficients are given in Tables S6 and S7.

UPDATE:
For further insight in the current understanding of substitution rates used for the SNP counting method, I direct readers to the Wang (2014) article which enumerates on the 4 primary methods that have been used to calculate the substitution rate:
  1. Human - Chimp Comparisons : Thompson (2000) , Kuroki (2006)
  2. Deep Rooting Pedigree: Xue (2009)
  3. Autosomal Mutation Rate Adjustment: Mendez (2013)
  4. Founding Migrations Based Inference:  Poznick (2013), Francalacci (2013)  
In terms of inferences based on the Y Chromosome TMRCA and the Out Of Africa migrations the authors suggest that Xue (2009) and Poznick (2013) give the most reasonable estimates. 

Comparison of different Y chromosomal substitution rates in time estimation using Y chromosome dataset of 1000 Genome dataset. Time estimations are performed in BEAST. (a) TMRCA of 526 Y chromosomes (including haplogroup A1b1b2b-M219 to T). (b) Time of Out-of-Africa migration, the age of macro-haplogroup CT. HCR- Thomson and HCR-Kuroki: Y chromosome base-substitution rate measured from human-chimpanzee comparison by Thomson et al. [6] and Kuroki et al. [7], respectively. Pedigree rate: Y chromosome base-substitution rate measured in a deep-rooting pedigree by Xue et al. [8]. Autosomal Rate Adjusted: Y chromosome substitution rate adjusted from autosomal mutation rates by Mendez et al. [9]. AEFM-America and AEFM-Sardinian: Y chromosome base-substitution rate based on archaeological evidence of founding migrations using initial peopling of Americas [10] and initial Sardinian expansion [11], respectively. Different reported mutation rates are given at the log scale. Confidence intervals for some of the mutation rates are very wide, and time calculations here use only the point estimate. The times would overlap more if all the uncertainties were taken into account. Figure was drawn using boxplot in R 3.0.2.

However a fifth method , entirely sequencing Y chromosomes from verifiable ancient individuals , a method which is still at its infancy but gaining momentum, should refine the substitution rate to a level of precision that as of yet has not been available. It stands to be seen if it will corroborate the rates from the front runners (Xue (2009), Poznick (2013) ) or maybe even yield unforeseen results.

1 comment: