Thursday, February 21, 2013

The Zhivotovsky Multiplier


It is reported that Zhivotovsky's effective mutation rate [1] has the effect of increasing the TMRCA of a lineage, as computed by the use of Microsattelite Genetic Distances[2], by a factor of 3-4 fold relative to TMRCAs computed via mutation rates observed in pedigree and family studies [3].

By utilizing my TMRCA calculating program, I want to explore,
  1. What effect does different marker combinations have on this multiplier ?
  2. What effect does marker size have on this multiplier ?
  3. Is there a variation in this multiplier for different data-sets?

First, to ensure that my program correctly calculates the TMRCA when the Zhivotovsky mutation rate of 0.00069 is applied to all the markers in my database consistently (versus only the marker specific Pedigree mutation rates I have thus far been utilizing), I attempted to replicate the TMRCA computations of the following publication;




One reason I chose this article, in addition to it using Zhivotovsky's effective mutation rate, is because it had the most complete STR profiles supplied along with the paper. The article states the following:

The following eight loci, DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393 and DYS439, were used to estimate expansion times using the methodology described by Zhivotovsky et al.21 as modified according to Sengupta et al. A microsatellite evolutionary effective mutation rate of 6.9 × 10–4 was used.

All the eight loci that are mentioned above are available in my program's database, in addition, the modification with respect to Sengupta et al. it is referring to is (1) using the Median repeats for the ancestral haplotype instead of Modal repeats, which is fine since my program computes for both scenarios separately (2) Modification in the computation of the lower/upper bound estimates, which is not necessary for my particular case here, as I am only interested in the central estimates of TMRCA (at least for now, although it would be interesting to see what impact would be seen when the upper/lower bound are estimated according to effective vs. pedigree rates).


The full Chiraoni et al.(2009) J-P58 STR data can be downloaded from the “Supplementary info section”

After importing the data and tweaking the following 17 haplotypes that had contained null values for one or more markers: [xEJ_C5, J1_E1, J1_H5, st_2164, st_2149, J1_C2, Tanta100, 2, 44, 59, 82, 115, 117, 147, 148, 158, 170] by replacing the missing ones with the modal repeats for the entire data-set,  I ended up with 453 J-P58 haplotypes, same as reported in Table 1. When I ran these haplotypes in my program I get;

Dataset:J1e_Chiaroni2
Marker list:8_Chiaronimarkerlist
Sample size:453

Pedigree/Familial Rates Summary
Years/Generation:28 - 33
TMRCA Range:2648 - 4057
Mean TMRCA:3382
Median TMRCA:3368
SD:444

Coalescent_Detail =

{
[1,1] = Chandler;8 Markers Generations(Median)--122.94 Generations(Modal)--122.94
[1,2] = Stafford;8 Markers Generations(Median)--108.45 Generations(Modal)--108.45
[1,3] = Burgarella_Navascues;8 Markers Generations(Median)--117.65 Generations(Modal)--117.65
[1,4] = Ballantyne;8 Markers Generations(Median)--94.601 Generations(Modal)--94.601
[1,5] = Zhivotovsky;8 Markers Generations(Median)--401.11 Generations(Modal)--401.11
}

Since Zhivotovsky is using 25 Years/Generation , this would mean 401.11 X 25 = 10,027.75 years ago for the TMRCA of the J-P58 haplotypes according to my program, the study reports a central estimate of 10, 100 years ago for the same haplotypes, thus my program is correct within 0.7% or less than 3 generations, part of the error could be with how I tweaked those 17 haplotypes that I mentioned above.

To double check, I retested my program with the J1e-YCAII 22/22 haplotypes (a subset of the J-P58 haplotypes from above), however, I could only retrieve 202 of these haplotypes, versus the 203 they are reporting in Table 1. In any event, for those particular haplotypes, I get the following results:

Dataset:J1e_Chiaroni2_YCAII22_22
Marker list:8_Chiaronimarkerlist
Sample size:202

Pedigree/Familial Rates Summary
Years/Generation:28 - 33
TMRCA Range:2260 - 3574
Mean TMRCA:2921
Median TMRCA:2946
SD:415

Coalescent_Detail =

{
[1,1] = Chandler;8 Markers Generations(Median)--108.31 Generations(Modal)--108.31
[1,2] = Stafford;8 Markers Generations(Median)--91.632 Generations(Modal)--91.632
[1,3] = Burgarella_Navascues;8 Markers Generations(Median)--102.45 Generations(Modal)--102.45
[1,4] = Ballantyne;8 Markers Generations(Median)--80.719 Generations(Modal)--80.719
[1,5] = Zhivotovsky;8 Markers Generations(Median)--365.91 Generations(Modal)--365.91
}

Again, for 25 years/generation, this comes out to 365.91 X 25 = 9,147.75, which is off by 0.6% or slightly more than 2 generations from the central estimate that they report for those same haplotypes in Table 1.

I additionally also used the 8 markers from the publication above to compute TMRCAs for E1b1b and J-P58 STR data from FTDNA to get the following results:


Dataset:J1c3
Marker list:8_Chiaronimarkerlist
Sample size:256

Pedigree/Familial Rates Summary
Years/Generation:28 - 33
TMRCA Range:3636 - 5604
Mean TMRCA:4810
Median TMRCA:4752
SD:659

Coalescent_Detail =

{
[1,1] = Chandler;8 Markers Generations(Median)--169.6 Generations(Modal)--169.6
[1,2] = Stafford;8 Markers Generations(Median)--161.59 Generations(Modal)--161.59
[1,3] = Burgarella_Navascues;8 Markers Generations(Median)--169.85 Generations(Modal)--169.85
[1,4] = Ballantyne;8 Markers Generations(Median)--129.86 Generations(Modal)--129.86
[1,5] = Zhivotovsky;8 Markers Generations(Median)--467.05 Generations(Modal)--467.05
}

Dataset:EM35-Balanced
Marker list:8_Chiaronimarkerlist
Sample size:180

Pedigree/Familial Rates Summary
Years/Generation:28 - 33
TMRCA Range:5678 - 8101
Mean TMRCA:6713
Median TMRCA:6782
SD:756

Coalescent_Detail =

{
[1,1] = Chandler;8 Markers Generations(Median)--245.49 Generations(Modal)--245.49
[1,2] = Stafford;8 Markers Generations(Median)--219.22 Generations(Modal)--219.22
[1,3] = Burgarella_Navascues;8 Markers Generations(Median)--212.94 Generations(Modal)--212.94
[1,4] = Ballantyne;8 Markers Generations(Median)--202.79 Generations(Modal)--202.79
[1,5] = Zhivotovsky;8 Markers Generations(Median)--801.13 Generations(Modal)--801.13
}

Above, we can see that while the FTDNA J-P58 data-set seems older by ~66 generations (for the Zhivotovsky rates) than the Chiaroni (2009) dataset, it is interesting to note that the E1b1b1 dataset's results of 801 generations or 20.028 KYA is quite close to the lower bound of Cruciani (2007)'s TMRCA estimate of 20.9-23.9 KYA.

Now that I have verified the program works reasonably well with the effective mutation rate that I added to the database, I can now test what the effect of marker combination/size has, on not only the absolute TMRCA estimates, but on the Zhivotovsky multiplier as well.

To do this, I utilized 2 of the FTDNA datasets I used above for E-M35 and J-P58. Since I have a maximum of 49 markers to work with, I simply took a random subset of those 49 markers for computation of both the Zhivotovsky and the Pedigree coalescent estimates, since the main purpose is to find the effect of marker combination, I then repeat the process of extracting random combinations from within the super-set of 49 markers and performing the computations, thus, I repeated the process of extracting said combinations a total of 50 times per chosen marker size.

For instance, if I choose my first marker size to be 8, I would then compute TMRCAs for 50 separate random 8 marker combinations within the dataset. Then, I double the marker size and compute TMRCAs for another 50 random combinations and so forth, until I reach my marker size limit.

Starting with a marker size of 8, and then going to 16, 32 and 40, here below are the results for the E-M35 and J-P58 datasets with 50 random combinations for each marker size.

Note that in the tables below, Z-TMRCA, denotes the TMRCA in generations obtained using the effective mutation rate while P-TMRCA denotes an average of the TMRCAs obtained from the 4 separate sources that utilize the pedigree rates in my program. The  X Max, X Min, X Average and X SD, denote respectively the maximum, minimum, average and standard deviation of the Zhivotovsky multiple found for each combination run. All columns of the tables are sortable.






8 Marker – E-M35_FTDNA




16 Marker – E-M35_FTDNA




32 Marker – E-M35_FTDNA



40 Marker – E-M35_FTDNA



8 Marker – J-P58_FTDNA



16 Marker – J-P58_FTDNA



32 Marker – J-P58_FTDNA



40 Marker – J-P58_FTDNA





From the tables above, it is clear that the Zhivotovsky multiple has wide ranges depending on the marker combination (as well as which pedigree rates are used), but it can also be seen that the ranges narrow down when marker size is increased. 

For instance, for the E-M35 dataset with a marker size of 8, the minimum multiple is 0.71 (below parity !) and is found at Combo # 6, while the maximum multiple is 9.67 found at Combo # 34, this gives a multiple range of 8.96, for the same dataset, but however for a marker size of 40, the minimum multiple is 2.15 (Combo # 29) while the maximum multiple is 5.41 (Combo # 28), giving a multiple range of 3.26, which is almost a 64% reduction in the range of the multiple going from the smallest to the largest marker size. A similar pattern of a range reduction in the multiple can also be seen with the other dataset (J-P58).

The above could be explained by either 1 of 2 reasons that need further testing, (1) As the marker size increases the pool of newly available markers to combine randomly also gets limited, as there are ultimately only 49 markers to work with, thus the results become more of a repetition or (2) Marker size increase really does have a significant effect on the range of the multiple.


Whatever the case, it is clear that marker combination has a tremendous effect on the Zhivotovsky multiple, more especially if the markers used are low in size, therefore, the 'mantra' of multiply by 3 (or divide by 3) while true on average, is a gross oversimplification, or even completely wrong, if the particular combination of marker properties along with the particular set of pedigree mutation rates used are not accounted for in detail.

Although more data-points than only the 8, 16, 32 and 40 marker sizes that I used would certainly be ideal, there was no particular correlation between the marker sizes and the absolute TMRCA's generated for both the Pedigree and Zhivotovsky rates. 

The E-M35 data-set had an overall mean (for all marker sizes and random combinations) for the Z-TMRCA of 1456 generations with an SD of 70.28, and R^2 of 0.09, while the P-TMRCA had 401, 4.73 and 0.92 respectively for the same variables. The J-P58 dataset had 726, 33.75 and 0.62 respectively for the Z-TMRCA and 235, 16.91 and 0.36 for the P-TMRCA. (see also below)



3 comments:

  1. This may be some data you would like to add if you havent already seen it.

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2869035/

    ReplyDelete
    Replies
    1. As a matter of fact I had never seen this data before, I ran the J-M267 haplotypes of these Kuwaiti Bedouins through my calculator using the Chiaroni 8 marker list from above:

      Dataset:Bedouin_JM267
      Marker list:8_Chiaronimarkerlist
      Sample size:79

      Pedigree/Familial Rates Summary
      Years/Generation:28 - 33
      TMRCA Range:3719 - 6540
      Mean TMRCA:5085
      Median TMRCA:5106
      SD:859

      Coalescent_Detail =

      {
      [1,1] = Chandler;8 Markers; Generations(Median)--198.2 Generations(Modal)--198.2
      [1,2] = Stafford;8 Markers; Generations(Median)--161.7 Generations(Modal)--161.7
      [1,3] = Burgarella_Navascues;8 Markers; Generations(Median)--174.16 Generations(Modal)--174.16
      [1,4] = Ballantyne;8 Markers; Generations(Median)--132.84 Generations(Modal)--132.84
      [1,5] = Zhivotovsky;8 Markers; Generations(Median)--609.98 Generations(Modal)--609.98
      }


      These haplotypes definitely seem older than both the Chiaroni J-P58 haplotypes and the FTDNA J-P58 haplotypes by a factor of up-to 50%, that is strange, one would think that they'd show a rather limited age due to inbreeding....

      Delete
    2. Actually I was comparing branches (J-M267) to twigs (J-P58), when I should be comparing the bedouin J-M267 to FTDNA J-M267,

      Dataset:J-M267 (ftdna)
      Marker list:8_Chiaronimarkerlist
      Sample size:573

      Pedigree/Familial Rates Summary
      Years/Generation:28 - 33
      TMRCA Range:3898 - 6239
      Mean TMRCA:5271
      Median TMRCA:5249
      SD:771

      Coalescent_Detail =

      {
      [1,1] = Chandler;8 Markers; Generations(Median)--189.09 Generations(Modal)--189.09
      [1,2] = Stafford;8 Markers; Generations(Median)--177.23 Generations(Modal)--177.23
      [1,3] = Burgarella_Navascues;8 Markers; Generations(Median)--185.87 Generations(Modal)--185.87
      [1,4] = Ballantyne;8 Markers; Generations(Median)--139.21 Generations(Modal)--139.21
      [1,5] = Zhivotovsky;8 Markers; Generations(Median)--508.7 Generations(Modal)--508.7
      }

      Which means that the bedouin's data is older than FTDNA's but not as old as I had thought earlier, still what this tells me is that the Bedouin's data, likely holds more lineages within J-M267 than just J-P58 only.

      Delete