Showing posts with label TMRCA. Show all posts

Tuesday, February 3, 2015

SNP based module added to the Y TMRCA calculator

The solely STR based Y TMRCA calculator now also can accept SNP based input to compute the TMRCA of a node. Instructions and methodology can be found within the app at the link below:
https://ehelix.pythonanywhere.com/init/default/index

For now, it uses 7 separate mutation rates that all come from different publications, but not all necessarily using differing methods to derive the rates. I will look to expand these as more substitution mutation rates become available.

Below I have run some quick verifications for 3 separate mutation rate sources:

Poznick (2013) rates via Underhill (2014)

Xue (2009) rates via Cruciani (2013)

Scozzari (2013) rates for the same publication

Poznick (2013) rates via Underhill (2014)

The following is stated in Underhill (2014):

A consensus has not yet been reached on the rate at which Y-chromosome SNPs accumulate within this 9.99Mb sequence. Recent estimates include one SNP per: ~100 years,⁵⁸ 122 years,⁴ 151 years⁵ (deep sequencing reanalysis rate), and 162 years.⁵⁹ Using a rate of one SNP per 122 years, and based on an average branch length of 206 SNPs from the common ancestor of the 13 sequences, we estimate the bifurcation of R1 into R1a and R1b to have occurred ~25,100 ago (95% CI: 21,300–29,000). Using the 8 R1a lineages, with an average length of 48 SNPs accumulated since the common ancestor, we estimate the splintering of R1a-M417 to have occurred rather recently, B5800 years ago (95% CI: 4800–6800). The slowest mutation rate estimate would inflate these time estimates by one third, and the fastest would deflate them by 17%.

Putting in the variables for the R1 node from above into the calculator,
We get an output of:

R1 - Underhill (2014)

which for the mutation rate they used , i.e. Poznick (2013), the calculator gives 25.15 KYA, close enough to their estimate of 25.1 KYA.
Similarliy for the R1a-M417 node , we get:

R1a-M417 - Underhill (2014)

Again, looking @ the calculator's Poznick TMRCA of 5.86 KYA, we can see it is close enough to their estimate of 5.8 KYA.

Comprehensive Ethiopian YDNA TMRCA Estimates

Find below a comprehensive list for all central TMRCA estimates calculated from the Plaster thesis for 6 UEPs (look at this post under Interactive Chart of Figure 3.2 for the frequencies of the UEPs). P*(x R1a) & Y*(x BT,A3b2) are not included due to their minimal frequency and very sporadic distribution.

There were a total of 5,756 haplotypes reported with the paper for the markers DYS19, DYS388, DYS390, DYS391, DYS392 and DYS393. 30 of those haplotypes belonged to P*(x R1a) & Y*(x BT,A3b2), leaving a total of 5,726 haplotypes. These remaining haplotypes, were then categorized with the criteria of Cultural ID + Generic Language Group* + UEP, any group of haplotypes that conformed to this criteria with N >1 and with a coalescent not equal to 0 (meaning non-identical haplotypes) were processed for their TMRCA and reported, accounting for 5,668 or 98% of the total haplotypes reported for the paper.

The tables are ordered according to the frequencies of the tested UEPs in Ethiopia, i.e. E*(x E1b1a), 3985 Haplotypes > J, 689 Haplotypes > A3b2, 601 Haplotypes > K*(xL,N1c,O2b,P) , 154 Haplotypes > BT*(xDE,JT), 193 Haplotypes and E1b1a7, 46 Haplotypes .

Note that both the mean TMRCA's for Zhivotovsky (Z-TMRCA) and the pedigree rates (P-TMRCA), some times also known as germline rates, are in units of generations, the suitable length of a generation for the Z-TMRCA is 25 years, while for the P-TMRCA it may range from 28 to 33 years.

If detail of the TMRCA analysis for any of the populations listed below maybe required, go to the table here, and upload the necessary file into the Y TMRCA calculator and filter for the specific population in question.

Y TMRCA Calculator as a Web App

The Y DNA (STR) TMRCA calculator can now be accessed as a web application with full functionality here:

http://ehelix.pythonanywhere.com/

It is also embedded in this blog in a new page (above)

UPDATE (02/11/2014)

Another series of updates for the calculator:

User now able to utilize the previously idle first column in the csv file to group haplotypes together and thus compute the TMRCA for a specified group (see example below)
The application now also accepts Locus names in NIST format as well.
It also now automatically deletes any haplotype with a non-integer value given for any locus in the *.csv file. (instead of producing an error for that scenario)

TMRCA calculator for Python

I have converted the TMRCA calculator to run from only on Octave to Python as well, see here for the Octave version.
It is specifically made for Python 2.7, and have not had a chance to test it on other versions. No more libraries are required to run the script other than the standard libraries that come with 2.7. Some of the advantages of converting to Python are: less steps to run the program, easier for (future) web app deployment and more user access to Python than Octave.

The Zip file can be dowloaded here: https://dl.dropboxusercontent.com/u/42082352/TMRCA.zip
--------------------------------------------------------------------------------------------------------
TMRCA Calculator Instructions - for python 2.7

To check if the TMRCA program is correctly working on your system, first run it with the dataset
provided here before trying different datasets, to do so:

(1) Make sure you have python 2.7 loaded on your system (either Windows or Linux will work) and start running the interpreter.
(2) In the interpreter, change your working directory to the directory where you saved the unzipped folder by using:
(i) import os
(ii) os.chdir('~PATH/TMRCA/')
-Where ~PATH is the full path where the TMRCA folder is placed on your system.
If you are unsure of your current working directory, type the command: os.getcwd()
(3) import the tmrca module by typing: import tmrca
(4) Execute Script by typing: tmrca.Analysis('EM35_Example.csv','all')
(5) If this produces results with no errors in the interpreter, then the program is correctly installed and you can proceed to reading and analysing different datasets.

Reading and analysing new Data

After correctly executing the above steps, read and analyse new data by using the following steps:
(1)Examine the example STR data file in the "TMRCA/" folder entitled "EM35_Example.csv"
(2)Any STR data file to be analysed should first be made in the same format as the "EM35_Example.csv" file , specifically:
(a) DYS names in the first row should have the exact same nomenclature (the orders can be different however).
(b) Each row (except the first) should represent one sample.
(c) Each coloumn (except the first) should represent repeats for one marker/DYS#.
(d) The first column should represent sample identifiers, ex. Kit#, sample ID,...
(e) The cell found in the first row and first column should have the Dataset's name, this will be the same name used throughout the analysis.
(f) No cells shall contain null values and avoid having cells that contain characters which have spaces in between them.
(g) The file MUST be a *.csv file with commas used for field delimiters
(3) Place the *.csv file directly in the "TMRCA/" folder (i.e. in your working directory)
(4) Start the interpreter, change the working directory to '~PATH/TMRCA/', as per the instructions above and import tmrca.
(5) If you want to analyse a specific set of markers from your dataset go to step 6, otherwise go to step 7
(6) Go to the file "/TMRCA/Markerlist/49markerlist.txt", and pick the markers you want to use for analysis from there. Save your chosen
markers into a new *.txt file and into the same folder as "/TMRCA/Markerlist/". Take a look at any of the other marker list text files in
the folder for an example of how a marker list should look. Note that all marker list files need to be *.txt
(7) If you are specifying a set of markers to use for the analysis, for example "8_Chiaronimarkerlist.txt", then run the program
by typing: tmrca.Analysis('EM35_Example.csv','8_Chiaronimarkerlist.txt'),otherwise, just type: tmrca.Analysis('EM35_Example.csv','all').

Wednesday, May 1, 2013

Analyzing YDNA J lineages in Ethiopian linguistic groups

The extensive YDNA dataset found in the Plaster paper has a total of 691 YDNA lineages that belong to haplogroup J, although there is no more detailed SNP resolution reported for most of these lineages, it is safe to assume, from previous data on Ethiopia, that a vast majority of them would belong to J1-M267. There is a limited set of STR data that accompanies these lineages as well, namely only for the markers; 19, 388, 390, 391, 392 and 393.

According to the report, J lineages are proportionally found higher in Semitic speakers in Ethiopia, ~21% ,followed by Omotic speakers at ~ 12% and Cushitic speakers at ~ 8%. Out of the 691 YDNA J lineages reported, 259 were Semitic speakers, 266 spoke some type of Omotic language and most of the remainder spoke Cushitic languages.

Source code for the ASD based TMRCA calculator (Octave)

The code for the TMRCA calculator of YDNA STR haplotypes that I use can be downloaded from here : https://dl.dropboxusercontent.com/u/42082352/TMRCA_ASD.zip

See also here for instances of where I have used the calculator in the past:
http://ethiohelix.blogspot.com/2012/06/finding-tmrca-of-ethiopian-ydna.html
http://ethiohelix.blogspot.com/2012/11/extensive-doctoral-thesis-on-ethiopian.html
http://ethiohelix.blogspot.com/2013/01/tmrca-calculations-from-plaster-nry.html
http://ethiohelix.blogspot.com/2013/02/the-zhivotovsky-multiplier.html
http://ethiohelix.blogspot.com/2013/03/african-sahel-ydna.html

The code is written for Octave and is also Matlab compatible. There is also an instruction file that explains how to run the calculator in the folder that is linked above which can also be found below:
---------------------------------------------------------------------------------------------------------

To check if the TMRCA program is correctly working on your system, first run it with the dataset
provided here before trying different datasets, to do so:

(1) Make sure you have Octave loaded on your system (either Windows or Linux will work) and start octave in the command line.
(2) In the command line, change your working directory to the directory where you saved the unzipped folder by using: cd ~PATH/TMRCA_ASD/
If you are unsure of your current working directory, type the command: pwd()
(3) Type: fcompositeTMRCA("Buckova_EM78","all")
(4) If this produces results, then the program and functions are correctly installed and you can proceed to reading and analysing different datasets.

Reading and analysing new Data

After correctly executing the above steps, read and analyse new data by using the following steps:
(1)open the example STR data file in the "TMRCA_ASD/Loaded_Data/" folder entitled "EM35_STR.xls"
(2)Any STR data file to be analysed should first be made in the same format as the "EM35_STR.xls" file , specifically:
(a) DYS names in the first row should have the exact same nomenclature (the orders can be different however).
(b) Each row (except the first) should represent one sample.
(c) Each coloumn (except the first) should represent repeats for one marker/DYS#.
(d) The first column should represent sample identifiers, ex. Kit#, sample ID,...
(e) The cell found in the first row and first column should have the Dataset's name, this will be the same name used throughout the analysis.
(f) No cell shall contain null values and avoid having cells that contain characters which have spaces in between them.
(3) In Excel or openoffice, convert the "EM35_STR.xls" workbook to a ".csv" file by saving the file as "YSTR.csv" and placed into the
same "TMRCA_ASD/Loaded_Data/" folder. The program will only look for a file entitled "YSTR.csv", so make sure that the same name is used for your file.
(4) Start octave, in the command line, change the working directory to "~PATH/TMRCA_ASD/Loaded_Data/"
(5) Type on the octave prompt: readdata
(6) Octave will start reading the dataset and create the file "EM35-Balanced" in the folder "/TMRCA_ASD/Loaded_Data/" when it is finished.
(7) If you want to analyse a specific set of markers from your dataset go to setep 8, otherwise go to step 9
(8) Go to the file "/TMRCA_ASD/Markerlist/49markerlist.txt", and pick the markers you want to use for the analysis. Then save your chosen
markers into a new txt file in the same folder as "/TMRCA_ASD/Markerlist/". Take a look at the file "8_Chiaronimarkerlist.txt" for
an example of how the marker list should look.
(9) In octave, change your working directory back up one level by typing: cd ..
(10) If you are specifying a set of markers to use in the analysis, then run the program by typing: fcompositeTMRCA("EM35-Balanced","8_Chiaronimarkerlist.txt"), otherwise, just type: fcompositeTMRCA("EM35-Balanced","all").
----------------------------------------------------------------------------------------------------------
Update : Version2 - *.CSV read, + Auto path detect. (fcompositeTMRCA.m, fmarkerextract.m, readdata.m)
Update(04/25/13) : Version3 - Add option for using all available markers, print used/unused markers. (fcompositeTMRCA.m, fmarkerextract.m, fAssignmutation.m)

Thursday, March 7, 2013

African Sahel YDNA

Multiple and differentiated contributions to the male gene pool of pastoral and farmer populations of the African Sahel

ABSTRACT

The African Sahel is conducive to studies of divergence/admixture genetic events as a result of its population history being so closely related with past climatic changes. Today, it is a place of the co-existence of two differing food-producing subsistence systems, i.e., that of sedentary farmers and nomadic pastoralists, whose populations have likely been formed from several dispersed indigenous hunter-gatherer groups. Using new methodology, we show here that the male gene pool of the extant populations of the African Sahel harbors signatures of multiple and differentiated contributions from different genetic sources. We also show that even if the Fulani pastoralists and their neighboring farmers share high frequencies of four Y chromosome subhaplogroups of E, they have drawn on molecularly differentiated subgroups at different times. These findings, based on combinations of SNP and STR polymorphisms, add to our previous knowledge and highlight the role of differences in the demographic history and displacements of the Sahelian populations as a major factor in the segregation of the Y chromosome lineages in Africa. Interestingly, within the Fulani pastoralist population as a whole, a differentiation of the groups from Niger is characterized by their high presence of R1b-M343 and E1b1b1-M35. Moreover, the R1b-M343 is represented in our dataset exclusively in the Fulani group and our analyses infer a north-to-south African migration route during a recent past.

Closed Access

Y(x CF) Phylogeny, Red = SNPs Tested, Blue =Presumed Tested

CF Phylogeny, Red = SNPs Tested, Blue =Presumed Tested

TMRCA calculations from Plaster NRY data : Correcting an Error

Previously, I had computed TMRCAs for the YDNA STR data from the additional material that was provided along with Dr.Chris Plaster's thesis. However, after a brief communication with the author, I found out that the marker order of the STRs in the excel file was reported wrongly, the correct order for the markers are thus as follows:

DYS19 DYS388 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS437 DYS438 DYS439 DYS448 DYS456 DYS635 Y GATA H4

This changes my TMRCA calculations because I am not computing the coalescent using a generic mutation rate that is equivalent for all the markers, but rather each marker has its own mutation rate attributed to it.

When I rerun my program using the newly corrected order above I get the following:

As can be seen, using the new order of markers generally reduces the number of generations to coalescent for the Plaster data-set. The previous observation of a relatively lower TMRCA for the haplozone data of E-M123 versus that of the E-M34 Plaster data-set largely disappears.

To check if the fact that the high number of samples (129) present in the E-M123 haplozone data-set was skewing the results, I took 23 random samples (which equals the same number of samples available in the Plaster E-M34 data-set) from the larger E-M123 Haplozone dataset and re-run the TMRCA calculations on just those samples, I repeated this process 300 times, only 28% of the runs yielded a mean TMRCA less than the E-M34 Plaster data-set, if sample size was skewing the results I would expect >50% of the runs to have a mean TMRCA less than that of the E-M34 plaster dataset.

That said, the E-M34 Plaster data-set still had a relatively higher generations to coalescent than the E-M84 Haplozone dataset, E-M84 is a subclade of E-M34 and a high majority of haplotypes that belong to E-M34 also test positive for the E-M84 SNP (at least for the non-African E-M34 haplotypes that we know of).

Other than that, the new, and corrected, ordering of the markers did not have much impact in relative TMRCA terms between the Plaster and Haplozone/FTDNA data for the other lineages I had tested.

Tuesday, June 19, 2012

Finding the TMRCA of Ethiopian YDNA lineages using an ASD method.

I have been lately working on computing TMRCAs using an ASD or average square difference method on publicly available Y-STR haplotypes. The premise for finding the TMRCA using the ASD method is quite straight forward and easy to understand, a putative ancestral haplotype is calculated for a given dataset and the repeat of each sample at each marker in the dataset is subtracted from this ancestral haplotype, this result is then cumulated and divided by the number of samples and the marker specific mutation rate, the process is repeated for every single marker in the dataset and the mean is then multiplied by an assumed years per generation length, the formula below articulates this method:

TMRCA formula (ASD method)

Where;

N= Total number of Samples

Z= Total number of Markers

L⁰= Putative Ancestral Haplotype (Median or Modal repeats)

L= Individual sample haplotype repeats

m= Marker Specific Mutation Rate

G= Years / Generation

The biggest variable here, other than the sampling strategy of a given dataset, are the several marker specific mutation rates that are available. The process of selection of a correct mutation rate is an unsettled issue, I have therefore utilized 4 sets of mutation rates that were compiled by Paul Newlin, a collaborator at the E3b Project, these rates come from several different publications and you can read about them here for more detail:

The Chandler Mutation Rates:

http://www.jogg.info/22/Chandler.pdf
Stafford Bayesian Mutation Rates:

Essentially a compilation of other mutation rates
Burgarella & Navascués Mutation Rates:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3039515/
Ballantyne Mutation Rates:

http://www.cell.com/AJHG/retrieve/pii/S0002929710004192

In order to have an analogously accurate comparison of the TMRCAs between the different publications, I had to weed out and intersect the available markers from above with markers that are found in the public domain. This essentially left me with the following 46 markers that intersected with all 4 of the above sets of rates as well as the 66 markers that are widely used:

406s1 , 19 , 388 , 389-1 , 389-2 , 390 , 391 , 392 , 393 , 426 , 436 , 437 , 438 , 439 , 442 , 444 , 446 , 447 , 448 , 450 , 454 , 455 , 456 , 458 , 460 , 472 , 481 , 487 , 490 , 492 , 511 , 520 , 531 , 534 , 537 , 557 , 565 , 568 , 572 , 578 , 590 , 594 , 617 , 640 , 641 and gatah4.

In addition, since the Chandler mutation rates had a complete intersection with the 66 widely used markers, an additional 66 marker Chandler set was independently used that included the following markers in addition to the 46 listed above:

385a , 385b , 459a , 459b , 449 , 464a , 464b , 464c , 464d , ycaiia , ycaiib , 607 , 576 , 570 , cdya , cdyb , 395s1a , 395s1b , 413a and 413b.

Haplogroups A, E and J, cover well over 90% of the YDNA lineages found in Ethiopia. More specifically within these haplogroups, I was more interested in finding the TMRCA for A-M13, E-M35 and J1-M267, as these lineages cover over 70% but under 80% of said lineages, whereas the remaining 20-30% of lineages found in Ethiopia belong to E1b1*(x E1b1b,E1b1a1), other types of E lineages like E2 and E*, and some specific clades that belong to haplogroups B,T and J2.

Ethio Helix ኢትዮ:ሒሊክስ

Pages

Tuesday, February 3, 2015

SNP based module added to the Y TMRCA calculator

Friday, February 14, 2014

Comprehensive Ethiopian YDNA TMRCA Estimates

Monday, January 27, 2014

Y TMRCA Calculator as a Web App

Tuesday, October 8, 2013

TMRCA calculator for Python

Wednesday, May 1, 2013

Analyzing YDNA J lineages in Ethiopian linguistic groups

Sunday, April 21, 2013

Source code for the ASD based TMRCA calculator (Octave)

Thursday, March 7, 2013

African Sahel YDNA

Multiple and differentiated contributions to the male gene pool of pastoral and farmer populations of the African Sahel

ABSTRACT

Saturday, January 5, 2013

TMRCA calculations from Plaster NRY data : Correcting an Error

Tuesday, June 19, 2012

Finding the TMRCA of Ethiopian YDNA lineages using an ASD method.

Blog Archive

Search This Blog

Contact Form

Pages

Tuesday, February 3, 2015

Friday, February 14, 2014

Monday, January 27, 2014

Tuesday, October 8, 2013

Wednesday, May 1, 2013

Sunday, April 21, 2013

Thursday, March 7, 2013

ABSTRACT

Saturday, January 5, 2013

Tuesday, June 19, 2012

Blog Archive

Search This Blog

Subscribe To

Contact Form