Showing posts with label Y DNA. Show all posts
Showing posts with label Y DNA. Show all posts

Tuesday, March 17, 2015

New NGS study of the Y DNA

A new Y-DNA study has appeared using Next Generation Sequencing, where ~9 Mb of the Y Chromosome was sequenced for 456 samples (299 of which were new) some preliminary observations are outlined below:

(1) Mutation Rate:

This is the second published study to calibrate the substitution mutation rate for the YDNA based on fossil evidence, to do this, they used a combination of derived mutation rates from 2 separate fossils; the 12.6 KY old Anzick fossil from Montana belonging to haplogroup Q1b and the 4 KY old Saqqaq fossil from Greenland belonging to haplogroup Q2b. The first study, Fu (2014) used the 45 KY old Ust-Ishim fossil from Siberia belonging to haplogroup K(xLT). Interestingly, despite the big difference in age of these fossils of ~ 36 KYA (on average), the derived mutation rates were quite close to each other, with the current study's central estimate only ~8% slower than the rates derived from the Ust-Ishim fossil. The 95% CI bounds for this study were however less tight than the 95% CI bounds of Fu (2014). I have already incorporated these new rates into the TMRCA calculator under Karmin (2015).

(2) Coalescence of Non-African YDNA chromosomes:

The authors report :
....... a cluster of major non-African founder haplogroups in a narrow time interval at 47–52 kya, consistent with a rapid initial colonization model of Eurasia and Oceania after the out-of-Africa bottleneck
Which aligns almost perfectly with the recent find in Manot, Israel of the 49.2 - 60.2 KY old non-African AMH fossil believed of being closely related to the ancestors of all extant non-Africans, i.e. the first OOA migrants.

(3) A "New" E1b1b (E-M215) topology:

The "new" topology of E-M215 they outline below is in-fact over 3 years old, actually, we knew more back then than what they show in this paper today (see here)
E-M215 Karmin (2015)
Compared with what we knew 3 years ago (note: CTS8288 above is equivalent to E-Z830 below):


The unanswered questions with respect to the major topology of E-M215 remain:
  • What is the relationship, if any,  of E-V92 with respect to E-Z827, E-Z830 or E-V68
  • What is the relationship, if any, of E-V6 with respect to E-Z827, E-Z830 or E-V68

A recent bottleneck of Y chromosome diversity coincides with a global change in culture
 

Abstract

It is commonly thought that human genetic diversity in non-African populations was shaped primarily by an out-of-Africa dispersal 50–100 thousand yr ago (kya). Here, we present a study of 456 geographically diverse high-coverage Y chromosome sequences, including 299 newly reported samples. Applying ancient DNA calibration, we date the Y-chromosomal most recent common ancestor (MRCA) in Africa at 254 (95% CI 192–307) kya and detect a cluster of major non-African founder haplogroups in a narrow time interval at 47–52 kya, consistent with a rapid initial colonization model of Eurasia and Oceania after the out-of-Africa bottleneck. In contrast to demographic reconstructions based on mtDNA, we infer a second strong bottleneck in Y-chromosome lineages dating to the last 10 ky. We hypothesize that this bottleneck is caused by cultural changes affecting variance of reproductive success among males.


Link (Closed Access)

Tuesday, February 3, 2015

SNP based module added to the Y TMRCA calculator

The solely STR based Y TMRCA calculator now also can accept SNP based input to compute the TMRCA of a node. Instructions and methodology can be found within the app at the link below:
https://ehelix.pythonanywhere.com/init/default/index

For now, it uses 7 separate mutation rates that all come from different publications, but not all necessarily using differing methods to derive the rates. I will look to expand these as more substitution mutation rates become available.

Below I have run some quick verifications for 3 separate mutation rate sources:

Poznick (2013) rates via Underhill (2014)

The following is stated in Underhill (2014):
A consensus has not yet been reached on the rate at which Y-chromosome SNPs accumulate within this 9.99Mb sequence. Recent estimates include one SNP per: ~100 years,⁵⁸ 122 years,⁴ 151 years⁵ (deep sequencing reanalysis rate), and 162 years.⁵⁹ Using a rate of one SNP per 122 years, and based on an average branch length of 206 SNPs from the common ancestor of the 13 sequences, we estimate the bifurcation of R1 into R1a and R1b to have occurred ~25,100 ago (95% CI: 21,300–29,000). Using the 8 R1a lineages, with an average length of 48 SNPs accumulated since the common ancestor, we estimate the splintering of R1a-M417 to have occurred rather recently, B5800 years ago (95% CI: 4800–6800). The slowest mutation rate estimate would inflate these time estimates by one third, and the fastest would deflate them by 17%.
Putting in the variables for the R1 node from above into the calculator,
We get an output of:

 R1 - Underhill (2014)
which for the mutation rate they used , i.e. Poznick (2013), the calculator gives 25.15 KYA, close enough to their estimate of 25.1 KYA.
Similarliy for the R1a-M417 node , we get:

R1a-M417 - Underhill (2014)
Again, looking @ the calculator's Poznick TMRCA of 5.86 KYA, we can see it is close enough to their estimate of 5.8 KYA.

Friday, February 21, 2014

YDNA E-M123; A closer look

E-M123 (as well as E-M34) was first discovered by Underhill(2000) and is found with a low to medium frequency distribution in East Africa and the Middle East, while it has a low frequency distribution in North Africa and Europe.

Phylogeny:
Figure 1 - Current and previous E-M215 phylogenetic structure 

Figure 1 shows a comparison of the basic phylogeny of E-M215/M35 as was known before 2011 (a) and after (b), with a 'who and when' key for the Discovery of the UEPs. Notice the impact the rearrangement has on the phylogenetic placement of E-M123, specifically the fact that E-M123 is shown to have a more recent common ancestor with the East and Southern African variants of E-M35, i.e. E-V42 and E-M293, before it does with any of the other variants of E-M35.

Previous publications:

While it is unfortunate that all of the research that has previously been published on E-M123 was done under the consideration of the older (and rather out of date) configuration of the basic structure of E-M35, it is still worth while to look at articles that have tried to untangle the origins and history of this lineage, of these, 3 come to mind:

Friday, February 14, 2014

Comprehensive Ethiopian YDNA TMRCA Estimates

Find below a comprehensive list for all central TMRCA estimates calculated from the Plaster thesis for 6 UEPs (look at this post under Interactive Chart of Figure 3.2 for the frequencies of the UEPs). P*(x R1a) & Y*(x BT,A3b2)  are not included due to their minimal frequency and very sporadic distribution. 

There were a total of 5,756 haplotypes reported with the paper for the markers DYS19, DYS388, DYS390, DYS391, DYS392 and DYS393.  30 of those haplotypes belonged to P*(x R1a) & Y*(x BT,A3b2), leaving a total of 5,726 haplotypes. These remaining haplotypes, were then categorized with the criteria of Cultural ID + Generic Language Group* + UEP, any group of haplotypes that conformed to this criteria with N >1 and with a coalescent not equal to 0 (meaning non-identical haplotypes) were processed for their TMRCA and reported, accounting for 5,668 or 98% of the total haplotypes reported for the paper.

The tables are ordered according to the frequencies of the tested UEPs in Ethiopia, i.e. E*(x E1b1a), 3985 Haplotypes  > J,  689 Haplotypes  > A3b2, 601 Haplotypes  > K*(xL,N1c,O2b,P) , 154 Haplotypes > BT*(xDE,JT), 193 Haplotypes  and E1b1a7, 46 Haplotypes .

Note that both the mean TMRCA's for Zhivotovsky (Z-TMRCA) and the pedigree rates (P-TMRCA), some times also known as germline rates, are in units of generations, the suitable length of a generation for the Z-TMRCA is 25 years, while for the P-TMRCA it may range from 28 to 33 years.

If detail of the TMRCA analysis for any of the populations listed below maybe required, go to the table here, and upload the necessary file into the Y TMRCA calculator and filter for the specific population in question.

Tuesday, February 11, 2014

Ethiopian YDNA J STR Analysis - An addendum

In the past, I had carried out a TMRCA (STR) analysis of YDNA haplogroup J haplotypes from Ethiopia using the primary dataset from the Plaster thesis that was discussed here. While that particular dataset had a large number of haplotypes, it also had a low number of Markers (6). However there was supplementary data that had Y-STR Haplotypes from haplogroup J supplied with the paper. While it only had data for a select few of the populations found in the main paper, it however had better resolution typing at 14 markers. Below are the TMRCA results for those haplotypes. The Dataset can be found in this table in .csv format under "Ethiopian_JM267.csv".
In total, 54 haplotypes were found in the supplementary dataset, nevertheless the total number of haplotypes among the population groups sum up to 53 above, the reason is because one haplotype that belonged to the Anuak dataset was not included.

The results are quite consistent with the results I got from the dataset with less resolution, even if the sample sizes are quite small. For instance, although the Afar had the J Haplogroup in excess of 25%, their haplotypes show the least amount of diversity, conversely the high diversity of Haplogroup J in the other populations is still maintained. 

While the Zhivotovsky TMRCA (Z-TMRCA) for all the 691 YDNA J haplotypes found in Ethiopia in the lower resolution dataset was previously calculated to 595 generations, the Z-TMRCA for the higher resolution dataset for all 54 haplotypes, as seen above, was calculated to 705 generations, if only the markers that were used in the lower resolution data set were used to compute the Z-TMRCA in these 54 haplotypes we would get a Z-TMRCA of 631 Generations. Furthermore, if we intersected the 14 markers from this dataset with the recommended Zhivotovsky markers, the resulting markers of '19', '393', '392', '391', '390', '439', '388', '389-1' and '389-2' , would yield a Z-TMRCA of 920 generations, implicating  an introduction of YDNA J-M267 in Ethiopia well into the Upper Paleolithic.

Update: With respect to the low resolution haplotypes from the plaster thesis; I have added 5,726 YDNA str haplotypes  in *.csv format compatible with the calculator and tabulated according to the UEPs tested, in the Table at this link below as well: http://ehelix.pythonanywhere.com/init/default/Example_Files

Monday, January 27, 2014

Y TMRCA Calculator as a Web App

The Y DNA (STR) TMRCA calculator can now be accessed as a web application with full functionality here:

http://ehelix.pythonanywhere.com/

It is also embedded in this blog in a new page (above)

UPDATE (02/11/2014)

Another series of updates for the calculator:

  • User now able to utilize the previously idle first column in the csv file to group haplotypes together and thus compute the TMRCA for a specified group (see example below)
  • The application now also accepts Locus names in NIST format as well.
  • It also now automatically deletes any haplotype with a non-integer value given for any locus in the *.csv file. (instead of producing an error for that scenario)

Tuesday, October 22, 2013

New paper sheds light on the F-series YDNA SNPs

The F-series YDNA SNPs appeared at the end of last year with results from Geno 2.0, now an electronic pre-print at arXiv.org sheds some light on the discovery of these SNPs.

The paper, entitled :  Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers, is freely available for download.

Some interesting (relevant to this blog) quotations from the paper follows (in blue) :

To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades.

Nearly all the Y chromosomes outside Africa are derivative at the SNP M168 and belong to any of its three descendent super-haplogroups – DE, C, and F 9,10,15, strongly supporting the out-of-Africa theory. The time of the anatomically modern human’s exodus from Africa has yielded inconsistent results ranging from 39 kya 16, 44 kya 10, 59 kya 17, 68.5 kya 18 to 57.0 – 74.6 kya 19.


This below explains why the F-series SNPs are for the most part found below CT-M168.

we selected 110 males, encompassing the haplogroups O, C, D, N, and Q which are common in East Eurasians, as well as haplogroups J, G, and R which are common in West Eurasians (see Table S1), and sequenced their non-repetitive segments of NRY using a pooling-and-capturing strategy.


Overall ~4,500 base substitutions were identified in all the samples from the whole Y chromosome, in which >4,300 SNPs that has not been publicly named before 2012 (ISOGG etc.). We designated each of these SNP a name beginning with ‘F’ (for Fudan University) (see Table S2). We obtained ~3.90 Mbp of sequences with appropriate quality (at least 1x coverage on >100 out of 110 samples), and identified ~3,600 SNPs in this region.


Table S2 is not available in the PDF file, the link says that all the tables are in a 'separate ancillary file', but such file is also not available, at least not at the time of the publishing of this post, and may become available when the paper is officially published. With out seeing the actual location on the Y chromosome where these SNPs are found it is hard to say how many of them are redundant SNPs relative to the PF and CTS SNPs, and how many of them are truly 'novel'.

Considering that 3.9 Mbp range constitutes only less than half of 10 Mbp non-repetitive region in Y chromosome 7, the time resolution of east Asian Y chromosome phylogeny is expected to be doubled in the near future.


To overcome the factors for uncertainty of mutation rate, a calibration with series of samples of comparable time scales might be used. For the case of mitochondrion, a recent study, in which several C-14 calibrated ancient complete sequences (4 – 40 kya) were incorporated into the tree, made the absolute dates much more convincing 41, and we expect a parallel calibration for the Y chromosome in the near future.


The authors conclude the paper with this paragraph:

Despite of the mutation rate uncertainty, we evaluate our calculation of absolute divergence time as acceptable. Firstly, our out-of-Africa date (54.1 kya) is still within the range of previous estimations (39 – 74.6 kya). Secondly, the out-of-Africa date is similar to the recent estimation of two great mitochondrial expansions outside Africa – M (49.6 kya) and N (58.9 kya) 42. Thirdly, it is not contradictory to the emergence of earliest modern human fossil out of Africa (e.g. ~ 50 kya in Australia) 43.

In the Supplementary Materials/Additional Discussions section they also mention this:

It remained mysterious that how many times the anatomically modern human migrated out of Africa, since that among the three superhaplogrous C, DE and F, Haplogroup F distributes in whole Eurasia, C in Asia and Austronesia, D exclusively in Asia, while D’s brother clade E distribute mainly in Africa 62, so there are two hypotheses, 1) haplogroups D and CF migrated out of Africa separately; 2) the single common ancestor of CF and DE migrated out of Africa followed by a back-migration of E to Africa. From this study, the short interval between CF/DE and C/F divergences weakens the possibility of multiple independent migrations (CF, D, and DE*) out of Africa, and thus supports the latter hypothesis 63 (Fig. S2 a).


Perhaps the only new material they have from this study that may strengthen the hypothesis of an extra-African origin of haplogroup E is, as they mention, the 'short interval' between the common ancestor of CF and DE  and the C/F divergences, however, this 'short interval' is relative to which branch length? They did not compute the interval between the BT common ancestor and the CFDE divergence, in addition, what length of time would be considered too short to disqualify the possibility of multiple independent migrations, and how would this length of time be evaluated? next, what about the cases of DE* found in Nigeria and Guinea-Bissau that they failed to mention here, that is to say, cases found that are neither D or E but are down stream from the YAP+ insertion, how exactly are they to be explained ? 

Either way, putting all these questions aside, let us assume that their proposal is correct, how then would this be reconciled with the last paragraph in the actual paper, where they associate M and N mtDNA haplogroups, with the out of Africa expansion, this would mean that if E back migrated, it would have done so with lineages downstream from mtDNA haplogroups M and N, however, many areas in Africa where E- dominates (except for East and North Africa) have, if not zero, close to zero amounts of mtDNA haplogroups M and N, wouldn't we expect to see at least some traces of the mtDNA counter part for this supposed ancient back migration in YDNA haplogroup E dominant areas of Africa other than the East and the North ? In an otherwise good and all around informative paper, I think the authors may have jumped the gun with this particular speculation, perhaps that is why they stuck it into the supplementary section of the paper and not the actual paper itself, as a testament to the highly speculative nature to their supposition.

Tuesday, October 8, 2013

TMRCA calculator for Python

I have converted the TMRCA calculator to run from only on Octave to Python as well, see here for the Octave version.
It is specifically made for Python 2.7, and have not had a chance to test it on other versions. No more libraries are required to run the script other than the standard libraries that come with 2.7. Some of the advantages of converting to Python are: less steps to run the program, easier for (future) web app deployment and more user access to Python than Octave.

The Zip file can be dowloaded here: https://dl.dropboxusercontent.com/u/42082352/TMRCA.zip
--------------------------------------------------------------------------------------------------------
TMRCA Calculator Instructions - for python 2.7

To check if the TMRCA program is correctly working on your system, first run it with the dataset
provided here before trying different datasets, to do so:

(1) Make sure you have python 2.7 loaded on your system (either Windows or Linux will work) and start running the interpreter.
(2) In the interpreter, change your working directory to the directory where you saved the unzipped folder by using:
(i) import os 
(ii) os.chdir('~PATH/TMRCA/')
-Where ~PATH is the full path where the TMRCA folder is placed on your system.
If you are unsure of your current working directory, type the command: os.getcwd()
(3) import the tmrca module by typing: import tmrca  
(4) Execute Script by typing: tmrca.Analysis('EM35_Example.csv','all')
(5) If this produces results with no errors in the interpreter, then the program is correctly installed and you can proceed to reading and analysing different datasets.

Reading and analysing new Data

After correctly executing the above steps, read and analyse new data by using the following steps:
(1)Examine the example STR data file in the "TMRCA/" folder entitled "EM35_Example.csv"
(2)Any STR data file to be analysed should first be made in the same format as the "EM35_Example.csv" file , specifically:
(a) DYS names in the first row should have the exact same nomenclature (the orders can be different however).
(b) Each row (except the first) should represent one sample.
(c) Each coloumn (except the first) should represent repeats for one marker/DYS#.
(d) The first column should represent sample identifiers, ex. Kit#, sample ID,...
(e) The cell found in the first row and first column should have the Dataset's name, this will be the same name used throughout the analysis.
(f) No cells shall contain null values and avoid having cells that contain characters which have spaces in between them.
(g) The file MUST be a *.csv file with commas used for field delimiters
(3) Place the *.csv file directly in the "TMRCA/" folder (i.e. in your working directory)
(4) Start the interpreter, change the working directory to '~PATH/TMRCA/', as per the instructions above and import tmrca.
(5) If you want to analyse a specific set of markers from your dataset go to step 6, otherwise go to step 7
(6) Go to the file "/TMRCA/Markerlist/49markerlist.txt", and pick the markers you want to use for analysis from there. Save your chosen
markers into a new *.txt file and into the same folder as "/TMRCA/Markerlist/". Take a look at  any of the other marker list text files in
the folder for an example of how a marker list should look. Note that all marker list files need to be *.txt
(7) If you are specifying a set of markers to use for the analysis, for example "8_Chiaronimarkerlist.txt", then run the program
by typing: tmrca.Analysis('EM35_Example.csv','8_Chiaronimarkerlist.txt'),otherwise, just type: tmrca.Analysis('EM35_Example.csv','all').

Wednesday, July 31, 2013

A summary of interesting recent genetics papers.

I'm taking a break from my Summer break to post a few interesting papers that have come out within the past couple of months.


This paper supports such a notion of continuous gene-flow between Africans and non-Africans since the major Out of Africa event that was precursor to the populating of all continents outside of Africa.
To be sure, such a notion is not new but has been highlighted before by methods used by authors such as Li and Durbin (2011) for instance. Such a notion, is also sufficient to explain the intermediate genetic nature of West Eurasians, I.e between Africans and East Asian/Native Americans, that I have blogged about and demonstrated using ADMIXTURE in the past.


A few quotes from the paper:

"In this paper, we study the length distribution of tracts of identity by state (IBS), which are the gaps between pairwise differences in an alignment of two DNA sequences. These tract lengths contain information about the amount of genetic diversity that existed at various times in the history of a species and can therefore be used to estimate past population sizes. IBS tracts shared between DNA sequences from different populations also contain information about population divergence and past gene flow. By looking at IBS tracts shared within Africans and Europeans, as well as between the two groups, we infer that the two groups diverged in a complex way over more than 40,000 years, exchanging DNA as recently as 12,000 years ago." 

"To illustrate the power of our method, we use it to infer a joint history of Europeans and Africans from the high coverage 1000 Genomes trio parents. Previous analyses agree that Europeans experienced an out-of-Africa bottleneck and recent population growth, but other aspects of the divergence are contested [47]. In one analysis, Li and Durbin separately estimate population histories of Europeans, Asians, and Africans and observe that the African and non-African histories begin to look different from each other about 100,000–120,000 years ago; at the same time, they argue that substantial migration between Africa and Eurasia occurred as recently as 20,000 years ago and that the out-of-Africa bottleneck occurred near the end of the migration period, about 20,000–40,000 years ago. In contrast, Gronau, et al. use a likelihood analysis of many short loci to infer a Eurasian-African split that is recent enough (50 kya) to coincide with the start of the out of Africa bottleneck, detecting no evidence of recent gene flow between Africans and non-Africans [14]. The older Schaffner, et al. demographic model contains no recent European-African gene flow either [48], but Gutenkunst,et al. and Gravel, et al. use SFS data to infer divergence times and gene flow levels that are intermediate between these two extremes [22][49]. We aim to contribute to this discourse by using IBS tract lengths to study the same class of complex demographic models employed by Gutenkunst, et al. and Gronau, et al., models that have only been previously used to study allele frequencies and short haplotypes that are assumed not to recombine. Our method is the first to use these models in conjunction with haplotype-sharing information similar to what is used by the PSMC and other coalescent HMMs, fitting complex, high-resolution demographic models to an equally high-resolution summary of genetic data."

"We estimate that the European-African divergence occurred 55 kya and that gene flow continued until 13 kya. About 5.8% of European genetic material is derived from a ghost population that diverged 420 kya from the ancestors of modern humans. The out-of-Africa bottleneck period, where the European effective population size is only 1,530, lasts until 5.9 kya."

"Our inferred human history mirrors several controversial features of the history inferred by Li and Durbin from whole genome sequence data: a post-divergence African population size reduction, a sustained period of gene flow between Europeans and Yorubans, and a “bump” period when the ancestral human population size increased and then decreased again. Unlike Li and Durbin, we do not infer that either population increased in size between 30 and 100 kya. Li and Durbin postulate that this size increase might reflect admixture between the two populations rather than a true increase in effective population size; since our method is able to model this gene flow directly, it makes sense that no size increase is necessary to fit the data. In contrast, it is possible that the size increase we infer between 240 kya and 480 kya is a signature of gene flow among ancestral hominids."

"Our estimated divergence time of 55 kya is very close to estimates published by Gravel, et al.and Gronau, et al., who use very different methods but similar estimated mutation rates to the  per site per generation that we use in this paper. However, recent studies of de novo mutation in trios have shown that the mutation rate may be closer to  per site per generation [51][55][56]. We would estimate older divergence and gene flow times (perhaps  times older) if we used the lower, more recently estimated mutation rate. This is because the lengths of the longest IBS tracts shared between populations should be approximately exponentially distributed with decay rate ."




This paper discusses some points, rather the lack of evidence, that makes a pre-toba migration of modern humans outside of Africa almost impossible to reconcile with currently available evidence.

A few quotes from the paper:

"There are currently two sharply conflicting models for the earliest modern human colonization of South Asia, with radically different implications for the interpretation of the associated genetic and archaeological evidence (Fig. 1). The first is that modern humans arrived ∼50–60 ka, as part of a generalized Eurasian dispersal of anatomically modern humans, which spread (initially as a very small group) from a region of eastern Africa across the mouth of the Red Sea and expanded rapidly around the coastlines of southern and Southeast Asia, to reach Australia by ∼45–50 ka (7–10, 14–18) (Fig. 2). The second, more recently proposed view, is that there was a much earlier dispersal of modern humans from Africa sometime before 74 ka (and conceivably as early as 120–130ka), reaching southern Asia before the time of the volcanic “supereruption” of Mount Toba in Sumatra (the largest volcanic eruption of the past 2 million y) at ∼74 ka (1–6)."
"We find no evidence, either genetic or archaeological, for a very early modern human colonization of South Asia, before the Toba eruption. All of the available evidence supports a much later colonization beginning ∼50–55 ka, carrying mitochondrial L3 and Y chromosome C, D, and F lineages from eastern Africa, along with the Howiesons Poort-like microlithic technologies (see above and Genetics and Archaeology). We see no reason to believe that the initial modern human colonization of South and Southeast Asia was distinct from the process that is now well documented for effectively all of the other regions of Eurasia from ∼60 ka onward, even if the technological associations of these expanding populations differed (most probably for environmental reasons) between the eastern and northwestern ranges of the geographical dispersal routes."

"The archaeological evidence initially advanced to support an earlier (pre-Toba) dispersal of African-derived populations to southern Asia has since been withdrawn by the author responsible for the original lithic analyses, who now suggests that they are most likely “the work of an unidentified population of archaic people” (ref. 11, p. 26). Meanwhile, the genetic evidence outlined earlier indicates that any populations dispersing from Africa before 74 ka would predate the emergence of the mtDNA L3 haplogroup, the source for all known, extant maternal lineages in Eurasia (8, 28) (Fig. 5). The size of the mtDNA database is very substantial: currently there are almost 13,000 complete non-African mtDNA genomes available, not one of which is pre-L3."




This paper, written by a geneaolgoical community member, has made an impressive effort at creating and automating a comprehensive method to pylogenetically classify Geno 2.0 YDNA SNPs. Details of the algorithm are not available:

"To illustrate this, the author has used this Y-tree clade predictor (using the latest ISOGG tree as a basis for comparison) to classify over 1650 sets of publicly accessible Geno 2.0 Y-SNP calls. This information was then used as an input into another algorithm designed by the author – an algorithm developed to automate the construction of a phylogenetic Y-tree, while overcoming the challenges identified above. The technical details of this process will remain proprietary for the time being."



Wednesday, May 8, 2013

Another Extensive thesis on East African DNA


It was brought to my attention last week, thanks to a comment on this blog made by the user 'Umi', that another thesis on East African DNA variation was publicly available online:

Complex Genetic History of East African Human Populations

This is also an extensive thesis with a wealth of information akin to Plaster's thesis, the primary differences being that this one was more focused on parts of East Africa that are found further to the South of Ethiopia, and in addition to uni-parental analysis, it also included some Autosomal model-based inference, albeit of quite low resolution in today's standards; 848 microsattelites and 479 indels (refer to Tishkoff et al. 2009 for marker details).

Due to the extensive nature of the report I haven't had a chance to cover its entire scope, instead, for starters, I have first focused on the YDNA data by creating a relative frequency chart from the results reported in Fig. 3.3.2. 

Several things to initially point out here,

  • The report outlines the discovery of 4 new SNPs, TL1-4. The first two were found in Haplogroup B and downstream from B-M150 and B-M112 respectively. The last two, TL3 and TL4, were found in haplogroup E and downstream from E-U174 and E-V32 respectively. Incidentally, the fourth SNP that is under E-V32, TL4, could potentially be the same as Z808/Z809 as identified recently by the geneological community, however, as the report does not give the Y-Chromosome location of the SNP in a NCBI Build 36/37 format, this can not be verified, at least by me, at the moment.
  • A couple of the frequency results in Fig. 3.3.2 do not add up, in particular, the frequency results for the Boni and the Baggara, but also to a lesser extent for the Kanuri and Teita.  I have labeled the missing frequency results with a “?” in the relative charts for those specific populations.
  • The Burji and Konso are labeled as being only from Kenya throughout the report, however most Burji are from Ethiopia, and the Konso are exclusively found in Ethiopia, I have reflected this in the charts.
  • STR data is not readily available to perform TMRCA estimates on, however, some TMRCA results are reported using Zhivotovsky's rates in Table 3.3.1, nevertheless, these are estimates only for different lineages found in the dataset for all the samples and not necessarily comparing TMRCAs in the different populations under study.
  • J-M62, while a subclade of J-M267, is not the main subclade of J-M267 found in East Africa, that would be J-P58, therefore, the results for J-12f2.1 (x M62, M172) reported, may after all be, or largely include, J-P58 lineages, off-course those results could also include variants of J-M267 other than J-P58 and J-M62 as well since the SNP was not directly tested. 
  • E-P2* lineages are abundantly found (> 30%) in the Konso, Burji and Mbugwe, however on closer examination and correlation with current data, these could be E-M329, E-V38* or even E-M215*, as none of these SNPs were directly tested. Genuine E-P2* lineages would be positive for E-P2 and negative for V38 and M215 (See Trombetta et al. 2011)
  • Similarly, the E-M35* lineages reported could be members of relatively newly discovered lineages of E-Z830*( See this post for details), or some of the untested variantes of E-M35, i.e.  E-V42, V92 and maybe even E-V68 (x M78)

Tuesday, May 7, 2013

Analyzing YDNA A-M13 lineages in Ethiopian linguistic groups

Similar to the previous analysis of J lineages found in Ethiopia from the Plaster paper, the other prevalent lineage in Ethiopia, A-M13 (formerly known also as A3b2), is also analyzed below. A total of 616 A-M13 lineages were reported in the study, of which ~32% were classified as Semitic speakers, ~40% as Cushitic speakers, ~17% as Omotic speakers and the remainder within the Nilo-Saharan speaking macro-phylum.

Wednesday, May 1, 2013

Analyzing YDNA J lineages in Ethiopian linguistic groups

The extensive YDNA dataset found in the Plaster paper has a total of 691 YDNA lineages that belong to haplogroup J, although there is no more detailed SNP resolution reported for most of these lineages, it is safe to assume, from previous data on Ethiopia, that a vast majority of them would belong to J1-M267. There is a limited set of STR data that accompanies these lineages as well, namely only for the markers; 19, 388, 390, 391, 392 and 393.

According to the report, J lineages are proportionally found higher in Semitic speakers in Ethiopia, ~21% ,followed by Omotic speakers at ~ 12% and Cushitic speakers at ~  8%.  Out of the 691 YDNA J lineages reported, 259 were Semitic speakers, 266 spoke some type of Omotic language and most of the remainder spoke Cushitic languages.

Sunday, April 21, 2013

Source code for the ASD based TMRCA calculator (Octave)

The code for the TMRCA calculator of YDNA STR haplotypes that I use can be downloaded from here : https://dl.dropboxusercontent.com/u/42082352/TMRCA_ASD.zip

See also here for instances of where I have used the calculator in the past:
http://ethiohelix.blogspot.com/2012/06/finding-tmrca-of-ethiopian-ydna.html
http://ethiohelix.blogspot.com/2012/11/extensive-doctoral-thesis-on-ethiopian.html
http://ethiohelix.blogspot.com/2013/01/tmrca-calculations-from-plaster-nry.html
http://ethiohelix.blogspot.com/2013/02/the-zhivotovsky-multiplier.html
http://ethiohelix.blogspot.com/2013/03/african-sahel-ydna.html

The code is written for Octave and is also Matlab compatible. There is also an instruction file that explains how to run the calculator in the folder that is linked above which can also be found below:
---------------------------------------------------------------------------------------------------------


To check if the TMRCA program is correctly working on your system, first run it with the dataset
provided here before trying different datasets, to do so:

(1) Make sure you have Octave loaded on your system (either Windows or Linux will work) and start octave in the command line.
(2) In the command line, change your working directory to the directory where you saved the unzipped  folder by using: cd ~PATH/TMRCA_ASD/
If you are unsure of your current working directory, type the command: pwd()
(3) Type: fcompositeTMRCA("Buckova_EM78","all")
(4) If this produces results, then the program and functions are correctly installed and you can proceed to reading and analysing different datasets.


Reading and analysing new Data

After correctly executing the above steps, read and analyse new data by using the following steps:
(1)open the example STR data file in the "TMRCA_ASD/Loaded_Data/" folder entitled "EM35_STR.xls"
(2)Any STR data file to be analysed should first be made in the same format as the "EM35_STR.xls" file , specifically:
(a) DYS names in the first row should have the exact same nomenclature (the orders can be different however).
(b) Each row (except the first) should represent one sample.
(c) Each coloumn (except the first) should represent repeats for one marker/DYS#.
(d) The first column should represent sample identifiers, ex. Kit#, sample ID,...
(e) The cell found in the first row and first column should have the Dataset's name, this will be the same   name used throughout the analysis.
(f) No cell shall contain null values and avoid having cells that contain characters which have spaces in between them.
(3) In Excel or openoffice, convert the "EM35_STR.xls" workbook to a ".csv" file by saving the file as "YSTR.csv" and placed into the
same "TMRCA_ASD/Loaded_Data/" folder. The program will only look for a file entitled "YSTR.csv", so make sure that the same name is used for your file.
(4) Start octave, in the command line, change the working directory to "~PATH/TMRCA_ASD/Loaded_Data/"
(5) Type on the octave prompt: readdata
(6) Octave will start reading the dataset and create the file "EM35-Balanced" in the folder "/TMRCA_ASD/Loaded_Data/" when it is finished.
(7) If you want to analyse a specific set of markers from your dataset go to setep 8, otherwise go to step 9
(8) Go to the file "/TMRCA_ASD/Markerlist/49markerlist.txt", and pick the markers you want to use for the analysis. Then save your chosen
markers into a new txt file in the same folder as "/TMRCA_ASD/Markerlist/". Take a look at the file "8_Chiaronimarkerlist.txt" for
an example of how the marker list should look.
(9) In octave, change your working directory back up one level by typing: cd ..
(10) If you are specifying a set of markers to use in the analysis, then run the program by typing: fcompositeTMRCA("EM35-Balanced","8_Chiaronimarkerlist.txt"), otherwise, just type: fcompositeTMRCA("EM35-Balanced","all").
----------------------------------------------------------------------------------------------------------
Update : Version2 -  *.CSV read, + Auto path detect. (fcompositeTMRCA.m, fmarkerextract.m, readdata.m)
Update(04/25/13) : Version3 - Add option for using all available markers, print used/unused markers. (fcompositeTMRCA.m, fmarkerextract.m, fAssignmutation.m)

Thursday, March 7, 2013

African Sahel YDNA


Multiple and differentiated contributions to the male gene pool of pastoral and farmer populations of the African Sahel


ABSTRACT

The African Sahel is conducive to studies of divergence/admixture genetic events as a result of its population history being so closely related with past climatic changes. Today, it is a place of the co-existence of two differing food-producing subsistence systems, i.e., that of sedentary farmers and nomadic pastoralists, whose populations have likely been formed from several dispersed indigenous hunter-gatherer groups. Using new methodology, we show here that the male gene pool of the extant populations of the African Sahel harbors signatures of multiple and differentiated contributions from different genetic sources. We also show that even if the Fulani pastoralists and their neighboring farmers share high frequencies of four Y chromosome subhaplogroups of E, they have drawn on molecularly differentiated subgroups at different times. These findings, based on combinations of SNP and STR polymorphisms, add to our previous knowledge and highlight the role of differences in the demographic history and displacements of the Sahelian populations as a major factor in the segregation of the Y chromosome lineages in Africa. Interestingly, within the Fulani pastoralist population as a whole, a differentiation of the groups from Niger is characterized by their high presence of R1b-M343 and E1b1b1-M35. Moreover, the R1b-M343 is represented in our dataset exclusively in the Fulani group and our analyses infer a north-to-south African migration route during a recent past.

Closed Access



Y(x CF)  Phylogeny, Red = SNPs Tested, Blue =Presumed Tested 
CF Phylogeny, Red = SNPs Tested, Blue =Presumed Tested

Monday, March 4, 2013

Geno 2.0 YDNA SNP Pathways.


The Geno 2.0 chip tests some 13,000 SNPs on the Y-Chromosome, by far the largest from all commercial DNA companies, in addition, a lot of these SNPs do not have a place assigned in the YDNA phylogeny, no official phylogeny has been published yet either.

However, the customers of this project get the option to transfer the SNPs to FTDNA and thereby join the numerous grouped projects under the FTDNA umbrella, which then displays the results of which SNPs they tested positive for.

Although we don't know where most of these SNPs belong on the YDNA tree, we do know where some of them belong, and by utilizing the most rudimentary operations of set mathematics (union, intersection and set difference), in addition to the positions of the known SNPs in the current YDNA phylogeny tree (ISOGG 2013) it is possible to segregate these SNPs that appear on the project pages into phylogenetic pathways.

This posting will change frequently as more and more kits appear in the FTDNA project pages.

The first thing to realize is that the following list of 101 SNPs are either erroneous or erroneously reported and need to be discarded if they appear on any of the results until FTDNA , NATGEO or whoever else is responsible fixes them,

CTS1034+ CTS10436+ CTS10713+ CTS10738+ CTS11085+ CTS11454+ CTS11844+ CTS12173+ CTS2080+ CTS2223+ CTS230+ CTS2447+ CTS295+ CTS3234+ CTS335+ CTS3647+ CTS3763+ CTS3914+ CTS4276+ CTS4623+ CTS4714+ CTS477+ CTS5458+ CTS5580+ CTS6010+ CTS6353+ CTS6384+ CTS6891+ CTS7453+ CTS7492+ CTS7859+ CTS7951+ CTS8133+ CTS8178+ CTS8244+ CTS9096+ CTS947+ CTS9512+ CTS9548+ F1173+ F1221+ F1300+ F1327+ F1369+ F1707+ F1754+ F1831+ F1833+ F1842+ F1870+ F1882+ F2000+ F2137+ F2150+ F2177+ F2223+ F2494+ F2503+ F2546+ F2631+ F2845+ F2887+ F2932+ F3035+ F3039+ F317+ F3187+ F3225+ F3394+ F3397+ F3455+ F375+ F3948+ F3965+ F4131+ F4277+ F830+ F842+ F869+ F889+ F910+ F942+ F943+ F969+ L366+ L477+ L493+ L515+ L516+ L517+ L552+ L594+ M263+ PF4208+ PF4330+ PF5061+ PF6868+ PF7392+ Z148+ Z191+ Z365+  



Notes : Unions will be listed without symbol, ex Set ABC = Set ( (A  B) ∪ C)
            Known SNP identification is all based on ISOGG 2013 only.


Pathway from root to CT-M168 (=Set # A)



Binary Operation: Set1 Set2

Number of SNPs: 77


CTS10362+ CTS109+ CTS11358+ CTS11575+ CTS125+ CTS1996+ CTS3331+ CTS3431+ CTS3662+ CTS4364+ CTS4368+ CTS4740+ CTS5318+ CTS5457+ CTS5532+ CTS6383+ CTS6800+ CTS6907+ CTS7922+ CTS7933+ CTS8243+ CTS8980+ CTS9828+ L566+ L781+ M139+ M168+ M294+ M42+ M94+ PF1016+ PF1029+ PF1031+ PF1040+ PF1046+ PF1061+ PF1092+ PF1097+ PF110+ PF1203+ PF1269+ PF1276+ PF15+ PF192+ PF210+ PF212+ PF223+ PF234+ PF258+ PF263+ PF272+ PF278+ PF292+ PF316+ PF325+ PF342+ PF500+ PF601+ PF667+ PF719+ PF720+ PF725+ PF779+ PF796+ PF803+ PF815+ PF821+ PF840+ PF844+ PF892+ PF937+ PF951+ PF954+ PF970+ V189+ V52+ V9+

Identified as same level as BT
Identified as same level as CT-M168
Identified as same level as P <---- Looks unreliable and maybe a false positive report.


Thursday, February 21, 2013

The Zhivotovsky Multiplier


It is reported that Zhivotovsky's effective mutation rate [1] has the effect of increasing the TMRCA of a lineage, as computed by the use of Microsattelite Genetic Distances[2], by a factor of 3-4 fold relative to TMRCAs computed via mutation rates observed in pedigree and family studies [3].

By utilizing my TMRCA calculating program, I want to explore,
  1. What effect does different marker combinations have on this multiplier ?
  2. What effect does marker size have on this multiplier ?
  3. Is there a variation in this multiplier for different data-sets?

First, to ensure that my program correctly calculates the TMRCA when the Zhivotovsky mutation rate of 0.00069 is applied to all the markers in my database consistently (versus only the marker specific Pedigree mutation rates I have thus far been utilizing), I attempted to replicate the TMRCA computations of the following publication;


Friday, February 8, 2013

Sudan YDNA

This is from a relatively old study, but it seems that it is the most comprehensive YDNA breakdown we have of North and South Sudan to date.

Y-chromosome variation among Sudanese: restricted gene flow, concordance with language, geography, and history. Hassan (2008)

Here is a map of the populations tested from Fig.1 of the Study
Populations Studied

Here below is the phylogeny (as known back in 2008) of the SNPs tested, note that those in bold; E-M75, E-P2, G-M201 and T-M70 were NOT tested in the study.

SNPs tested (except those in bold)
The E-M78+ cases from above were also tested for Cruciani's V-Series SNPs as well for further resolution,


Cruciani's V-Series SNPs (2007)

Some notes:


  • The high level (38%) of E-M215 (x M78) in the Borgu is quite intriguing, I wonder what variant/s of E-M215 it is?
  • Almost all the J-12f2(x M172) should be J-M267.
  • B-M60 is found in Southern Nilo-Saharan speakers and not the North Western ones, while A-M13 is found in both.
  • The F-M89(x M52,M170,I2f2, M9) found in the north is also interesting, although it could possibly be G-M201, at least part of it.
  • E-V22 has a relatively high presence in these samples, even when compared to the Egyptian samples from Cruciani '07, and most certainly higher than its presence in Ethiopia.
  • The High presence of E-V12 (x V32) is also concordant with its putative area of origin, all the E-M78 found in the Nuer and the Copts is of this variety.
  • The presence of E-M78* in the Masalit and the Nuba is notable.
  • Off course the strangest result is the 54% R-M173 (x P25) in the Fulani, this could be some R1b*(R-M343), or some type of R1a, the latter would be very out of place for the region, while the former could be reconciled with the presence of more downstream R1b variants in Africa. 


Monday, February 4, 2013

A speculative superimposition of E-M35 variants onto Afroasiatic.

Here is a speculative superimposition of the variants of YDNA E-M215/M35 (E1b1b/1) onto an Afroasiatic internal classification, Lionel Bender's (1997) classification. 


The red question marks represent a less unsure fit.

Saturday, January 5, 2013

TMRCA calculations from Plaster NRY data : Correcting an Error


Previously, I had computed TMRCAs for the YDNA STR data from the additional material that was provided along with Dr.Chris Plaster's thesis. However, after a brief communication with the author, I found out that the marker order of the STRs in the excel file was reported wrongly, the correct order for the markers are thus as follows:

DYS19 DYS388 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS437 DYS438 DYS439 DYS448 DYS456 DYS635 Y GATA H4

This changes my TMRCA calculations because I am not computing the coalescent using a generic mutation rate that is equivalent for all the markers, but rather each marker has its own mutation rate attributed to it.

When I rerun my program using the newly corrected order above I get the following:


As can be seen, using the new order of markers generally reduces the number of generations to coalescent for the Plaster data-set. The previous observation of a relatively lower TMRCA for the haplozone data of E-M123 versus that of the E-M34 Plaster data-set largely disappears. 

To check if the fact that the high number of samples (129) present in the E-M123 haplozone data-set was skewing the results, I took 23 random samples (which equals the same number of samples available in the Plaster E-M34 data-set) from the larger E-M123 Haplozone dataset and re-run the TMRCA calculations on just those samples, I repeated this process 300 times, only 28% of the runs yielded a mean TMRCA less than the E-M34 Plaster data-set, if sample size was skewing the results I would expect >50% of the runs to have a mean TMRCA less than that of the E-M34 plaster dataset.

That said, the E-M34 Plaster data-set still had a relatively higher generations to coalescent than the E-M84 Haplozone dataset, E-M84 is a subclade of E-M34 and a high majority of haplotypes that belong to E-M34 also test positive for the E-M84 SNP (at least for the non-African E-M34 haplotypes that we know of).

Other than that, the new, and corrected, ordering of the markers did not have much impact in relative TMRCA terms between the Plaster and Haplozone/FTDNA data for the other lineages I had tested.