Saturday, March 31, 2012

Cross Validating and K Selection


There are two ways of choosing a K value for any given dataset that one wishes to perform an ADMIXTURE run on, one is to throw a dart at a random set of numbers and hope it works out for the very best, the other is to run ADMIXTURE at different K's while computing a cross validation error for each of the K values using the --cv flag, I did this with the studentized global dataset that I discussed earlier in this post. The Cross Validation error values for K 1-14 for that particular dataset can be seen in the graphs below,

close up :
While the CV-Error values do not start flattening out until about K=10, the CV error values do not start inflecting until K=13, meaning K=13 is the appropriate choice for this dataset.

Cross Validation can take a considerably long time to run, as each consecutive K has to be evaluated along with its error separately, unless one has access to a very fast machine off-course.

As a reference, the Bash shell code to run Cross Validation in ADMIXTURE for up-to K=14 is:

for K in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; \
do ./admixture32 -j2 --cv=14 “filename.bed” $K | tee log${K}.out; done

where CV error values will be recorded in the .out files for each K.

Peaking populations for each cluster for K =2-13

K=2
Cluster1: pygmy,mbutipygmy,sotho/tswana,biakapygmy,fang

Cluster2: chinese-americans,tujia,miao,hezhen,han

East Asians and Africans split, with West Asians and Europeans belonging to 1/3 African and 2/3 East Asian, the reverse is seen with Ethiopians, 2/3 African and 1/3 East Asian.



K=3
Cluster1: sardinian,basque,tuscans,italian,spaniards

Cluster2: pygmy,mbutipygmy,sotho/tswana,biakapygmy,bantusouthafrica

Cluster3: she,chinese-americans,han,singapore-chinese,chinese
West Asians Split off.

 
K=4
Cluster1: sardinian,basque,tuscans,italian,cypriots

Cluster2: pygmy,mbutipygmy,sotho/tswana,biakapygmy,bantusouthafrica

Cluster3: colombian,karitiana,surui,pima,totonac

Cluster4: she,han,singapore-chinese,chinese,miao
Native Americans split off.

K=5
Cluster1: she,han,chinese-americans,chinese,singapore-chinese

Cluster2: surui,karitiana,colombian,pima,totonac

Cluster3: sardinian,basque,spaniards,italian,tuscans

Cluster4: pygmy,mbutipygmy,biakapygmy,bantusouthafrica,sotho/tswana

Cluster5: papuan,irula,tn-dalit,ap-mala,malayan
Oceanians and South Asians split off together.

K=6
Cluster1: papuan,melanesian,tongan,samoan,paniya

Cluster2: pygmy,mbutipygmy,biakapygmy,bantusouthafrica,sotho/tswana

Cluster3: karitiana,colombian,surui,pima,totonac

Cluster4: she,han,chinese-americans,singapore-chinese,chinese

Cluster5: sardinian,basque,spaniards,italian,tuscans

Cluster6: irula,tn-dalit,ap-madiga,ap-mala,north-kannadi
Oceanians and South Asians split off from each other.

K=7
Cluster1: sardinian,basque,spaniards,italian,tuscans

Cluster2: dogon,yoruba,bambaran,hausa,igbo

Cluster3: irula,tn-dalit,ap-mala,ap-madiga,north-kannadi

Cluster4: san-nb,san,!kung,pygmy,mbutipygmy

Cluster5: papuan,melanesian,tongan,samoan,paniya

Cluster6: colombian,surui,karitiana,pima,totonac

Cluster7: she,han,chinese-americans,singapore-chinese,chinese
San split off from the African component.

K=8
Cluster1: dogon,yoruba,bambaran,hausa,igbo

Cluster2: irula,tn-dalit,ap-mala,ap-madiga,north-kannadi

Cluster3: papuan,melanesian,tongan,samoan,paniya

Cluster4: koryaks,nganassans,chukchis,evenkis,yakut

Cluster5: dai,vietnamese,singapore-chinese,she,han

Cluster6: sardinian,basque,spaniards,italian,tuscans

Cluster7: san-nb,san,!kung,pygmy,mbutipygmy

Cluster8: surui,karitiana,colombian,pima,totonac
Siberians split off from the East Asian component.

K=9
Cluster1: papuan,melanesian,tongan,samoan,paniya

Cluster2: iban,samoan,tongan,singapore-malay,dai

Cluster3: japanese,hezhen,han-nchina,xibo,beijing-chinese

Cluster4: sardinian,basque,spaniards,italian,tuscans

Cluster5: san-nb,san,!kung,pygmy,mbutipygmy

Cluster6: dogon,yoruba,bambaran,hausa,igbo

Cluster7: surui,karitiana,colombian,pima,totonac

Cluster8: irula,tn-dalit,ap-mala,ap-madiga,north-kannadi

Cluster9: koryaks,chukchis,nganassans,east-greenlanders,kets
A South East Asian Component forms.

K=10
Cluster1: saudis,bedouin,yemen-jews,samaritians,tunisia

Cluster2: papuan,melanesian,tongan,samoan,paniya

Cluster3: dai,vietnamese,iban,singapore-chinese,she

Cluster4: hadza,maasai,ethiopians,ethiopian-jews,bulala
Cluster5: irula,tn-dalit,ap-madiga,ap-mala,north-kannadi

Cluster6: surui,karitiana,colombian,pima,totonac

Cluster7: koryaks,nganassans,chukchis,evenkis,yakut

Cluster8: dogon,yoruba,brong,igbo,bambaran

Cluster9: san-nb,san,!kung,pygmy,mbutipygmy

Cluster10: lithuanians,belorussian,orcadian,n-european,utahn-whites
West Asian component splits into 2 components; North European and Middle East & North African (MENA).  An East African component that was previously concealed by  the West Asian and African components forms. The previous South East Asian component disappears.

K=11
Cluster1: dai,vietnamese,singapore-chinese,she,han

Cluster2: koryaks,nganassans,chukchis,evenkis,yakut

Cluster3: surui,karitiana,colombian,pima,totonac

Cluster4: tunisia,bedouin,saudis,sahara-occ,yemen-jews

Cluster5: dogon,yoruba,brong,igbo,bambaran

Cluster6: lithuanians,belorussian,orcadian,n-european,utahn-whites

Cluster7: papuan,melanesian,tongan,samoan,paniya

Cluster8: san-nb,san,!kung,pygmy,mbutipygmy

Cluster9: irula,malayan,tn-dalit,ap-mala,ap-madiga

Cluster10: hadza,maasai,ethiopians,sandawe,bulala

Cluster11: kalash,brahui,balochi,makrani,georgians
A central Asian component forms.

K=12
Cluster1: surui,karitiana,colombian,pima,totonac

Cluster2: lithuanians,belorussian,orcadian,n-european,utahn-whites

Cluster3: san-nb,san,!kung,pygmy,mbutipygmy

Cluster4: iban,samoan,tongan,singapore-malay,cambodian

Cluster5: bedouin,saudis,yemen-jews,samaritians,tunisia

Cluster6: papuan,melanesian,tongan,samoan,paniya

Cluster7: japanese,beijing-chinese,han-nchina,chinese-americans,xibo

Cluster8: koryaks,chukchis,east-greenlanders,west-greenlanders,kets

Cluster9: irula,tn-dalit,ap-madiga,ap-mala,north-kannadi

Cluster10: dogon,yoruba,brong,igbo,bambaran

Cluster11: nganassans,evenkis,yakut,dolgans,kets

Cluster12: hadza,maasai,ethiopians,ethiopian-jews,bulala
Central Asian component disappears, a second Siberian component is formed, the S. East Asian component reappears.

 
K=13
Cluster1: san-nb,san,!kung,xhosa,bantusouthafrica

Cluster2: surui,karitiana,colombian,pima,totonac

Cluster3: papuan,melanesian,tongan,samoan,paniya

Cluster4: japanese,han-nchina,beijing-chinese,xibo,hezhen

Cluster5: hadza,maasai,ethiopians,sandawe,bulala

Cluster6: lithuanians,belorussian,orcadian,n-european,utahn-whites

Cluster7: koryaks,chukchis,nganassans,evenkis,east-greenlanders

Cluster8: tunisia,bedouin,saudis,yemen-jews,sahara-occ

Cluster9: kalash,brahui,balochi,makrani,georgians

Cluster10: pygmy,mbutipygmy,biakapygmy,alur,fang

Cluster11: irula,malayan,tn-dalit,ap-mala,ap-madiga

Cluster12: dogon,yoruba,brong,bambaran,igbo

Cluster13: iban,samoan,tongan,singapore-malay,dai

Central Asian Component reappears, a new Pygmy component is formed, second Siberian component disappears.

Fst for K=13.

UPDATE: Median cluster % for all populations, K13.
- no title specified
ADMIXTURE, Global K13NSanN. AmericanOceanianE. AsianE. AfricanN. EuropeanSiberianMENACentral AsianPygmyS. AsianW. AfricanS.E. Asian
!kung 878%0%0%0%2%0%0%0%0%2%0%16%0%
adygei 110%1%0%3%0%32%3%20%42%0%1%0%0%
african-americans 372%1%0%0%1%13%0%1%3%3%0%72%0%
algeria 120%0%0%0%5%22%1%48%5%0%3%13%0%
altaians 80%2%0%37%0%12%31%0%12%0%0%0%0%
alur 70%0%0%0%34%0%0%0%0%17%0%50%0%
ap-brahmin 140%1%2%1%0%8%2%1%36%0%48%0%2%
ap-madiga 50%0%2%2%0%0%0%0%24%0%66%0%5%
ap-mala 80%0%2%2%0%0%0%0%22%0%67%0%5%
armenians 110%0%0%0%0%19%0%34%43%0%2%0%0%
armenians-b 30%0%1%0%0%48%4%17%26%0%1%0%0%
ashkenazy-jews 150%0%0%1%0%37%0%34%24%0%1%0%0%
azerbaijan-jews 60%1%0%0%0%15%0%37%44%0%0%0%1%
balochi 180%1%0%1%0%7%1%13%53%0%20%0%0%
bambaran 143%1%0%0%1%0%0%1%0%1%0%91%0%
bamoun 103%0%0%0%4%0%0%0%0%7%0%85%0%
bantukenya 53%0%0%0%20%0%0%2%0%5%0%67%0%
bantusouthafrica 324%0%0%1%6%0%0%0%0%4%0%65%0%
basque 240%0%1%0%0%75%0%16%6%0%1%0%0%
bedouin 330%0%0%0%3%0%0%65%27%0%0%2%0%
beijing-chinese 910%0%0%68%0%0%2%0%0%0%0%0%28%
belorussian 40%1%1%0%0%77%4%3%15%0%1%0%0%
biakapygmy 1217%0%0%0%1%0%0%0%0%33%0%45%0%
bnei-menashe-jews 40%0%2%1%0%7%0%16%34%0%34%0%3%
bolivian 170%95%0%1%0%1%3%0%0%0%0%0%0%
brahui 180%1%0%0%0%8%1%13%55%0%20%0%0%
brong 44%0%0%0%0%0%0%0%1%3%0%91%0%
bulala 120%0%0%0%38%0%0%3%0%0%0%57%0%
burusho 170%2%1%7%0%13%4%2%41%0%27%0%2%
buryat 160%0%1%49%0%5%38%1%5%0%0%0%1%
buryats 130%0%1%47%0%5%38%0%5%0%1%0%0%
cambodian 50%0%1%31%0%0%0%0%1%0%11%0%57%
chinese 50%0%0%60%0%0%0%0%0%0%0%0%38%
chinese-americans 730%0%0%63%0%0%0%0%0%0%0%0%36%
chukchis 110%17%0%0%0%0%80%0%0%0%0%0%2%
chuvashs 120%2%0%6%0%54%19%1%15%0%2%0%0%
cochin-jews 40%2%2%0%1%5%2%8%34%0%46%0%1%
colombian 60%100%0%0%0%0%0%0%0%0%0%0%0%
cypriots 70%0%1%1%0%29%0%39%30%0%0%0%0%
dai 60%0%0%36%0%0%0%0%0%0%3%0%62%
daur 80%1%1%63%0%1%25%0%1%0%0%0%8%
dogon 241%0%0%0%0%0%0%1%0%0%0%94%0%
dolgans 50%0%0%28%0%10%56%0%3%0%2%0%0%
druze 300%0%0%0%0%17%0%42%38%0%0%0%0%
east-greenlanders 60%35%0%0%0%4%60%0%0%0%0%0%0%
egypt 120%0%0%0%7%11%0%47%24%0%0%7%0%
egyptans 70%0%0%0%8%10%0%49%23%0%0%7%0%
ethiopian-jews 121%0%1%0%37%0%0%38%8%0%0%11%0%
ethiopians 121%0%0%1%36%0%1%39%7%0%0%11%0%
evenkis 110%0%0%34%0%3%61%0%2%0%0%0%0%
fang 76%0%0%0%5%0%0%0%0%7%0%80%0%
french 220%1%0%0%0%70%0%14%12%0%1%0%0%
fulani 72%0%0%1%5%7%1%25%0%0%2%58%0%
georgia-jews 40%0%0%1%0%16%0%37%43%0%0%0%0%
georgians 170%0%0%0%0%23%0%28%46%0%0%0%0%
gujaratis 530%1%1%1%0%2%0%0%37%0%55%0%2%
gujaratis-b 140%2%1%0%0%13%2%0%40%0%40%0%1%
hadza 1119%0%0%0%80%0%0%0%0%0%0%0%0%
han 240%0%0%60%0%0%0%0%0%0%0%0%39%
han-nchina 60%0%0%68%0%0%4%0%2%0%0%0%24%
hausa 91%0%0%0%2%0%0%0%0%3%0%90%0%
hazara 160%1%0%31%0%14%16%6%23%0%8%0%4%
hema 113%0%1%0%31%0%1%10%2%4%0%46%0%
hezhen 40%1%0%66%0%0%28%0%0%0%0%0%6%
hungarians 90%2%0%0%0%69%2%10%15%0%1%0%0%
iban 150%0%2%11%0%0%2%0%0%0%7%0%77%
igbo 103%0%0%0%1%0%0%0%0%2%0%90%0%
iranian-jews 40%0%0%1%0%12%1%39%44%0%2%0%0%
iranians 120%1%1%0%0%16%1%28%45%1%7%1%0%
iraq-jews 80%0%1%0%0%14%0%41%40%1%1%0%1%
irula 240%0%0%0%0%1%0%2%1%0%89%0%0%
italian 80%0%1%0%0%60%0%23%14%0%0%0%1%
japanese 1540%0%1%91%0%0%1%0%0%0%0%0%6%
jordanians 141%0%0%0%3%16%1%42%33%0%1%3%1%
kaba 92%0%0%1%10%0%0%0%0%4%0%80%0%
kalash 160%2%1%0%0%10%3%0%65%0%16%0%2%
karitiana 140%100%0%0%0%0%0%0%0%0%0%0%0%
kets 20%5%0%13%0%19%54%0%8%0%1%0%0%
khmer-cambodian 30%0%3%27%0%0%0%0%0%0%13%0%55%
kongo 53%0%0%0%5%0%0%0%0%6%0%83%0%
koryaks 130%7%0%0%0%0%93%0%0%0%0%0%0%
kurd 160%1%1%0%0%19%0%29%46%0%3%0%0%
kyrgyzstani 150%1%0%40%0%13%24%3%12%0%2%0%3%
lahu 50%0%1%42%0%0%1%0%0%0%3%0%52%
lebanese 30%1%2%0%1%20%0%40%33%0%2%2%0%
lezgins 130%2%0%0%0%32%2%16%45%0%1%0%0%
libya 90%1%1%0%7%17%0%50%10%0%2%9%0%
lithuanians 60%1%0%0%0%80%2%0%12%0%3%0%0%
luhya 732%0%0%0%22%0%0%0%0%6%0%67%0%
maasai 1002%0%0%0%55%0%0%14%0%1%0%24%0%
mada 80%1%0%0%22%0%0%0%0%3%0%73%0%
makrani 190%1%0%0%0%7%0%15%54%0%18%3%0%
malayan 20%1%5%3%0%1%2%0%12%1%70%0%6%
mandenka 133%0%0%0%2%0%0%3%0%1%0%88%0%
maya 120%86%0%1%0%3%3%2%1%0%0%0%0%
mbutipygmy 130%0%0%0%0%0%0%0%0%100%0%0%0%
melanesian 70%0%74%0%0%0%0%0%0%0%0%0%25%
mexicans 380%44%0%1%0%27%2%12%6%0%1%3%0%
miao 60%0%0%56%0%0%1%0%0%0%0%0%42%
mongola 60%1%0%64%0%4%14%1%1%0%0%0%13%
mongolians 80%2%1%46%0%10%30%2%7%0%0%0%2%
moroccans 51%0%0%0%3%18%1%54%0%1%3%15%0%
morocco-jews 70%0%0%0%1%32%0%39%23%0%1%2%1%
morocco-n 120%1%0%0%3%27%0%49%1%0%4%12%0%
morocco-s 130%0%0%0%5%18%0%50%0%1%3%16%0%
mozabite 210%0%0%0%3%20%0%53%0%0%4%16%0%
n-european 140%1%0%0%0%74%1%8%13%0%0%0%0%
naxi 50%0%1%63%0%0%6%0%0%0%4%0%26%
nepalese 170%1%1%7%0%11%3%0%35%0%35%0%4%
nganassans 150%0%0%11%0%0%88%0%0%0%0%0%0%
nguni 418%0%1%0%6%0%0%0%0%4%0%71%0%
north-kannadi 60%0%3%3%0%0%0%0%23%0%65%0%3%
orcadian 90%1%0%0%0%75%2%7%14%0%0%0%0%
oroqen 70%0%0%52%0%0%40%0%0%0%0%0%5%
palestinian 270%1%1%0%3%14%0%46%32%0%1%2%0%
paniya 40%0%13%16%0%0%1%0%0%1%14%1%48%
papuan 170%0%100%0%0%0%0%0%0%0%0%0%0%
pathan 140%2%0%1%0%17%1%6%44%0%26%0%1%
pedi 818%0%0%0%5%0%0%0%1%4%0%71%0%
pima 110%95%0%0%0%0%5%0%0%0%0%0%0%
punjabi-arain 150%2%1%0%0%10%1%4%45%0%34%0%0%
pygmy 170%0%0%0%0%0%0%0%0%100%0%0%0%
romanians 90%0%0%0%0%55%3%19%19%0%0%0%0%
russian 200%2%0%0%0%70%9%1%14%0%2%0%1%
sahara-occ 100%0%0%0%6%16%1%57%0%0%3%15%0%
sakilli 40%0%3%3%0%1%0%0%25%0%64%0%2%
samaritians 31%0%2%0%0%11%0%49%35%0%1%0%0%
samoan 110%0%25%0%0%0%0%0%0%0%0%0%74%
san 2488%0%0%0%0%0%0%0%0%0%0%0%0%
san-nb 12100%0%0%0%0%0%0%0%0%0%0%0%0%
sandawe 1712%1%0%0%38%0%0%13%1%5%0%29%0%
sardinian 220%0%0%0%0%59%0%35%4%0%0%0%0%
saudis 150%0%0%0%4%0%0%63%30%0%0%0%0%
selkups 70%5%0%9%0%26%47%0%10%0%1%0%0%
sephardic-jews 130%0%0%0%0%33%0%37%26%0%1%0%0%
she 90%0%0%59%0%0%0%0%0%0%0%0%40%
sindhi 150%2%1%0%0%11%1%5%44%0%35%0%0%
singapore-chinese 700%0%0%60%0%0%0%0%0%0%0%0%40%
singapore-indians 530%1%2%1%0%2%1%1%32%0%54%0%3%
singapore-malay 590%1%4%15%0%0%1%0%1%0%10%0%65%
slovenian 170%1%0%0%0%70%2%9%15%0%1%0%0%
sotho/tswana 525%0%0%0%3%0%0%0%0%4%0%67%0%
spaniards 50%0%0%0%0%68%1%19%10%0%0%1%1%
stalskoe 50%2%0%2%0%34%3%16%39%0%2%0%0%
surui 70%100%0%0%0%0%0%0%0%0%0%0%0%
syrians 100%1%0%0%1%16%0%40%35%0%3%2%0%
thai 170%1%2%15%0%1%2%1%3%0%16%0%57%
tn-brahmin 90%2%2%0%0%8%2%0%36%0%48%0%1%
tn-dalit 70%0%3%0%0%0%1%0%23%0%67%0%5%
tongan 110%0%30%0%0%0%0%0%0%0%0%0%70%
totonac 150%91%0%1%0%3%5%0%0%0%0%0%0%
tu 70%1%1%63%0%3%8%1%3%0%1%0%18%
tujia 50%0%0%62%0%0%0%0%0%0%0%0%36%
tunisia 110%0%0%0%1%20%0%59%0%0%4%13%0%
turks 130%1%0%4%0%26%3%28%35%0%2%0%0%
tuscans 790%0%0%0%0%53%0%26%18%0%0%0%0%
tuvinians 110%1%1%41%0%9%40%0%6%0%0%0%1%
urkarah 110%2%0%0%0%36%2%11%45%0%0%0%0%
utahn-whites 720%1%0%0%0%75%1%7%12%0%1%0%0%
uygur 70%2%0%29%0%17%12%5%22%0%7%0%6%
uzbekistan-jews 20%1%1%0%0%18%1%35%42%0%2%0%1%
uzbeks 100%1%0%27%0%21%17%6%20%0%6%0%1%
vietnamese 40%0%1%42%0%0%0%0%0%0%4%0%52%
west-greenlanders 80%26%0%0%0%23%45%1%2%0%2%0%0%
xhosa 327%0%0%0%7%0%0%1%0%2%0%61%0%
xibo 60%0%1%67%0%1%15%0%2%0%0%0%13%
yakut 180%0%1%37%0%3%53%1%4%0%0%0%0%
yemen-jews 120%0%1%0%4%3%0%58%31%0%1%0%0%
yemenese 71%0%1%1%5%3%1%42%28%1%3%7%1%
yi 60%0%1%62%0%0%7%0%0%0%3%0%26%
yoruba 922%0%0%0%0%0%0%0%0%2%0%93%0%
yukaghirs 60%0%0%16%0%31%42%0%6%0%1%0%0%

All results can be downloaded here: ADMIXTURE_K1-14.tar.gz
which contains:
PLINK formatted *.bed, *.bim, *.fam files
*.txt file with complete list of samples
K folders containing:
*.P and *.Q ADMIXTURE output files
log file, with Fst distances and CV errors
Processed Output folder containing:
Median Cluster %
Average Cluster %
Standard Deviations
Cluster Key: Top five populations in each cluster
list of Unique Populations
GNU OCTAVE variable   loading file, *.mat

10 comments:

  1. Why do mandenka have MENA component? Also, many of the North African have significant West African component? this is not seen in other analysis of this type except maybe for some of the South Morroco.

    ReplyDelete
    Replies
    1. I am supposing you are talking about the cross validated K13 results correct ?
      The Mandenka had the MENA component at 2.83%, other West Africans like the Dogon had it at 1.19%. The MENA component is the component that links Africans with Middle-easterners, it is neither just African nor just Mideastern, but both, however, it is found at a more higher frequency with indigenous SubSaharn Eastern Africans, like the Sandawe and Maasai (13-14%) and Ethiopians (40%) than it is found with West Africans.

      For the West African component, it is found in Northern Africans at a median frequency in the following order:
      mozabite 16.34%
      morocco-s 15.75%
      sahara-occ 15.32%
      moroccans 14.80%
      algeria 13.02%
      tunisia 12.88%
      morocco-n 11.91%
      libya 9.45%
      egypt 6.67%
      egyptans 6.60%

      Ethiopians had it at 10-11%, so likely this component is a kind of pan-African component that has its presence in a variety of African populations due to many millennia of inter-African migrations/interactions.

      “this is not seen in other analysis of this type except maybe for some of the South Morroco. “

      Many of the analysis out there do not provide you with the direct samples they are using since they include a large number of private samples from participants in 'projects', this analysis is different because I actually provide you with ALL the direct samples and polymorphisms that I employed for this analysis since they are all public, thus you can verify the ADMIXTURE analysis for yourself if you have the requisite software installed.

      Delete
  2. Etyopsis, first i want to thankyou greatly for this post, its very informative and the Dataset will help me get started on a "New World" Ancestry project, in which it is Vital for the African components to be as broken down as possible, but will also need to keep alot of the populations for people with native ancestry, east asian, south asian, and southeast asian, siberian, south/north euro, mideast and north-african. Do you have any tips/suggestoins for having the Western Bantu and Eastern bantu's cluster form? Or any suggestoins at all will be very welcome, i am Lemba from ABF

    ReplyDelete
    Replies
    1. Hi Lemba, from my observations so far, the eastern and Western Bantu clusters don't form on a global level from the current SNPs that are included in my dataset, they do however form from an inter-continental African prespective. See this post for details: Intra African Genome-Wide Analysis
      The populations you have listed are very widespread globally, so when you include all those populations it becomes more of a global analysis, and thus it may be difficult to split the Eastern from Western Bantu/Niger Kordofani.
      So, you could try the following steps,
      1) Start with the base intra African dataset, as I have outlined in the post I linked you to above. Include the New World populations in that dataset and see if the Eastern and Western Bantu components are still splitting, there the North Africans can act as a proxy for Eurasian gene-flow to start with.
      2) Then add one by one the Eurasian populations you are interested in to that dataset, start with the ones that are furthest away from Africa, Native Americans, East Asians,...... See if the Eastern and Western Bantu clusters are still splitting, if so you can add more populations from Eurasia, but at some point the components will stop splitting I am just not sure at which point it will be.

      I have uploaded the base Intra African dataset I use in PLINK format here , you need to create your new world dataset to merge with however, if you haven't done so already. You also need to be mindful of your K selection, just because some components split at a given K, it doesn't necessarily mean they are statistically useful, hence try to utilize the cross-validation error values ADMIXTURE computes for each K run, even though it may take a while for your machine to process.

      Hope this helps for you to get started, let me know if you have other questions.

      Delete
    2. Thanks for the advice!, I just included amerindians into your pan-african dataset to see how it behaves. Will run a K=14 tonight and when i have all the populations in, ill do the K validation you did.

      How are you merging in 23andme data files? I am using a script Rhazib had on his site, and i was able to generate a .tped and .tfam file. Now my questoin is, after i make that .bed .bim .fam, don't i have to filter only the snp's which these datasets are using?

      23andme data suggestions would really help = )

      Delete
    3. Yes, you have to filter out the SNPs used in the dataset from those in your raw data, to do so, after you have successfully converted your raw data to the PLINK format (.bed,.bim, .fam) run the following commands in PLINK.
      Assuming the dataset is the Africa one I posted earlier, i.e "Africa_Rev4_public"

      First, extract the SNPs:
      plink --bfile "Africa_Rev4_public" --write-snplist --out "Extracted_SNPs"

      This will write a file called "Extracted_SNPs.snplist", in your working folder.

      Then, use that new file to extract the SNPs from your raw data, assuming your raw data is named "my_rawdata", use the following:

      plink --bfile "my_rawdata" --extract "Extracted_SNPs.snplist" --make-bed --out "my_rawdata_filtered"

      This will make new .bed,.bim and .fam files called "my_rawdata_filtered"

      Lastly, merge your filtered raw data with the main file you are using ("Africa_Rev4_public" in this case)

      plink --bfile "Africa_Rev4_public" --bmerge "my_rawdata_filtered.bed" "my_rawdata_filtered.bim" "my_rawdata_filtered.fam" --make-bed --out "New_file"

      This will make new .bed,.bim and .fam files called "New_file", which you can use to run ADMIXTURE with.

      One last note, the script that Razib had on his blog did not work well for me to convert 23andME raw data, it had some issues with the no calls, I'm not sure if it worked out for you, in any event I had to write my own code using GNU OCTAVE to convert 23andME raw data to the appropriate .tped and .tfam PLINK formatted files, if you want I can make a separate posting outlining how to do that.......

      Delete
    4. Wow thanks! this is what i needed. Yes in the SH script i got an error for the tfam, but with the perl script i didn't , although maybe its not being verbose. If you can post your own code i would really appreciate it! Also how do you output the current population list inside a .bed file? For example after i add myself how do i extract the population list .txt out of the .bed

      Delete
    5. I have created a post for converting 23andME raw data. Check it out and let me know if it works for you.
      As far as outputting the current population list you are using, I have a separate but a little more complicated program for correlating fam files with a superset text file, I know exactly what you are saying though, since the .fam files only has a listing of the family and Individual Id's it is hard to say which population the sample belongs to by just looking at it, if you have only added a few samples I suggest you can just manually identify them, and then add them to your original population list manually, the critical thing is that the samples have to be in the same order as the fam file as that is the order ADMIXTURE will report the output of the cluster proportions in the .q file.

      Delete
  3. Why doesn't the MENA cluster break up?

    ReplyDelete
    Replies
    1. It breaks up at K=14, even-though the cross validation error at that point is higher than that of K13's, and so the results may not be as reliable as for the K found at the lowest cross validation error. I had attached the K14 results anyway with the other data files in my original post.

      At K14, the MENA cluster breaks up into a North West African and a South West Asian cluster that peaks in the Tunisians and Bedouins respectively. An additional Polynesian cluster that peaks in the Samoans is also formed while the Pygmy cluster disappears to compensate for it.

      The interesting thing however about the MENA cluster breaking up @ K14 into a NW African and SW Asian cluster is that both the components are significantly present in the Ethiopian dataset at about 17% and 30% respectively, the fact that the MENA cluster breaks up into such spatially spread but distinct components, while both components simultaneously appear in the Ethiopian dataset points to its (i.e. MENA component's) relative antiquity in Africa, perhaps a significant portion of it is even older in Africa than it is in the middle-east itself.

      Delete