Thursday, April 26, 2012

Converting 23andME raw data into PLINK format


A commenter requested that I post the script I use to convert 23andME raw data to the required PLINK format that can be then used for ADMIXTURE computations, there are several ways to do this using different types of script, but since the script I use and am most familiar with is one that is compatible with GNU OCTAVE, that is the one I will post here.

I am assuming readers will be using a linux platform, eg. Ubuntu. 
In addition, the script requires for PLINK to be already installed on your machine.

  1. Download and Install GNU OCTAVE, you can do this from Ubuntu's 'Software Centre' by simply searching for OCTAVE, it takes less than 5 minutes to download and install.
  2. Create a new folder that you will use for converting raw data and name the folder, for instance create a folder on your desktop and rename the new folder “Convert_23andME”.
  3. Download and then copy and paste this file into the “Convert_23andME” folder that you just created.
  4. Download your raw data from 23andME, unzip it and copy the .txt file and paste it into the “Convert_23andME” folder you just created. You should have only 2 files in that folder now.
  5. Start the Terminal window in Ubuntu. Change directory to the Desktop/Convert_23andME folder you created by typing in the command line of the terminal window :cd Desktop/Convert_23andME/
  6. Start octave by typing “octave” in the command line of the terminal window
  7. Next, type: Raw_Convert ("My_Rawdata.txt") where the string argument being passed in-between the quotations, i.e. My_Rawdata.txt,  should be EXACTLY the name of the raw-data file you placed into the Convert_23andME folder in step 4.
  8. Avoid any spaces when answering all the questions*, and press enter, allow the program to process your raw data, V2 data takes about 22 minutes on my machine, V3 will obviously be longer. The speed will depend on your machine.
  9. When it is done, you will see 3 additional folders created within your “Convert_23andME” folder, the first folder (_conversion) will have three files with extensions .tped, .tfam and .nocall, these are the files converted by the script, where the .tped and .tfam files are the PLINK formatted transposed pedigree files of your raw data, while the .nocall is a file with the Chromosome#, assigned reference SNP IDs and position of your raw data points that were not successfully genotyped and is just for your record. The second folder (_binaryPED) will contain the files with extensions .bed, .bim, and .fam, which are created by PLINK and are the binary PED and associated files of your raw data that can be then merged with other data-sets to perform ADMIXTURE , MDS, as well as various other genome-wide analysis on. The last folder (_misc) is a folder containing miscellaneous files created by PLINK as a result of conversion from tped to binary ped, they may include files containing lists of heterozygous haploid genotypes and so forth, consult the PLINK manual for details.             
  10. Exit octave, just type 'exit'
*for the Questions the converting program in octave asks you;

Output File Name?

This is the name you want to give to your converted raw data file, the name you give it here will have the necessary extensions automatically appended to it so there is no need to include any extensions here, enter just the name sans the extension.

Family ID?

This will be the family ID PLINK identifies your raw data with,

Individual ID?

This will be the individual ID PLINK identifies your raw data with whereby the combination of a family and individual ID should uniquely identify a person,

Paternal ID (Default=0)?

You can just leave this at 0,

Maternal ID (Default=0)?

You can also leave this at 0,

SEX (1=male; 2=female; other=unknown)?

Enter 1 for male and 2 for female.
--------------------------------------------------------
Edit_Rev2: Converted Program into function, included Chromosome # and Position fields in No call list.

Edit_Rev3: Segregated No calls between Mitochondria, X and Y, included total-passed SNPs for PLINK in summary.

Edit_Rev4: Automated binary PED file creation.  

6 comments:

  1. Hey thanks alot for this , i just used it!

    This guide is great, and i can confirm this worked flawlessly. I have a quadcore Xeon, if you ever need to crunch a big dataset, let me know and i can give you SSH access or run it for you.

    Took me 5.8 minutes to convert a V2 23andme file.
    Time to Process : 352.626282

    Also out of curiosity, when you use plink does plink ask for double extension filenames? Plink only works for me when i copy or rename the files to match. like it'll say rawdata.tped.tped file missing, so i have to rename rawdata.tped to rawdata.tped.tped and rawdata.tfam to rawdata.tped.tfam

    Are you also having this?

    Another note.
    Gnuoctave is using only 1 core, if you have a multiple core machine it may be cut down if you can find the multi-threaded library for gnu octave, im looking for it.

    23andme data conversion should have taken me 1.45 minutes if it was multi-threaded.

    ReplyDelete
    Replies
    1. Wow less than 6 mins., I really need to update my machine :)

      "Also out of curiosity, when you use plink does plink ask for double extension filenames? Plink only works for me when i copy or rename the files to match. like it'll say rawdata.tped.tped file missing, so i have to rename rawdata.tped to rawdata.tped.tped and rawdata.tfam to rawdata.tped.tfam"

      When you type the name you want to give your raw data in octave, just type in the name only without the extension (.tped), the program will automatically create the necessary extensions, I have updated the post to clarify this. So say the name you give your rawdata in octave is my_rawdata, then the program will make the files my_rawdata.tped, my_rawdata.tfam and my_rawdata.nocall.

      Next with PLINK all you need to type is:
      plink --tfile "my_rawdata" --make-bed --out "Rawdata"

      you don't even need to type the extension, actually, you don't even need the quotations, the following will have the same outcome as the above:
      plink --tfile my_rawdata --make-bed --out Rawdata

      There is no need to type the extension because the flag --tfile, lets PLINK know what extension files and associated files to look for, in this case a transposed pedigree file, similarly in the case of using the flag --bfile, you don't need to type .bed because it knows to look for a binary ped file and those other files associated with .bed

      On your other note, yes I have a dual core machine but it is very old, and I'm not sure neither how to set up octave to run using both cores.

      Delete
  2. Yeah, if you ever need to crunch a big project/dataset let me know my machin eis available, since im not doing 24/7 runs. I just ran admixture, and im going to try to go by the fam files to see which clusters formed, i was just playing around and i forgot to do the fst distance to see which K's are "noisier" but i did a K=15, ill let u know what happens. I have Your Afro dataset + All the amerinds. There is no euro yet, im going to see which amerinds are admixed, the euro should be proxied by north-african, ill prune all mixed natives, for Euro's im thinking of Using Basque's as the center point, hopefully they form their own cluster, might just not put other euros so they can do this and then add them as individiduals.

    By the way That African dataset + Amerind's + my own raw data took about 1 hour and 40 minutes to run.

    ReplyDelete
    Replies
    1. OK thanks for the additional computing space offer, it may come in handy for larger data-sets.
      As for the Euros, you can use the Basques or the Lithuanians, with the caveat that the former have slightly higher 'MENA' affinities while the latter have slightly higher Siberian plus South-Central Asian affinities.

      Delete
    2. So ive been doing a couple of runs, some interesting notes.

      I got to see the error at the K's and when i combine that African dataset from k=1 to k=16, the least error K's are K=12 and K=14.

      Interestingly when the North-African cluster is MISSING (K=12) My native american gets inflated by like 2%. Something is similar about north-africans and natives, and since i have substantial north-african ancestry in its absence it inflates my native.

      I was actually able to get ALL The African cluster you have as low as K=14. Including my euro, native, north-african and arabian clusters. But i wasnt able to reproduce the West-Central African bantu cluster until K=16, which had a higher error rate, although not crazy high.

      I adde my GF to compare, and our African ancestry differ depending on the K.

      In some K's i am all E. bantu, in others i am all west african (dogon).

      In some she is all E. Bantu and in some she is all West African.

      I wonder why the big difference?

      This seems to be the only thing not stable so far. But the native, euro and other scores are pretty consistent for me and my GF.

      Thing is the Bantu scores don't change for the populations, but they change from all to nothing for me and my gf, have you expirienced things like this?

      One thing too, Is there a script to make finding out the K's more automated, ive been doing it manually using excel = ).

      Delete
    3. Interesting run Lemba.

      “Something is similar about north-africans and natives, and since i have substantial north-african ancestry in its absence it inflates my native. “

      Likely, since Natives are ultimately Eurasian and since North Africans have substantial Eurasian as well as Eurasian-Like ancestry, the program is finding similarities in the patterns of allele frequency in individuals of these 2 groups....

      “I adde my GF to compare, and our African ancestry differ depending on the K.

      In some K's i am all E. bantu, in others i am all west african (dogon).

      In some she is all E. Bantu and in some she is all West African.

      I wonder why the big difference? “

      Probably because Eastern Bantus are a modified version of West African (Niger Kordofanian) groups, modified in-terms of additional Rift Valley, Afroasiaitic and Nilosahran genes. However, the discernment is also fuzzy because the SNPs are Eurocentric, but it is all we have to work with for now unfortunately, until we get another diverse set of African samples that are genotyped with more Afrocentric SNPs.

      “One thing too, Is there a script to make finding out the K's more automated, ive been doing it manually using excel = ). “

      I indeed do have a script in octave that automates the output from ADMIXTURE, the issue is that it relies on a text file that has a listing of the samples with the population names in the correct order that ADMIXTURE reports the output in its .q file. In turn, I have another script that correlates the .fam files with a correctly ordered .txt file, but currently I am the only one that can operate these scripts since I wrote them, they are messy and it would take some modification to make them user-friendly, I don't mind posting the scripts at all, but it may take me some time to modify as I'm considerably busy with other things this week. Perhaps in the meantime, you can also take a look at the software R and its capabilities, apparently it can also meaningfully sort and graph these results from ADMIXTURE, I am however not very familiar with R..... Another thing with my scripts in octave is that I hold all the actual data (numbers) in population specific sub-matrices, which allows me greater flexibility in analyzing each population group's data and thereby performing different statistical analysis like studentization, ANOVA and the likes, I'm not sure if R would provide you with such flexibility unless one custom programs it.

      Delete