If you are regularly doing bioinformatics, one of the tasks you will do quite often is to link or map SNP rsID to the corresponding genes (gene names) in which those are located.
There are many available options/solutions to achieve this task, ranging from using web interfaces that require manual work to solutions that are based on using specific databases and APIs that enable you to do it programmatically.
My goal in this blog post is to cover the solution I am aware of and I use the most. I will do follow-up blog posts where I present other solutions that can be used. I will start with the solution I think is best or at least works best for me.
This solution is based on Entrez Direct by NCBI. Entrez Direct is shortly described as a set of Entrez or E-utilities for use in the Unix command line. Entrez Direct is a set of navigation and accessory programs that enables you to use public genomic data available through NCBI easily.
For this specific case, I will be using:
From EntrezDirect tools I will be using the esearch, esummary and xtract programs that are part of EntrezDirect utilities
Also, in this specific case, I will use a simple text file with the list of SNP rsIDs that need to be assigned with corresponding HUGO Gene Nomenclature Convention (HGNC) gene ids as an input file. For this use case I compiled a text file with over 60000 SNPs with rsIDs.
I will use Unix commands such as: cat (to display the content of input file with list of SNP rsIDs, parallel (to speed up the whole process) and uniq to filter out any duplicates from the standard output that is being saved into a separate file.
So, the input file looks like this
% head test_snps.txt
rs12562034
rs3934834
rs9442398
rs6687776
rs9442373
rs2887286
rs9439462
rs2803291
rs908742
rs3753242
The final command to map SNPs rsIDs to gene names look like this:
cat test_snps.txt | parallel -j 8 "esearch -db snp -query '{} AND human [orgn]' | esummary | xtract -pattern DocumentSummary -element SNP_ID,NAME | uniq >> genes_snps.txt"
Here I use also parallel command where parameter -j mainly depends on the number of CPU cores or threads available on your machine.
In essence, the above command passes the output of the test_snps.txt file that contains the list of SNPs with corresponding rsIDs and pipes these into parallelly run series of Entrez Direct commands that finally provide SNP ID and HGNC gene names.
That would be it :)
Comments