This blog post focuses on giving you an overview of the bcftools annotate command, along with the most common examples. First, let me give you a short overview of this blog post in case you are in a hurry and want to check if you can find the thing you are looking for. So, the overview of this post is following:
Introduction to the bcftools annotate command: This post's first section introduces what bcftools annotate actually does and presents the most important parameters for the bcftools annotate command. If you are not familiar with this bcftools annotate command, I encourage you to read this part; otherwise, if familiar or you are simply in a hurry, just jump to the next section here. Also, if you are not familiar with bcftools, in general, I recommend reading a short introductory post about bcftools I wrote here.
Tutorial on bcftools annotate with concrete examples: This is the post’s second segment, and I would say the most useful one, where I list examples of how I most commonly use the bcftools annotate command. You can find links to those concrete examples below👇
How to remove a field/column/annotation from a VCF file using the bcftools annotate command?
How to annotate a VCF file using the bcftools annotate command?
How to use multithreading with the bcftools annotate command?
How to create a new ID column in VCF file based on the existing one using bcftools annotate?
Introduction to bcftools annotate command
What is the bcftools annotate command?
In essence, bcftools annotate the command to add or remove annotations from the VCF/BCF file. In practice, if you want to remove some specific field/annotation from the VCF file, like ID, INFO, or FORMAT fields, you will use this command. Also, if you want to add new fields/annotations, you will use this command along with the file from which you will fetch this annotation and assign them to your original file. This file from which you will fetch annotations, in cases when you want to add new fields/annotations, can be another bgzipped VCF or BED file that is tabix indexed or it can be another tab-delimited file that must contain CHROM, POS columns/fields and any other arbitrary number of annotation columns you want to transfer into your original input file.
What are the command options for the bcftools annotate?
The three bcftools annotate command parameters that I use most commonly are:
-a, --annotations: This parameter defines the annotation file from which annotations you want to add to your file will be fetched. As I explained, this can be a bgzip compressed VCF or BED file, or tab-delimited file, which must contain CHROM and POS columns/fields and any additional arbitrary columns/fields.
-c, --columns: This parameter defines the list of columns/fields that need to be added to your input file from the annotation file. The list of columns is provided and can be supplemented with + sign or . depending on what exactly is need to be done. I will reproduce the great table provided in this tutorial:
-x, --remove: This parameter defines the list of columns/fields that need to be removed from the input file.
Now, these are as I wrote, in my opinion, three key parameters and, I would say, most commonly used. However, besides these three there is a list of additional parameters. Now, I would like to jump to the tutorial with concrete examples of how bcftools annotate can be used.
bcftools annotate: Tutorial With Concrete Examples
In this section, which is basically a bcftools annotate tutorial, I will try to present some of the most common use cases or examples of bcftools annotate can be used and how I use it most often. I will use All_20180418.vcf.gz as an annotation file. You can download this file from here, including it’s index file.
How to remove field/column/annotation from a VCF file using the bcftools annotate command?
Let’s start with the most straightforward and relatively common task: removing a field/column/annotation from a VCF file using bcftools. This can be done straightforwardly using the following command:
bcftools annotate -x FORMAT/SVTYPE input_file.vcf.gz -Oz -o input_file_no_svtype.vcf.gz
Or you can use
bcftools annotate --remove FORMAT/SVTYPE input_file.vcf.gz -Oz -o input_file_no_svtype.vcf.gz
If you want to remove attribute from INFO column you would use in the same way the following command:
bcftools annotate -x INFO/ConversionType input_file.vcf.gz -Oz -o input_file_no_conversion_type.vcf.gz
Or you can use
bcftools annotate --remove INFO/ConversionType input_file.vcf.gz -Oz -o input_file_no_conversion_type.vcf.gz
By now, I guess you got the idea.
How to annotate a VCF file using the bcftools annotate command?
Now the other widespread use of the bcftools annotate command is to add new columns or add new fields with some of the columns like, for example, FORMAT or INFO columns that contain multiple fields. For this purpose, we need an annotation file from which bcftools annotate can fetch the annotations you need and add them to your original file.
So, I created a use case where I removed the ID column from the example file above (input_file.vcf.gz). Where now instead of SNPs IDs I have only dots (.) and now these dots will be replaced with IDs from reference file which in this case will be the All_20180418.vcf.gz mentioned previously.
So, let’s check how it looks like:
Removing ID column using following command:
bcftools annotate --remove ID input_file.vcf.gz -Oz -o input_file_no_id.vcf.gz
Also this new file that we will use as input must be index and if you are using your own obviously the file that you will use as an input file has to be indexed. So, let’s do that:
tabix -p vcf input_file_no_id.vcf.gz
Adding ID column using bcftools annotate command:
bcftools annotate input_file_no_id.vcf.gz -a All_20180418.vcf.gz -c ID -Oz -o input_file_added_id.vcf.gz
How to use multithreading with the bcftools annotate command?
This process of annotating a VCF/BCF file can take some time, so if you are having a large file to annotate, you can use the –threads/-N parameter.
bcftools annotate input_file_no_id.vcf.gz --threads 8 -a All_20180418.vcf.gz -c ID -Oz -o input_file_added_id.vcf.gz
or
bcftools annotate input_file_no_id.vcf.gz -N 8 -a All_20180418.vcf.gz -c ID -Oz -o input_file_added_id.vcf.gz
How to create a new ID column in VCF file based on the existing one using bcftools annotate?
Often it happens that you do not want to use ID in the form in which they are available in your input VCF file such those with rsIDs or some other type of ID and you need to create some kind of composite ID that combines for example chromosome, chromosome position and corresponding alleles. Well, in those situation you can also use bcftools annotate in the following way:
bcftools annotate input_file.vcf.gz -x ID -I +’%CHROM:%POS:%REF:%ALT’ -Oz -o input_file_composite_id.vcf.gz
This is it for now. I will try to expand this post with additional examples later on.