ℹ️ In this new bcftools-related blog post, I will cover the bcftools concat command, along with the most common examples, as usual, in the form of a tutorial.
In case you do not want to go through the whole blog post, here is a short overview of the post:
1️⃣ What is the bcftools concat command? This post's first section introduces what bcftools concat does and the most essential command parameters. If you are unfamiliar with this bcftools concat command; otherwise, you can jump to the next section here. Also, if you are unfamiliar with bcftools, I recommend reading a short introductory post about bcftools I wrote here.
2️⃣ Tutorial on bcftools concat with concrete examples: In this practical segment of the post, I will list examples of how I most commonly use the bcftools concat command. Hyperlinks to concrete examples below👇
What is the bcftools concat command?
ℹ️ This bcftools command is relatively simple in what it does. It combines, or to be more precise, concatenates VCF/BCF files, and the most crucial aspect of it is that input files that you want to concatenate/combine must have two things:
input VCF/BCF files must have the same sample columns and
columns in your input VCF/BCF files must appear in the same order.
ℹ️ The most common usage of this command is to:
To combine VCF/BCF files that contain data from individual chromosomes into a single VCF file or
To combine multiple VCF/BCF files where some input files contain only SNPs and others contain only indels. The other situation is if you have any VCF/BCF files containing a set of different genetic variants with the same columns.
Another critical aspect is that the input files you want to combine/concatenate must be sorted by chromosome and position if using the bcftools concat command.
Also, input VCF files must be provided in the correct order to produce sorted VCF output unless the --allow-overlaps (-a) option is specified.
The bcftools concat command parameters that I most commonly use are:
-d, --rm-dups: enables you to output duplicate records of the specified type in multiple files only once or simply means removing duplicate genetic variants of a specific type from multiple combined/concatenated files. Options for this parameter are:
snps: remove snps with duplicated positions (chr: position) regardless of whether the ALT allele match or not, and only the first indel will be considered and appear on the output.
indels: remove indels with duplicated positions (chr: position)regardless of whether the REF and ALT alleles match or not, and only the first indel will be considered and appear on the output.
both: combination of snps and indels, meaning remove indels with duplicated positions (chr: position)regardless of whether the REF and ALT alleles match or not and remove snps with duplicated positions (chr: position) regardless of whether ALT allele match or not and only the first record will be considered and appear on the output.
all: remove duplicated records based on position (chr: position) regardless of whether the ALT alleles match or not only the first record will be considered and appear on the output.
exact: remove duplicated records based on position and where only records with identical REF and ALT alleles are kept.
-a, --allow-overlaps parameter enables bcftools concat command that the first coordinate of the following file can precede the last record of the current file.
--threads parameter speeds up concatenation if files with many records are combined.
Obviously -o, --output (defining the output file) and -O, --output-type b|u|z|v[0-9] (defining the type of output file) parameters that do not need an introduction.
--write-index, which is also handy.
For additional bcftools concat command parameters, I suggest checking out the official documentation here.
👩🏾🏫 Tutorial on bcftools concat with concrete examples
This section will cover some of the most common examples of using the bcftools concat command, and over time, I hope I will expand it with additional examples.
🧑🏻🏫How to combine/concat a list of VCF/BCF files into a single VCF file using bcftools concat?
The most common example of the bcftools concat use is the following where multiple input VCF or BCF files are provided to the bcftools command and combined into a single file:
bcftools concat input_file_1.vcf.gz input_file_2.vcf.gz input_file_3.vcf.gz -Oz -o input_file_1_2_3.vcf.gz
The same thing can be achieved by using a wild card instead of listing all the input files, which would look like this:
bcftools concat input_file_*.vcf.gz -Oz -o input_file_1_2_3.vcf.gz
🧑🏻🏫 How to combine/concat by providing a file with a list of VCF/BCF files into a single VCF file using bcftools concat?
The exact output as in the previous example can be achieved by providing a file that contains the list of your input VCF/BCF files that you want to combine.
First, you want to create a file with the list of VCF/BCF files you want to combine. This can be done on Unix-based OS using the following command:
ls input_file_*.vcf.gz > vcf_files.txt
cat vcf_files.txt
input_file_1.vcf.gz
input_file_1_2_3.vcf.gz
input_file_2.vcf.gz
input_file_3.vcf.gz
Once you have created a file with the list of VCF files (vcf_files.txt), you can use the following command to combine your input VCF file:
bcftools concat -f vcf_files.txt -Oz > input_file_1_2_3.vcf.gz
🧑🏻🏫How to reduce processing time when using the bcftools concat command?
The bcftools concat command has the --threads parameter available, which enables you to use multithreading with a defined number of worker threads. However, it is essential to underline that this option is currently used only when the output is saved using –output-type b or z, meaning compressed BCF or compressed VCF files as an output. So, using this parameter for other output-type such as uncompressed BCF (u) and VCF files (v), does not make a difference.
Examples of how to use bcftools concat to combine VCF files of individual chromosomes with multithreading parameter are below:
Using a file with list of input VCF/BCF files:
bcftools concat -f vcf_files.txt -Oz --threads 8 > input_file_1_2_3.vcf.gz
Simply listing input VCF/BCF files:
bcftools concat input_file_1.vcf.gz input_file_2.vcf.gz input_file_3.vcf.gz --threads 8 -Oz -o input_file_1_2_3.vcf.gz
Using a wild card to define input VCF/BCF files:
bcftools concat input_file_*.vcf.gz --threads 8 -Oz -o input_file_1_2_3.vcf.gz
🧑🏻🏫How do you concatenate/combine VCF/BCF files with varying order of samples using the bcftools concat command?
The other thing that can happen is that multiple VCF files you want to combine/concatenate have different orders of samples, and due this the bcftools concat command can not be used.
For this, we have to apply some preprocessing steps. I will provide an example below where I will use only two input files, and the example can be extended to multiple VCF/BCF files.
We have input_file_1.vcf.gz and input_file_2.vcf.gz. These files contain genotype data for the following samples:
% bcftools query -l input_file_1.vcf
sample1
sample2
sample3
% bcftools query -l input_file_2.vcf
sample1
sample3
sample2
Since samples are not in the same order in these two files. The first step is to order them in the same way. So, in our case, the easiest thing would be to order samples from input_file_2.vcf in the same way as those in input_file_1.vcf.
This can be done using the bcftools query and the bcftool view command. The first step will be to get a list of samples using bcftools query -l and pass the output from that command using pipe (|) into the sort command and save it to a specific file sorted_samples.txt like this:
bcftools query -l input_file_2.vcf | sort > sorted_samples.txt
The next step will be to use the bcftools view command to reorder samples in input_file_2.vcf by passing a list of sorted samples stored in the sorted_samples.txt file.
bcftools view -S sorted_samples.txt input_file_2.vcf -o input_file_2_sorted.vcf
Now, we can finally use bcftools concat to combine/merge these two files in the following way:
bcftools concat input_file_1.vcf input_file_2_sorted.vcf -Oz -o input_file_1_2.vcf.gz
🧑🏻🏫 How to handle duplicates when using the bcftools concat command?
Suppose VCF/BCF files you are trying to combine/concatenate have multiple records/rows that are identical. In that case, those can be removed by using --rm-dups which has multiple options already listed and explained above. The critical information is that --rm-dups must be used along with the -a (--allow-overlaps) parameter, and that input files must be compressed and indexed.
bcftools concat input_file_1.vcf.gz input_file_2.vcf.gz --rm-dups exact -a exact -a -Oz -o input_file_1_2.vcf.gz
So, just to repeat, if you want to combine multiple VCF/BCF files you can use --rm-dups to remove duplicates and, more precisely:
--rm-dups snps: remove snps with duplicated positions (chr: position) regardless of whether the ALT allele match or not, and only the first indel will be considered and appear on the output.
--rm-dups indels: remove indels with duplicated positions (chr: position)regardless of whether the REF and ALT alleles match or not, and only the first indel will be considered and appear on the output.
--rm-dups both: a combination of snps and indels, meaning remove indels with duplicated positions (chr: position)regardless of whether the REF and ALT alleles match or not and remove snps with duplicated positions (chr: position) regardless of whether ALT allele match or not and only the first record will be considered and appear on the output.
--rm-dups all: remove duplicated records based on position (chr: position) regardless of whether the ALT alleles match or not only the first record will be considered and appear on the output.
--rm-dups exact: remove duplicated records based on position and where only records with identical REF and ALT alleles are kept.
This would be all regarding the bcftools concat command; if I remember some other examples related to this command, I will update the post. Also, if you have any concrete questions regarding this command or others I have already covered, please message me, and I will try to get back to you.