top of page
Writer's pictureDr Edin Hamzić

🧬👩🏽‍💻 A Guide to Merging VCF and BCF Files: Your bcftools merge Tutorial

Updated: Oct 24, 2023

So, this blog post will be about the bcftools merge command, and here is the outline of the posts so you can jump on things that interest you the most in case you are in a hurry:



What is the bcftools merge?

First of all, if you are unfamiliar with bcftools, you can check my short introductory post about bcftools here, where you can also find the list of other blog posts I wrote about bcftools commands.


Now, about the bcftools merge command, the primary purpose of this command is to merge multiple individual VCF or BCF files into a single VCF or BCF file. In essence, this can be done by providing a list of file names directly in the bcftools merge command or by providing a path to a file that contains the list of VCF/BCF files to be merged. Additional parameters and options of the bcftools merge command are also available, but more about that is in the following sections of the blog post.


Tutorial: A list of concrete bcftools merge examples


How to prepare VCF files before merging them with bcftools merge


Before merging files, ensure your VCF/BCF files are sorted and indexed. Here is how you can do it.


If you have some specific folder where your VCF/BCF files are stored, you can quickly apply the following series of commands:


1. First let’s create a file list (list.txt) that will contain all VCF/BCF files that are stored in some of your directories (for example: /Users/john_doe/bcftools/merge/), and here I use the wild card pattern input_file*.vcf.gz, and you can use the pattern that fits to your use-case:

find /Users/edinhamzic/Downloads/bcftools/merge -name "input_file*.vcf.gz" --maxdepth 1 > list.txt

Once you have created file list, the next step is to sort those files:

cat list.txt | while read file 
do
bcftools sort $file -Oz -o  "${file/input_file/sorted_input_file}" 
bcftools index -t "${file/input_file/sorted_input_file}" 
done

In the examples below, you will see that you can skip the step of indexing as you can use the --write-index parameter during the merge. Also, theoretically, bcftools sort has the option --write-index. Anyhow, with the above example, you can proceed to the next step: merging your input VCF/BCF files.

How to merge two or more VCF files using bcftools merge?

Merging two or more files can be accomplished with the following command:

bcftools merge input_file1.vcf.gz input_file2.vcf.gz -o merged_files.vcf.gz

However, if you have duplicated samples in your input files, the bcftools merge command will stop and raise an error. In this case, you have two options: (i) Go back to the original VCF/BCF files and remove duplicated samples or (ii) Use the --force-samples parameter, which will enable you to proceed with the bcftools merge anyway by prepending the index of the file to duplicated samples.

bcftools merge input_file1.vcf.gz input_file2.vcf.gz input_file3.vcf.gz –force-examples -Oz -o merged_files.vcf.gz

How to merge multiple VCFs using the wild card?

If you have a large number of files that you want to combine, simply listing their names is not practical, and for that, you have two other options. One of those is to use wild cards if your files have a common string in their names. The example would be the following one for the examples I used above:

bcftools merge input_file*.vcf.gz -Oz -o merged_files.vcf.gz

How to merge multiple VCFs using a file list?

The other thing you can do if you have a large number of VCF/BCF files and those files don’t have some common string where you would easily apply wild card pattern, like above, is to create a file that will contain all files you want to merge and provide this file to bcftools merge command.

In my case, I have file .vcf.gz file (input_file_1.vcf.gz, input_file_2.vcf.gz, input_file_3.vcf.gz, input_file_4.vcf.gz, input_file_5.vcf.gz) and to merge them by using a file list I would have to do following things.


First, create a file with the list of VCF/BCF files you want to merge. One way to create a file list is to use the ls command below. Assuming you will be running bcftools merge from the folder where your files are stored:


ls  input_file*.vcf.gz > list_of_file.txt

The other way is to create a file with full paths where /Users/john_die/bcftools/merge is a folder where VCF/BCF files are stored:

find /Users/john_die/bcftools/merge -name "input_file*.vcf.gz" --maxdepth 1 > list_of_files.txt

Once you created your file with the list of VCF/BCF files to be merged (list_of_files.txt), you can proceed with merging those files using the following command:

bcftools merge –file-list list_of_file.txt -Oz -o merged_files.vcf.gz

How to select only specific genomic regions when using the bcftools merge command?

The command below illustrates the use case where you can combine VCF/BCF files and only get results for genetic variants located on chromosome 1 using parameter -r

bcftools merge file1.vcf.gz file2.vcf.gz -Oz -r 1 -o merged_files.vcf.gz

a

You can use the similar command for multiple chromosomes like this:

bcftools merge file1.vcf.gz file2.vcf.gz -Oz -r 1,20 -o merged_files.vcf.gz

The other handy option can also use the -R parameter, where you provide a file with genomic regions for which you want to get the output of the bcftools merge command. The following example illustrates the BED file with genomic regions for which you want to get results in the merged file:

cat regions.bed
1 10000 50000
2 20000 40000
3 10000 70000

Once you prepare the BED file, you can use the following command to merge the files using a wild card:

bcftools merge input_file*.vcf.gz -R regions.bed -Oz -o merged_files.vcf.gz

Or by simply listing the files you want to merge:

bcftools merge input_file1.vcf.gz input_file2.vcf.gz -R regions.bed -Oz -o merged_files.vcf.gz

Or by providing a files list:

bcftools merge –file-list list_of_file.txt -R regions.bed -Oz -o merged_files.vcf.gz

Hope you found this helpful. This is it for now, but I already have some ideas for expanding this blog post with additional examples.


Drop Me a Line, Let Me Know What You Think

Thanks for submitting!

© 2020 by BioComputiX

bottom of page