Here is a short one. This is probably obvious for anyone with basic knowledge of bcftools; still, I have often seen people mixing what bcftools concat and bcftools merge commands do.
True, in essence, both commands combine a set of individual VCF/BCF files into a single one, but how that is done and how those commands function is quite different.
So, this is a short blog post with two simple visuals explaining the core difference between the bcftools concat and bcftools merge.
Let's first explain how each of those commands is defined and what it does.
What Is the Bcftools Merge, and How Does It Work?
What official bcftools documentation tells about bcftools merge:
bcftools merge: Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. For example, when merging file A.vcf.gz containing samples S1, S2, and S3 and file B.vcf.gz containing samples S3 and S4, the output file will contain five samples named S1, S2, S3, 2:S3, and S4.
The core segment from the above paragraph "merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file."
So, if you have a set of VCF/BCF files with different (non-overlapping) samples and want to join those into a single VCF/BCF file, you will use bcftools merge.
Figure 1 shows, in a very simplified way, how the bcftools merge command works.
What Is the Bcftools Concat, and How Does It Work?
What official bcftools documentation tells about bcftools concat:
bcftools concat: Concatenate or combine VCF/BCF files. All source files must have the same sample columns appearing in the same order. Can be used, for example, to concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel VCF into one. The input files must be sorted by chr and position. The files must be given in the correct order to produce sorted VCF on output unless the -a, --allow-overlaps option is specified. With the --naive option, the files are concatenated without being recompressed, which is very fast.
The core segment from the above paragraph "all source files must have the same sample columns appearing in the same order. "
So, if you have a set of VCF/BCF files with sample (overlapping) samples, but different set of genetic variants in those, you will use bcftools concat command.
This is often the case when you have for example VCF/BCF files with genetic variants coming from different chromosomes, but for the same set of samples.
Figure 1 shows, in a very simplified way, how the bcftools concat command works.
If interested, I also have two tutorial blog posts on these two commands you can check here:
Kommentare