This is my take on writing a starter blog post that introduces you to a term for trimming NGS data, specifically, one of the most commonly used trimming tools, Trimmomatic.
With this blog post, I wrote a piece that aims to serve as a starter post, which will be expanded to cover all the critical aspects of trimming NGS data and, more specifically, on Trimmomatic as a tool.
As usual, I provide a short table of contents so you can jump on sections you are interested in:
What Is Trimming in NGS Bioinformatic Data Analysis?
Trimming refers to removing unwanted or low-quality sequences from high-throughput sequencing data. High-throughput sequencing technologies, such as NGS from Illumina, for example, generate large volumes of short DNA or RNA sequences; these DNA and RNA sequences are also known as reads.
Now, these reads can contain various types of errors and artifacts that can affect downstream analyses and interpretation of the data.
Trimming helps to remove these regions of low confidence, sequencing artifacts, adapter sequences, and low-quality bases. This means that these artifacts and errors have to be removed, and this process of removal is known as trimming.
So, from the above, it is evident that the purpose of trimming data is to improve the quality and reliability of the NGS data before further analysis by removing regions of low confidence or sequences that are not biologically relevant.
So, trimming is a critical step in the preprocessing of NGS data. By performing trimming data, bioinformaticians can obtain cleaner, more accurate, and more reliable sequencing data. This is further important in obtaining high-quality results in various downstream bioinformatics applications, such as genome assembly, variant calling, gene expression analysis, and other biological investigations.
Trimming is typically performed as an essential step in the preprocessing pipeline before proceeding with downstream analyses. The specific trimming criteria and parameters used depend on the sequencing technology, experimental setup, and the quality standards required for the analysis. Moreover, different bioinformatics tools are available to perform read trimming, but more about them here.
Why Is Trimming Done?
As mentioned, trimming is a crucial step in preprocessing NGS data. The main motivations why trimming is performed are the following:
Trimming Is Necessary to Remove Adapters
Adapter trimming or adapter removal is probably one of the most important ones. Specific adapter sequences are attached to the ends of DNA or RNA fragments/reads to enable the sequencing reaction, and this is done as a part of the library preparation process for DNA sequencing. These adapter sequences are not part of the target DNA or RNA sequence being sequenced from the target sample but are part of the raw NGS sequencing data. So, after sequencing is done those adapter sequences must be removed from the reads before NGS data is used for further analysis. This removal is necessary to ensure that only the actual and relevant biological information is processed during the mapping or genome assembly.
Trimming Improves Overall Quality of NGS Data
Sequencing technologies generate a quality score for each base in a read, representing confidence in the base call. So, each base in an NGS read is assigned a quality score, usually a Phred score. Quality scores indicate confidence in the base call; higher scores correspond to more reliable bases.
Low-quality bases with low-quality scores (low Phred scores) are more likely to be erroneous and can introduce noise into the data. Trimming helps to remove these low-quality bases, improving the overall quality of the data.
In the Process of Trimming the Read Length Normalization Is Done
Sequencing data might contain reads of varying lengths due to the limitations of the sequencing technology or other experimental factors. Setting trimming parameters can help normalize the read lengths by removing the low-quality or ambiguous regions, resulting in more consistent and comparable read lengths.
Trimming Also Helps in Removing Contaminants
It can sometimes happen that raw sequencing data might contain contaminating sequences that are not the focus of the given study. Trimming can help to remove these contaminating sequences, ensuring that the subsequent analyses are focused on the relevant biological information. There are multiple different types of contaminating sequences in raw NGS data, such as:
Adapter Contamination: As mentioned earlier, adapter sequences used during library preparation can remain in the sequencing data if not properly removed. Adapter contamination can affect read alignment and downstream analyses.
Barcode/Index Contamination: In multiplexed or barcoded sequencing experiments, where different samples are pooled and sequenced together, cross-contamination between samples can occur if barcode/index sequences are not accurately assigned to the correct samples.
PCR Artifacts: Polymerase chain reaction (PCR) is commonly used to amplify DNA or cDNA before NGS library preparation. During PCR, amplification errors or chimeric sequences can lead to artifacts that are not representative of the true biological sample.
Cross-Species Contamination: In metagenomic or environmental sequencing studies, contamination from DNA/RNA of other species, including contaminants from the laboratory environment, reagents, or other samples, can occur.
Host Contamination: In studies involving host-pathogen interactions or microbiome analysis, there might be contamination from the host genome or other organisms residing in the host.
Reference Contamination: If the reference genome or transcriptome used for read alignment contains misassemblies, contaminant sequences, or incorrectly annotated regions, it can introduce biases and errors in the analysis.
Sample Mislabeling: Incorrectly labeled or mislabeled samples during sample preparation or data processing can lead to contamination in the downstream analysis.
Sample Carryover: Contamination can occur when residues of DNA/RNA from previous sequencing runs or samples are inadvertently carried over into subsequent experiments.
Trimming of NGS Data Leads to Error Correction
Some NGS platforms may introduce systematic errors, such as sequencing biases or other artifacts. Trimming can help correct or mitigate these errors by removing problematic regions from the reads.
Sequencing biases are systematic errors or preferences introduced during the process of Next-Generation Sequencing (NGS) that can lead to inaccuracies in the representation of the original DNA or RNA molecules. These biases can impact various stages of NGS data analysis and may affect downstream interpretations. Some of the most common sequencing biases are:
GC Bias: GC content refers to the proportion of guanine (G) and cytosine (C) nucleotides in a DNA or RNA sequence. Some sequencing technologies may exhibit a bias towards regions with specific GC content, leading to overrepresentation or underrepresentation of certain sequences based on their GC content.
PCR Amplification Bias: Polymerase chain reaction (PCR) is used in library preparation to amplify DNA fragments before sequencing. PCR amplification can introduce biases, with certain regions being overamplified, while others may be underrepresented.
Positional Bias: Sequencing platforms may have positional biases, where the accuracy of base calling varies along the length of the read. For example, the accuracy may decrease towards the end of the read.
Adapter Dimers: During library preparation, adapter sequences can ligate together to form adapter dimers. These short, non-target sequences can be overrepresented in the sequencing data.
Homopolymer Errors: Sequencing technologies can struggle to accurately read long stretches of the same nucleotide (homopolymer regions), leading to higher error rates in these regions.
Fragment Length Bias: In paired-end sequencing, the size selection step during library preparation can introduce biases in the distribution of fragment lengths, leading to unequal representation of different fragment sizes.
Contextual Bias: Certain sequence contexts may be difficult for sequencing technologies to accurately read, leading to errors or underrepresentation in specific sequence contexts.
Optical Duplication: During cluster generation on the flow cell, some clusters may be physically close together and result in the same cluster being sequenced multiple times, leading to an overrepresentation of those reads.
Reference Bias: The choice of a reference genome or transcriptome can introduce biases, especially if the reference does not represent the specific organism or sample accurately.
Fragmentation Bias: Fragmenting DNA or RNA during library preparation can introduce biases in the representation of different regions or genes.
It is essential to be aware of these sequencing biases and consider their potential effects during data analysis and interpretation. Several bioinformatics tools and methods are employed to correct or mitigate these biases to obtain more accurate and reliable results from NGS data.
What Is Trimmed vs Truncated?
Not a critical thing, but I think it is important to distinguish two terms that might be mixed. Those are trimming and truncation, and explain the difference between trimmed and truncated as these two terms in the context of bioinformatics data analysis of NGS data refer to two different ways of processing the raw sequencing reads to remove unwanted or low-quality regions.
Both processes (trimming and truncation) involve removing bases from the ends of the reads, but they have different implications and are used in distinct situations:
Trimmed refers to removing bases from one or both ends of a read when the quality of those bases is below a certain threshold or when they contain unwanted sequences (e.g., adapter sequences). Trimming is typically performed to improve the overall quality and accuracy of the sequencing data by discarding low-quality or irrelevant bases.
Truncated refers to cutting or shortening the length of the read from one or both ends without considering the quality of the bases. Truncation is often performed for read length normalization, ensuring all the reads have the same length by removing bases from the longer reads. Truncation can be helpful when downstream analysis methods or tools require reads of uniform length.
In summary, trimming involves removing bases from the ends of the reads based on their quality or the presence of unwanted sequences, with the goal of improving data quality and accuracy. On the other hand, truncation involves shortening the read length uniformly from one or both ends, often for the purpose of normalization or to meet the requirements of downstream analysis tools. Both trimmed and truncated reads are commonly used in bioinformatics, depending on the specific needs and goals of the analysis.
What Is Trimmomatic?
Trimmomatic is one of the most popular bioinformatics tools for quality control (QC) and next-generation sequencing (NGS) data preprocessing.
It is widely used due to its efficiency, flexibility, and ability to work with various sequencing data formats. I don’t have exact data, but it is among the most popular tools used for QC of NGS data. You can check the following section for alternatives to Trimmomatic.
In essence, Trimmomatic’s main functionality is to remove low-quality regions and sequencing artifacts from raw NGS reads, ensuring that only high-quality, reliable data is used for the downstream analysis.
Trimmomatic is a command-line tool, and it is developed in Java. Trimmomatic can be downloaded from the following link, but I covered the Trimmomatic installation in the sections below.
How Do You Install Trimmomatic?
I will cover here an installation of Trimmomatic on MacOS. The critical thing to repeat is that Trimmomatic is developed in Java and used as a command line tool. The fact that it is a command line tool developed in Java makes it platform-independent. So, Trimmomatic will work on any operating system that supports Java.
How to Install Java Runtime Environment on Your System?
For instructions on how to download and install Java on your system, check the following link. An important note is that all the instructions below assume you have already installed Java on your system.
How to Install Trimmomatic on macOS?
Here is a set of instructions for installing Trimmomatic on MacOS. You can do it using conda or just downloading the Trimmomatic JAR file and running it. I will give short walkthroughs for both options.
How to Install Trimmomatic on macOS Using conda?
Step 1: Install conda if not already installed. You can check detailed instructions for installing Conda on MacOS here.
Step 2: Once you install conda, you can use it to install Trimmomatic like this:
conda install -c bioconda trimmomatic
Step 3: If you installed Trimmomatic like this, you can use Trimmomatic by simply typing trimmomatic in Terminal.
For detailed instructions on how to install Trimmomatic using Conda, check this link.
How to Install Trimmomatic on macOS Using Trimmomatic Binary?
Step 1: Download the Trimmomatic binary version for MacOS from here
Step 2: Extract the trimmomatic jar file.
Step 3: You can start using it by simply typing following command in the command line:
java -jar trimmomatic-0.39.jar
What Are Alternatives to Trimmomatic?
There are other tools that can be used for trimming of NGS data and these can be considered as key alternatives to Trimmomatic:
Cutadapt: A versatile tool for adapter removal and quality trimming of Illumina and other NGS reads. Cutadapt is a fast and flexible tool specifically designed for adapter trimming in sequencing reads. Cutadapt can handle both single-end and paired-end data and supports various types of adapter sequences. Also, it particularly efficient in removing adapter contamination and is widely used for this purpose.
BBDuk: BBDuk is a part of the BBMap suite, BBDuk is a robust tool for quality filtering, adapter trimming, and error-correction of NGS reads.
Sickle: Sickle is a fast and efficient tool specifically designed for adapter and quality trimming of paired-end reads. It focuses on removing low-quality bases from the ends of reads, and it supports both single-end and paired-end data.
Trinity: Primarily known as a de novo transcriptome assembler, Trinity also includes built-in trimming functionality for RNA-seq data that is in essence using Trimmomatic.
PRINSEQ: A flexible tool that offers multiple options for quality filtering, trimming, and statistical summaries of NGS data.
Skewer: An adapter trimmer for paired-end NGS reads, known for its speed and efficiency. Skewer is a versatile adapter and quality trimming tool that aims to identify and remove sequencing adapters efficiently. It is particularly useful for handling data generated from Illumina platforms.
PEAR: PEAR stands for Paired-End reAd mergeR and therefore it was mainly designed for merging overlapping paired-end reads, but PEAR can also perform quality trimming.
SeqPurge: A tool that combines read trimming with adapter removal and can handle various sequencing platforms.
Fastp is a fast and user-friendly tool that performs quality control, adapter trimming, and read filtering. It generates comprehensive HTML reports with QC metrics, providing an easy way to assess the quality of the data. Fastp is known for its speed and memory efficiency, making it a popular choice for large datasets.
AdapterRemoval As the name suggests, AdapterRemoval is specifically designed for adapter trimming and quality filtering. It can work with both single-end and paired-end data and includes features like support for mismatched adapters and adapter merging.
Needless to say that selection of tool will depend on the specific needs of your analysis, the type of sequencing data, nature of the contaminants or artifacnts present in the data and obviously what od you prefer CLI or GUI solution among other parameters.
Comentários