Frequently Asked Questions
Answers to the most frequently asked questions
Our specialists for next-generation sequencing (NGS) continuously strive to offer you individual solutions for every clinical or research question. Further, the dedicated RPS project management team and bioinformatic division work closely with you to develop the best strategy for realizing your projects.
However, we recognize that sometimes a quick question benefits from a short answer. Therefore, we have compiled a set of frequently asked sequencing-technology-related questions to give you more information about technical and bioinformatics terms. If you can’t find the answer to your question, don’t hesitate to get in touch with us.
Frequently Asked Questions
Technical Aspects
Next-generation sequencing (NGS) is a massively parallel DNA sequencing method. It is a high-throughput determination of the nucleic acid sequence of a given DNA or cDNA template. Therefore, the efficiency in cost and time compared to the original Sanger sequencing is strikingly increased.
Third-generation sequencing (TGS) is a new class of sequencing methods that enables the sequencing of nucleotides (DNA or RNA) without the need for PCR amplification. The so-called single-molecule real-time (SMRT) technology works by reading the nucleotide sequences at the single molecule level allowing much longer read lengths than current NGS methods. Furthermore, the SMRT enables the analysis of data in real-time.
A flow cell serves as a crucial reaction chamber inside the sequencer. It is a thick glass slide with several lanes, which contain nanowells. Millions of oligos, which are complementary to the adapters of a library, are randomly attached to the nanowells of the glass slide. The flow cell is inserted into the sequencer and in each nanowell, one single DNA library is amplified to generate thousands of copies and form a cluster. The cluster density is crucial for the success of a sequencing run. The sequencer uses special sequencing chemistry to detect these clusters as fluorescent spots and transform the signals into digital data.
The read length describes the number of base pairs (bp) sequenced from a DNA template. A DNA fragment used for sequencing has adapters on each end, and the DNA fragments can be sequenced from both sides with a distinct read length (usually 100−250 bp). Depending on the size of the DNA fragments, these reads can overlap or be separated by a DNA stretch, which is not sequenced.
The library fragments are sequenced from one side for the indicated read length and then from the opposite end with paired-end sequencing. Paired-end reads improve the ability to identify the relative position of the reads in the genome. Therefore, it significantly enhances the specificity of the analysis. It is usually depicted as, e.g., 2 x 100 bp or 2 x 150 bp or as PE100 or PE150. In single-end sequencing (e.g., 1 x 100 bp), the sequencer only analyzes the DNA fragment from one side.
The turnaround time is the general time estimation for our RPS products. This general estimation accounts for the standard analysis and standard processing steps without any extras. Additionally, certain conditions (such as sample size or sample quality) apply.
In contrast to the general estimation of the turnaround time, the processing time is a realistic estimation of your project, including possible additional steps and analyses. It accounts for tailored project conditions and enables capacity planning.
Bioinformatical Aspects
FASTQ data contain all data generated from clusters on a flow cell, which have passed through the filter, reading the signals/intensities coming from each cluster.
The raw data generated by Illumina sequencers, such as the NovaSeq™ 6000, are stored in a binary base call format (.bcl). For further analysis, they need to be converted into FASTQ format. The FASTQ format (.fastq.gz) is also commonly used to store NGS data from Illumina sequencing systems and can be easily used for downstream analyses. If paired-end sequencing has been performed, you receive two files for each sample, corresponding to forward and reverse reads. FASTQ data are always included in our standard sequencing products.
Unique molecular identifiers (UMIs) are short indices used to tag uniquely each molecule within a sequencing library. UMIs consist of random sequence compositions, ensuring the library’s unique fragment-UMI combination. These molecular barcodes are added to a sequencing library before PCR amplification. Therefore, UMIs enable the accurate quantification of the original nucleic acids with bioinformatics software by removing duplicate reads and PCR errors. Distinguishing PCR duplicates from real biological duplicates results in improved data quality and increased variant detection sensitivity.
Do not mix up UMIs and UDIs (unique dual indexes). UDIs allow the assignments of reads with the same barcodes to a specific sample after pooling (see “What is multiplexing and demultiplexing”) and have to be used in each library preparation. When UMIs are available for special library preparation, they can be used additionally to the UDIs. Combining both UMIs and UDIs can improve data analysis accuracy.
Sequence alignment is the detailed arrangement of each base in a read. To find out which genomic region the reads correspond to, they are often mapped to a reference genome. The data containing information about the DNA sequence and the corresponding genomic location are given in a binary alignment/mapping file (.bam).
Coverage indicates the average number of unique reads aligning to a reconstructed or reference sequence to cover these regions. The higher the coverage, the better the detection of a variant at a distinct position. Therefore, deep sequencing aims for high coverage in a certain genetic region.
In case there is no existing reference genome, a de novo assembly is performed. By merging the overlapping reads, longer DNA sequences (contigs) or even the whole original genome can be reconstructed.
Variant calling is a technique to identify sequence variants by comparing the sequencing data to a reference genome. Its data are given in a variant call format (.vcf) and provide information about the positions in which the sample differs from the reference genome. Our customers receive two VCF files for each sample: A list of point mutations (SNVs) as well as a list of small insertions and deletions (indels). They can be opened with a standard text editor.
Annotation is the assignment of the identified variants to information available using several databases. Annotation provides details such as possible functions of variants or disease-causing variants. A VCF file usually accompanies annotation files. Annotation files are given in a tab-separated file (.tsv) and can be viewed with MS Excel if the file does not contain more than 1.05 million lines.
With each data set, we are sending a md5sum (message-digest algorithm). It allows our customers to check the data set for completeness. As soon as the sent file changes, the md5sum changes as well.
Base calling is a digital assignment of the fluorescent signals emitted during sequencing to the corresponding nucleotides. The images of the signals are then processed to infer the order of the nucleotides.
Multiplexing allows the simultaneous sequencing of pooled samples during a single sequencing run. The individual samples can be differentiated from each other by using specific barcode sequences (index adapters). These barcodes are attached to both ends of the original DNA fragment during the library preparation. The combination of unique forward and reverse index sequences allows the unambiguous assignment of reads to a specific sample. Furthermore, these adapters enable an orientation-specific hybridization of the complete fragment to the flow cell. Multiplexing of samples permits higher throughput of many samples simultaneously and thus a reduction of sequencing costs.
Using the indices, demultiplexing is the differentiation of multiplexed samples by assigning reads to their original samples.
Usually, adapter sequences are removed from the sequencing reads prior to further analysis. The raw reads coming from the sequencer need to be processed to be used for further analyses.
The Phred score is a quality parameter depicting the accuracy of base identification and is generated during sequencing. It is a measure of tiny error probabilities. Their efficacy can be compared by characterizing the sequence quality of different sequencing methods. The score is proportional to the logarithm of the error probability P of base calling: Q = -10 log10 P. A high Q score depicts a more reliable base call that is presumably correct. A Phred score of 30 means the probability that the base was called incorrectly is 1 in 1000. The Phred quality score is given with the Illumina standard Phred encoding (offset +33).
Contact Us
Do you have a question or are you interested in our service? Feel free to contact us. We will take care of your request as soon as possible.
Start Your Project with Us
We are happy to discuss sequencing options and to find a solution specifically tailored to your clinical study or research project.
When getting in contact, please specify sample information including starting material, number of samples, preferred library preparation option, preferred sequencing depth and required bioinformatic analysis level, if possible.