This software is intended to call fusion and exon-skipping events from raw sequencing data generated from human samples using the Twist Alliance CeGaT RNA Fusion Panel and sequenced on Illumina HiSeq/NovaSeq systems. Note that this software is not intended for installation on personal computers, and does not offer a graphical user interface. It is intended to be run as part of bioinformatics pipelines on compute infrastructure and its output files will need to be post-processed for interpretation by non-technical users.
If you are looking for cancer diagnostics as a service, please visit our diagnostic pages here. If you are interested in sequencing services, please have a look here.
Fusion detection involves multiple steps that are performed automatically by our software package: First, reads are aligned using the STAR aligner (Dobin et al, Bioinformatics 2013) to the human reference genome hg19. After a first mapping pass, identified novel splice sites are integrated into the reference and a second pass mapping is performed. The resulting read alignments are analyzed by STARfusion (Haas et al, Genome Biol 2019) to detect fusion transcripts. We post-process the STARfusion output to correct for varying levels of read duplication, filter fusion calls to the intended panel target regions, and scan for transcript variants (e.g., MET exon 14 skipping) which STARfusion does not detect.
Steps How to Use the Software
We distribute the software as containerized application. To run it, an OCI compatible runtime environment is needed, such as docker.
To setup docker, please refer to the online documentation here.
- Download the container image from here. The download is 31 GB in size.
- Install the image by calling docker load <cegat_starfusion.dockerimage
- Download the wrapper script.
The process of generating sequencing data should follow roughly these steps:
- RNA extraction from a suitable sample
- Sequencing library preparation using a strand-specific protocol (e.g. using the Twist RNA library prep)
- Target Enrichment using the Twist workflow and the Twist Alliance CeGaT RNA Fusion Panel
- Sequencing on an Illumina instrument generating paired-end reads of at least 100+100 bp
Running the Application
The input files to the application are fastq files generated by the sequencer. You will need two files per sample, one for the forward reads and one for the reverse reads. These must be in fastq.gz format. Adapter trimming should be performed before running the fusion detection.
- Prepare an empty output folder for your results, let’s say /path/results
- Select your sequencing data files, say /path/mysample.1.fastq.gz and /path/mysample.2.fastq.gz
- Decide on how many CPU cores the analysis should use. If you are unsure, you can call the command: nproc
For this example, let’s assume the command returns “64” and you want to use all available cores.
- Start the process: bash wrapper.sh /path/mysample.1.fastq.gz /path/mysample.2.fastq.gz /path/result 64
All output files are tab-separated tabular text files. All coordinates are given on the hg19 reference genome.
The main output file is combined_results.tsv
It is a two-column tab-separated file:
(1) Event Name, e.g. ALK–EML4 or EGFR_vIII
(2) FFPM, the number of fragments per million supporting this event. FFPM is the count of the number of reads supporting this fusion events, normalized by the total number of unique molecules that were sequenced.
Note: Reporting is limited to the transcript variants listed below, and to fusions involving one of the genes listed.
Additional output files are:
A tabular listing of all detected fusions. This file is produced by STARfusion and is described in detail on the STARfusion wiki. The most important columns are:
(1) FusionName (The detected fusion, e.g. GNB4–ETV1) and
(9) FFPM (The number of fragments per million supporting this fusion)
A tabular listing of all detected intra-gene (exon-skipping) events. This file has 6 columns:
(1) Fusion Name, e.g. EGFR_VIII
(2) HGNC symbol (gene name) of the affected gene
(3)-(5) Genomic location of the skipping event, with respect to the hg19 reference genome
(6) FFPM, the number of fragments per million supporting this event
- all_reads.bam (+bai)
An alignment of all sequenced reads to the hg19 reference genome
- fusions_evidence_mapped.bam (+bai)
Alignments of only the reads supporting fusion events
A self-contained website with visualizations of the detected fusions
Intermediate files for troubleshooting can be found in the subfolder intermediate/. You can safely delete this folder if the analyses completes.
This software is provided as is, and for research use only. Not for use in diagnostic procedures.
CeGaT GmbH makes no representations and extends no warranties of any kind, either express or implied, including warranties of merchantability, fitness for a particular purpose, non-infringement, design, output, throughput, and the absence of latent or other defects whether or not discoverable.
Feel free to contact us if you have any questions or need further support.