Fusions

CeGaT Fusion Detection Software Package

Intended Use

This software is intended to call fusion and exon-skipping events from raw sequencing data generated from human samples using the Twist Alliance CeGaT RNA Fusion Panel and sequenced on Illumina HiSeq/NovaSeq™ systems. Note that this software is not intended for installation on personal computers, and does not offer a graphical user interface. It is intended to be run as part of bioinformatics pipelines on compute infrastructure, and its output files will need to be post-processed for interpretation by non-technical users.

If you are looking for cancer diagnostics as a service, please visit our diagnostic pages here. If you are interested in sequencing services, please have a look here.

Background

Fusion detection involves multiple steps that are performed automatically by our software package: First, reads are aligned using the STAR aligner (Dobin et al, Bioinformatics 2013) to the human reference genome hg19. After a first mapping pass, identified novel splice sites are integrated into the reference and a second pass mapping is performed. The resulting read alignments are analyzed by STARfusion (Haas et al, Genome Biol 2019) to detect fusion transcripts. We post-process the STARfusion output to correct for varying levels of read duplication, filter fusion calls to the intended panel target regions, and scan for transcript variants (e.g., MET exon 14 skipping) which STARfusion does not detect.

Steps How to Use the Software

Installation

Sequencing Data

Running the application

Result files

Installation

We distribute the software as containerized application. To run it, an OCI compatible runtime environment is needed, such as docker.

To setup docker, please refer to the online documentation here.

  • Download the container image from here. The download is 31 GB in size.
  • Install the image by calling docker load <cegat_starfusion.dockerimage
  • Download the wrapper script.

Sequencing Data

The process of generating sequencing data should follow roughly these steps:

  • RNA extraction from a suitable sample
  • Sequencing library preparation using a strand-specific protocol (e.g. using the Twist RNA library prep)
  • Target Enrichment using the Twist workflow and the Twist Alliance CeGaT RNA Fusion Panel
  • Sequencing on an Illumina instrument generating paired-end reads of at least 100+100 bp

Running the Application

The input files to the application are fastq files generated by the sequencer. You will need two files per sample, one for the forward reads and one for the reverse reads. These must be in fastq.gz format. Adapter trimming should be performed before running the fusion detection.

  • Prepare an empty output folder for your results, let’s say /path/results
  • Select your sequencing data files, say /path/mysample.1.fastq.gz and /path/mysample.2.fastq.gz
  • Decide on how many CPU cores the analysis should use. If you are unsure, you can call the command: nproc
    For this example, let’s assume the command returns “64” and you want to use all available cores.
  • Start the process: bash wrapper.sh /path/mysample.1.fastq.gz /path/mysample.2.fastq.gz /path/result 64

Result Files

All output files are tab-separated tabular text files. All coordinates are given on the hg19 reference genome.

The main output file is combined_results.tsv

It is a two-column tab-separated file:
(1) Event Name, e.g. ALK–EML4 or EGFR_vIII
(2) FFPM, the number of fragments per million supporting this event. FFPM is the count of the number of reads supporting this fusion events, normalized by the total number of unique molecules that were sequenced.

Note: Reporting is limited to the transcript variants listed below, and to fusions involving one of the genes listed.

ABL1, ACTB, AFAP1, AGK, AKAP4, AKAP9, AKAP12, AKT1, AKT2, AKT3, ALK, ARHGAP6, ARHGAP26, ASPL, ASPSCR1, ATF1, ATP1B1, ATRX, AVIL, AXL, BAG4, BCL2, BCOR, BCORL1, BCR, BEND2, BICC1, BRAF, BRD3, BRD4, c11orf95, CAMTA1, CCAR2, CCDC6, CCDC88A, CCDC170, CCNB3, CCND1, CD44, CD74, CEP85L, CIC, CLDN18, CLIP1, CLTC, CNTRL, COL1A1, CREB1, CREB3L1, CREB3L2, CRTC1, CTNNB1, DDIT3, DNAJB1, EGFR, EML4, EPC1, EPCAM, ERBB2, ERBB4, ERG, ESR1, ESRRA, ETV1, ETV4, ETV5, ETV6, EWSR1, EZR, FEV, FGFR1, FGFR2, FGFR3, FLI1, FN1, FOXO1, FOXO4, FOXR2, FUS, GLI1, GOPC, GPR128, HEY1, HMGA2, HTRA1, IGF1R, INSR, JAK2, JAZF1, KIAA1549, KIF5B, KIT, LEUTX, LMNA, LPP, LTK, MAGI3, MAML1, MAML2, MAML3, MAMLD1, MAP3K8, MARS1, MAST1, MAST2, MEAF6, MET, MGA, MGMT, MITF, MKL2, MN1, MSH2, MYB, MYBL1, MYC, NAB2, NCOA1, NCOA2, NCOA3, NCOA4, NFATC2, NFIB, NOTCH2, NPM1, NR4A3, NRG1, NRG2, NSD3, NTRK1, NTRK2, NTRK3, NUTM1, PAX3, PAX7, PAX8, PBX1, PDGFB, PDGFD, PDGFRA, PDGFRB, PHF1, PIK3CA, PLAG1, PML, POU5F1, PPARG, PPARGC1A, PPP1CB, PRKACA, PRKAR1A, PRKCA, PRKCB, PRKD1, PRKD2, PRKD3, PTPRZ1, QKI, RAD51B, RAF1, RANBP2, RARA, RELA, RELCH, RET, ROS1, RPS6KB1, RREB1, RSPO2, RSPO3, SDC1, SDC4, SH3PXD2A,SLC1A2, SHTN1, SLC34A2, SLC44A1, SLC45A3, SND1, SQSTM1, SS18, SSX1, SSX2, SSX4, STAT6, STRN, SUZ12, TACC1, TACC2, TACC3, TAF2N, TAF15, TCF3, TCF12, TERT, TFE3, TFEB, TFG, THADA, TMPRSS2, TPM3, TPR, TRIM24, TRIM33, TRIO, TTYH1, VGLL2, VGLL3, VMP1, WT1, WWTR1, YAP1, YWHAE, ZC3H7B, ZMYM2, ZNF703

EGFR del ex2-3, EGFR del ex2-4, EGFR del ex2-14, EGFR del ex2-22 (mLEEK), EGFR del ex5-6, EGFR del ex6-7, EGFR del ex9, EGFR del ex9-10, EGFR del ex10, EGFR del ex12, EGFR del ex25-26, EGFR del ex25-27, EGFR del ex26-27, EGFR VII, EGFR VIII, ERBB2 ex16 skipping, FGFR2IIIb, MET ex14 skipping, NFE2L2 ex2 skipping, PDGFRA del ex8-9

Additional output files are:

  • fusions.tsv
    A tabular listing of all detected fusions. This file is produced by STARfusion and is described in detail on the STARfusion wiki. The most important columns are:
    (1) FusionName (The detected fusion, e.g. GNB4–ETV1) and
    (9) FFPM (The number of fragments per million supporting this fusion)
  • intragene_events.tsv
    A tabular listing of all detected intra-gene (exon-skipping) events. This file has 6 columns:
    (1) Fusion Name, e.g. EGFR_VIII
    (2) HGNC symbol (gene name) of the affected gene
    (3)-(5) Genomic location of the skipping event, with respect to the hg19 reference genome
    (6) FFPM, the number of fragments per million supporting this event
  • all_reads.bam (+bai)
    An alignment of all sequenced reads to the hg19 reference genome
  • fusions_evidence_mapped.bam (+bai)
    Alignments of only the reads supporting fusion events
  • fusion_evidence_details.html
    A self-contained website with visualizations of the detected fusions

Intermediate files for troubleshooting can be found in the subfolder intermediate/. You can safely delete this folder if the analyses completes.

Disclaimer

This software is provided as is, and for research use only. Not for use in diagnostic procedures.

CeGaT GmbH makes no representations and extends no warranties of any kind, either express or implied, including warranties of merchantability, fitness for a particular purpose, non-infringement, design, output, throughput, and the absence of latent or other defects whether or not discoverable.

Downloads

Application Note: A Smarter Way to Fish For Fusions

Contact Us

Feel free to contact us if you have any questions or need further support.