In this work, noting the multiple functional changes of cancer associated genes and possible collective effect of multiple mutations, we have develop a bi-clustering based method namely PreGoLoF (Predictor for Gain or Loss of Functions) to predict such heterogeneous gain or loss of functions led by specific mutations on a single gene or interactive effect of multiple mutations by integrating genomic mutation and transcriptomics data. The aim of this study is to provide a computational tool to comprehensively identify the functional gain or loss led by specific mutations, concurrent mutations or collective effect of multiple mutations.
This analysis pipeline of PreGoLoF is given in Figure 1. For a given mutated gene and a cancer type, we get all the transcriptomic data of the N samples of the cancer from the TCGA database, with N1 samples having mutations and N2 having no mutations in the gene, with N1 + N2 = N. The expression pattern of gene X is considered to be affected by the mutated gene if the distribution of X’s expression levels over the N samples is significantly different from its distribution over the N2 samples without mutations, for any X in the genome. Currently, two types of significant differences are considered: the N1 samples significantly enrich the highest (or lowest) expressed genes among the N samples; or the distribution of X’s expressions over the N samples has a distinct peak that is significantly enriched by the N1 samples. Our goal here is to determine which pathways are enriched by genes with altered expression distributions due to a specific mutated gene in some subset of the N1 samples. This problem can be formulated as a bi-clustering problem and solved using our own program QUBIC, with each bi-cluster consisting of a set of samples sharing common functional losses/gains, measured using pathways with altered expression patterns, which will be followed with a pathway enrichment analysis coupled with a statistical significance assessment. The result of this analysis for each target mutated gene is a set of pathways in an integrated set of the KEGG, MisgDB, TransPath and GO pathways, with altered expression patterns over some subsets of the N1 samples due to the mutated gene.
Step I (Identify the mutation associated gene expression)
Mixed Gaussian model with left truncated assumption is first fit to the RSEM normalized gene expression level to identify the number of peaks in each gene’s expression profile. For a given mutation, the mutation associated gene is determined by if the gene is significantly differential expressed between the mutation samples and non-mutations samples or at least one peak in the gene expression profile is significantly associated with the mutation.
Step II (Data discretization for possible GoLoF associated expressions)
The expression profile of each mutation associated gene is then discretized into 1/0 values indicating a specific gene expression pattern associated with the mutation. For the mutation associated gene expression with a single peak, a Kolmogorov statistics based approach is applied to identify the mutation associated over/under expression. For the gene expression with multiple peaks, fisher exact test is applied to identify the mutation associated peak. For each mutation associated over/under expression or peak, samples with the pattern are assigned by 1 while the other rest are assigned by 0. The discretized data are merged with the mutation profile of other genes and input to the bi-cluster step.
Step III (Bi-clustering to identify GoLoF and corresponding samples)
We apply our in-house bi-clustering method QUBIC 1.0 to identify the bi-clusters by using the discretized input data. To ensure the significance of the identified bi-clusters, relatively strict parameters are applied in the computation (See More in Method part). The identified bi-clusters contain other mutations may correspond to possible interactive effect of multiple mutations, the large bi-clusters cover most of the mutation samples are considered as the general effect of the mutation while the rest may correspond to possible GoLoF. Functions of the genes and the mutation types and positions of the samples in each of the identified bi-clusters are then examined to elucidate the detailed GoLoF of each bi-cluster and further assess the significance of the prediction.
Step IV (Assessment of prediction significance)
Significance of the identified GoLoF is assessed on three levels by: 1) examining consistence of the mutation types or mutation positions on protein secondary or tertiary structure level; 2) examining pathways enrichment of the genes in the identified bi-clusters and 3) comparing the predicted GoLoF with the known PPI network of the mutation. For the predicted interactive effect of multiple mutations, pathway enrichment and comparison with PPI are applied to assess the prediction significance.