ORENGE: ORphan ENzyme assiGnmEnt

Release 1.0

 

An intrinsic problem with homology-based pathway prediction, the currently dominating strategy for pathway prediction, is that the mapped pathways generally have unassigned enzymes, referred to as orphan enzymes, since homologous pathways are often not identical, each possibly having its unique enzymatic reactions and/or using unique enzyme-encoding genes. Multiple techniques have been developed to overcome this issue but with only limited success.

Genomic Location is Information!

We present here a new method for assigning orphan enzymes in a metabolic pathway using fundamentally novel information compared to existing methods. Specifically, we have recently discovered that the location of any gene in a bacterial genome is highly constrained by the genomic locations of the other genes encoding the pathway(s) involving this gene, referred to as relevant genes. Thus the to-be-identified gene for each orphan enzyme tends to have a narrow genomic range when the majority of its relevant genes are already assigned. When this information is applied in conjunction with accurate prediction of enzyme-encoding genes, it can lead to reliable assignments of orphan enzymes at a genome scale.

We have implemented this idea as a computer program and tested it on all 89 (partially) assigned E. coli K12 metabolic pathways in KEGG. The program can accurately assign, on average, 64.4%, 63% and 61% of the orphan enzymes artificially created by randomly and repeatedly removing 1%, 5% and 10% of the already assigned genes across the 89 pathways, respectively. On a benchmark set consisting of 17 orphan enzymes, our algorithm also shows considerably better performance than other methods. In addition, we applied our program on 14 real orphan enzymes and found that our program consistently performed the best across the genome-scale gene expression datasets with glucose and L-galactonate as carbon sources, according to a defined inconsistent score measuring the difference between enzymatic activities estimated by the gene expression data and those required by current metabolic network of E. coli.