Current works on bioinformatics

Protein threading

The tertiary (3D) structure of a protein carries the essential information about the functions of the protein. Computational techniques have played an important role in protein structure solution in the past decade, which has been complementary to the more accurate but much more expensive experimental techniques such as x-ray crystallography and NMR. It is the computational technique, particularly template-based prediction technique that has the potential to keep up with the rate at which protein sequences are being identified by world-wide sequencing and bioinformatics efforts. Protein threading, pioneered by D. Eisenberg and colleagues in early 1990's, has been the main workhorse for protein structure prediction in the past decade. Many protein structures have been predicted using this technique before the experimental solutions to their atomic structures become available, having provided highly useful information in guiding scientists in their experimental design for investigation of the target proteins. A number of algorithms have been developed for rigorously solving the threading problem. While protein threading continues to play a key role in protein structure prediction, we have witnessed in recent years that this class of prediction techniques have begun to reach its limitations evidenced by its performance in the recent CASP contests. One of the main limiting factors is that protein threading uses residue-level information only in its structure prediction, once a major advantage of this technique. While using residue-level information only has made it practically feasible to make structure predictions even for large proteins, the predicted structures are generally in low resolution, at best an accurate backbone structure virtually with no information about the side-chains, where key functional sites often reside. This is clearly an urgent need for more accurate and more powerful techniques for template-based protein structure prediction.

We present a novel computational framework for solving a generalized protein threading problem, in which backbone threading and side-chain packing are predicted simultaneously. To the best of our knowledge, there has not been any published work that attempts to solve the problem of simultaneous backbone threading and side-chain packing. The reason why side-chain information has not been used before during the backbone threading process is obviously due to the belief that the problem is computationally too intractable to be practically useful. Clearly, our generalized threading problem is NP-hard because both backbone threading and side-packing prediction are its special cases, respectively, and both have been shown to be NP-hard. Despite all this, we developed a rigorous and sufficiently fast algorithm for the generalized protein threading, by taking advantage of the special properties of the generalized threading problem. In this problem, we consider simultaneously backbone threading and side-chain packing during the process of a protein structure prediction. For a given query protein sequence and a template structure, our goal is to find a threading alignment between the query sequence and the template structure, along with a rotamer assignment for each side-chain of the query protein, which optimizes an energy function that combines a backbone threading energy and a side-chain packing energy. This highly computationally challenging problem is solved through first formulating this problem as a graph-based optimization problem. My new idea and main contribution lie in a novel application of tree decomposition technique to this new paradigm of protein threading, and various graph-theoretic techniques are employed to achieve the computational efficiency to make our algorithm practically useful, which takes advantage of a number of special properties of the graph representing this generalized threading problem. The overall framework of our algorithm is a dynamic programming algorithm implemented on an optimal tree decomposition of the graph representation of our problem. By using various additional heuristic techniques such as the dead-end elimination, we have demonstrated that our algorithm can solve a generalized threading problem within practically acceptable amount of time and space, the first of its kind. The theoretical paper was accepted to appear in Algorithmica conditionally. The implementation is on going with luminous perspective. The proposed research project is funded by NSF in which I am a major contributor.

 

Uber operon prediction

The rapidly expanding pool of sequenced microbial genomes provides a very rich source of information for deciphering the hidden information encoded in a genome and the organizational structures of the encoded information. One powerful tool for decoding such information is the so-called comparative genome analysis, which attempts to derive the encoded information through directly comparing the genomes against one another. Through such comparisons, "conserved" genomic structures at various organizational levels could possibly be detected. Then by linking these identified genomic structures to already well-established biological entities, one could possibly infer their biological meanings. For situations where such links are not clearly identifiable yet, the significance of the uncovered "conserved" genomic structures could possibly be established through statistical means. Comparative genome analyses have been used to predict operon structures, a layer of well-established genomic structure, at a whole genome scale. As more powerful comparative genome analysis tools become available, we expect that more genomic structures, previously understood or new, will be revealed.

As we understand now, there are a number of well-established higher-level genomic structures beyond operons in a microbial genome, which include regulons, modulons, and stimulons. A regulon is a group of operons which are co-regulated by the same transcriptional machinery, while a modulon is a group of regulons that are controlled by more global regulators and respond to more general physiological states of a cell. At an even higher level is a set of stimulons, each of which consists of a collection of operons, regulons and/or modulons that respond to a common environmental stimulus. Each of these genomic structures generally encodes a biological pathway or a complex network (or possibly portions of a pathway/network). Hence identification and characterization of these genomic structures has direct implications to deciphering biological pathways and networks in a systematic manner, which represents one of the key tasks in the study of an organism at a systems level.

It is known that in bacteria, genes are transcribed using operons (including single-gene operons) as the basic units, while in eukaryotes genes are transcribed individually. While the exact reason for this phenomenon requires more investigation, we suspect that one possible reason might be that as organisms evolve to become more complex, they might have the tendency to use each of their genes in more biological processes, which requires the flexibility of different gene associations to efficiently handle different needs for co-transcription. This, in turn, might have led to the breakup of the fixed gene associations enforced by the (large) operon structures in the ancient and simple organisms to possibly smaller transcriptional units in more complex organisms. We have recently carried out a systematic study on the tryptophan synthesis operons. We found that these operons are fairly larger (average operon size is 6.4 for 24 archaeabacteria genomes) in some archaeabacteria while their sizes are in general smaller (average operon size is 1.4 for 17 cyanobacteria genomes) in some cyanobacteria. This observation seems to suggest that some of these operons may have experienced the fission process during the evolution. To the extreme along this discussion, all eukaryotic genomes have each of their genes individually regulated transcriptionally, i.e., all their "operons" are singletons.

By identifying operons that used to be associated in some ancient organisms (e.g., two whole operons or parts of them belong to the same operon in an ancient organism) or other organisms in general, we may detect the footprints of operon evolution. This footprint of operon evolution might provide useful information leading to not only better understanding about genomic structures and their organization, but also possibly a new set of tools for studying biological machineries in a prokaryotic cell, just like the powerful tool that operons have provided to biological pathway prediction. In this study, we focus on the identification of the footprint of a particular class of operon evolution, uber-operons, a concept introduced by Lathe et al. in 2000. The essential idea of a uber-operon is that during evolution, larger operons might have broken into smaller operons in different ways along different evolutionary lineages. Hence by studying conservations among groups of operons ("uber-operons") rather than individual operons, it may help to uncover the "lost" association relationships among operons that used to work together constitutively. By the very nature of the uber-operon definition, it requires reference genomes to uncover the "lost" association relationship among of the uber-operons. In particular, it requires the knowledge of orthologous genes across genomes. Lathe et al. (14) proposed an iterative procedure for identification of uber-operons, assuming that orthologous gene relationships are given, which has limited the practical value of their method.

We presented a novel algorithm to simultaneously identify uber-operons in a target genome and orthologous gene relationships between the target and reference genomes. Our prediction algorithm predicts uber-operons through identifying groups of functionally or transcriptionally related operons, whose gene sets are conserved across the target and multiple reference genomes. Using this algorithm, we have predicted uber-operons for each of a group of 91 genomes, using the other 90 genomes as references. In particular, we predicted 158 uber-operons in E coli K12 covering 1,830 genes, and found that many of the uber-operons correspond to parts of known regulons or biological pathways or are involved in highly related biological processes based on their Gene Ontology assignments. For some of the predicted uber-operons that are not parts of known regulons or pathways, our analyses indicate that their genes are highly likely to work together in the same biological processes, suggesting the possibility of new regulons and pathways. We believe that our uber-operon prediction provides a highly useful capability and a rich information source for elucidation of complex biological processes such as pathways in microbes. All the prediction results are available at our Uber-Operon Database: http://csbl.bmb.uga.edu/uber , the first of its kind. The paper was published in NAR(2006).

 

Drug design without side-effect

Research effort in molecular biology, such as the human genome project, has been revealing the secret of our genetical composition, the long DNA sequences that determine almost every aspect of life. Applications that use this information have posed new challenge to design and analysis of efficient computational methods.

A frequently surfacing problem in many biological applications is to find one substring of length l that appears (with a few substitutions) at least once in each of a set of bad strings (such as bacterial sequences) and is not ``close'' to any substring of length l in each of another set of good strings (such as Human and live stock sequences). The problem has various applications in molecular biology such as universal PCR primer design, genetic drug target identification and genetic probes design. In particular, the genetic drug target identification problem searches for a sequence of genes that is close to bad genes (the target) but far from all good genes (to

avoid side-effect). Our study theoretically develops a polynomial time approximation scheme in both measures simultaneously. It ended a longstanding debate regarding if the following problem is polynomial time approximative. The paper was published in SIAM J on Computing (2003).

 

Motif finder: PROMOCO

The availability of a large number of genomic sequences and their genes, along with the availability of whole-cell microarray gene expression data, has made it possible to study a cell's transcription regulation machinery in a systematic manner. One of the key steps in deciphering the transcription regulation system is to identify the DNA binding motifs of the transcription regulators. Earlier computational methods for identification of such binding motifs are mainly based on an intuitive assumption that these binding motifs, when extracted and aligned, have high information content. Different search strategies have been employed to identify such conserved regions, including the greedy algorithm employed by CONSENSUS, unsupervised learning approach

in MEME , Gibbs sampling, and a data clustering method used in CUBIC.

A basic assumption, widely used by computational biologists, is that each set of conserved binding motifs would have evolved from a common ancestor. Mathematically, the problem is to identify a set of l -mers from a given set of promoter sequences which have at most k different positions from the putative ancestor. We used notation (l, k) -motifs to represent such a set of motifs. This formulation of the binding-motif identification allows us to solve the problem using combinatorial optimization techniques.

 

Rigorous solution to a (l, k) -motif problem in general is highly challenging - the general problem has been shown to be NP-hard. The difficulty comes mainly from the fact that the putative ancestor is unknown, and the distances between this putative ancestor and the binding motifs to be identified are also unknown. We presented a new algorithm, named PROMOCO, for the (l, k) -motif finder through introducing a new distance, called pseudo-Hamming distance, and defining a pseudo-neighborhood for each word using this distance measure. The algorithm, implemented as a computer software PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis-regulatory binding motifs. A drawback of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or desired motifs appear in only small portion of the given sequences.

 

Regulon prediction

Regulon is a group of operons which are co-regulated by the same transcriptional machinery. As described above, the identification and characterization of regulon has direct implications to deciphering biological pathways and networks in a systematic manner, which represents one of the key tasks in the study of an organism at a systems level. To the best our knowledge, there is no general regulon prediction tool in the world as of today. Due to the definition of regulon, the prediction of regulon in microbial genomes is closely related to the prediction of operons and regulatory elements. Recently, I developed three totally new approaches respectively to identify operons, regulatory elements and regulons in microbial genomes. I first presented an algorithm which can automatically determine the length of regulatory elements. Hence the prediction of regulon can be implemented automatically, i.e., we do not need to know the length of regulatory elements in advance, the first of its kind.

Bilustering

With the event of DNA chips and other techniques that measure the expression level of a large number of genes within a number of different experimental samples (conditions), a bunch of clustering/biclustering methods have been proposed to detect patterns in gene expression data. In contrast with clustering that group genes of similar behavior over all the conditions, biclustering organize genes into groups (overlap allowed) such that genes in one group behavior similarly over only a subset of conditions, and therefore to be of higher flexibility in analysis of gene expression data. But then biclustering seems to be more intractable than traditional clustering. Almost all the existing biclustering methods have been designed statistically, and to some extent demonstrated powerful for dealing with gene expression data. In this article, we developed a new approach for biclustering microarray data by using pure combinatorial technique. It will be seen that under our framework biclustering is essentially equivalent to traditional clustering according to computational complexity. The software, named BIGREEDY, developed based on this method applied to artificial and real data respectively. Not only are the results carried out unexpectedly amazing on both of the data sets, but the time consuming is also unparalleled parsimoniously. It deserves to be pointed out that the procedure proposed here is of sufficient universality to be applied to various kinds of biclustering models.