1. Introduction

2. List operons for specific organism

3. Universal search

4. Operon information and gene information

5. How to read operon information

6. Search for similar operons

7. Genome browser

8. Operon prediction

1. Introduction

DOOR (Database of prOkaryotic OpeRons) is an operon database developed by Computational Systems Biology Lab (CSBL) at University of Georgia. Although the operons in the database are based on prediction, there are some unique features that distinguish us from other operon databases.

  • Our prediction algorithm is based on the paper below:

    Based on the evaluation paper published in Brief Bioinformatics by Brouwer et al., this algorithm is consistently best at all aspects including sensitivity and specificity for both true positives and true negatvies, and the overall accuracy reaches ~90%. Because of its high accuracy, we believe that it will provide valuable information to the scientific community if we provide all the predicted operons to the public.

  • DOOR2 is the biggest operon database available nowadays. Currently, several research groups also relevant operon databases (predicted or collected from literature), such as OperonDB provided by Steven L. Salzberg's group at University of Maryland, predicted operons in MicrobesOnline at VIMSS (Virtual Institute of Microbial Stress and Survival), ODB at Kyoto University in Japan, DBTBS in Japan, and RegulonDB in Mexico.

    Among all of the operon databases existing nowadays, DOOR2 has the biggest data set available to the public. Currently, DOOR2 provides operon information in more than 675 prokaryotic genomes, while MicrobesOnline provides operon information in 620 genomes, OperonDB in 287 genomes, and ODB in 203 genomes. In addition, RegulonDB provides operon information in E. coli only, while DBTBS is for B. subtilis only.

    Although most of the operons in DOOR2 are not verified by experiments, we try to provide relevant literature information extracted from ODB along with the operons to make our database more comprehensive. In addition, we believe that the operon data provided by DOOR2 will be quite useful for scientific analysis involving operon evolution, operon transfer study, etc.

    We would like to emphasize that if you are strictly looking for experimentally verified operons, you should look into DBTBS and RegulonDB first.

  • We provide operons involved in RNA genes, which are rarely seen in other operon databases, especially in predicted operon databases.

  • We provide very powerful query capability for our users in order to best assist them in finding what they are looking for easily. Please see sections below for more detailed description.
  • We have defined the similarity scores between operons based on weighted maximum matching between operons. Similar operon groups can be used to predict accurate orthologous genes, and their upstream regions can be used to find the consensus binding motifs.
  • We have integrated two motif finding programs in the database: the popular MEME and our in-house program CUBIC. MEME is a very popular motif finding program, so we integrated it according to public interest. Our in-house motif finding program CUBIC actually outperfroms MEME in many aspects based on our experiences, thus we integrated it as well.
  • Convenient operon selection function makes feeding your interested operons to MEME or CUBIC very easy.

2. List operons for specific organisms

You can obtain a list with all of the operons for a specific organism.

A. Go to DOOR2 homepage: http://csbl.bmb.uga.edu/DOOR/index.php


B. Click on "Organisms" in the menu. After that, you will get a list of organisms as shown below. This window shows all of the organisms currently having operons predicted in DOOR2, and they are arranged in alphabetically order.


C. You can adjust the number of entries displayed on each page by changing the number in the "Show entries" box.


D. You can also type in the name of your desired organism in "Search". For example, if you are looking for strains of E. coli, type in "e coli" in the search box and a list of all of the strains of E. coli will be displayed:


E. Then find your desired strain, and all of its nucleotide chains in this organism will be displayed next to it. Suppose you are looking for E. coli K12 MG1655, find the name "Escherichia coli str. K-12 substr. MG1655" in the list, and its only nucleotide chain, NC_000913, will be shown next to it.


F. Click on the nucleotide chain NC_000913, and then a table of all of the predicted operons in E. coli K12 MG1655 will be shown. Each operon has a 4-digit number assigned to it, and each row shows the general information for genes in the operon. You can click on the 4-digit number of a particular operon for more detailed information. Please see section 4 for more information about this part.


3. Universal search

You can also search for your desired operons or other related information using our universal search box. Our search server is based on the Sphinx Open Source Search Server. You can type in any one of the fields shown below, or you can use the Sphinx extended query syntax listed in the examples below to create a more sophisticated search condition. Your search words will be highlighted in yellow.


We support the search of the following fields:

Fields Syntax
Operon ID ID
Number of protein-coding genes NumOfProteinGenes
Number of RNA-coding genes NumOfRNAGenes
Species Species
NC number NC
GI number GeneGIs
Gene synonym GeneSynonyms
Gene symbols GeneSymbols
Gene COG GeneCOGs
Gene descriptions GeneDescriptions

Here are some examples to show you how our search box functions:

  • Number of Protein-coding genes in an operon
    • If you want to find out all of the operons containing 4 protein-coding genes, type in @NumOfProteinGenes 4.
    • If you want to specify the organism in which these operons are found (such as E. coli), type in @NumOfProteinGenes 4 @Species coli to obtain a more specific result.
  • Species
    • To find all the strains of a particular organism, type in both its genus and species. For example, type in Escherichia coli to obtain a list of all of the different strains of E. coli.
    • To find a particular strain of an organism (such as the Escherichia coli str. K-12 substr. MG1655), type in coli MG1655.
  • NC number
    • Directly type in the NC number. For example, type in NC_000913 to find all the operons encoded by this nucleotide chain.
  • GI number
    • To look for a particular gene, simply type in the GI number. For example, if you are looking for gene 16127995, type in 16127995.
  • Gene synonym
    • You can also look for a particular gene using its synonym. For example, type in b0006 to look for gene 16128000.
  • Gene symbols
    • Type in the symbol of the operon that you are looking for. For example, type in thrL to obtain a list of all the operons with thrL.
    • To specify the organism in which the symbol is present (such as E. coli), type in @GeneSymbols thrL @Species coli.
  • Gene COG
  • Gene descriptions
    • Type in the function of an operon. For example, if you are looking for operons responsible for lactate permease activity, type in lactate permease.
    • If you want to specify the organism in E. coli, type in @GeneDescriptions lactate permease @Species coli.

4. Operon information and gene information

A. In order to obtain more information on your desired operon, click on its operon ID. For example, if you are looking for operon 2996, click on the number and it will lead you to a page with more operon information.



B. To obtain more information on a particular gene in an operon, click on its GI number. GI is an integer number in NCBI prokaryote genome release, which can be used to identify genes.



5. How to read operon information

A. Rows 1, 2, 3, 4, 7, 8, 9, and 12 are self-explanatory. Row 5 shows the number of operons similar to this specific operon. Clicking on the number will lead you to the similar operon displaying page. For more specific instruction, please see section 6.

B. Row 10 shows the operon's corresponding information in ODB, if it also exists in ODB. The literature information will also be showed here.

C. Row 11 shows a link to VIMSS MicrobesOnline operon page, if the operon exists in their operon page.

D. There is also a genome browser at the bottom of this page. Please see section 7 for more information.


6. Search for similar operons

A. Click on "similar operon number" to search for operons similar to the one that you are looking for...


7. Genome browser

A. Genome browser provides a graphic view of the arrangement of genes in an operon.


B. You can also view the arrangement of different transcription units in the operon due to alternative splicing by double-clicking or dragging the "TU" boxes on the left.



8. Operon Prediction

A. A flow chart showing the major steps of how our operon prediction is conducted is shown below. The cylinder-shaped texts represent the data used during operon prediction, while the texts in the rounded rectangle are the calculating and classifying steps used to generate the prediction.

B. A total of 6 features contribute to the final classification and prediction. Three of them are commonly used features, such as intergenic distance, neighborhood conservation, and phylogenetic distance.

  • Intergenic distance - the distance between each adjacent genes.
  • Neighborhood conservation - the likelihood of a pair of genes to be neighbors in a group of reference genomes.
  • Phylogenetic distance - the co-presence of a pair of genes in reference to a group of reference genomes.

C. DOOR2 also incorporates 3 additional features in operon classification and prediction. These 3 unique features are binding motif occurrence, GO-based functional similarity, and intergenic length ratio.

  • Binding motif occurrence - the number of times a DNA motif occurs in the intergenic region of a pair of genes.
  • GO-based functional similarity - the GO number of a gene indicates its biological function; the more similar two genes are, the more likely they are functionally related.
  • Intergenic length ratio - the natural log of the length ratio of the upstream and downstream genes.

D. The trained data is then classified using a freely available Matlab toolbox known as PRTools. When a substantial number of operons in the current genome is known, then PRTools use the non-linear decision tree-based classifier. If most operons in the genome is unknown, then PRTools employ the linear logistic function-based classifier based on existing operon information.