Simulated data
- Procedure:
To evaluate the performance of MREC in a systematic manner, we have run it on a set of simulated data first, for which we know the ground truth, and compared our prediction results with two existing tools, MEME and Cosmo. Since the performance of virtually all the motif-finding tools, our own included, depends on the level of motif conservation and the length of the motif, we considered these two factors explicitly in our design of the simulated data.We have generated a number of data sets containing a motif TTATCCACAA that was placed to an arbitrary location in each of a set of randomly generated background sequences after it was point-mutated, according to a given mutation rate. We have considered nine mutation rates for mutating each nucleotide to another one. Each test set contains 13 sequences with length 200 nucleotides, a length commonly used for prokaryotic promoter sequences. We generated 100 such sequence sets with the mutated motifs embedded in their sequences. We do this for each mutation rate, for result see figure 1(a).
We also compared the performances by MREC, MEME and Cosmo for identifying motifs with different lengths, with the same level of mutation rate from 18% tp 25%. For this test, we have generated 100 sequence sets for each possible motif length ranging from 8 to 18 nucleotides. Each set contains 13 sequences of 200 nucleotides long, with each sequence having one embedded motif with a fixed length. for result see figure 1(b).
- Results: (Figure 1)

Figure 1: Performance of MREC, MEME and Cosmo on simulated data generated using nine point-mutation rates. (a) The data set with different mutation rates. (b) The data set with different motif lengths.
Real biological data
- Procedure:
We have also tested the performance of the three programs on real biological data. To do that, we have collected sixteen promoter sequence sets from E. coli K12, containing the transcription factor binding sites of ArgR, CpxR, DnaA, Fur, GntR, LexA, NarP, NtrC, PhoB, PurR, FruR, Fnr, MetJ, CRP, TrpR and TyrR, respectively. The number of promoter sequences for each dataset varies from 11 to 161, and each sequence in the dataset is from 200-nucleotide to 300-nucleotide long, with the exception for the TyrR dataset that has varying sequence lengths, ranging from 200 to 452. We used the sequence profile for each set of binding motifs, retrieved from RegulonDB [19] and PRODORIC [20] as the ground truth when assessing the prediction accuracies. On this test, we used t=5 and s=1 (see Seeding Step of the Methods) in MREC, and the default parameters in the two other programs. The predicted motif length is determined by the motif (or motif closure for MREC) with the best p-value. For results, see Table 1 and Table 2. - Results: (Table 1 and Table 2)

P-value comparison
- Procedure:
To demonstrate the effectiveness of our p-value calculation over the previous methods, we have also carried out a comparison with MEME and CONSENSUS [6], which uses the traditional p-value calculation, in term of their motif prediction on the same 16 sets of promoter sequences used above. Since MEME did not provide a p-value, we have used csFFT [16], a popular p-value estimation program for motif finding, to calculate the p-value for the motif profiles given by MEME. The values of -log(p-value) with different motif lengths generated by these two programs and MREC are collected. Comparisons among the prediction performances by the three programs on two datasets, ArgR and DnaA, are shown in Figure 2, which represent two different motif profile structures. Specifically, the ArgR binding motif profile is more conserved at the two ends and the DnaA binding motif profile has a stable high conservation across the full length of the motif. - Results: (Figure 2)

Figure 2 - Comparison between the p-value by MREC and the p-values by csFFT and CONSENSUS. Here we take examples the ArgR and DnaA datasets in E. coli. The pink dash lines correspond to the correct motif length.
Page last updated: Oct. 06, 2009
