PRIME -- PaRtition of Ion types of tandem Mass spectra

Version 0.9 (February 28, 2005)

Version 1.0 (coming soon)

 

Copyright (C) 2005 CSBL/University of Georgia

All Rights Reserved


 

1. INTRODUCTION

 

PRIME is a mass spectrum data mining tool for peptide de novo sequencing and protein post-translational modifications identification. At its early stage, PRIME 0.9 employs a novel graph-theoretic approach to separate b and y ions in a tandem mass spectrum. Distinguished from any other graph-based approaches, PRIME considers two types of edges, one representing the connection between a pair of peaks suspected to be of the same ion type (type-1 edge) and the other representing the connection between a pair of peaks suspected to be of different ion types (type-2 edge), based on the observations: the mass difference between any two ions of the same type must be equal to the combination of some amino acids; whereas if the mass difference is not equal to the mass of any amino acids, it must arise from different type ions. Edge weights are assigned based on the estimated probabilities of whether the edges truly connect ions of the same or different types of ions. If one imagines type-1 edges carrying attractive force and type-2 edges repulsive force, the vertices of the same ion type in a spectrum graph should naturally cluster together, whereas vertices of different ion types repel away. Noise can be disconnected or attached by false type-1 edges to b or y ions. Separation of b and y ions then can be done through cutting all the type-2 edges and false type-1 edges optimally.

 

PRIME attempts to separate a set of MS/MS peaks into (a) b ions and their chemical variants (i.e., loss of water or ammonia), (b) y ions and their variants, and (c) the other ion types (which are treated as noise). A dynamic programming algorithm is developed to guarantee the global optimal solution. The algorithm runs in  time in the worst case, where i is the distance from the root in a breadth-first tree (BFT) of the spectrum graph, L is the depth of the BFT, and |Si| is the number of vertices on the i-th level of the BFT. For a spectrum graph, |Si| is generally small in value. However, the computation could be very time consuming in some cases where the graph connectivity is dense or the number of mass peaks is too large (>=80). For this case, a local optimal strategy is used to speed up the calculation at sacrifice of separate accuracy. The resulted computing time is roughly proportional to the size of mass peaks.

 

 

2. USAGE

 

            prime <mass_peak_file> [<param_file> [<seed>]]

 

2.1 INPUT

l        mass_peak_file
It is a plain text file which contains the peptide molecular weight at the first row and the pairs of the neutral mass and intensity of mass peaks at the following rows, like

 

1566.735                     # peptide molecular weight

424.2110                     20.85   y3 #neutral mass, intensity, ion-type

437.6440                     17.65   u

446.2210                     12.85   b4

553.2572                     42.73   y4

572.2670                     7.031   b6-18  

590.2780                     6.199   b6

...

The notation of ion-type in column third is used only for the evaluation purpose. For uninterpreted peptide, just leave it blank. Note that the list of mass peaks should appear in the increased order of ion mass, and ions are ignored if their mass values are bigger than the peptide parent mass.

 

NOTE: we assume that the input data has been already preprocessed such that,

(1) all the masses are neutral

(2) all the isotopic peaks have been removed and only the monoisotopic masses remain

 

l        param_file
it is a plain text file which contains parameter set:, like

 

MASS_TYPE = monoisotopic            #monoisotopic mass

DELTA_MASS = 0.05                       #Dalton, mass threshold used for connecting type-1 edge

EDGE2_MASS = 30.0                       #Dalton, mass threshold used for connecting type-2 edge

EDGE2_WEIGHT = 20          #weight for type-2 edge

AA_LIB = residue.lib             #path and filename for pseudo amino acid library

MASS_FUNC = mass.func      #mass function file

MAX_COMPLEXITY = 18.0   #maximal computational complexity allowable for dynamic programming optimization; if the estimate value of given spectrum graph is larger than it, the local optimal search method is executed to save computing time.

LOCAL_SEARCH = 1000      #the iteration number of local search

 

l        residue.lib
it is a plain text file that contains the information of pseudo amino acids, like

 

:a1  a3  type monoiso_ms   av_mass   freq

U    NH3 2   -17.00274  -17.0073   10

X    H2O 2   -18.01057  -18.0153   20

G    GLY 1   57.02146   57.05192   5.07

A    ALA 1   71.03711   71.0788    5.61

S    SER 1   87.03203   87.0782    8.94

P    PRO 1   97.05276   97.11668   4.34

p    PHO 2   -97.9769   -97.9952   1

V    VAL 1   99.06841   99.13256   5.64

T    THR 1   101.04768  101.10508  5.86

C    CYS 1   103.00919  103.1448   1.29

L    L/I 1   113.08406  113.15944  16.06

N    ASN 1   114.04293  114.10384  6.08

D    ASP 1   115.02694  115.0886   5.82

Q    GLN 1   128.05858  128.13072  3.92

K    LYS 1   128.09496  128.17408  7.30

E    GLU 1   129.04259  129.11548  6.54

M    MET 1   131.04049  131.19856  2.09

H    HIS 1   137.05891  137.14108  2.13

F    PHE 1   147.06841  147.17656  4.49

R    ARG 1   156.10111  156.18748  4.43

Y    TYR 1   163.06333  163.17596  3.35

s    pS  4   166.99836  167.0581   1

t    pT  4   181.01401  181.08498  1

W    TRP 1   186.07931  186.2132   1.05

y    pY  4   243.02966  243.15586  1

...

where the first column is the short name of the pseudo amino acid; the second column is its long annotation; the third is residue type (1: common, 2: neutral mass loss, 4: PTMs); fourth and fifth columns are its monoisotopic and average masses (note: the negative value means that it is a neutral mass loss); the sixth column is the frequency of amino acid occurrence, these values are derived from Yeast genome. For those whose frequencies are difficult to estimate (neutral mass losses or PTMs), 1 is assigned.

 

NOTE: (1) One could put any pseudo amino acids of interest in the list, including neutral mass losses, like U standing for loss of ammonia, X for loss of water, p for loss of HPO3; post-translational modifications, like pS representing for phosphoserine, pY for phosphotyrosine, etc. (2) The pseudo amino acids should be put in the increased order of the absolute mass.

 

l        mass.func
it is a conditional probability profile showing two ions being of the same type at a given mass difference. The values were derived from the statistical analysis of the simulated tandem mass spectra of tryptic peptides with a mass range of [800 Da, 4000 Da], digested from proteins in Yeast genome. The first column is the mass difference; the second is the probability of this mass difference arising from the same type ions; the third and fourth are the counts of this mass difference being of the same or different ion types.

 

#mass   prob  same_type   diff_type

...

56.7    0.00   0   9

56.8    0.00   0   214

56.9    0.00   0   6058

57.0    0.89   234494 28465

57.1    0.00   0   21521

57.2    0.00   0   1506

57.3    0.00   0   49

57.4    0.00   0   2

57.5    0.00   0   0

57.6    0.00   0   0

57.7    0.00   0   15

57.8    0.00   0   295

57.9    0.00   0   5901

58.0    0.00   0   28507

58.1    0.00   0   17410

58.2    0.00   0   1587

...

 

l        seed
It is a long unsigned integer which is used for local optimal search. The default value is the current computer time.

 

2.2 OUTPUT

l        mass_peak_file.out
A plain text file that contains the partitioning results: each mass peak is assigned to B (b ion), Y (y ion) or U (the other ion types). If it reports a large number of fake type-1 edges, the result may be unreliable.

 

l        mass_peak_file.net
This file contains the spectrum graph representation information for further view and plot as an input of the shared software Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). The b and y ions are represented by red circles and blue squares respectively. Noises are denoted by black circles. The closed symbols represent the ions observed in experimental spectrum while the open symbols denote the adding back complementary ions. The type-1 edges are plotted with red (b ion chains) and blue colors (y ion chains) respectively, while the type-2 edges are represented by thick black lines. Thick teal lines represent the false type-1 edges and thick dashed lines represent the discarded type-2 edges.

 

l        mass_peak_file.score
If the local optimal search is executed, this file will be generated to report the distribution of scores. Based it, one might be able to judge if the global solution is reached or not.

 

3. Contact information:

 

Please send any bugs or suggestions to Dr. Bo Yan at byn@csbl.bmb.uga.edu or mail to the following address:

 

Bo Yan, Ph.D.

Dept. of Biochemistry & Molecular Biology

Davison Life Sciences Complex, Room A110

120 Green Street

University of Georgia

Athens, GA 30602-7229

 

Any feedback will be appreciated.