PRIME -- PaRtition of Ion types of tandem Mass spectra
Version 0.9 (February 28, 2005)
Version 1.0 (coming soon)
Copyright (C) 2005
CSBL/University of
All Rights Reserved
1. INTRODUCTION
PRIME is a mass spectrum data mining tool for peptide de novo sequencing and protein post-translational modifications identification. At its early stage, PRIME 0.9 employs a novel graph-theoretic approach to separate b and y ions in a tandem mass spectrum. Distinguished from any other graph-based approaches, PRIME considers two types of edges, one representing the connection between a pair of peaks suspected to be of the same ion type (type-1 edge) and the other representing the connection between a pair of peaks suspected to be of different ion types (type-2 edge), based on the observations: the mass difference between any two ions of the same type must be equal to the combination of some amino acids; whereas if the mass difference is not equal to the mass of any amino acids, it must arise from different type ions. Edge weights are assigned based on the estimated probabilities of whether the edges truly connect ions of the same or different types of ions. If one imagines type-1 edges carrying attractive force and type-2 edges repulsive force, the vertices of the same ion type in a spectrum graph should naturally cluster together, whereas vertices of different ion types repel away. Noise can be disconnected or attached by false type-1 edges to b or y ions. Separation of b and y ions then can be done through cutting all the type-2 edges and false type-1 edges optimally.
PRIME
attempts to separate a set of MS/MS peaks into (a) b ions and their chemical
variants (i.e., loss of water or ammonia), (b) y ions and their variants, and
(c) the other ion types (which are treated as noise). A dynamic programming
algorithm is developed to guarantee the global optimal solution. The algorithm
runs in
time in the worst case, where i is the distance from the
root in a breadth-first tree (BFT) of the spectrum graph, L is the depth of the
BFT, and |Si| is the number of
vertices on the i-th level of the BFT. For a
spectrum graph, |Si| is generally
small in value. However, the computation could be very time consuming in some
cases where the graph connectivity is dense or the number of mass peaks is too
large (>=80). For this case, a local optimal strategy is used to speed up
the calculation at sacrifice of separate accuracy. The resulted computing time
is roughly proportional to the size of mass peaks.
2. USAGE
prime <mass_peak_file> [<param_file> [<seed>]]
2.1 INPUT
l mass_peak_file
It is a plain text file which contains the peptide molecular weight at the
first row and the pairs of the neutral mass and intensity of mass peaks at the
following rows, like
1566.735 # peptide molecular weight
424.2110 20.85 y3 #neutral mass, intensity, ion-type
437.6440 17.65 u
446.2210 12.85 b4
553.2572 42.73 y4
572.2670 7.031 b6-18
590.2780 6.199 b6
...
The notation of ion-type in column third is used only for the evaluation purpose. For uninterpreted peptide, just leave it blank. Note that the list of mass peaks should appear in the increased order of ion mass, and ions are ignored if their mass values are bigger than the peptide parent mass.
NOTE: we assume that the input data has been already preprocessed such that,
(1) all the masses are neutral
(2) all the isotopic peaks have been removed and only the monoisotopic masses remain
l param_file
it is a plain text file which contains parameter set:, like
MASS_TYPE = monoisotopic #monoisotopic mass
DELTA_MASS =
0.05
#
EDGE2_MASS =
30.0
#
EDGE2_WEIGHT = 20 #weight for type-2 edge
AA_LIB = residue.lib
#path and filename for pseudo amino acid library
MASS_FUNC = mass.func #mass function file
MAX_COMPLEXITY = 18.0 #maximal computational complexity allowable for dynamic programming optimization; if the estimate value of given spectrum graph is larger than it, the local optimal search method is executed to save computing time.
LOCAL_SEARCH = 1000 #the iteration number of local search
l residue.lib
it is a plain text file that contains the information of pseudo amino acids,
like
:a1 a3 type monoiso_ms av_mass
freq
U NH3 2 -17.00274 -17.0073 10
X H2O 2 -18.01057 -18.0153 20
G GLY 1 57.02146 57.05192 5.07
A
P PRO 1 97.05276 97.11668 4.34
p PHO 2 -97.9769 -97.9952
1
V VAL 1 99.06841 99.13256 5.64
T THR 1 101.04768 101.10508 5.86
C CYS 1 103.00919 103.1448 1.29
L L/I 1 113.08406 113.15944 16.06
D ASP 1 115.02694 115.0886 5.82
Q GLN 1 128.05858 128.13072 3.92
K
M MET 1 131.04049 131.19856 2.09
H HIS 1 137.05891 137.14108 2.13
F PHE 1 147.06841 147.17656 4.49
R ARG 1 156.10111 156.18748 4.43
Y TYR 1 163.06333 163.17596 3.35
s pS 4 166.99836 167.0581 1
t pT 4 181.01401 181.08498 1
y pY 4 243.02966 243.15586 1
...
where the first column is the short name of the pseudo amino acid; the second column is its long annotation; the third is residue type (1: common, 2: neutral mass loss, 4: PTMs); fourth and fifth columns are its monoisotopic and average masses (note: the negative value means that it is a neutral mass loss); the sixth column is the frequency of amino acid occurrence, these values are derived from Yeast genome. For those whose frequencies are difficult to estimate (neutral mass losses or PTMs), 1 is assigned.
NOTE: (1) One could put any pseudo amino acids of interest in the list, including neutral mass losses, like U standing for loss of ammonia, X for loss of water, p for loss of HPO3; post-translational modifications, like pS representing for phosphoserine, pY for phosphotyrosine, etc. (2) The pseudo amino acids should be put in the increased order of the absolute mass.
l mass.func
it is a conditional probability profile showing two ions being of the same type
at a given mass difference. The values were derived from the statistical analysis
of the simulated tandem mass spectra of tryptic peptides with a mass range of
[800 Da, 4000 Da], digested from proteins in Yeast genome. The first column is
the mass difference; the second is the probability of this mass difference
arising from the same type ions; the third and fourth are the counts of this
mass difference being of the same or different ion types.
#mass prob same_type diff_type
...
56.7 0.00 0 9
56.8 0.00 0 214
56.9 0.00 0 6058
57.0 0.89 234494 28465
57.1 0.00 0 21521
57.2 0.00 0 1506
57.3 0.00 0 49
57.4 0.00 0 2
57.5 0.00 0 0
57.6 0.00 0 0
57.7 0.00 0 15
57.8 0.00 0 295
57.9 0.00 0 5901
58.0 0.00 0 28507
58.1 0.00 0 17410
58.2 0.00 0 1587
...
l seed
It is a long unsigned integer which is used for local optimal search. The
default value is the current computer time.
2.2 OUTPUT
l mass_peak_file.out
A plain text file that contains the partitioning results: each mass peak is
assigned to B (b ion), Y (y ion) or U (the other ion types). If it reports a
large number of fake type-1 edges, the result may be unreliable.
l mass_peak_file.net
This file contains the spectrum graph representation information for further
view and plot as an input of the shared software Pajek
(http://vlado.fmf.uni-lj.si/pub/networks/pajek/).
The b and y ions are represented by red circles and blue squares respectively.
Noises are denoted by black circles. The closed symbols represent the ions
observed in experimental spectrum while the open symbols denote the adding back
complementary ions. The type-1 edges are plotted with red (b ion chains) and
blue colors (y ion chains) respectively, while the type-2 edges are represented
by thick black lines. Thick teal lines represent the false type-1 edges and
thick dashed lines represent the discarded type-2 edges.
l mass_peak_file.score
If the local optimal search is executed, this file will be generated to report
the distribution of scores. Based it, one might be able to judge if the global
solution is reached or not.
3. Contact information:
Please send any bugs or suggestions to Dr. Bo Yan at byn@csbl.bmb.uga.edu or mail to the following address:
Bo Yan, Ph.D.
Dept. of Biochemistry & Molecular Biology
Davison Life Sciences Complex, Room A110
Any feedback will be appreciated.