SUMOylation, a reversible post-translational modification of proteins by the small ubiquitin-related modifiers (SUMO), is crucial in a variety of biological processes, including transcription (Gregoire S & Yang XJ, 2005; Girdwood DW et al, 2004), mRNA metabolism (Li T et al, 2004), signal transduction (Liang M et al, 2004), nuclear transport (Kretz-Remy C et al, 1999; Pichler A et al, 2002), stress response (Huang TT et al, 2003), apoptosis (Besnault-Mascard L et al, 2005) and perception of sound (Zhou F et al, 2005), etc. SUMO proteins are highly conserved among eukaryotes, and consist of four components in mammalian, SUMO-1, SUMO-2, SUMO-3, and SUMO-4 (Hay RT et al, 2005). There is only one SUMO gene SMT3 in budding yeast, while there exist at least eight SUMO paralogs contained in plant (Kurepa J et al, 2003).

    Sumoylation has been reported to play essential roles in various diseases and disorders. In drosophila, sumoylation stabilizes Huntingtin (Htt), a pathogenic protein of the Huntington's disease (HD), to prevent Htt ubiquitination and degradation, and to exacerbate neurodegeneration disease (Steffan JS et al, 2004). Recently, sumoylation was proposed to be critical in amyloid precursor protein (APP) amyloidogenesis by an unknown mechanism (Li Y et al, 2003). Enhanced protein sumoylation by overexpression of SUMO-3 reduces the generation of Amyloid beta peptide (Abeta) (Li Y et al, 2003). And abnormal sumoylation is also associated with the pathogenesis of several other diseases, such as type 1 diabetes (T1D) (Li M et al, 2005) and Parkinson's disease (PD) (Shinbo Y et al, 2005), etc. In addition, SUMO proteins can modify several viral proteins, such as herpes simplex virus (HSV) ICP0 protein and the cytomegalovirus (CMV) IE1 protein (Muller S et al, 1999), and have potential functions in some cancers (Alarcon-Vargas et al, 2002; Fu C, et al, 2005).

    Recently, proteomic analysis of sumoylation substrates has been emerging as an appealing topic and a great challenge. Although several proteome-scale analyses have been performed to delineate the potential sumoylated substrates, the exact sumoylation sites still remain to be identified. The sumoylation process is dynamic and not all real SUMO substrates will be sumoylated in vivo simultaneously. Only a small fraction of the substrate, often <1%, is sumoylated in vivo at any given time (Johnson ES et al, 2004). So in silico identification of SUMO substrates with their respective sites is fundamental for understanding the mechanisms of sumoylation-related regulations in eukaryotic cells, and suggests potential candidates for further drug design.

    The majority of the sumoylation sites follow a consensus motif with ψ-K-X-E (ψ is a hydrophobic amino acid) (Hay RT et al, 2005; Johnson ES et al, 2004) or ψ-K-X-E/D (Melchior F et al, 2003; Denison C et al, 2004). Whereas, the accumulating experimental data shows that about ~23% real sumoylation sites don't follow the standard motif. Although a nuclear localization signal (NLS) and a consensus motif confer the ability to be sumoylated, some real SUMO substrates are not localized on nuclear. For example, protein DRP1 (dynamin related protein) is localized on the mitochondria and sumoylated during mitochondrial fission (Harder Z et al, 2004).

    These unexpected features introduce the difficulties into the sumoylation proteome analysis. At the current stage, only <200 unambiguous sumoylation sites of ~120 substrates have been reported in scientific literature. Thus, dissecting the full content of the SUMO proteome is emerging as an interesting problem and a great challenge. By mass spectrometry (MS) approaches, several large-scale experimental identifications of sumoylation proteins have been deployed in budding yeast (Denison C et al, 2004, Hannich JT et al, 2005; Zhou W et al, 2004; Wohlschlegel JA et al, 2004; Wykoff DD et al, 2005; Panse VG et al, 2004) and human (Li T et al, 2004; Zhao Y et al, 2004; Manza LL et al, 2004; Vertegaal AC et al, 2004; Gocke CB et al, 2005; Rosas-Acosta G et al, 2005), with ~560 and ~300 unique potential candidates respectively. These proteome-wide studies provided great insights into the functional diversity of sumoylation proteins, even when the exact sumoylation sites still remained to be identified.

    In this work, we firstly manually curated from the scientific literature and get 239 experiment-verified sumoylation sites. Then we apply two available strategies, GPS (Server) and MotifX (Server), on this data set. The final prediction system, SUMOsp, is based on "GPS or MotifX", and it shows satisfying sensitivity (89.11%) and specificity (80.08%) with cut-off score 4 by the Jack-Knife validation. The high specificity and the satisfying sensitivity make SUMOsp a powerful tool for in silico sumoylation sites prediction. For the convenience of experimentalists, we implemented an easy-to-use web server of SUMOsp, which can be freely accessed from: http://bioinformatics.lcd-ustc.org/sumosp/ .