¡¡

Ph.D. Thesis

¡¡

FU Yan, Directed by Prof. GAO Wen

Institute of Computing Technology &
Graduate University, Chinese
Academy of Sciences

¡¡

**ABSTRACT**

In information retrieval systems such as biological sequence search engines, the
retrieval functions (also referred to as scoring functions or ranking functions
sometimes) that list the search results in the order of their relevance to the
query are one of the most important components. The design of retrieval
functions can be carried out on two levels, i.e., the domain-dependent
construction of basic relevance measures and the relatively domain-independent
construction of the final retrieval function that combines multiple basic
relevance measures into a single one. In this thesis, two important
bioinformation retrieval problems, i.e., the protein sequence identification
problem and the protein homology prediction problem, are studied on the above
two levels of retrieval function design using machine learning techniques.

Peptide and protein identification
via tandem mass spectrometry and database search is an important biological
sequence retrieval problem. A key ingredient of peptide and protein
identification software is the peptide scoring function (retrieval function)
that measures the likelihood of a candidate peptide producing the experimental
spectrum. In a peptide scoring function, the most basic operation is to match
fragment ions predicted from a candidate peptide to the mass peaks in the
experimental spectrum. Due to the imprecision of mass measurement, random
mismatches often occur. In this thesis, a more accurate mass match error model,
namely conditional normal model, is first proposed to improve the accuracy of
matching. This model is based on two important observations on the mass error
distribution, i.e. the linearity between the mean of mass error and the ion
mass, and the logarithmic linearity between the standard deviation of mass error
and the peak intensity. To the best of the author¡¯s knowledge, the latter
quantitative relationship has never been reported before. An iterative learning
algorithm is also proposed to accurately estimate the model parameters from
training data to characterize the mass error distribution of tandem mass
spectra. The thesis then presents a nonlinear peptide scoring function, namely
KSDP, which is a nonlinear extension to the commonly used peptide scoring
method, spectral dot product (SDP). The correlation among fragment ions in a
tandem mass spectrum is very helpful for reducing random mismatches. In KSDP,
localized kernel functions are used to emphasize the co- occurring matches of
correlated ions. Experiments show that KSDP can significantly improve the
peptide identification accuracy. The KSDP-based peptide and protein
identification software tool pFind considerably outperforms the SDP-based
popular commercial software SEQUEST in terms of identification accuracy on
several data sets. At the 1% false positive rate, pFind identifies 10% to 30%
more peptides than SEQUEST.

Due to the complexity of practical
retrieval problems, there are usually more than one basic relevance measures,
resulting in multiple-dimensional feature vectors. How to combine the multiple
relevance measures into a single one is the problem of retrieval function
construction. Learning a retrieval function from training data is a common and
effective strategy. In general, retrieval function learning is independent of
specific domains. In this class of machine learning problem, the feature vectors
of database items are computed based on queries and thus they are grouped into
blocks by queries. The block structure of data is a unique feature of retrieval
function learning problems. This thesis describes a series of approaches for
more accurate learning of retrieval functions based on the block structure.
These approaches range from the intra-block data normalization and block feature
expansion methods for solving the non-i.i.d. (independent and identically
distributed) problem, the block selection and support vector under-sampling
methods for reducing redundant data, and the K-nearest-block ensemble method for
designing query-adaptive retrieval functions. Experimental results with the
support vector machine (SVM) used as the benchmark learner demonstrate that all
of these block-based approaches significantly outperform the straightforward
application of SVMs. The intra-block data normalization and data reduction
methods have contributed to our original winning of the protein homology
prediction task in the ACM KDDCUP-2004 competition. The K-nearest-block ensemble
approach is even more attractive in both prediction accuracy and training speed.
It obtains the best prediction result at present on the protein homology
prediction task mentioned above.

¡¡

**Keywords:** bioinformatics;
information retrieval; machine learning; mass spectrometry; peptide
identification; protein homology prediction

¡¡