Ph.D. Thesis

Machine Learning Based Bioinformation Retrieval

FU Yan, Directed by Prof. GAO Wen

Institute of Computing Technology &
Graduate University, Chinese Academy of Sciences


        In information retrieval systems such as biological sequence search engines, the retrieval functions (also referred to as scoring functions or ranking functions sometimes) that list the search results in the order of their relevance to the query are one of the most important components. The design of retrieval functions can be carried out on two levels, i.e., the domain-dependent construction of basic relevance measures and the relatively domain-independent construction of the final retrieval function that combines multiple basic relevance measures into a single one. In this thesis, two important bioinformation retrieval problems, i.e., the protein sequence identification problem and the protein homology prediction problem, are studied on the above two levels of retrieval function design using machine learning techniques.

        Peptide and protein identification via tandem mass spectrometry and database search is an important biological sequence retrieval problem. A key ingredient of peptide and protein identification software is the peptide scoring function (retrieval function) that measures the likelihood of a candidate peptide producing the experimental spectrum. In a peptide scoring function, the most basic operation is to match fragment ions predicted from a candidate peptide to the mass peaks in the experimental spectrum. Due to the imprecision of mass measurement, random mismatches often occur. In this thesis, a more accurate mass match error model, namely conditional normal model, is first proposed to improve the accuracy of matching. This model is based on two important observations on the mass error distribution, i.e. the linearity between the mean of mass error and the ion mass, and the logarithmic linearity between the standard deviation of mass error and the peak intensity. To the best of the authors knowledge, the latter quantitative relationship has never been reported before. An iterative learning algorithm is also proposed to accurately estimate the model parameters from training data to characterize the mass error distribution of tandem mass spectra. The thesis then presents a nonlinear peptide scoring function, namely KSDP, which is a nonlinear extension to the commonly used peptide scoring method, spectral dot product (SDP). The correlation among fragment ions in a tandem mass spectrum is very helpful for reducing random mismatches. In KSDP, localized kernel functions are used to emphasize the co- occurring matches of correlated ions. Experiments show that KSDP can significantly improve the peptide identification accuracy. The KSDP-based peptide and protein identification software tool pFind considerably outperforms the SDP-based popular commercial software SEQUEST in terms of identification accuracy on several data sets. At the 1% false positive rate, pFind identifies 10% to 30% more peptides than SEQUEST.

        Due to the complexity of practical retrieval problems, there are usually more than one basic relevance measures, resulting in multiple-dimensional feature vectors. How to combine the multiple relevance measures into a single one is the problem of retrieval function construction. Learning a retrieval function from training data is a common and effective strategy. In general, retrieval function learning is independent of specific domains. In this class of machine learning problem, the feature vectors of database items are computed based on queries and thus they are grouped into blocks by queries. The block structure of data is a unique feature of retrieval function learning problems. This thesis describes a series of approaches for more accurate learning of retrieval functions based on the block structure. These approaches range from the intra-block data normalization and block feature expansion methods for solving the non-i.i.d. (independent and identically distributed) problem, the block selection and support vector under-sampling methods for reducing redundant data, and the K-nearest-block ensemble method for designing query-adaptive retrieval functions. Experimental results with the support vector machine (SVM) used as the benchmark learner demonstrate that all of these block-based approaches significantly outperform the straightforward application of SVMs. The intra-block data normalization and data reduction methods have contributed to our original winning of the protein homology prediction task in the ACM KDDCUP-2004 competition. The K-nearest-block ensemble approach is even more attractive in both prediction accuracy and training speed. It obtains the best prediction result at present on the protein homology prediction task mentioned above.

Keywords: bioinformatics; information retrieval; machine learning; mass spectrometry; peptide identification; protein homology prediction

 (Download fulltext)

Back to homepage ...