PhD Defense: Analysis of genomic and Experimental data for Identifying Positively Selected Genes, Early Cancer Diagnosis, and Screening for Improved Industrial Protein Production
Bioinformatics employs an inter-disciplinary approach, which includes techniques from the fields of statistics, computer science, and biology, to solve biological problems. Recent advances in laboratory techniques and computational capabilities, such as microarray permit collection and analysis of biological data on a large scale. The analysis of such large data sets presents significant new challenges in both of biology and computer science, requiring development of specialized new techniques and tools in the field of bioinformatics.
Evolutionary genomics is a major area of bioinformatics research. As many previous studies have shown, a computational analysis of conservation of particular sequence elements under various selective pressures in multiple genomes will contribute to our understanding of biological diversity and speciation. Protozoan parasites Cryptosporidium parvum and Cryptosporidium hominis infect human and frequently cause diarrhea. Our analysis of their genome sequences shows a small number of genes are under positive selection, which is consistent with the fact that organisms need to keep their genetic materials stable for survival and reproduction; these genes are difficult to annotate because of low sequence similarity to genes with known functions. In addition, proteins with transmembrane domain and signal peptide are enriched in positively selected gene groups. All these new results helped answering questions to these pathogensí evolution history.
Machine learning or data mining is a research direction that designs or develops algorithms that can extract hidden patterns from large amount of data. As more and more biology experimental data are assembled and researchers are eager to learn the underlying principle governing biological process, many machine learning techniques are applied to biological data and have successfully modeled these processes. We applied such techniques to two separated research projects: early diagnosis of pancreatic cancer using mass spectrometry data and prediction of protein solubility. Using a data analysis framework that included three steps: preprocessing; feature selection and ensembles classification, we achieved a good performance on differentiating samples between disease group and control group. This work suggested a possible future direction in successfully diagnosing pancreatic cancer in its early stage. In protein solubility project, we proposed a new pair of novel features, and then tested a variety of machine learning approaches, using two newly assembled datasets. Also, a group of important protein sequence and secondary structure features having good correlation with protein solubility status were identified. Our results suggest that incorporating secondary structural information can improve solubility prediction.