Statistical methods to improve the analysis of biological data: Benchmarking phenotypes, protein function prediction, and spatial modelling of gene expression

Thumbnail Image
Zhou, Naihui
Major Professor
Iddo Friedberg
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Veterinary Microbiology and Preventive Medicine
Our faculty promote the understanding of causes of infectious disease in animals and the mechanisms by which diseases develop at the organismal, cellular and molecular levels. Veterinary microbiology also includes research on the interaction of pathogenic and symbiotic microbes with their hosts and the host response to infection.
Journal Issue
Is Version Of

Data collected in biological experiments comes in all shapes and sizes, including DNA and protein sequences, mRNA counts, spatial interactions, protein annotations, phenotypic images and so on. In order to make sense of this myriad of data, novel statistical methods are needed to not only model the biological data, but also to assess the accuracy of predictions. In this thesis, I present three research studies that perform statistical analysis in the benchmarking, assessment and modelling of genetic data, demonstrating diversity of bioinformatics research. The approach taken here is to tailor statistical methods for specific data types.

To provide quality benchmark data for phenotypic image processing and assessment, a Generalized Linear Mixed effects model was used to compare the performance of different groups of people (lay people recruited through Amazon Mechanical Turk versus experts) in their efficacy to highlight key elements in phenotypic images collected from corn fields. The analyzed images were then used as ground-truth for the training and testing of automated methods. We concluded that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping.

To assess the quality of computational protein function predictions, the third Critical Assessment of Functional Annotation (CAFA) was launched to evaluate predictions in the form of a community challenge. Each protein is associated with multiple functions represented by Gene Ontology terms (labels). These ontological terms form a hierarchical structure, and the frequency of each term is not distributed uniformly among different proteins. Precision-recall based assessment metrics were not enough to account for the non-uniform prior distribution of this multi-label problem, so semantic-distance based methods were developed for better model assessment. We concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods, it leaves considerable room and need for improvement. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation databases, computational function prediction, and our ability to manage big data in the era of large experimental screens.

To model the spatial dependency of gene expression on the 3D structure of the genome, a Poisson Hierarchical Markov Random Field model (PhiMRF) was developed for gene expression data that accounts for the pairwise spatial interaction from HiC experiments. The quantitative expression of genes on human chromosomes 1, 4, 5, 6, 8, 9, 12, 19, 20 , 21 and X all showed meaningful positive intra-chromosomal spatial dependency. Moreover, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are indeed strongly correlated with gene expression levels. The results both confirm and quantify the spatial correlation in gene expression. In addition, PhiMRF improves upon the stochastic modelling of gene expression that is currently widely used in differential expression analyses. PhiMRF is available at as an R package.

Fri May 01 00:00:00 UTC 2020