Annotating and characterizing orphan gene in Zea mays via diverse RNA-Seq data
No Thumbnail Available
Is Version Of
Genetics, Development and Cell Biology
In the past, many studies have dismissed the pervasive transcribed but unannotated transcripts as transcriptional "noise", which I refer to "dark transcriptome". Some functional genes have been identified from this dark transcriptome. Most genes in the dark transcriptome are orphan genes. Orphan genes are the recently emerged young genes, which share no sequence similarity with proteins in any other species. In the last 20 years, thousands of orphan genes have been experimentally shown to play important roles in diverse species. However, there remain significant limitation in our knowledge about orphan genes and their function. The traditional gene model prediction pipeline is based on the ab initio method and sequence homology, which is lacking for the orphan genes. It is hard to predict orphan genes by traditional methods. Moreover, orphan genes usually only expressed in some specific conditions. Even though many de novo genes have been evaluated in several studies from multiple RNA-Seq evidence, the limited library conditions may restrict the identification of orphan genes. Currently, we have no idea about the true number of orphan genes in a genome. Gene function prediction is largely based on sequence and domain similarity, with a small set of gene function inferred directly from experimental evidence. Orphan gene function cannot be inferred via the traditional method. Even though some orphan genes have been experimentally characterized, most of their functions are not integrated in the public database. This dissertation presents methods and tools to evaluate potential orphan genes, and predict potential orphan genes and their function efficiently. First, I comprehensively evaluated all potential ORFs in yeast using over 3,000 RNA-Seq and Ribo-Seq samples for transcription and translation evidence. Next, I developed a light weight, flexible, reproducible, and well-documented pipeline, BIND, to improve orphan gene prediction. Finally, I provide improved gene model predictions using BIND, and comprehensive functional annotations using co-expression analysis from over 1,000 RNA-Seq samples for 26 inbred lines in Zea mays subsp. mays. The functional annotation was validated by enrichment analysis with differential expression analysis. Thousands of orphan genes showed specific expression in at least one stress condition and tissue. The annotation of pan-orphan genes, especially for the inbred line-specific genes in 26 NAM founder lines, hold potential to help agronomists and geneticists to use as molecular markers for marker-assisted selection and to develop desired varieties for maize.