Skip to main content

Zeb Arendsee: Bioinformatics and Computational Biology seminar

Apr 25, 2018 - 9:00 AM
to , -
See the full event:

Genetics, Development and Cell Biology Ph.D. candidate: Zeb Arendsee

Major: Bioinformatics and Computational Biology

Major Professor: Eve Wurtele

Title: "Tracing the genomic origins of orphan genes"

Abstract: A surprising discovery of the genomic age is that novel protein-coding genes continually arise from non-coding precursors. Many of these so-called "orphan genes" have dramatic functions, such as the QQS gene in Arabidopsis thaliana and the antifreeze gene in Antarctic icefish. A growing body of evidence suggests that orphans are involved in the evolution of novel traits, regulation of development and metabolism, and interactions between species (pathogenesis and symbiosis). Identifying and documenting orphans can be particularly valuable to understanding the evolution of novel phenotypes, which also has potential applications in bioengineering.

However, identifying orphans and uncovering their histories is a challenging task and different methodologies can yield contradictory conclusions. The typical practice of inferring orphans is by homology, or searching for similar sequences in related species. Inference of homologs via sequence-based algorithms is highly sensitive to the annotations of the related species and also cannot distinguish between rapidly evolving sequences and genes of de novo origin. Therefore, sequences that are short or rapidly evolving, such as orphan genes and small non-coding RNAs, may yield no significant hits. Further, sequences of low-complexity or high-copy number may hide in a crowd of false positives.

Searching by context bypasses these problems. We present an algorithm for tracing loci between genomes using a synteny map, and test its efficacy by mapping all Arabidopsis thaliana-specific genes to the genomes of related species in Brassicaceae. By reducing the search space and winnowing false positives, we were able to assess the origin of the individual orphan genes with unprecedented resolution and group them into subclasses. The pipeline can distinguish orphans with high-confidence data support from orphans identified due to bad assembly or missing data. We traced many orphans to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species and the amount of data support provides a launching point for future orphan studies.