Thesis
A comparative computational approach for exploring sequence-derived properties of genomic features
Southern Cross University, Plant Science
Doctor of Philosophy (PhD), Southern Cross University
2021
DOI:
https://doi.org/10.25918/thesis.160
Metrics
108 File views/ downloads
463 Record Views
Abstract
In eukaryotic genomes, the DNA sequence is highly compacted into several chromatin structure levels to make up the chromosomes. The dynamic status of chromatin structure plays an important role in genomic regulation, with a wide range of DNA sequence-derived properties contributing to chromatin conformation. Although large numbers of eukaryotic genomes have now been sequenced, assembled and annotated, there is no systematic or interactive platform for analysing and exploring data relating to different genomic features along with a wide range of associated sequence-derived properties.
The main purpose of this thesis was to establish a systematic analysis platform to test hypotheses related to the variation in sequence-derived properties profiles that may be associated with genomic features such as exon-intron structure. A literature survey identified many studies discussing the important role that base composition and DNA physicochemical properties appear to have played during the evolution of eukaryotic genomes. However, less attention has been paid, particularly in plants, to exploring the evolutionary constraints of sequence-derived properties between closely related taxa that have diverged over different evolutionary periods. Three Brassicaceae (angiosperm, dicot plant) genomes (Arabidopsis thaliana, Brassica rapa, and Brassica oleracea) were chosen to represent genome divergence over approximately 20 million years ago (MYA).
Systems analysis was carried out to design and develop database tables supplemental to the Ensembl Core relational database schema in order to add new functionality for analysing genomes stored in Ensembl. Development of the Heuristic Ensembl Meta-Analysis (HEMA) platform included a pre-processing data pipeline to populate the database tables, and a data analysis pipeline containing structured queries that can explore systematic questions of any calculated sequence property associated with any subset of genic features. It also includes an interactive graphical user interface (GUI) that connects to the modified Ensembl Core database.
The analysis pipeline was used to test hypotheses relating to the evolutionary constraints on intron length and base composition in three Brassicaceae genomes. Variation in GC-content between exonic and intronic regions appears to be conserved, and the distribution of intron length appears to be constrained. An analysis of the full set of different genic features revealed symmetrical profiles of mean GC-content on either side of the mid-point of Intron 1 and Exon 2. While these profiles show an increase of GC-content associated with the mid-point of Exons. A reduction appears in GC-content corresponding to the mid-point of introns. Moreover, a negative correlation is apparent between GC-content and intron length within different gene groups such as high- and low-GC3 genes. After calculating the first derivative profiles of the absolute GC-content across the three species, the distance between the inner pair of local maxima flanking the reference mid-point for Exon 2 and Intron 1 appears to sensitive to the length of Exon 2 and Intron 1.
Further analysis was carried out in order to test hypotheses relating to the variation in sequence-derived structural and thermodynamic (physicochemical) properties profiles between exonic and intronic regions. A set of 11 properties was selected, after carrying out a hierarchical clustering of over 100 physicochemical dinucleotide sets found in the DiProDB database. In most cases, the calculated profiles of these properties are symmetrical on either side of the mid-point of Intron 1 and Exon 2. Most of the calculated profiles match those observed in GC-content. However, the twist and wedge profiles are not predicted by GC-content.
With the modification of the existing Ensembl Core database schema, and development of associated pre-processing and analysis pipelines, the HEMA system provides a platform for generating unique queries relating to genome-wide features and sequence-derived attributes. Also, it allows genomic researchers to carry out more sophisticated sequence analysis between any taxa within the Ensembl system, which may reveal potential genomic signals imposed by selection pressure.
Details
- Title
- A comparative computational approach for exploring sequence-derived properties of genomic features
- Creators
- Eslam Ibrahim Amin Ibrahim
- Contributors
- Graham King (Supervisor) - Southern Cross UniversityRamil Mauleon (Supervisor) - Southern Cross UniversityNedeljka Rosic (Supervisor) - Southern Cross UniversityAbdul K M Baten (Supervisor) - Southern Cross University
- Awarding Institution
- Southern Cross University; Doctor of Philosophy (PhD)
- Theses
- Doctor of Philosophy (PhD), Southern Cross University
- Publisher
- Southern Cross University, Plant Science
- Number of pages
- 247
- Identifiers
- 991012963600402368
- Copyright
- © Eslam Ibrahim 2021
- Academic Unit
- School of Environment, Science and Engineering; Southern Cross Plant Science
- Resource Type
- Thesis