Resources for Cannabis sativa L. omics data analyses

Locedie A. Mansueto; Locedie Mansueto

doi:10.25918/thesis.408

Global changes in Cannabis legislation and growing demand for its industrial and medicinal applications have spurred a resurgence of research and development. Compared to traditional crops, decades of stringent regulation have impeded public cannabis research, creating disparities in resources, knowledge, and capacity. This thesis aims to fill these gaps by developing platforms and tools to support Cannabis genetic, functional genomic and pre-breeding objectives. As a first research chapter, a Tripal instance was setup to host public Cannabis -omics datasets (www.icgrc.info). Most published Cannabis datasets from high-throughput methods (as of December 2022) were loaded, including genome assemblies and gene annotations, transcriptome, transcript and protein expression, genetic map and QTLs, and metabolic abundance. Genomic and RNA-Seq data were re-analyzed to discover variants, assemble transcripts, predict genes, and quantify expression level. The detailed steps published under the accompanying protocol.io further serve as manual for setting up Tripal for other resource-limited non-model crops. The Tripal and Chado data models were reviewed for the purpose of developing a multi-omics and multi-source data integration module. The solution implemented queries of the various Tripal modules to return tuples of (datatype, property, sample, value), merge by union, and pivot into a table of datatype-properties and samples. Web-service Application Programming Interfaces (APIs) were defined and used by template Jupyter notebooks in various multi-omics analyses to discover candidate genes. The second research chapter aimed to create a powerful and versatile allele mining tool. Public next generation sequences were used in variant calling against three reference genomes - cs10, Purple Kush and Finola. This compute-intensive task additionally benchmarked GATK with GPU-accelerated Parabricks. Raw variants discovered reached 90-110M SNPs, 17-21M indels from 380 samples with high heterozygosity. The resulting large genotype matrix is hosted using SNP-Seek for interactive query and analysis. As a final research chapter, a mid-density genotyping platform was designed through Integer Linear Programming utilizing SNP data generated under the previous chapters. The High-throughput Amplicon-based SNP-platform for medicinal Cannabis and industrial Hemp (HASCH) has 1504 genome-wide targets of high informativeness. Empirical evaluation using hemp samples gave high (92%) concordance and comparable phylogenetic tree with GBS data and demonstrated the ability to generate genetic maps and detect QTL. In conclusion, thesis outputs provide a much-needed step change for Cannabis genomic research. By delivering a suite of genomic platforms and tools taken for granted for conventional crops, they will accelerate Cannabis crop improvement for the benefit of the budding Cannabis industries.

Resources for Cannabis sativa L. omics data analyses

Files and links (1)

Metrics

Abstract

Details

Resources for Cannabis sativa L. omics data analyses

Files and links (1)

Metrics

Abstract

Details

Southern Cross University Social media