Streamline unsupervised machine learning to survey and graph indel-based haplotypes from pan-genomes
Bosen Zhang,Haiyan Huang,Laura E. Tibbs-Cortes,Adam Vanous,Zhiwu Zhang,Karen Sanguinet,Kimberly A. Garland-Campbell,Jianming Yu,Xianran Li
Molecular Plant Published:May 17, 2023
Pan-genomes with high quality de novo assemblies are shifting the paradigm of biology research in genome evolution, speciation, and function annotation (Shi et al., 2023). An all-vs.-all comparison across assemblies potentially overcomes the limitation of mapping short reads to a single assembly in cataloging polymorphisms, especially large insertions and deletions (indels) contributing to phenotypic variations through altering gene structure or expression (Chen et al., 2021). However, for specific genes, surveying and graphing large indels across assemblies are challenging and painstaking tasks (Mahmoud et al., 2019). Here, we constructed an interactive webapp, BRIDGEcereal (https://bridgecereal.scinet.usda.gov/), to expedite this process through streamlining unsupervised learning.
A large indel is flanked by two high-scoring segment pairs (HSPs). We devised two unsupervised machine learning algorithms to identify large indels (Figure 1A). The first algorithm, clustering HSPs for ortholog identification via coordinates and equivalence (CHOICE; Figure 1B), identifies and extracts the segment harboring the ortholog from each assembly. The segments are then subjected to an all-vs.-all comparison to survey potential large indels. The second algorithm, clustering via large-indel permuted slopes (CLIPS; Figure 1C and Supplemental Figure 1), groups segments sharing the same set of indels to graph a concise haplotype depiction for visualizing potential large indels, their impacts on the gene, and relationships among haplotypes. For indels outside of genes, because of unknown sizes and locations, multiple iterations may be needed to obtain the optimal haplotype graph by probing different up- and down-stream search boundaries and the order of haplotypes (Supplemental Figure 2). Through the interactive graph user interface of BRIDGEcereal, these parameters can be instantly adjusted based on the visual inspection.