A Natural Language Processing Model for Bacterial Genome Assembly Using Gene Annotations

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

In this project, we explored a new way of assembling bacterial genomes by using a natural language processing (NLP) approach, typically used for understanding text, to help piece together fragments of DNA. Normally, when scientists try to assemble bacterial genomes, they face challenges due to repetitive DNA sequences, which can make it difficult to correctly order the fragments, especially with current methods. Our approach aimed to address these challenges by training a model to ”understand” connections between DNA fragments, similar to how word-embedding models in language processing understand relationships between words. We started by analyzing gene sequences from known bacterial genomes, annotating them with specific codes using a tool called Bakta, and training a Word2Vec model on these codes. This allowed our model to learn how these genes usually connect in a bacterial genome. To test this, we first assembled DNA fragments (called contigs) from short-read DNA data, again annotated with Bakta. By examining the connections between the beginning and end of each contig, the model could suggest possible links between contigs based on their similarity in code patterns. We trained the model on multiple bacterial species together and also created models for each species separately. While species-specific models performed better, the combined model for all species still worked surprisingly well, which could simplify the process when working with mixed bacterial samples in real-world applications. This approach shows promise in handling repetitive DNA regions and potentially improving genome assembly accuracy in bacterial studies.

Keywords

Citation