A SURVEY OF DISCRETE DIFFUSION METHODS FOR NATURAL LANGUAGE AND DNA SEQUENCE GENERATION

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

While diffusion models have achieved state-of-the-art results in continuous domains like image generation, their application to inherently discrete data such as natural language and DNA presents unique challenges. Continuous-space adaptations often introduce artifacts and complexities, motivating a focused investigation into models that operate directly on discrete data. This survey provides a comprehensive overview of the methods and advancements in the field of discrete diffusion models. We review the foundational formulations, including Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), and their theoretical adaptations to discrete state spaces. We then chronologically survey advancements across key modalities—Natural Language Processing and DNA sequences—examining critical research topics such as novel forward processes and the adaptation of pre-trained language models. By synthesizing these developments and outlining future research directions, this paper offers a structured overview to this rapidly evolving field.

Keywords

Diffusion, Discrete Diffusion, Diffusion language model, Natural language Processing, Generative Modeling, Denoising Diffusion Probabilistic Models, Score-Based Generative Models, DNA generation,

Citation