Quantifying Stylistic Convergence in Stable Diffusion Models

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Text-to-image (TTI) models have become increasingly more integrated in creative workflows, but their commercial use remains frowned upon due to worries about, among other things, reduced artistic diversity. This thesis investigates how the diversity in visual style of Stable Diffusion models (SD1.5, SD2.1, SDXL, SD3.5) progresses through different versions and examines whether the models demonstrate an increased inclination towards converging or dominant "default styles". Through the use of five controlled prompt sets used to generate images, a large real-world community image set, and three separate visual similarity measurement metrics (DiffSim, DINO, CLIP), the research will evaluate stylistic similarity both under controlled and real-world conditions. Results indicate that identifiable aesthetic tendencies are weak in early versions of Stable Diffusion, but become more pronounced in newer models, especially when the amount of guidance provided through prompts is low. SDXL demonstrates high levels of stylistic similarity among categories, and demonstrates particularly strong characteristics of a "default style," while SD3.5 follows similar but sometimes weaker trends. However, the results of the real-world data illustrate an opposing trend: community images generated with SDXL-based models demonstrated greater stylistic variety than those produced with models that use SD1.5 as a backbone. These findings suggest that model usability/controllability, fine-tuning ecosystems, and user practices may be factors that greatly influence the degree of stylistic variability realized by users, and therefore that it is possible for base-models to converge stylistically without restricting the creative possibilities of those models through community effort. Overall, the results demonstrate a conflict between the increasingly strong aesthetic biases present in recent base models and the expanded range of creative options available to users. The research contributes new empirical evidence of stylistic behavior in diffusion models and provides insight to how this changes in controlled and practical environments.

Keywords

generative AI; text-to-image; AI; Stable Diffusion; creativity; Artificial Intelligence; image style; style diversity; diffusion; TTI; civitAI; image generation; synthetic images; AI creativity

Citation