Quantifying Stylistic Convergence in Stable Diffusion Models
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Text-to-image (TTI) models have become increasingly more integrated in creative workflows,
but their commercial use remains frowned upon due to worries about, among other
things, reduced artistic diversity. This thesis investigates how the diversity in visual style of
Stable Diffusion models (SD1.5, SD2.1, SDXL, SD3.5) progresses through different versions
and examines whether the models demonstrate an increased inclination towards converging
or dominant "default styles". Through the use of five controlled prompt sets used
to generate images, a large real-world community image set, and three separate visual
similarity measurement metrics (DiffSim, DINO, CLIP), the research will evaluate stylistic
similarity both under controlled and real-world conditions.
Results indicate that identifiable aesthetic tendencies are weak in early versions of Stable
Diffusion, but become more pronounced in newer models, especially when the amount of
guidance provided through prompts is low. SDXL demonstrates high levels of stylistic similarity
among categories, and demonstrates particularly strong characteristics of a "default
style," while SD3.5 follows similar but sometimes weaker trends.
However, the results of the real-world data illustrate an opposing trend: community
images generated with SDXL-based models demonstrated greater stylistic variety than
those produced with models that use SD1.5 as a backbone. These findings suggest that
model usability/controllability, fine-tuning ecosystems, and user practices may be factors
that greatly influence the degree of stylistic variability realized by users, and therefore
that it is possible for base-models to converge stylistically without restricting the creative
possibilities of those models through community effort.
Overall, the results demonstrate a conflict between the increasingly strong aesthetic biases
present in recent base models and the expanded range of creative options available
to users. The research contributes new empirical evidence of stylistic behavior in diffusion
models and provides insight to how this changes in controlled and practical environments.
Keywords
generative AI; text-to-image; AI; Stable Diffusion; creativity; Artificial Intelligence; image style; style diversity; diffusion; TTI; civitAI; image generation; synthetic images; AI creativity