Unmasking Memorization: Assessing Dutch Language Memorization in mT5 Models

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

This study investigates the memorization of Dutch language content in mT5 models, a multilingual variant of the T5 Transformer-based models. A fill-mask evaluation technique is used to assess memorization and how it varies across different model sizes. Results show that memorization increases with model size up to a certain point. Significant memorization is observed in the 580M and 1.2B sized models, while the smallest 300M and largest 3.7B models are close to baseline generalization performance, minimizing relative memorization effects. Additionally, the findings reveal that data duplication and varying the masking level impact the memorization effect. Moderately duplicated sequences exhibit the highest memorization. Furthermore, a masking level similar to pre-training conditions also results in the highest observed memorization which sharply declines when the masking level is increased. These findings have implications for model reliability as well as ethical and legal implications, particularly regarding the use of copyrighted training data. This research underscores the need to balance training data and adjust model design to promote generalization and minimize memorization in multilingual models.

Keywords

LLMs;mT5;memorization;copyright;Dutch;NLP

Citation