Dutch style model trained using authorship verification

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Currently, there are no models trained specifically trained on the creation of representation of Dutch linguistic style. Neither has a task been developed to evaluate and verify the created embeddings. In this thesis, I construct a model that creates a style representation for Dutch and I create evaluation data to test if the created representation truly represents style. To create these embeddings, RobBERT-base is fine-tuned using the contrastive authorship verification task. To find the best-performing model, two datasets are constructed, and the loss function is experimented with as well as the value for the margin. The performance of the fine-tuned models falls in line with the results that are found in similar research for English style. For the evaluation, the STEL dataframe is adapted to a Dutch version. Some categories are copied from the English variant and translated to properly reflect Dutch style. Other categories are novel in this version. There are two versions of the STEL task, one of which controls for content to ensure that the embedding makes a decision based on style. The performance of the embeddings on the STEL task shows similarities to the results that are found in research into the English equivalent and shows that for most tasks the fine-tuned model learns to perform better on the tasks that control for style than the baseline model does. Therefore, this thesis concludes that it is possible to utilize methods devised for creating and evaluating English style representations and transform these into a Dutch version that show similar results as the original do

Keywords

Citation