Validation of a Bayesian mixture model for language contact with the use of synthetic language data

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Speaker communities typically have some level of interaction and are not completely isolated. When individuals who speak different languages come into contact, it is probable that their respective languages undergo a process of convergence. Ranacher et al. (2021) have developed a method, sBayes, to estimate the relative role of language contact, as opposed to inheritance and universal preference, in creating similarities between languages. The model promises to identify contact areas from empirical data using (Bayesian) inference. However, validation of the approach proves difficult since they use em- pirical data of real-world language in which, by definition, actual contri- butions of language contact, inheritance and universal preference are not known. To further validate the sBayes model, a dataset is needed from which we know our expected descriptive contact, inheritance and universal prefer- ence values prior to the model run. This dataset can then be compared to the output of sBayes. For this purpose, we created synthetic language datasets using an agent- based model to test the accuracy of sBayes. Using these datasets we con- ducted two experiments, one to validate sBayes ability to detect isolated causal explanations per language feature. The second to test sBayes fit to an artificial language dataset and in determining language areas (clusters) and overall causality counts. Our results suggest that synthetic language data can successfully be used for validation purposes of the sBayes language model. sBayes accuracy on identifying clearly isolated causalities has a combined mean squared error of 0.05 in our simulations. In a simulated real life situation, the model find a similar amount of contact areas. In addition, the overall distribution of feature state causality is the same in our synthetic data when we compare it to a benchmark experiment.

Keywords

bayesian;sBayes;artificial;language;data;contact

Citation