Hidden uncertainty in data analysis: Understanding sources of variability in many-analyst projects

This study examines (1) how analytical decisions contribute to variability in many-analyst studies and (2) whether specific decisions can be identified as key drivers. Several models, varying in complexity, were trained and validated on a synthetic multiverse dataset and tested for generalization on the many-analyst dataset from Breznau et al. (2022). While non-linear models performed well on the multiverse dataset (XGBoost R2 = 0.96), none generalized to the many-analyst dataset (R2 ~ 0.0), possibly due to noise or the absence of key decisions in the synthetic data. SHAP values and feature importance highlighted that choices about variables, especially type of independent variables was most impactful. Although current models failed to explain variance in many-analyst settings, findings suggest that efforts to explain variability in many-analysts projects should employ complex models capturing non-linear relationships and emphasize the choice of variables.

Keywords

many-analyst projects, meta research, analytical decisions, multiverse analysis, machine learning

URI

https://studenttheses.uu.nl/handle/20.500.12932/49822

Hidden uncertainty in data analysis: Understanding sources of variability in many-analyst projects

Files

Publication date

Authors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI