Hidden uncertainty in data analysis: Understanding sources of variability in many-analyst projects
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
This study examines (1) how analytical decisions contribute to variability in many-analyst studies and (2) whether specific decisions can be identified as key drivers.
Several models, varying in complexity, were trained and validated on a synthetic multiverse dataset and tested for generalization on the many-analyst dataset from Breznau et al. (2022). While non-linear models performed well on the multiverse dataset (XGBoost R2 = 0.96), none generalized to the many-analyst dataset (R2 ~ 0.0), possibly due to noise or the absence of key decisions in the synthetic data. SHAP values and feature importance highlighted that choices about variables, especially type of independent variables was most impactful.
Although current models failed to explain variance in many-analyst settings, findings suggest that efforts to explain variability in many-analysts projects should employ complex models capturing non-linear relationships and emphasize the choice of variables.
Keywords
many-analyst projects, meta research, analytical decisions, multiverse analysis, machine learning