Detecting and Mitigating Goal Misgeneralisation with Logical Interpretability Tools

This thesis expands on the problem of AI alignment, and the specific instances of misalignment. Current and future problems are discussed to stress the increasing importance of alignment, and both reward specification and goal misgeneralisation are discussed as difficulties with aligning agent behavior with the intended objective of its designer. Original research will be done by eliciting and studying properties of goal misgeneralisation in a novel collection of toy environments. Furthermore, rule induction algorithms are implemented as an interpretability tool in order to generate multiple different explanations for an agent's behavior, which can aid in detecting goal misgeneralisation.

Keywords

AI Safety; Alignment; Interpretability; Misalignment; Goal Misgeneralisation; Logical Induction Algorithms

URI

https://studenttheses.uu.nl/handle/20.500.12932/44216

Detecting and Mitigating Goal Misgeneralisation with Logical Interpretability Tools

Files

Publication date

Authors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI