Context-aware error detection for relational datasets using Large Language Models
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Detecting erroneous values in datasets remains a challenging and time-consuming task.
Different types of errors could occur in each dataset. Errors could either be syntactic, in
which case they do not conform to the structure or domain of other values, or semantic,
where values are syntactically correct but appear in the wrong context. The variability
of the context in which these errors occur makes it hard to design a tool to detect all
errors in all contexts. Methods exist to address this problem, but the need to account
for context when detecting errors is still challenging, often relying on expensive human
intervention. In this research, we developed a new tool that leverages the context
awareness of Large Language Models (LLMs) to perform context-aware detection of
semantic and syntactic errors. By pruning datasets to optimize the size and quality of
input, and employing prompt engineering designed for error detection, the tool extends
the range of detectable syntactic and semantic errors by detecting errors that cannot
be detected otherwise.
Keywords
Dataset Context; Syntactic Structure; Semantics; Large Language Models (LLMs)