Context-aware error detection for relational datasets using Large Language Models

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Detecting erroneous values in datasets remains a challenging and time-consuming task. Different types of errors could occur in each dataset. Errors could either be syntactic, in which case they do not conform to the structure or domain of other values, or semantic, where values are syntactically correct but appear in the wrong context. The variability of the context in which these errors occur makes it hard to design a tool to detect all errors in all contexts. Methods exist to address this problem, but the need to account for context when detecting errors is still challenging, often relying on expensive human intervention. In this research, we developed a new tool that leverages the context awareness of Large Language Models (LLMs) to perform context-aware detection of semantic and syntactic errors. By pruning datasets to optimize the size and quality of input, and employing prompt engineering designed for error detection, the tool extends the range of detectable syntactic and semantic errors by detecting errors that cannot be detected otherwise.

Keywords

Dataset Context; Syntactic Structure; Semantics; Large Language Models (LLMs)

Citation