A comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful technique for enhancing the problem-solving capabilities of large language models (LLMs), particularly in complex tasks that require multi-step reasoning. Recent research has revealed that CoT explanations may not accurately represent models’ actual reasoning mechanisms, as models can conceal flawed thought processes or alter answers without acknowledging external influences. This limitation compromises the reliability of CoT-based methods for safety supervision and alignment monitoring since models may offer thorough but deceptive explanations for inaccurate responses. Current training and fine-tuning techniques need to be evaluated for their effectiveness in enhancing both accuracy and faithfulness of chain-of-thought reasoning. This study shows that Group Relative Policy Optimization (GRPO) training achieves superior performance compared to Direct Preference Optimization (DPO) in larger models, with the Qwen2.5-14B-Instruct model attaining the highest scores across all evaluation metrics. Both approaches show a positive correlation between model size and performance, but GRPO shows more potential for improving faithfulness metrics, although with less consistent behavior at smaller model scales. These results imply that the GRPO technique offers a promising path for creating more transparent AI systems. Our findings also highlight the trade-off between GRPO’s superior peak performance and DPO’s steady scaling behavior, while also asking questions regarding computational accessibility and the necessity of further development in faithfulness evaluation techniques, as the demand for explainable AI increases across many domains.

Keywords

Citation