Leveraging Reward Machines for Efficient Multi-Objective Reinforcement Learning
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Reinforcement Learning (RL) provides a powerful framework for sequential decision-making, enabling agents to learn optimal behavior through interaction with an environment. While standard RL assumes a single scalar reward function and a Markovian reward structure, many real-world problems involve multiple, potentially conflicting objectives and require temporally extended behaviors that fundamentally violate the Markov property. These complexities motivate the study of Multi-Objective Reinforcement Learning (MORL) and non-Markovian reward specifications, both of which demand new representations and learning strategies to ensure effective, sample-efficient, and interpretable solutions.
This thesis proposes a novel formalism, Multi-Objective Reward Machines (MORMs), to unify these challenges. Reward Machines (RMs) are finite-state automata that encode non-Markovian reward functions in a structured and expressive manner. By exposing the internal logic of the reward function to the agent, RMs enable more efficient learning and clearer task decomposition. We extend this concept to the multi-objective setting by assigning each objective its own single-objective RM, and composing them into a product automaton that captures the full structure of the temporally extended, vector-valued reward signal.
The resulting MORM is combined with a base environment MDP, which is labeled with reward-relevant events observable by the agent, yielding an augmented MOMDP over a structured, higher-dimensional state space. While this construction allows the use of existing MORL algorithms, it also increases computational complexity due to the expanded state representation. To address this, we generalize Counterfactual Experiences for Reward Machines (CRM) to the multi-objective setting. This adaptation leverages the known structure of the MORM to produce counterfactual transitions across RM components, enabling efficient off-policy learning and substantially improving sample efficiency without compromising policy quality.
We empirically validate this framework on a collection of novel non-Markovian multi-objective tasks, each constructed by composing simple reward patterns, such as sequences and loops of labeled events, into structured MORMs. The underlying environment is a discrete-state, discrete-action, deterministic gridworld in which the agent triggers events by visiting specific locations. For each experiment, we compute the true Pareto front using Pareto Value Iteration (PVI), and compare it to the solution sets obtained by applying Pareto Q-Learning (PQL) on the augmented MORM-based MOMDP, both with and without multi-objective CRM. Evaluation using metrics such as hypervolume and expected utility demonstrates that CRM yields substantial improvements in sample efficiency, especially in settings with sparse or delayed rewards, while maintaining high-quality coverage of the true optimal set.
This work contributes a unified framework for handling non-Markovian reward structure and multiple objectives within a single formulation, enabling more principled and sample-efficient learning in complex sequential decision-making tasks.
Keywords
Multi-Objective Reinforcement Learning; Reward Machines; Reinforcement Learning