In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting — perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

Original languageEnglish
Title of host publicationAlgorithmic Decision Theory - 5th International Conference, ADT 2017, Proceedings
Subtitle of host publication5th International Conference, ADT 2017, Luxembourg, Luxembourg, October 25–27, 2017, Proceedings
EditorsJörg Rothe
Number of pages17
ISBN (Electronic)978-3-319-67504-6
ISBN (Print)978-3-319-67503-9
Publication statusPublished - 25 Oct 2017
EventInternational Conference on Algorithmic Decision Theory - Hotel Parc-Belle-Vue 5, Avenue Marie-Thérèse L – 2132 Luxembourg., Luxembourg, Luxembourg
Duration: 25 Oct 201727 Oct 2017
Conference number: 5

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10576 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferenceInternational Conference on Algorithmic Decision Theory
Abbreviated titleADT
Internet address

ID: 36362579