Learning Action Embeddings for Off-Policy Evaluation

authored by: Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov
Abstract: Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
Organisation(s): Data Base and Information Systems Section
External Organisation(s): Brno University of Technology
Kempelen Institute of Intelligent Technologies (KINIT)
Amazon Search
Amazon, London
Type: Conference contribution
Pages: 108-122
No. of pages: 15
Publication date: 20.03.2024
Publication status: Published
Peer reviewed: Yes
ASJC Scopus subject areas: Theoretical Computer Science, Computer Science(all)
Electronic version(s): https://doi.org/10.48550/arXiv.2305.03954 (Access: Open)
https://doi.org/10.1007/978-3-031-56027-9_7 (Access: Closed)

BibTeX

@inproceedings{a295fa61047d4a44b67283f51a78e4ce,
title = "Learning Action Embeddings for Off-Policy Evaluation",
abstract = "Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.",
keywords = "large action space, multi-armed bandits, off-policy evaluation, recommender systems, representation learning",
author = "Matej Cief and Jacek Golebiowski and Philipp Schmidt and Ziawasch Abedjan and Artur Bekasov",
note = "Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.; 46th European Conference on Information Retrieval, ECIR 2024 ; Conference date: 24-03-2024 Through 28-03-2024",
year = "2024",
month = mar,
day = "20",
doi = "10.48550/arXiv.2305.03954",
language = "English",
isbn = "9783031560262",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland GmbH",
pages = "108--122",
editor = "Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis",
booktitle = "Advances in Information Retrieval",
address = "Germany",
}