Optimistic initialization and greediness lead to polynomial time learning in factored MDPs

István Szita, A. Lőrincz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.

Original languageEnglish
Title of host publicationACM International Conference Proceeding Series
Volume382
DOIs
Publication statusPublished - 2009
Event26th Annual International Conference on Machine Learning, ICML'09 - Montreal, QC, Canada
Duration: Jun 14 2009Jun 18 2009

Other

Other26th Annual International Conference on Machine Learning, ICML'09
CountryCanada
CityMontreal, QC
Period6/14/096/18/09

Fingerprint

Polynomials
Reinforcement
Costs

ASJC Scopus subject areas

  • Human-Computer Interaction

Cite this

Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. / Szita, István; Lőrincz, A.

ACM International Conference Proceeding Series. Vol. 382 2009. 125.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Szita, I & Lőrincz, A 2009, Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. in ACM International Conference Proceeding Series. vol. 382, 125, 26th Annual International Conference on Machine Learning, ICML'09, Montreal, QC, Canada, 6/14/09. https://doi.org/10.1145/1553374.1553502
Szita, István ; Lőrincz, A. / Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. ACM International Conference Proceeding Series. Vol. 382 2009.
@inproceedings{b24de61cc40845e9bc1a1efc7655ec52,
title = "Optimistic initialization and greediness lead to polynomial time learning in factored MDPs",
abstract = "In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.",
author = "Istv{\'a}n Szita and A. Lőrincz",
year = "2009",
doi = "10.1145/1553374.1553502",
language = "English",
isbn = "9781605585161",
volume = "382",
booktitle = "ACM International Conference Proceeding Series",

}

TY - GEN

T1 - Optimistic initialization and greediness lead to polynomial time learning in factored MDPs

AU - Szita, István

AU - Lőrincz, A.

PY - 2009

Y1 - 2009

N2 - In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.

AB - In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.

UR - http://www.scopus.com/inward/record.url?scp=70049101614&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70049101614&partnerID=8YFLogxK

U2 - 10.1145/1553374.1553502

DO - 10.1145/1553374.1553502

M3 - Conference contribution

SN - 9781605585161

VL - 382

BT - ACM International Conference Proceeding Series

ER -