Abstract
In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.
Original language | English |
---|---|
Title of host publication | ACM International Conference Proceeding Series |
Volume | 382 |
DOIs | |
Publication status | Published - 2009 |
Event | 26th Annual International Conference on Machine Learning, ICML'09 - Montreal, QC, Canada Duration: Jun 14 2009 → Jun 18 2009 |
Other
Other | 26th Annual International Conference on Machine Learning, ICML'09 |
---|---|
Country | Canada |
City | Montreal, QC |
Period | 6/14/09 → 6/18/09 |
Fingerprint
ASJC Scopus subject areas
- Human-Computer Interaction
Cite this
Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. / Szita, István; Lőrincz, A.
ACM International Conference Proceeding Series. Vol. 382 2009. 125.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - Optimistic initialization and greediness lead to polynomial time learning in factored MDPs
AU - Szita, István
AU - Lőrincz, A.
PY - 2009
Y1 - 2009
N2 - In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.
AB - In this paper we propose an algorithm for polynomial-time reinforcement learn-ing in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an em-pirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized op-timistically. We prove that with suitable ini-tialization (i) FOIM converges to the xed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with re- spect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the rst algorithm with these properties.
UR - http://www.scopus.com/inward/record.url?scp=70049101614&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70049101614&partnerID=8YFLogxK
U2 - 10.1145/1553374.1553502
DO - 10.1145/1553374.1553502
M3 - Conference contribution
AN - SCOPUS:70049101614
SN - 9781605585161
VL - 382
BT - ACM International Conference Proceeding Series
ER -