Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors

Reverse Itertive Deepening for Finite-Horizon MDPs with Lrge Brnching Fctors Andrey Kolobov Peng Di Musm Dniel S. Weld {kolobov, dipeng, musm, weld}@cs.wshington.edu Dept. of Computer Science nd Engineering Google Inc. University of Wshington 6 Amphitheter Pkwy Settle, USA, WA-9895 Mountin View, USA, CA-9443 Abstrct In contrst to previous competitions, where the problems were gol-bsed, the 2 Interntionl Probbilistic Plnning Competition (IPPC-2) emphsized finite-horizon rewrd mximiztion problems with lrge brnching fctors. These MDPs modeled more relistic plnning scenrios nd presented chllenges to the previous stte-of-the-rt plnners (e.g., those from IPPC-28), which were primrily bsed on domin determiniztion technique more suited to gol-oriented MDPs with smll brnching fctors. Moreover, lrge brnching fctors render the existing implementtions of RTDP- nd LAO -style lgorithms inefficient s well. In this pper we present GLUTTON, our plnner t IPPC- 2 tht performed well on these chllenging MDPs. The min lgorithm used by GLUTTON is LR 2 TDP, n LRTDPbsed optiml lgorithm for finite-horizon problems centered round the novel ide of reverse itertive deepening. We detil LR 2 TDP itself s well s series of optimiztions included in GLUTTON tht help LR 2 TDP chieve competitive performnce on difficult problems with lrge brnching fctors subsmpling the trnsition function, seprting out nturl dynmics, cching trnsition function smples, nd others. Experiments show tht GLUTTON nd PROST, the IPPC-2 winner, hve complementry strengths, with GLUTTON demonstrting superior performnce on problems with few high-rewrd terminl sttes. Introduction New benchmrk MDPs presented t the Interntionl Probbilistic Plnning Competition (IPPC) 2 (Snner 2) demonstrted severl weknesses of existing solution techniques. First, the dominting plnners of pst yers (FF- Repln (Yoon, Fern, nd Givn 27), RFF (Teichteil- Koenigsbuch, Infntes, nd Kuter 28), etc.) hd been gered towrds gol-oriented MDPs with reltively smll brnching fctors. To tckle such scenrios, they hd relied on fully determinizing the domin (smll brnching fctor mde this fesible) nd solving the determinized version of the given problem. For the ltter prt, the performnce of these solvers criticlly relied on powerful clssicl plnners (e.g., FF (Hoffmnn nd Nebel 2)) nd heuristics, Peng Di completed this work while t the University of Wshington. Copyright c 22, Assocition for the Advncement of Artificil Intelligence (www.i.org). All rights reserved. ll of which ssumed the existence of gol, the uniformity of ction costs, benign brnching fctors, or ll three. In contrst, the mjority of IPPC-2 MDPs were problems with finite horizon, non-uniform ction costs, lrge brnching fctors, nd no gol sttes chrcteristics to which determiniztion-bsed plnners re hrd to dpt. Incidentlly, lrge brnching fctors mde the existing implementtions of heuristic serch lgorithms such s LRTDP (Bonet nd Geffner 23) or AO (Nilsson 98) obsolete s well. These lgorithms re centered round the Bellmn bckup opertor, which is very expensive to compute when stte-ction pirs hve mny successors. Second, previous top-performers optimized for the probbility of their policy reching the MDP s gol, which ws the evlution criterion t preceding IPPCs (Bryce nd Buffet 28), not the expected rewrd of tht policy. At IPPC-2 the evlution criterion chnged for the ltter, more subtle objective, nd thus becme more stringent. Thus, overll, IPPC-2 introduced much more relistic MDPs nd evlution criteri thn before. Indeed, in rel-world systems, lrge brnching fctors re common nd re often cused by nturl dynmics, effects of exogenous events or forces of nture tht cnnot be directly controlled but tht need to be tken into ccount during plnning. Moreover, the controller (e.g., on robot) my only hve limited time to come up with policy, circumstnce IPPC-2 lso ttempted to model, nd the expected rewrd of the produced policy is very importnt. To succeed under these conditions, plnner needs to be not only sclble but lso sensitive to the expected rewrd mximiztion criterion nd, crucilly, hve strong nytime performnce. The min theoreticl contribution of this pper is LR 2 TDP, n lgorithm tht, with dditionl optimiztions, cn stnd up to these chllenges. LR 2 TDP is founded on crucil observtion tht for mny MDPs M(H) with horizon H, one cn produce successful policy by solving M(h), the sme MDP but with much smller horizon h. Therefore, under time constrints, trying to solve the sequence of MDPs M(), M(2), with incresing horizon will often yield ner-optiml policy even if the computtion is interrupted long before the plnner gets to tckle MDP M(H). this strtegy, which we cll reverse itertive deepening, forms the bsis of LR 2 TDP. Although the bove intuition ddresses the issue of nytime performnce, by itself it does not enble LR 2 TDP to

hndle lrge brnching fctors. Accordingly, in this pper we introduce GLUTTON, plnner derived from LR 2 TDP nd our entry in IPPC-2. GLUTTON endows LR 2 TDP with optimiztions tht help chieve competitive performnce on difficult problems with lrge brnching fctors subsmpling the trnsition function, seprting out nturl dynmics, cching trnsition function smples, nd using primitive cyclic policies s fll-bck solution. Thus, this pper mkes the following contributions: We introduce the LR 2 TDP lgorithm, n extension of LRTDP to finite-horizon problems bsed on the ide of reverse itertive deepening. We describe the design of GLUTTON, our IPPC-2 entry built round LR 2 TDP. We discuss vrious engineering optimiztions tht were included in GLUTTON to improve LR 2 TDP s performnce on problems with lrge brnching fctors due to nturl dynmics. We present results of empiricl studies tht demonstrte tht LR 2 TDP performs much better thn the strightforwrd extension of LRTDP to finite-horizon MDPs. In ddition, we crry out bltion experiments showing the effects of vrious optimiztions on GLUTTON s performnce. Finlly, we nlyze the comprtive performnce of GLUTTON nd PROST (Keller nd Eyerich 22), the winner of IPPC-2, nd find tht the two hve complementry strengths. Bckground MDPs. In this pper, we focus on probbilistic plnning problems modeled by finite-horizon MDPs with strt stte, defined s tuples of the form M(H) = S, A, T, R, s, H where S is finite set of sttes, A is finite set of ctions, T is trnsition function S A S [, ] tht gives the probbility of moving from s i to s j by executing, R is mp S A R tht specifies ction rewrds, s is the strt stte, nd H is the horizon, the number of time steps fter which the process stops. In this pper, we will reson bout the ugmented stte spce of M(H), which is set S {,..., H} of sttenumber of steps-to-go pirs. Solving M(H) mens finding policy, i.e. rule for selecting ctions in ugmented sttes, s.t. executing the ctions recommended by the policy strting t the ugmented initil stte (s, H) results in ccumulting the lrgest expected rewrd over H time steps. In prticulr, let vlue function be ny mpping V : S {,..., H} R, nd let the vlue function of policy π be the mpping V π : S {,..., H} R tht gives the expected rewrd from executing π strting t ny ugmented stte (s, h) for h steps, h H. Idelly, we would like to find n optiml policy π, one whose vlue function V for ll s S obeys V (s, h) = mx π {V π (s, h)} for h H. As it turns out, for given MDP V is unique nd stisfies Bellmn equtions (Bellmn 957) for ll s S: { } V (s, h) = mx A R(s, ) + s S T (s,, s )V (s, h ) for h H nd V (s, ) = otherwise. () WLOG, we ssume the optiml ction selection rule π to be deterministic Mrkovin, i.e., of the form π : S {,..., H} A, since for every finite-horizon MDP t lest one optiml such policy is gurnteed to exist (Putermn 994). If V is known, deterministic π cn be derived from it by choosing V -greedy ction in ech stte ll h H. Solution Methods. Eqution suggests dynmic progrmming-bsed wy of finding n optiml policy, clled Vlue Itertion (VI) (Bellmn 957). VI uses Bellmn equtions s n ssignment opertor, Bellmn bckup, to compute V in bottom-up fshion for t =, 2,... H. The version of VI for infinite-horizon gol-oriented stochstic shortest pth MDPs (Bertseks 995) hs given rise to mny improvements. AO (Nilsson 98) is n lgorithm tht works specificlly with loop-free MDPs (of which finite-horizon MDPs re specil cse). Tril-bsed methods, e.g., RTDP (Brto, Brdtke, nd Singh 995) nd LRTDP (Bonet nd Geffner 23), try to rech the gol from the initil stte multiple times (in multiple trils) nd updte the vlue function over the sttes in the tril pth using Bellmn bckups. Unlike VI, these lgorithms memorize only sttes rechble from s, thereby typiclly requiring much less spce. As we show in this pper, LRTDP cn dpted nd optimized for finite-horizon MDPs. LR 2 TDP We begin by introducing LR 2 TDP, n extension of LRTDP for finite-horizon problems. Like its predecessor, LR 2 TDP solves n MDP for the given initil stte s optimlly in the limit. Extending LRTDP to finite-horizon problems my seem n esy tsk, but its most nturl extension performs worse thn the one we propose, LR 2 TDP. As reminder, LRTDP (Bonet nd Geffner 23) for gol-oriented MDPs opertes in series of trils strting t the initil stte s. Ech tril consists of choosing the greedy best ction in the current stte ccording to the current vlue function, performing Bellmn bckup on the current stte, smpling n outcome of the chosen ction, trnsitioning to the corresponding new stte, nd repeting the cycle. A tril continues until it reches gol, ded end ( stte from which reching the gol is impossible), or converged stte. At the end of ech tril, LRTDP performs specil convergence check on ll sttes in the tril to prove, whenever possible, the convergence of these sttes vlues. Once it cn prove tht s hs converged, LRTDP hlts. Thus, strightforwrd dpttion of LRTDP to finitehorizon MDP M(H), which we cll LRTDP-FH, is to let ech tril strt t (s, H) nd run for t most H time steps. Indeed, if we convert finite-horizon MDP to its goloriented counterprt, ll sttes H steps wy from s re gol sttes. However, s we explin below, LRTDP-FH s nytime performnce is not very good, so we turn to more sophisticted pproch. Our novel lgorithm, LR 2 TDP, follows different strtegy, which we nme reverse itertive deepening. As its pseudocode in Algorithm shows, it uses LRTDP-FH in loop to solve sequence of MDPs M(), M(2),, M(H), in tht order. In prticulr, LR 2 TDP first decides how to ct optimlly in (s, ), i.e.

Input: MDP M(H) with initil stte s Output: Policy for M(H) strting t s function LR 2 TDP(MDP M(H), initil stte s ) begin forech h =,..., H or until time runs out do Run LRTDP-FH(M(h), s ) end end function LRTDP-FH(MDP M(h), initil stte s ) begin Convert M(h) into the equivlent gol-oriented MDP M h g, whose gols re sttes of the form (s, ). Run LRTDP(M h g, s ), memoizing the vlues of ll the ugmented sttes encountered in the process end Algorithm : LR 2 TDP ssuming there is only one more ction to execute this is exctly equivlent to solving M(). Then, LR 2 TDP runs LRTDP-FH to decide how to ct optimlly strting t (s, 2), i.e. two steps wy from the horizon this mounts to solving M(2). Then it runs LRTDP-FH gin to decide how to ct optimlly strting in (s, 3), thereby solving M(3), nd so on. Proceeding this wy, LR 2 TDP either eventully solves M(H) or, if operting under time limit, runs out of time nd hlts fter solving M(h ) for some h < H. Crucilly, in the spirit of dynmic progrmming, LR 2 TDP reuses stte vlues computed while solving M(), M(2),..., M(h ) when tckling the next MDP in the sequence, M(h). Nmely, observe tht ny (s, h ) in the ugmented stte spce of ny MDP M(h ) lso belongs to the ugmented sttes spces of ll MDPs M(h ), h h, nd V (s, h ) is the sme for ll these MDPs. Therefore, by the time LR 2 TDP gets to solving M(h), vlues of mny of its sttes will hve been updted or even converged s result of hndling some M(i), i < h. Accordingly, LR 2 TDP memoizes vlues nd convergence lbels of ll ugmented sttes ever visited by LRTDP-FH while solving for smller horizon vlues, nd reuses them to solve subsequent MDPs in the bove sequence. Thus, solving M(h) tkes LR 2 TDP only n incrementl effort over the solution of M(h ). LR 2 TDP cn be viewed s bckchining from the gol in gol-oriented MDP with no loops. Indeed, finite-horizon MDP M(H) is simply gol-oriented MDP whose stte spce is the ugmented stte spce of M(H), nd whose gols re ll sttes of the form (s, H). It hs no loops becuse executing ny ction leds from some stte (s, h) to nother stte (s, h ). LR 2 TDP essentilly solves such MDPs by first ssuming tht the gol is one step wy from the initil stte, then two steps from the initil stte, nd so on, until it ddresses the cse when the gol is H steps wy from the initil stte. Compre this with LRTDP-FH s behvior when solving M(H). LRTDP-FH does not bcktrck from the gol; insted, it tries to forwrd-chin from the initil stte to the gol (vi trils) nd propgtes stte vlues bckwrds whenever it succeeds. As n lterntive perspective, LRTDP-FH itertes on the serch depth, while LR 2 TDP itertes on the distnce from the horizon. The benefit of the ltter is tht it llows for the reuse of computtion cross different itertions. Clerly, both LRTDP-FH nd LR 2 TDP eventully rrive t the optiml solution. So, wht re the dvntges of LR 2 TDP over LRTDP-FH? We rgue tht if stopped premturely, the policy of LR 2 TDP is likely to be much better for the following resons: In mny MDPs M(H), the optiml policy for M(h) for some h << H is optiml or ner-optiml for M(H) itself. E.g., consider mnipultor tht needs to trnsfer blocks regulrly rriving on one conveyor belt onto nother belt. The mnipultor cn do one pick-up, move, or put-down ction per time step. It gets unit rewrd for moving ech block, nd needs to ccumulte s much rewrd s possible over 5 time steps. Delivering one block from one belt to nother tkes t most 4 time steps: move mnipultor to the source belt, pick up block, move mnipultor to the destintion belt, relese the block. Repeting this sequence of ctions over 5 time steps clerly chieves mximum rewrd for M(5). In other words, M(4) s policy is optiml for M(5) s well. Therefore, explicitly solving M(5) for ll 5 time steps is wste of resources solving M(4) is enough. However, LRTDP-FH will try to do the former it will spend lot of effort trying to solve M for horizon 5 t once. Since it spreds its effort over mny time steps, it will likely fil to completely solve M(h) for ny h < H by the dedline. Contrriwise, LR 2 TDP solves the given problem incrementlly, nd my hve solution for M(4) (nd hence for M(5)) if stopped premturely. When LRTDP-FH strts running, mny of its trils re very long, since ech tril hlts only when it reches converged stte, nd t the beginning reching converged stte tkes bout H time steps. Moreover, t the beginning, ech tril cuses the convergence of only few sttes (those ner the horizon), while the vlues of ugmented sttes with smll time step vlues chnge very little. Thus, the time spent on executing the trils is lrgely wsted. In contrst, LR 2 TDP s trils when solving n MDP M(h) re very short, becuse they quickly run into sttes tht converged while solving M(h ) nd before, nd often led to convergence of most of tril s sttes. Hence, we cn expect LR 2 TDP to be fster. As consequence of lrge tril length, LRTDP-FH explores (nd therefore memorizes) mny ugmented sttes whose vlues (nd policies) will not hve converged by the time the plnning process is interrupted. Thus, it risks using up vilble memory before it runs out of time, nd to little effect, since it will not know well how to behve in most of the stored sttes nywy. In contrst, LR 2 TDP typiclly knows how to ct optimlly in lrge frction of ugmented sttes in its memory. Note tht, incidentlly, LR 2 TDP works in much the sme wy s VI, rising question: why not use VI in the first plce? The dvntge of synchronous dynmic progrmming over VI is similr in finite-horizon settings nd in goloriented settings. A lrge frction of the stte spce my be unrechble from s in generl nd by the optiml policy in prticulr. LR 2 TDP voids storing informtion bout mny

of these sttes, especilly if guided by n informtive heuristic. In ddition, in finite-horizon MDPs, mny sttes re not rechble from s within H steps, further incresing potentil svings from using LR 2 TDP. So fr, we hve glossed over subtle question: if LR 2 TDP is terminted fter solving M(h), h < H, wht policy should it use in ugmented sttes (s, h ) tht it hs never encountered? There re two cses to consider ) LR 2 TDP my hve solved s for some h < min{h, h }, nd b) LR 2 TDP hs not solved (or even visited) s for ny time step. In the first cse, LR 2 TDP cn simply find the lrgest vlue h < min{h, h } for which (s, h ) is solved nd return the optiml ction for (s, h ). This is the pproch we use in GLUTTON, our implementtion of LR 2 TDP, nd it works well in prctice. Cse b) is more complicted nd my rise, for instnce, when s is not rechble from s within h steps. One possible solution is to fll bck on some simple defult policy in such situtions. We discuss this option when describing the implementtion of GLUTTON. Mx-Rewrd Heuristic To converge to n optiml solution, LR 2 TDP needs to be initilized with n dmissible heuristic, i.e., n upper bound on V. For this purpose, GLUTTON uses n estimte we cll the Mx-Rewrd heuristic. Its computtion hinges on knowing the mximum rewrd R mx ny ction cn yield in ny stte, or n upper bound on it. R mx cn be utomticlly derived for n MDP t hnd with simple domin nlysis. To produce heuristic vlue V (s, h) for (s, h), Mx- Rewrd finds the lrgest horizon vlue h < h for which GLUTTON lredy hs n estimte V (s, h ). Recll tht GLUTTON is likely to hve V (s, h ) for some such h, since it solves the given MDP in the reverse itertive deepening fshion with LR 2 TDP. If so, Mx-Rewrd sets V (s, h) = V (s, h ) + R mx (h h ); otherwise, it sets V (s, h) = R mx h. The bound obtined in this wy is often very loose but is gurnteed to be dmissible. The Chllenge of Lrge Brnching Fctors In spite of its good nytime behvior, LR 2 TDP by itself would not perform well on mny IPPC-2 benchmrks due to lrge brnching fctors in these MDPs. In rel-world systems, lrge brnching fctors often rise due to the presence of nturl dynmics. Roughly, the nturl dynmics of n MDP describes wht hppens to vrious objects in the system if the controller does not ct on them explicitly in given time step. In physicl systems, it cn model lws of nture, e.g. the effects of rdioctive decy on collection of prticles. It cn lso cpture effects of exogenous events. For instnce, in the MDPs of Sysdmin (Snner 2), one of IPPC-2 domins, the tsk is to control network of servers. At ny time step, ech server is either up or down. The controller cn restrt one server per time step, nd tht server is gurnteed to be up t the next time step. The other servers cn chnge their stte spontneously those tht re down cn go bck up with some smll probbility, nd those tht re up cn go down with probbility proportionl to the frction of their neighbors tht re down. These rndom trnsitions re the nturl dynmics of the system, nd they cuse the MDP to hve lrge brnching fctor. Imgine Sysdmin problem with 5 servers. Due to the nturl dynmics, the system cn trnsition to ny of the 2 5 sttes from ny given one in just one time step. The primry effect of lrge brnching fctor on the effectiveness of lgorithms such s VI, RTDP, or AO is tht computing Bellmn bckups (Eqution ) explicitly becomes prohibitively expensive, since the summtion in it hs to be crried out over lrge frction of the stte spce. We ddress this issue in the next section. GLUTTON In this section, we present GLUTTON, our LR 2 TDPbsed entry t the IPPC-2 competition tht endows LR 2 TDP with mechnisms for efficiently hndling nturl dynmics nd other optimiztions. Below we describe ech of these optimiztions in detil. A C++ implementtion of GLUTTON is vilble t http://www.cs.wshington.edu/i/plnning/glutton.html. Subsmpling the Trnsition Function. GLUTTON s wy of deling with high-entropy trnsition function is to subsmple it. For ech encountered stte-ction pir (s, ), GLUTTON smples set U s, of successors of s under, nd performs Bellmn bckups using sttes in U s, : V (s, h) mx A R(s, ) + T (s,, s )V (s, h ) s U s, (2) The size of U s, is chosen to be much smller thn the number of sttes to which could trnsition from s. There re severl heuristic wys of setting this vlue, e.g. bsed on the entropy of the trnsition function. At IPPC-2 we chose U s, for given problem to be constnt. Subsmpling cn give n enormous improvement in efficiency for GLUTTON t resonbly smll reduction in the solution qulity compred to full Bellmn bckups. However, subsmpling lone does not mke solving mny of the IPPC benchmrks fesible for GLUTTON. Consider, for instnce, the forementioned Sysdmin exmple with 5 servers (nd hence 5 stte vribles). There is totl of 5 ground ctions in the problem, one for restrting ech server plus noop ction. Ech ction cn potentilly chnge ll 5 vribles, nd the vlue of ech vrible is smpled independently from the vlues of others. Suppose we set U s, = 3. Even for such smll size of U s,, determining the current greedy ction in just one stte could require 5 (5 3) = 76, 5 vrible smpling opertions. Considering tht the procedure of computing the greedy ction in stte my need to be repeted billions of times, the need for further improvements, such s those tht we describe next, quickly becomes evident. Seprting Out Nturl Dynmics. One of our key observtions is the fct tht the efficiency of smpling successor sttes for given stte cn be drsticlly incresed by reusing some of the vrible smples when generting successors for multiple ctions. To do this, we seprte ech ction s effect into those due to nturl dynmics (exogenous effects), those due to the ction itself (pure effects), nd those due to some interction between the two (mixed

effects). More formlly, ssume tht n MDP with nturl dynmics hs specil ction noop tht cptures the effects of nturl dynmics when the controller does nothing. In the presence of nturl dynmics, for ech non-noop ction, the set X of problem s stte vribles cn be represented s disjoint union X = X ex X pure X mixed Moreover, for the noop ction we hve X = ( noop (X ex X none X mixed )) Xnoop none where X ex re vribles cted upon only by the exogenous effects, X pure only by the pure effects, X mixed by both the exogenous nd pure effects, nd X none re not ffected by the ction t ll. For exmple, in Sysdmin problem with n mchines, for ech ction other thn the noop, X pure =, X ex = n, nd X none =, since nturl dynmics cts on ny mchine unless the dministrtor restrts it. X mixed =, consisting of the vrible for the mchine the dministrtor restrts. Notice tht, t lest in the Sysdmin domin, for ech non-noop ction, X ex is much lrger thn X pure + X mixed. Intuitively, this is true in mny rel-world domins s well nturl dynmics ffects mny more vribles thn ny single non-noop ction. These observtions suggest generting U s,noop successor sttes for the noop ction, nd then modifying these smples in order to obtin successors for other ctions by resmpling some of the stte vribles using ech ction s pure nd mixed effects. We illustrte this technique on the exmple of pproximtely determining the greedy ction in some stte s of the Sysdmin-5 problem. Nmely, suppose tht for ech ction in s we wnt to smple set of successor sttes U s, to evlute Eqution 2. First, we generte U s,noop noop smple sttes using the nturl dynmics (i.e., the noop ction). Setting U s,noop = 3 for the ske of the exmple, this tkes 5 3 = 5 vrible smpling opertions, s explined previously. Now, for ech resulting s U s,noop nd ech noop, we need to re-smple vribles X pure X mixed nd substitute their vlues into s. Since X pure X mixed =, this tkes one vrible smpling opertion per ction per s U s,noop. Therefore, the totl number of dditionl vrible smpling opertions to compute sets U s, for ll noop is 3 noop stte smples vrible smple per non-noop ction per noop stte smple 5 non-noop ctions = 5. This gives us 3 stte smples for ech non-noop ction. Thus, to evlute Eqution 2 in given stte with 3 stte smples per ction, we hve to perform 5 + 5 = 3 vrible smpling opertions. This is bout 25 times fewer thn the 76,5 opertions we would hve to perform if we subsmpled nively. Clerly, in generl the speedup will depend on how loclized ctions pure nd mixed effects in the given MDP re compred to the effects of nturl dynmics. The cvet of shring the nturl dynmics smples for generting non-noop ction smples is tht the resulting non-noop ction smples re not independent, i.e. re bised. However, in our experience, the speedup from this strtegy (s illustrted by the bove exmple) nd ssocited gins in policy qulity when plnning under time constrints outweigh the disdvntges due to the bis in the smples. We note tht severl techniques similr to subsmpling nd seprting nturl dynmics hve been proposed in the reinforcement lerning (Proper nd Tdeplli 26) nd concurrent MDP (Musm nd Weld 24) literture. An lterntive wy of incresing the efficiency of Bellmn bckups is performing them on symbolic vlue function representtion, e.g., s in symbolic RTDP (Feng, Hnsen, nd Zilberstein 23). A gret improvement over Bellmn bckups with explicitly enumerted successors, it nonetheless does not scle to mny IPPC-2 problems. Cching the Trnsition Function Smples. In spite of the lredy significnt speedup due to seprting out the nturl dynmics, we cn compute n pproximtion to the trnsition function even more efficiently. Notice tht nerly ll the memory used by lgorithms such s LR 2 TDP is occupied by the stte-vlue tble contining the vlues for the lredy visited (s, h) pirs. Since LR 2 TDP popultes this tble lzily (s opposed to VI), when LR 2 TDP strts running the tble is lmost empty nd most of the vilble memory on the mchine is unused. Insted, GLUTTON uses this memory s cche for smples from the trnsition function. Tht is, when GLUTTON nlyzes stte-ction pir (s, ) for the first time, it smples successors of s under s described bove nd stores them in this cche (we ssume the MDP to be sttionry, so the smples do not need to be cched seprtely for ech ((s, h), ) pir). When GLUTTON encounters (s, ) gin, it retrieves the smples for it from the cche, s opposed to re-generting them. Initilly the GLUTTON process is CPU-bound, but due to cching it quickly becomes memory-bound s well. Thus, the cche helps it mke the most of vilble resources. When ll of the memory is filled up, GLUTTON strts grdully shrinking the cche to mke room for the growing stte-vlue tble. Currently, it chooses stte-ction pirs for eviction nd replcement rndomly. Defult Policies. Since GLUTTON subsmples the trnsition function, it my terminte with n incomplete policy it my not know good ction in sttes it missed due to subsmpling. To pick n ction in such stte (s, h ), GLUTTON first ttempts to use the trick discussed previously, i.e. to return either the optiml ction for some solved stte (s, h ), h < h, or rndom one. However, if the brnching fctor is lrge or the mount of vilble plnning time is smll, GLUTTON my need to do such rndom substitutions for so mny sttes tht the resulting policy is very bd, possibly worse thn the uniformly rndom one. As it turns out, for mny MDPs there re simple cyclic policies tht do much better thn the completely rndom policy. A cyclic policy consists in repeting the sme sequence of steps over nd over gin. Consider, for instnce, the robotic mnipultor scenrio from before. The optiml policy for it repets n ction cycle of length 4. In generl, ner-optiml cyclic policies re difficult to discover. However, it is esy to evlute the set of primitive cyclic policies for problem, ech of which repets single ction. This is exctly wht GLUTTON does. For ech ction, it evlutes the cyclic policy tht repets tht ction in ny stte by simulting this policy severl times nd verging

the rewrd. Then, it selects the best such policy nd compres it to three others, lso evluted by simultion: () the smrt policy computed by running LR 2 TDP with substituting rndom ctions in previously unencountered sttes, (2) the smrt policy with substituting the ction from the best primitive cyclic policy in these sttes, nd (3) the completely rndom policy. For the ctul execution, GLUTTON uses the best of these four. As we show in the Experiments section, on severl domins, pure primitive cyclic policies turned out to be surprisingly effective. Performnce Anlysis Our gols in this section re threefold ) to show the dvntge of LR 2 TDP over LRTDP-FH, b) to show the effects of the individul optimiztions on GLUTTON s performnce, nd c) to compre the performnce of GLUTTON t IPPC- 2 to tht of its min competitor, PROST. We report results using the setting of IPPC-2 (Snner 2). At IPPC-2, the competitors needed to solve 8 problems. The problems cme from 8 domins, problems ech. Within ech domin, problems were numbered through, with problem size/difficulty roughly incresing with its number. All problems were rewrd-mximiztion finite-horizon MDPs with the horizon of 4. They were described in the new RDDL lnguge (Snner 2), but trnsltions to the older formt, PPDDL, were vilble nd prticipnts could use them insted. The prticipnts hd totl of 24 hours of wll clock time to llocte in ny wy they wished mong ll the problems. Ech prticipnt rn on seprte lrge instnce of Amzon s EC2 node (4 virtul cores on 2 physicl cores, 7.5 GB RAM). The 8 benchmrk domins t IPPC-2 were Sysdmin (bbrevited s Sysdm in figures in this section), Gme of Life (GoL), Trffic, Skill Teching (Sk T), Recon, Crossing Trffic (Cr Tr), Elevtors (Elev), nd Nvigtion (Nv). Sysdmin, Gme of Life, nd Trffic domins re very lrge (mny with over 2 5 sttes). Recon, Skill Teching, nd Elevtors re smller but require lrger plnning lookhed to behve ner-optimlly. Nvigtion nd Crossing Trffic essentilly consist of gol-oriented MDPs. The gol sttes re not explicitly mrked s such; insted, they re the only sttes visiting which yields rewrd of, wheres the highest rewrd chievble in ll other sttes is negtive. A plnner s solution policy for problem ws ssessed by executing the policy 3 times on specil server. Ech of the 3 rounds would consist of the server sending the problem s initil stte, the plnner sending bck n ction for tht stte, the server executing the ction, noting down the rewrd, nd sending successor stte, nd so on. After 4 such exchnges, nother round would strt. A plnner s performnce ws judged by its verge rewrd over 3 rounds. In most of the experiments, we show plnners normlized scores on vrious problems. The normlized score of plnner P l on problem p lwys lies in the [, ] intervl nd is computed s follows: score norm (P l, p) = mx{, s rw(p l, p) s bseline (p)} mx i {s rw (P l i, p)} s bseline (p) where s rw (P l, p) is the verge rewrd of the plnner s policy for p over 3 rounds, mx i {s rw (P l i, p)} is the mximum verge rewrd of ny IPPC-2 prticipnt on p, Norm. Score.5 Glutton NO ID Glutton Sysdm GoL Trffic Sk T Recon Cr Tr Elev Nv Figure : Averge normlized scores of GLUTTON nd GLUT- TON-NO-ID on ll of the IPPC-2 domins. nd s bseline (p) = mx{s rw (rndom, p), s rw (noop, p)} is the bseline score, the mximum of expected rewrds yielded by the noop nd rndom policies. Roughly, plnner s score is its policy s rewrd s frction of the highest rewrd of ny prticipnt s policy on the given problem. We strt by presenting the experiments tht illustrte the benefits of vrious optimiztions described in this pper. In these experiments, we gve different vrints of GLUTTON t most 8 minutes to solve ech of the 8 problems (i.e., divided the vilble 24 hours eqully mong ll instnces). Reverse Itertive Deepening. To demonstrte the power of itertive deepening, we built version of GLUTTON denoted GLUTTON-NO-ID tht uses LRTDP-FH insted of LR 2 TDP. A-priori, we my expect two dvntges of GLUTTON over GLUTTON-NO-ID. First, ccording to the intuition in the section describing LR 2 TDP, GLUTTON should hve better nytime performnce. Tht is, if GLUT- TON nd GLUTTON-NO-ID re interrupted T seconds fter strting to solve problem, GLUTTON s solution should be better. Second, GLUTTON should be fster becuse GLUT- TON s trils re on verge shorter thn GLUTTON-NO-ID. The length of the ltter s trils is initilly equl to the horizon, while most of the former s end fter only few steps. Under limited-time conditions such s those of IPPC-2, both of these dvntges should trnslte to better solution qulity for GLUTTON. To verify this prediction, we rn GLUTTON-NO-ID under IPPC-2 conditions (i.e. on lrge instnce of Amzon EC2 with 24-hour limit) nd clculted its normlized scores on ll the problems s if it prticipted in the competition. Figure compres GLUTTON nd GLUTTON-NO-ID s results. On most domins, GLUTTON-NO-ID performs worse thn GLUTTON, nd on Sysdmin, Elevtors, nd Recon the difference is very lrge. This is direct consequence of the bove theoreticl predictions. Both GLUTTON-NO-ID nd GLUTTON re ble to solve smll instnces on most domins within llocted time. However, on lrger instnces, both GLUTTON-NO-ID nd GLUTTON typiclly use up ll of the llocted time for solving the problem, nd both re interrupted while solving. Since GLUTTON-NO-ID hs worse nytime performnce, its solutions on lrge problems tend to be worse thn GLUTTON s. In fct, the Recon nd Trffic domins re so complicted tht GLUTTON-NO-ID nd GLUTTON re lmost lwys stopped before finishing to solve them. As we show when nlyzing cyclic policies, on Trffic both plnners end up flling bck on such policies, so their scores re the sme. However, on Recon cyclic policies do not work very well, cusing GLUTTON-NO-ID to fil drmticlly due to its poor nytime performnce. Seprting out Nturl Dynmics. To test the importnce of seprting out nturl dynmics, we crete version of our plnner, GLUTTON-NO-SEP-ND, lcking this

Norm. Score.5 Glutton NO SEP ND Glutton Sysdm GoL Trffic Sk T Recon Cr Tr Elev Nv Figure 2: Averge normlized scores of GLUTTON nd GLUT- TON-NO-SEP-ND on ll of the IPPC-2 domins. feture. Nmely, when computing the greedy best ction for given stte, GLUTTON-NO-SEP-ND smples the trnsition function of ech ction independently. For ny given problem, the number of generted successor stte smples N per stte-ction pir ws the sme for GLUTTON nd GLUTTON-NO-SEP-ND, but vried slightly from problem to problem. To guge the performnce of GLUTTON-NO- SEP-ND, we rn it on ll 8 problems under the IPPC-2 conditions. We expected GLUTTON-NO-SEP-ND to perform worse overll without fctoring out nturl dynmics, smpling successors should become more expensive, so GLUTTON-NO-SEP-ND s progress towrds the optiml solution should be slower. Figure 2 compres the performnce of GLUTTON nd GLUTTON-NO-SEP-ND. As predicted, GLUTTON-NO- SEP-ND s scores re noticebly lower thn GLUTTON s. However, we discovered the performnce pttern to be richer thn tht. As it turns out, GLUTTON-NO-SEP-ND solves smll problems from smll domins (such s Elevtors, Skill Teching, etc.) lmost s fst s GLUTTON. This effect is due to the presence of cching. Indeed, smpling the successor function is expensive during the first visit to sttection pir, but the smples get cched, so on subsequent visits to this pir neither plnner incurs ny smpling cost. Crucilly, on smll problems, both GLUTTON nd GLUT- TON-NO-SEP-ND hve enough memory to store the smples for ll stte-ction pirs they visit in the cche. Thus, GLUTTON-NO-SEP-ND incurs higher cost only t the initil visit to stte-ction pir, which results in n insignificnt speed increse overll. In fct, lthough this is not shown explicitly in Figure 2, GLUTTON-NO-SEP-ND occsionlly performs better thn GLUTTON on smll problems. This hppens becuse for given stte, GLUTTON-NO-SEP-ND-produced smples for ll ctions re independent. This is not the cse with GLUT- TON since these smples re derived from the sme set of smples from the noop ction. Consequently, GLUTTON s smples hve more bis, which mkes the set of smples somewht unrepresenttive of the ctul trnsition function. The sitution is quite different on lrger domins such s Sysdmin. On them, both GLUTTON nd GLUTTON-NO- SEP-ND t some point hve to strt shrinking the cche to mke spce for the stte-vlue tble, nd hence my hve to resmple the trnsition function for given stte-ction pir over nd over gin. For GLUTTON-NO-SEP-ND, this cuses n pprecible performnce hit, immeditely visible in Figure 2 on the Sysdmin domin. Cching Trnsition Function Smples. To demonstrte the benefits of cching, we pit GLUTTON ginst its clone without cching, GLUTTON-NO-CACHING. GLUTTON- NO-CACHING is so slow tht it cnnot hndle most IPPC- 2 problems. Therefore, to show the effect of cching we run GLUTTON nd GLUTTON-NO-CACHING on instnce 2 of six IPPC-2 domins (ll domins but Trffic nd Re- Time (sec) 2 Glutton NO CACHING Glutton Sysdm GoL Sk T Cr Tr Elev Nv Problem 2 of... Figure 3: Time it took GLUTTON with nd without cching to solve problem 2 of six IPPC-2 domins. con, whose problem is lredy very hrd), nd record the mount of time it tkes them to solve these instnces. Instnce 2 ws chosen becuse it is hrder thn instnce nd yet is esy enough tht GLUTTON cn solve it firly quickly on ll six domins both with nd without cching. As Figure 3 shows, even on problem 2 the speed-up due to cching is significnt, reching bout 2.5 on the lrger domins such s Gme of Life, i.e. where it is most needed. On domins with big brnching fctors, e.g. Recon, cching mkes the difference between success nd utter filure. Cyclic Policies. The cyclic policies evluted by GLUT- TON re seemingly so simple tht it is hrd to believe they ever bet the policy produced fter severl minutes of GLUT- TON s honest plnning. Indeed, on most problems GLUT- TON does not resort to them. Nonetheless, they turn out to be useful on surprising number of problems. Consider, for instnce, Figures 4 nd 5. They compre the normlized scores of GLUTTON s smrt policy produced t IPPC- 2, nd the best primitive cyclic policy cross vrious problems from these domins. On Gme of Life (Figure 4), GLUTTON s smrt policies for the esier instnces clerly win. At the sme time, notice tht s the problem size increses, the qulity of cyclic policies ners nd eventully exceeds tht of the smrt policies. This hppens becuse the increse in difficulty of problems within the domin is not ccompnied by commensurte increse in time llocted for solving them. Therefore, the qulity of the smrt policy GLUTTON cn come up with within llocted time keeps dropping, s seen on Figure 4. Grnted, on Gme of Life the qulity of cyclic policies is lso not very high, lthough it still helps GLUTTON score higher thn on ll the problems. However, the Trffic domin proves (Figure 5) tht even primitive cyclic policies cn be very powerful. On this domin, they dominte nything GLUTTON cn come up with on its own, nd pproch in qulity the policies of PROST, the winner on this set of problems. It is due to them tht GLUTTON performed resonbly well t IPPC-2 on Trffic. Whether the success of primitive cyclic policies is prticulr to the structure of IPPC-2 or generlizes beyond them is topic for future reserch. Comprison with PROST. On nerly ll IPPC-2 problems, either GLUTTON or PROST ws the top performer, so we compre GLUTTON s performnce only to PROST s. When looking t the results, it is importnt to keep in mind one mjor difference between these plnners. PROST (Keller nd Eyerich 22) is n online plnner, wheres GLUTTON is n offline one. When given n seconds to solve problem, GLUTTON spends this entire time trying to solve the problem from the initil stte for s lrge horizon s possible (recll its reverse itertive deepening strtegy). Insted, PROST plns online, only for sttes it gets from the server. As consequence, it hs to divide up the n sec-

Norm. Score.5 Cyclic Policy "Smrt" Policy 2 3 4 5 6 7 8 9 Gme of Life Problem # Figure 4: Normlized scores of the best primitive cyclic policies nd of GLUTTON s smrt policies on Gme of Life. Norm. Score.5 Cyclic Policy "Smrt" Policy 2 3 4 5 6 7 8 9 Trffic Problem # Figure 5: Normlized scores of the best primitive cyclic policies nd of the smrt policies produced by GLUTTON on Trffic. onds into smller time intervls, ech of which is spent plnning for prticulr stte it receives from the server. Since these intervls re short, it is unresonble to expect PROST to solve stte for lrge horizon vlue within tht time. Therefore, PROST explores the stte spce only up to preset depth from the given stte, which, s fr s we know from personl communiction with PROST s uthors, is 5. Both GLUTTON s nd PROST s strtegies hve their disdvntges. GLUTTON my spend considerble effort on sttes it never encounters during the evlution rounds. Indeed, since ech IPPC-2 problem hs horizon 4 nd needs to be ttempted 3 times during evlution, the number of distinct sttes for which performnce relly mtters is t most 3 39 + = 7 (the initil stte is encountered 3 times). The number of sttes GLUTTON visits nd tries to lern policy for during trining is typiclly mny orders of mgnitude lrger. On the other hnd, PROST, due to its rtificil lookhed limit, my fil to produce good policies on problems where most high-rewrd sttes cn only be reched fter > 5 steps from (s, H), e.g., gol-oriented MDPs. During IPPC-2, GLUTTON used more efficient strtegy of llocting time to different problems thn simply dividing the vilble time eqully, s we did for the bltion studies. Its high-level ide ws to solve esy problems first nd devote more time to hrder ones. To do so, GLUTTON first solved problem from ech domin. Then it kept redistributing the remining time eqully mong the remining problems nd picking the next problem from the domin whose instnces on verge hd been the fstest to solve. As result, the hrdest problems got 4-5 minutes of plnning. Figure 6 shows the verge of GLUTTON s nd PROST s normlized scores on ll IPPC domins, with GLUTTON using the bove time lloction pproch. Overll, GLUTTON is much better on Nvigtion nd Crossing Trffic, t pr on Elevtors, slightly worse on Recon nd Skill Teching, nd much worse on Sysdmin, Gme of Life, nd Trffic. As it turns out, GLUTTON s success nd filures hve firly cler pttern. Sysdmin, Gme of Life, nd Trffic, lthough very lrge, do not require lrge lookhed to produce resonble policy. Tht is, lthough the horizon of ll these MDPs is 4, for mny of them the optiml policy with lookhed of only 4-5 hs good performnce. As result, GLUTTON s ttempts to solve the entire problem offline do not py off by timeout, GLUTTON lerns how to behve well only in the initil stte nd mny of the sttes t depths 2-3 from it. However, during policy execution it often ends up in sttes it filed to even visit during the trining stge, nd is forced to resort to defult policy. It Norm. Score.5 PROST Glutton Sysdm GoL Trffic Sk T Recon Cr Tr Elev Nv Figure 6: Averge normlized scores of GLUTTON nd PROST on ll of the IPPC-2 domins. fils to visit these sttes not only becuse it subsmples the trnsition function, but lso becuse mny of them cnnot be reched from the initil stte within smll number of steps. On the other hnd, PROST copes with such problems well. Its online nture ensures tht it does not wste s much effort on sttes it ends up never visiting, nd it knows wht to do (t lest to some degree) in ll the sttes encountered during evlution rounds. Moreover, trying to solve ech such stte for only horizon 5 llows it to produce good policy even if it fils to converge within the llocted time. Recon, Skill Teching, nd Elevtors re smller, so before timeout, GLUTTON mnges to either solve them completely or explore their stte spces to significnt horizon vlues nd visit most of their sttes t some distnce from s. Therefore, lthough GLUTTON still hs to use defult policies in some sttes, in most sttes it hs good policy. In Nvigtion nd Crossing Trffic, the distnce from (s, H) to the gol (i.e., highest-rewrd sttes) is often lrger thn PROST s lookhed of 5. This mens tht PROST often does not see gol sttes during the lerning stge, nd hence fils to construct policy tht ims for them. Contrriwise, GLUTTON, due to its strtegy of itertive deepening, cn usully find the gol sttes nd solve for policy tht reches them with high probbility. Conclusion Unlike previous plnning competitions, IPPC-2 emphsized finite-horizon rewrd mximiztion problems with lrge brnching fctors. In this pper, we presented LR 2 TDP, novel LRTDP-bsed optiml lgorithm for finite-horizon problems centered round the ide of reverse itertive deepening nd GLUTTON, our LR 2 TDP-bsed plnner t IPPC-2 tht performed well on these chllenging MDPs. To chieve this, GLUTTON includes severl importnt optimiztions subsmpling the trnsition function, seprting out nturl dynmics, cching the trnsition function smples, nd using primitive cyclic policies s the defult solution. We presented n experimentl evlution of GLUTTON s core ides nd comprison of GLUTTON to the IPPC-2 top-performing plnner, PROST. GLUTTON nd PROST hve complementry strengths, with GLUTTON demonstrting superior performnce on problems with gol sttes, lthough PROST won overll. Since PROST is bsed on UCT nd GLUTTON on LRTDP, it is nturl to sk: is UCT better lgorithm for finite-horizon MDPs, or would LR 2 TDP outperform UCT if LR 2 TDP were used online? A comprison of n online version of GLUTTON nd PROST should provide n nswer. Acknowledgments. We would like to thnk Thoms Keller nd Ptrick Eyerich from the University of Freiburg for vluble informtion bout PROST, nd the nonymous reviewers for insightful comments. This work hs been supported by NSF grnt IIS-6465, ONR grnt N4-2--2, nd the UW WRF/TJ Cble Professorship.

References Brto, A.; Brdtke, S.; nd Singh, S. 995. Lerning to ct using rel-time dynmic progrmming. Artificil Intelligence 72:8 38. Bellmn, R. 957. Dynmic Progrmming. Princeton University Press. Bertseks, D. 995. Dynmic Progrmming nd Optiml Control. Athen Scientific. Bonet, B., nd Geffner, H. 23. Lbeled RTDP: Improving the convergence of rel-time dynmic progrmming. In ICAPS 3, 2 2. Bryce, D., nd Buffet, O. 28. Interntionl plnning competition, uncertinty prt: Benchmrks nd results. In http://ippc- 28.lori.fr/wiki/imges//3/Results.pdf. Feng, Z.; Hnsen, E. A.; nd Zilberstein, S. 23. Symbolic generliztion for on-line plnning. In UAI, 9 6. Hoffmnn, J., nd Nebel, B. 2. The FF plnning system: Fst pln genertion through heuristic serch. Journl of Artificil Intelligence Reserch 4:253 32. Keller, T., nd Eyerich, P. 22. PROST: Probbilistic Plnning Bsed on UCT. In ICAPS 2. Musm, nd Weld, D. S. 24. Solving concurrent mrkov decision processes. In AAAI 4. Nilsson, N. 98. Principles of Artificil Intelligence. Tiog Publishing. Proper, S., nd Tdeplli, P. 26. Scling model-bsed vergerewrd reinforcement lerning for product delivery. In ECML, 735 742. Putermn, M. 994. Mrkov Decision Processes. John Wiley & Sons. Snner, S. 2. Reltionl dynmic influence digrm lnguge (RDDL): Lnguge description. http://users.cecs.nu.edu.u/ ssnner/ippc 2/RDDL.pdf. Snner, S. 2. ICAPS 2 interntionl probbilistic plnning competition. http://users.cecs.nu.edu.u/ ssnner/ippc 2/. Teichteil-Koenigsbuch, F.; Infntes, G.; nd Kuter, U. 28. RFF: A robust, FF-bsed MDP plnning lgorithm for generting policies with low probbility of filure. In Sixth Interntionl Plnning Competition t ICAPS 8. Yoon, S.; Fern, A.; nd Givn, R. 27. FF-Repln: A bseline for probbilistic plnning. In ICAPS 7, 352 359.