Debuggig Aget Iteractios: a ase Study David later Natioal Istitute of Stadards ad Techology 100 Bureau Drive Stop 8260 Gaithersburg MD 20899-8260 USA 2000-05-31 Abstract The otract Net protocol is a geeral-purpose protocol for distributed problem solvig May moder aget ifrastructures facilitate the geeratio of agets supportig otract Net We used oe such ifrastructure to simulate a otract Net-based approach to ob schedulig ad foud that some obs failed to get scheduled eve though the resources were available This paper describes two phases of the subsequet debuggig effort The first phase was ehacig the visualizatio of the aget commuity to reveal the causes of failed egotiatios The secod phase was formalizig the problem usig a Temporal alculus of ommuicatig Systems (TS) ad attemptig to fid a solutio After explorig a umber of solutios that would ot geeralize we foud that switchig from oe-phase to two-phase commitmet sufficed to fix the problem Itroductio or some time the maufacturig sector has maitaied iterest i aget-based approaches to supply chai maagemet plaig schedulig ad cotrol The difficulty of coordiatig the flow of iformatio through the may domais of resposibility withi or amog maufacturig eterprises ofte makes autoomous agets seem a attractive way to simplify the problem However systems composed of iteractig agets are otoriously difficult to test ad debug It is hard eough to achieve sufficiet visibility ito agets iteractios to be able to determie whether idividual agets are behavig as specified It is harder yet to kow where to begi whe the collective behavior of a group of apparetly sae agets was ot as expected May moder aget ifrastructures facilitate the geeratio of agets supportig the geeral-purpose otract Net protocol 1 for distributed problem solvig We used oe such package Zeus *2 versio 102 i a case study of debuggig aget iteractios I the followig sectios we describe our test sceario the testability ehacemets that helped us uderstad its behavior ad the aalysis that revealed the true depth of the problem we faced Test sceario We came upo our test sceario hoestly whe a learig exercise to build a aget-based simulatio of ob schedulig wet awry The sceario cotais a merged user iterface ad supervisory aget called the Guardia To achieve the users goals the Guardia must arrage for obs to be doe by a pool of three * ommercial equipmet ad materials are idetified i order to describe certai procedures I o case does such idetificatio imply recommedatio or edorsemet by the Natioal Istitute of Stadards ad Techology or does it imply that the materials or equipmet idetified are ecessarily the best available for the purpose 1
workcells aptly amed Good heap ad ast (see ig 1) By refactorig the stadard shop floor schedulig problem as procuremet of labor as a commodity istead of cetralized choreography of completely subordiated workcells we ope the door to a style of virtual maufacturig wherei the meas of productio would be reted as eeded by trasiet maufacturig eterprises But it is ot really ecessary to motivate the particular sceario sice the ecoutered problem ad the techiques used to aalyze it are sceario-idepedet Havig idetified the Guardia as a otract Net maager ad the workcells as otract Net cotractors we eeded oly to set the parameters of the workcells tasks to be able to geerate the etire simulatio Workcell Good takes two time slices ad charges 1000 uits of the currecy of choice to supply the labor for oe ob heap takes two time slices ad charges 500 while ast takes oe time slice ad charges 1000 The relative quality of the work is ot quatified so Good is at a competitive disadvatage The egotiatio i our sceario ivolves oly five differet message types: all or Proposals (P) Refuse Propose Accept-Proposal ad Reect-Proposal The Guardia seds a P message to each workcell to iitiate egotiatios Each workcell respods with either a Propose message if it is willig ad able to do the work at some price or a Refuse message if it is uwillig or uable to do the work The Guardia collects proposals ad the chooses oe that is cheapest The "wier" receives a Accept- Proposal message; ay other rs receive Reect-Proposal messages If o proposals are received the work does ot get doe Give three workcells all capable of performig the same obs we iocetly requested the Guardia to accomplish three obs at the same time or o particular reaso we expected oe ob to go to each workcell; it would have bee equivalet for two obs to go to ast ad oe to go to heap To our shock ad horror the simulatio routiely gave oe ob to ast oe ob to heap ad oe ob to o oe at all Debuggig Zeus icludes a set of tools for moitorig ad aalyzig aget behavior At the forefrot the aget society viewer (ig 1) provides a aimated view of the types of messages passig betwee agets at rutime We edited the Zeus source code that colorizes messages so that Reect-Proposal ad Refuse messages would be easier to see ofusigly we saw two reectios (oe to Good oe to ast) ad seve refusals (three from Good two from heap two from ast) At this poit it was ot obvious to us why Good reected all three obs The Zeus aget viewer permits viewig of the icomig ad outgoig messages for each aget We agai edited source code to expad the size of the mailbox buffers so that all messages set durig the ru could be recalled Lookig at the messages (see ig 2) we saw that refusals arrive with various attributes but with o obvious way to determie the reaso for the refusal We tried differet values for the Guardias budget ad the workcells costs to o effect We the tried differet cofirm ad ed times for the requested obs The cofirm time is the deadlie for the awardig of the cotract while the ed time is the deadlie for completio of the cotracted work The resultig behaviors (see Table 1) showed that there were more potetial problems but were ot sufficiet to diagose the problem at had Table 1: Effects of chagig cofirm time ad ed time ofirm time Ed time Behavior Now2 Now5 Good idle ast oe ob heap oe ob Rarely Good idle ast two obs heap oe ob -1 ("Dot care") Now5 Would ot ru Now5 Now5 Would ot ru Now4 Now5 All workcells idle 2
ofirm time Ed time Behavior Now3 Now5 Good idle ast oe ob heap idle Now3 Now6 Idistiguishable from 2/5 Now2 Now6 Good idle ast two obs heap oe ob Despite its regular success the last variat (2/6) was uremarkable because it represeted a sceario with substatially less schedule pressure We tried ruig the sceario without ast O the first try Good ad heap were each awarded oe of the obs; the third ob was ot feasible O the secod try heap was awarded oe ob while the other two failed We chaged the Guardias egotiatio strategy from Default-No-Negotiatio to Growthuctio This icreased the amout of egotiatio but the all three obs failed Whe we made the aalogous chage to the Workcell strategies the behavior was as it was before but with more egotiatio We tured o verbose debuggig but gaied o iformatio from it We fially examied the source code implemetig the various states of egotiatio The coordiatio egie view (see ig 3) shows the state graph of the egotiatio with failures i red (i moochrome copy dark gray) I most states there are multiple places where some coditio will cause them to retur a failure Whe this occurs we see the failure but ot its cause Testability ehacemets The oudatio for Itelliget Physical Agets (IPA) specifies that reect ad messages should cotai a reaso field i the cotet Slightly paraphrased the specificatio reads: The aget receivig a act is etitled to believe that the (causal) reaso for the refusal is represeted by the third term of the tuple which may be the costat true 3 We modified Zeus 102 to add a reaso field to the cotet ad preset this iformatio i the coordiatio egie view We also attached reaso codes to local failures that do ot geerate messages betwee agets Now the Guardias view of the ob that failed (see ig 4) shows that it failed because all three workcells d it as ifeasible Good ad asts views of the first ob (see ig 5) show that they bid o the cotract but did ot wi the first ob was awarded to heap ast was awarded the secod ob because both Good ad heap d it as well as the third ob as beig ifeasible (see ig 6) ast d oly the third ob The behavior of the agets is ow more trasparet ad we ca guess what is goig o: 1 Job 1 arrives first is d o by all three workcells ad awarded to heap The other two workcells respod to the Guardias Reect-Proposal messages with Refuse messages obfuscatig the fact that it was the Guardia who called off egotiatio Now it is obvious from the reaso embedded i the refusal: "Reected (Received better proposal)" 2 Job 2 arrives secod is d by Good ad heap because they already have Job 1 tetatively o their schedules ast s o it because it theoretically has time to do both Job 1 ad Job 2 ad is awarded the ob 3 Job 3 arrives last is immediately d by all three workcells as ot feasible Eve though we ra it as multiple processes o the same computer the distributed ature of the simulatio caused the orderig of messages to be semi-radom I rare istaces these radom perturbatios led ast to bid o obs 2 ad 3 istead of 1 ad 2 i which case ast could be awarded two obs The root of the problem the is that tetatively scheduled obs are sufficiet to cause the immediate refusal of other obs that would require the same time slot eve though the subsequet reectio of the 3
tetative ob would make the d ob feasible But we caot simply chage the workcells to allow overbookig of tetative obs because we would the cotracts that are ot feasible ormalizatio of the problem We model a simplified versio of our sceario usig a Temporal alculus of ommuicatig Systems (TS) 45 (oe of several that exist by the same acroym) This model is the fed ito the Ediburgh ocurrecy Workbech 6 (WB) versio 71 for aalysis We origially modeled the complete sceario with three workcells ad Guardia The resultig model exceeded 100K states before we had eve completed it ad could ot be aalyzed o a workstatio havig a full gigabyte of memory The formulas ad machie-readable TS for the origial model are available o request Here we preset a simplified model where the workcell Good has bee removed The remaiig two workcells are sufficiet to demostrate a rage of system behaviors The other simplificatios are as follows: 1 We revert to the Default-No-Negotiatio ad Default-ixed-Margi strategies for the Guardia ad workcells respectively which reduces the complexity of the iteractios but does ot chage the ature of the "failure" or the uderlyig "fault" 2 We treat egotiatio messages as if they were istataeous whereas workcell labor takes time 3 We preted that work o ed proposals begis immediately ad must be completed by t 2 4 We do ot model the uecessary messages that are set by workcells i respose to reectproposal messages 5 We do ot model the may alterative modes of failure such as agets failig to respod 6 We do ot allow for iiite future schedulig of obs but oly deal with the schedulig of the three obs from our sceario Let ad N refer to the agets heap ast ad Guardia respectively Actios: all for proposals to workcell for ob : { } 1 3 Workcell s to do ob : Workcell s to do ob : Guardia s proposal for workcell to do ob at the d price: Guardia reects proposal for workcell to do ob at the d price: reect Guardia Let P be the Guardia subprocess tryig to obtai labor for ob N 1< < 3 P P 0 eedback 0 Each ob produces two P messages Each of the two workcells ca sed either a proposal or a refusal i respose to a P givig four distict cases These are multiplied by two permutatios of the P resposes The actual order is uimportat but failig to all permutatios leads to deadlocks i the TS Although we have ot parameterized the messages with the d price of the labor the 4
Guardias preferece for cheaper labor is captured i the decisio tree below based o the kow properties of the workcells eedback obs obs 0 obs 0 obsnil 0 obs obs 0 obs 0 obsnil 0 ( 0 reect 0) ( 0 reect 0) obs are icluded for the techical reaso that system traces from the WB The actios of the form show commuicatios betwee agets as opaque tau (iteral) actios of the system We require observable actios to be able to iterpret the traces that lead to ay give state heap ree (Had we a Good workcell it would be ied as 1 1 Tetative1 ree 2 2 Tetative 3 3 Tetative ree G ) The scheduler policy at the root of our problem is codified i Tetative : Tetative 1 2 3 1 ofirmed 1 reect 1 ree 2 2 Tetative 3 3 Tetative 1 1 ially i ofirmed we simulate a task requirig two time slices The followig is sufficiet for our purposes but does ot allow for schedulig of additioal future obs while the task is ruig ( ofirmed ofirmed ) 1 ( 2) ree 2 2 1 3 3 1 ofirmed Tetative Tetative ofirmed ofirmed Aalogously for 2 3 2 3 5
Although it is feasible to reduce the umber of processes by mergig the three Tetative ad similarly for behaviors of the system ast r Tetative ito a sigle ofirmed we keep them separated here to clarify the iteded ( Te Te ) r Te 1 1 1 2 2 2 3 3 2 2 Te12 3 3 Te Te 1 1 o1 reect 1 r Aalogously for Te 2Te3 Te 12 3 3 Te 1 Te2o reect 1 Te2 2 Te1o reect 2 Te1 Aalogously for Te 13Te23 Te 12 1 2 13 1 o2 3 3 Te1o2 1 o12 reect 1 o Te Te o 1 o Te o Te o Te o Aalogously for 3 2 1 2 3 3 1 3 2 o ( Te o Te ) 1 1) r 2 2 2 1 3 3 ( o Aalogously for o 2o3 o 12 2) r 3 3 ( o Aalogously for o 13o23 Aalysis 12 (4) To make the aalysis termiate we iclude a additioal process T 0 that halts the system after four time slices The system ca ow be composed as 3 2 1 3 6
( N T ) { list of uobservable actios elided } System \ Loadig the model ito the WB we fid that the system has approximately 2732 states There are four distict "deadlock" states all of which represet itetioal haltigs of the system at t 4 Usig the WB fuctio to list all observatios of legth three we fid 66 of them Modulo the various combiatorics we oly have three distict system behaviors: 1 bids o oe ob bids o the other two There are 18 ways this ca happe 2 competes usuccessfully with for oe ob ad wis aother while the third ob is d by both workcells There are 36 ways this ca happe 3 ad collide as i case 2 but is reected before the P for the third ob arrives so bids o the last ob ad gets it There are 12 ways this ca happe I practice the order i which P messages arrive at workcells is ot uiformly radom ad it is highly ulikely for oe egotiatio to ru to completio while the P for aother laguishes e route so there is a strog bias i favor of case 2 But it is iterestig to ote that if ay sequece were as likely as ay other oe ob would still fall through more ofte tha ot This would probably ot be true of our origial threeworkcell sceario where there are more ways for all three obs to get doe Evaluatio of prospective solutios I a free market both Guardia ad workcells would suffer whe obs fall through uecessarily Neither side has a motive to leave this problem ufixed Workcell-serialized egotiatios Oe approach that seems completely wholesome ad geeral at first glace is for the workcells to delay respodig to Ps while their schedules are tetatively full We ca model this i the TS by failig to P messages while i the relevat states: Tetative Te o 1 reect 1 ofirmed 1 ree Teoo reect Teo o Teoo reect o Te o oo reect 1 Te o o The resultig system ufortuately cotais deadlocks or example: Tetative Te 23 1 o Quotig from the Ediburgh ocurrecy Workbech user maual (Versio 71) dated 1999-07-18: "The umber of states of a aget is ot as clear a cocept as you might thik: treat the umber as a rough idicatio of size oly" 7
obs 1 1 0 1 0 1 obs1 obs 2 2 N 2 0 0 2 obs2 obs 3 3 0 0 3 3 obs3 ( 0 reect 0) 1 1 0 ( 0 reect 0) 2 0 ( 0 reect 0) 3 2 3 0 1 2 3 I the geeral case this problem would be a show-stopper I our particular sceario if we presume that the Guardia has prior kowledge of the price differetial betwee heap ad ast ad is oly iquirig to see if their schedules are clear we could work aroud by ig the proposal from heap before ast eve respods: eedback obs reect obs 0 0 { } other permutatios are uchaged The resultig System has (approximately) 1685 states ad 18 distict observatios all of which get all three obs doe Eve with the stopgap solutio this approach cotais a iheret tradeoff i that it delays resposes to the later Ps Although we have ot explicitly modeled the time cosumed i egotiatios it is clear that the delay will get worse as additioal Ps pile up ailure to respod to a P is equivalet to a refusal so i the simple case othig would be lost but at some poit the backlog of "bad" Ps would begi to impact the executio of later "good" oes Moreover the delays could have uable social cosequeces i practice particularly if the Guardia also fails to or reect proposals i a timely fashio Our sceario is too limited to permit aalysis of these behaviors Guardia-serialized egotiatios I scearios havig oly a sigle otract Net maager it would suffice to issue the Ps oe at a time delayig the ext util egotiatio o the previous has completed This is accomplished i our model by lettig P1 revisig eedback 1 ad eedback 2 to get rid of the parallel operators ad replacig the eight otemporal deadlocks ( 0) remaiig i each with P 1 The resultig system has a mere 36 states ad oly oe observable behavior both workcells bid o the first ob which goes to heap; the secod ad third obs are oly bid o by ast Of course this itroduces more of the umodeled egotiatio delay that we discussed above A more sceario-specific solutio is to alter the Guardia to issue the Ps to ast oly after they have bee d by heap: N P 0 eedback 8
eedback obs 0 obs obsnil 0 0 This is merely a escalatio of the "pre-selectio" of heap that we made previously where proposals from heap were ed before ast had a chace to respod but this oe requires o modificatios to the workcells behavior The resultig system has 431 states ad 18 distict observatios all of which award all three obs without iter-workcell competitio Isistet Guardia A simple strategy to implemet is to have the Guardia try agai if a ob fails to attract a proposal the first time aroud We ca attempt this i our modeled sceario by replacig the otemporal deadlock followig each obsnil i eedback with P effectively loopig back immediately as soo as a P fails Ufortuately this creates a cycle that is ot guarateed to termiate The ext best thig is to wait oe time slice before tryig agai: ( I order to hadle these Ps we must exted our 1)P ofirmed states slightly We kow that heap would be icapable of fiishig a ob by 2 P arrived at t 1: ofirmed 1 ofirmed 1 2 2 ofirmed 1 (1) ofirmed 1 3 3 ofirmed 2 2 ofirmed 1 (1) ree 3 3 ofirmed 1 1 t if its The o processes already permit ast to Ps at t 1 It is ot ecessary to exted the o k states for this sceario; eterig such states implies that all three obs will have bee allocated The resultig system has 2812 states ad 66 distict observatios that break dow i the same proportios as i the origial model with case 2 revised as follows: 2 competes usuccessfully with for oe ob ad wis aother while the third ob is d by both workcells Time passes ad the s the third ob o the secod try There are 36 ways this ca happe I practice the umber of retries would eed to be costraied i various ways as retries of proposals havig o chace of success tie up valuable resources More sigificatly this approach while geerally applicable does ot help us i all cases It is oly by virtue of the fact that ast ca begi a ob oe cycle late ad still fiish o time that we achieve able results I scearios where all workcells must begi work at t 0 to achieve able results tryig agai later is of o help Better upward commuicatio The workcells could iform the Guardia that two obs uder cosideratio are mutually exclusive requirig the same resources at the same time ad force the Guardia to make a choice I our sceario this would degeerate to cetralized schedulig with may superfluous iteractios Obviously if there were multiple requestors with coflictig eeds the decisio could ot be passed upwards i this way 9
10 Two-phase commitmet Two-phase commitmet is commoly used with distributed databases to esure global cosistecy 7 We ca use a differet kid of two-phase commitmet i the cotext of distributed schedulig to help global coherecy 8 Workcells are o loger obliged to a cotract whe they sed a proposal ad the Guardia is o loger assured of gettig a ob doe by sedig a ace Upo receipt of a ace the workcell must either seal the cotract or back out of it This secod "commitmet" is the firm If oe r backs out the Guardia is able to sed a ace to aother r To aalyze our sceario with a two-phase protocol we add a message type which is ied by IPA ad remove the reect message for becomig redudat The workcell seds if it is willig to firm up the commitmet otherwise it seds elided permutatio other obsnil obs obs obs obs obs obs obs eedback 0 0 0 0 0 0 0 0 Because the two-phase protocol is more complex tha the origial separatig out states that ca theoretically be combied ow becomes a burde The followig two states eable heap to react appropriately to all feasible egotiatios assumig sae behavior o the part of the Guardia If the Guardia were to attempt somethig isae like ig a proposal that was ever issued the separated processes would block the actio but the merged process would blithely progress ito a cofirmed state Ucommitted
11 ofirmed ofirmed ofirmed Ucommitted Ucommitted Ucommitted Ucommitted 3 3 2 2 1 1 3 3 2 2 1 1 ofirmed ofirmed ofirmed ofirmed ofirmed ofirmed Ucommitted ofirmed (2) 3 3 3 3 2 2 2 2 1 1 1 1 Uc o o o Uc Uc Uc Uc 3 3 2 2 1 1 3 3 2 2 1 1 (1) 3 3 3 3 2 2 2 2 1 1 1 1 o o o o o o Uc o
1 1 o 1 1 o 2 2 o o (2) Uc 2 2 o 3 3 o 3 3 o The resultig system has 2304 states ad 72 distict observatios oe of which allow ay obs to fail We agai have three distict system behaviors: 1 ad each bid o all three obs is forced to back out of two of them which are the awarded to There are 18 ways this ca happe 2 ad compete for two obs while the third is oly bid o by backs out of oe of the two cotested obs which is the awarded to There are 36 ways this ca happe 3 ad compete for oe ob which is awarded to while the other two are oly bid o by There are 18 ways this ca happe ever fails to bid o all three obs because it caot possibly wi the first awarded cotract oclusio Our experieces support the ed wisdom that obtaiig globally coheret behavior from autoomous agets is a ambitious goal Nevertheless a simple two-phase commit protocol sufficed i this case Perhaps the more valuable results are the lessos leared about testig ad aalysis of aget-based systems The IPA specificatio leaves testability out of scope 9 ad the reasos embedded i Reect ad Refuse messages are implemetatio-depedet cotet uture stadards for aget ifrastructures might stadardize the commuicatio of reasos to facilitate the developmet of iteroperable testig ad debuggig tools Ackowledgemets We thak Steve Ray for suggestig the two-phase commitmet protocol ad the other reviewers for their valuable iput 12
igures igure 1: Society view with messages i trasit 13
igure 2: Refusal i ibox 14
igure 3: Guardia view of failed ob 15
igure 4: Guardia view with reasos added 16
igure 5: Workcell view of reectio igure 6: Workcell view of refusal 17
Refereces "By selectig these liks you will be leavig NIST webspace We have provided these liks to other web sites because they may have iformatio that would be of iterest to you No ifereces should be draw o accout of other sites beig refereced or ot from this page There may be other web sites that are more appropriate for your purpose NIST does ot ecessarily edorse the views expressed or cocur with the facts preseted o these sites urther NIST does ot edorse ay commercial products that may be metioed o these sites" 1 Radall Davis ad Reid G Smith "Negotiatio as a Metaphor for Distributed Problem Solvig" Artificial Itelligece 20(1) 1983 pp 63-103 2 Zeus home page http://wwwlabsbtcom/proects/agets/zeus/ 2000 3 IPA 97 Specificatio Versio 20 Part 2 (Aget ommuicatio Laguage) sectio 6516 Available from http://wwwfipaorg/ as file fipa8a22doc 4 aro Moller ad hris Tofts "A Temporal alculus of ommuicatig Systems" i Lecture Notes i omputer Sciece #458 Spriger-Verlag 1990 pp 401-415 5 Robi Miler ommuicatio ad ocurrecy Pretice Hall 1989 6 Ediburgh ocurrecy Workbech home page http://wwwdcsedacuk/home/cwb/ 2000 7 James Gray "Notes o Data Base Operatig Systems" i Operatig Systems: A Advaced ourse Spriger-Verlag 1978 See also ay subsequet textbook o distributed databases 8 Sarah Wallace M K Seehi Ed Barkmeyer Steve Ray ad Eva K Wallace "Maufacturig Systems Itegratio: otrol Etity Iterface Specificatio" NISTIR 5272 September 1993 Available at http://wwwmelistgov/msidlibrary/summary/9335html 9 IPA Architectural Overview (99/07/09) sectio 6113 Available from http://wwwfipaorg/ as file fipa9710doc 18