On the suboptimality of the Optimal Planning Track at IPC (Comments welcome!)
It is undeniable that the planning competition has been a great influence on the planning community. While most of it has been for the good, there are some mixed blessings..
One case in point is the "Optimality track"--which has been run for the second time this year.
Not only is the way this track is being run profoundly misguided (IMHO), but it also has lead to equally erroneous conclusions (e.g. "optimal planning" = "SAT planning").
The first time it was run, the organizers allowed entrants to choose the optimality metric they want to guarantee (which, if you read "performance metrics 101" is a big no-no!). The obvious effect was that several entrants--especially the bounded-length planners--chose the "easiest for them to ensure" metric--which is the "step-optimality".
Notice that the choice of step-optimality has no compelling justification for classical planning--where the only reasonable metrics are cost-oriented--find the plan with the lowest cumulative action cost ---with "sequential optimality" being the special case when actions have uniform cost. (One usual argument in favor of step-optimality is that it is akin to "make-span" in temporal planning. I don't buy this unless we are actually doing temporal planning benchmarks--which pose a whole slew of additional challenges to bounded length SAT-style planners.)
The second time it was run (which was this year), the organizers let people continue using step optimality as the optimality metric.
The result is that we now have two optimal planning competitions where the performance metric is very hard to justify from the users' point of view, and the winners of the track--the bounded length SAT planners--are arguably only good at step-optimality (that, arguably, no real users would ever care about!).
Other approaches, such as heuristic planners using admissible heuristics, which can actually provide cost-optimal plans, are naturally shut-out, since step-optimal plans can have lengths that are upto a factor of n less than sequential or cost-optimal plans.
This state-of-affairs has been mistakenly interpreted to mean that optimal planning = SAT planning (and Henry Kautz--in his recent AAAI talk--went as far as to suggest that we should all focus on optimal planning, and this means SAT planning is pretty much the way to go..)
Arguably, if the track measures cost-optimality, bounded length planners are not likely to have that much of a performance edge. This is because the performance of bounded length planners such as SAT depends crucially on the length of the encoding over which they need to search. With sequential optimality, they will have to search increasingly larger sized encodings. With cost-optimality, they would have to do branch-and-bound search at each encoding level.
I think it is time optimal track focused on optimality metrics that make sense to the users (rather than to the planners). This might in turn let us all focus on novel techniques that really solve the optimal planning problem (rather than lull us into the false complacency that optimal planning can be handled by current SAT planners).
Rao
7/25
---------
One post script. My beef (or carrot--since I am a vegetarian), is not with bounded length planning (which is a great idea), but rather with the way optimality track has been run. I think it is possible for heuristic search and bounded length planners to play complementary roles. Bounded length planners may still be best at showing that plans below a certain cost do not exist, thus providing useful lower bounds.
One case in point is the "Optimality track"--which has been run for the second time this year.
Not only is the way this track is being run profoundly misguided (IMHO), but it also has lead to equally erroneous conclusions (e.g. "optimal planning" = "SAT planning").
The first time it was run, the organizers allowed entrants to choose the optimality metric they want to guarantee (which, if you read "performance metrics 101" is a big no-no!). The obvious effect was that several entrants--especially the bounded-length planners--chose the "easiest for them to ensure" metric--which is the "step-optimality".
Notice that the choice of step-optimality has no compelling justification for classical planning--where the only reasonable metrics are cost-oriented--find the plan with the lowest cumulative action cost ---with "sequential optimality" being the special case when actions have uniform cost. (One usual argument in favor of step-optimality is that it is akin to "make-span" in temporal planning. I don't buy this unless we are actually doing temporal planning benchmarks--which pose a whole slew of additional challenges to bounded length SAT-style planners.)
The second time it was run (which was this year), the organizers let people continue using step optimality as the optimality metric.
The result is that we now have two optimal planning competitions where the performance metric is very hard to justify from the users' point of view, and the winners of the track--the bounded length SAT planners--are arguably only good at step-optimality (that, arguably, no real users would ever care about!).
Other approaches, such as heuristic planners using admissible heuristics, which can actually provide cost-optimal plans, are naturally shut-out, since step-optimal plans can have lengths that are upto a factor of n less than sequential or cost-optimal plans.
This state-of-affairs has been mistakenly interpreted to mean that optimal planning = SAT planning (and Henry Kautz--in his recent AAAI talk--went as far as to suggest that we should all focus on optimal planning, and this means SAT planning is pretty much the way to go..)
Arguably, if the track measures cost-optimality, bounded length planners are not likely to have that much of a performance edge. This is because the performance of bounded length planners such as SAT depends crucially on the length of the encoding over which they need to search. With sequential optimality, they will have to search increasingly larger sized encodings. With cost-optimality, they would have to do branch-and-bound search at each encoding level.
I think it is time optimal track focused on optimality metrics that make sense to the users (rather than to the planners). This might in turn let us all focus on novel techniques that really solve the optimal planning problem (rather than lull us into the false complacency that optimal planning can be handled by current SAT planners).
Rao
7/25
---------
One post script. My beef (or carrot--since I am a vegetarian), is not with bounded length planning (which is a great idea), but rather with the way optimality track has been run. I think it is possible for heuristic search and bounded length planners to play complementary roles. Bounded length planners may still be best at showing that plans below a certain cost do not exist, thus providing useful lower bounds.
8 Comments:
I couldn't agree more. I would love to a see a pure sequentially optimal planning track at the next competition, so that everyone has to put their cards on the table for once. We don't even know what the state of the art in optimal planning is. This is an appalling situation.
-- Malte
I agree as well.
I believe you hit the nail on the head. The competition gives us an infinite loop based on poor initialization:
(1) planners choose what to optimize (i.e., planners choose the game they are playing)
(2) planner "A" wins optimal planning track (i.e., it is faster at winning its own game than others are at winning their own game)
(3) A's definition of optimality becomes the only definition of optimality
(4) goto 2
Rao has provided a break.
Just to add: If allowing a planner to choose a single metric on problems (competition a) gave us the current issues, another scary idea may be to let organizers to choose the metric (competition b). We should presume that users have some idea of what they want, and let them specify it using plan metric specifications. I think the benefits of plan metrics were properly highlighted in the last planning competition. These all involved numeric information (e.g., costs of preference violation). To me it seems not too much of a stretch to specify different numbers (e.g., (:metric (minimize (steps))); the possibility also exists to weight different metrics (e.g., (:metric (minimize (* 0.2 (steps)) + (* 0.8 (cost))))).
Given that our "user" = benchmark planning problems (i.e. those created by the organizers) this leads us to competition b. Presumably a nice workshop used to elicit varying domains would give us some better competition.
J.
Hi,
I certainly agree with most of what is said here. I somehow fail to see why it is being discussed so hotly. Yes, let's go to another split parallel/sequential optimal in IPC-6. What's the big deal? Henry must have really beat the drum LOUDLY for you all being so excited.
At IPC-4, our rationale was simply to not introduce too many splits to make the data non-understandable. If there is an increasing amount of planners optimizing things other than step-optimal -- which I would welcome full-heartedly -- then of course it makes sense to give them separate tracks. Personally I was delighted to see that Stefan entered Mips.BDD in this competition. Still there is the question of what optimal planners can actually solve. And I daresay that, of all the impracticality we still have in optimal planning, satplan is presumably still the least impractical. What's a planner good for that optimizes, let's say summed up cost but that cannot solve any but the tiniest examples?
So yes, I agree we should make finer distinctions in the future. But, not having heard Henrys talk, I guess I fail to see the need for satplan-bashing.
BTW, just to drive this point home, we'll probably do a version of satplan that optimizes summed up action costs. See you at the next IPC...
cheers,
Joerg
(Will). Naturally I agree. But preaching to a choir isn't fun, so I'm going to try to disagree:
At programming time, if the metric to optimize is known, one can improve performance by careful algorithm selection.
If you were really interested in problems where step-optimality was important, you would want SATPLAN to exist -- some general planner that can handle arbitrary metrics would not perform as well.
So there is a place for such planners; if there are such problems. There are such problems -- temporal problems.
---
Continueing disagreement would be hard after that point.
First of all, if we interpret SATPLAN as hardcoding optimizing makespan, then it can only handle the very restricted case of unit duration actions. There is no reason why any current approach can't be trivially modified to handle non-unit durations and still optimize makespan (it takes only a little bit of imagination to figure out how to fix SATPLAN in this regard) -- so if one is to hold the contention that this is a special enough metric to hardcode, then there should be duration information on the actions in the problem. Because it isn't believable that there are lots of unit-duration temporal planning problems.
Another place for such an approach is when optimality is not, in fact, the only criterion -- SATPLAN can be used as a heuristic planner for different metrics than step-optimality, when computation time is important, but cost is also important.
---
The way to balance such things would be to really give the actual evaluation metric up front -- each second of computation costs so much, each action costs so much and takes so long to execute, achieving the goal gives so much utility (negative cost) ... minimize cost/maximize utility.
Or, if converting down to a single dimension is as hard as one imagines it will be, then demand pareto-optimality -- really, not just theoretically (i.e., SAPA talks about pareto-optimality, but does no more than talk about it [to my knowledge]).
---
There is a sneaky problem: "arbitrary metric" isn't so easy to work with or specify. In particular, the PDDL language for specifying metrics doesn't let one specify that it is desirable to produce a solution quickly, sacrificing plan quality. Another limitation, but more minor, is that the metric has to be a simple combination of values of fluents -- e.g., specifying a robustness metric would be very difficult (the sum of the shortest distances from each state of the plan to a 'bad' state is minimized, as one example).
Arguably, metrics should be functions of all the relevant information about an entire plan, in any of its representations (causal link/proof structure, state trajectory, final state, ...), as well as the time (and memory?) used to synthesize it.
Parsing and evaluating such things, let alone searching in an intelligent fashion for good, or optimal, solutions would be very difficult.
---
So disagreement can resume at this point: if one is to handle flexible metrics, then one really ought to be handling very, very, flexible metrics. That seems like it will be too hard (though LPG would probably do great), and that perhaps one must settle on having "domain-independent" planners, with hardcoded metrics (so, in fact, one must carefully select which metric is a good approximation to your real metric).
But the competition should still present results based on the actual metric for each problem (which could vary by problem), and simply have a "quality" vs. "quantity" track -- how good is each plan, vs. how many plans per hour.
Or just one track, and every planner has to compete on the same evaluation metric (which would vary by problem in terms of the tradeoff between plan cost and computation time). There are still problems though; one can stack the competition in favor of any favorite (undominated) planner by including more or less examples of problems that the favorite planner is good at.
It's me again -- notwithstanding whether or not Henry was serious about all this or whether he just wanted to rock the boat, I feel the need to defend the IPC organizers a little. I fully agree that we should do something about the optimal track next time around, presumably introduce separate tracks for separate optimization criteria, under the condition that enough participants (3 per track?) actually register.
At IPC 2004, we did not even have full support for introducing the optimal track in the first place. Noone would have dreamed of suggesting separate tracks for separate optimization criteria. Recall that the IPC results are already heavily splitted into all sorts of categories. It just didn't seem wise to split an expectedly small group of planners into even smaller groups. Alfonso et al can speak for themselves, but I would suppose that the situation was similar. The organizers were certainly not trying to shut sequential optimal planners out. Perhaps we should thank Henry for stirring us up so much, so that something better will actually happen the next time around ;-) because of the increased awareness of this issue.
I think this is just the way the IPC event develops. So let's just calm down and see to make it better next time.
cheers,
Joerg
Whauu -- it looks that by the time I finished organizing my thoughts, most of them have been already expressed by others :)
I think that the main issue is that we all want to enjoy both having few alternative (optimal planning) tracks, and getting an easy-to-understand picture of state of the art. Needless to say that these two desiderata compete with each other ... In addition, I agree with Joerg that we should also take into account that *somebody* has to make all the hard work on running these competitions ...
Having said that, I think that the picture today is simpler than it may look like. Looking at the current state of the literature, it seems that having two optimal planning tracks at the next competition will make most people happy:
Track 1: makespan minimization
Track 2: sum-of-action costs minimization
(numbering of the tracks is meaningless :)
I am fully agree with Hector that having a flexible metric specification is not really relevant at this point --- we are simply not there yet, so first things first. (My personal opinion goes even farther --- I feel that extending the formalisms is advancing way faster than dealing with them ...)
Finally, I (again :) agree with Joerg --- this is exactly the way how events like IPC should probably evolve.
--Carmel
p.s. just to make a personal committment: I do plan to enter Track 2 next time, if such will take place.
I guess most that can be said on this topic has already been said above, and, like everyone else, I pretty much agree with everything... But I think that the main reasons for the results of the optimal track of at least the latest competition are (1) a lack of competitors and (2) some limitations of PDDL.
In IPC5, competitors were not allowed to choose to optimize whatever measure they wanted: they were supposed to optimize the problem metric, and nothing else. However, IPC5 had, like the two last competitions, a "STRIPS" subset of problems, that is, problems using only the STRIPS fragment of PDDL. In the STRIPS fragment of PDDL, you're very limited in what you can specify as :metric: you can't use fluents and you can't use preferences. (It's even questionable whether you can specify a :metric at all, and still be within the STRIPS fragment.) In the absence of a :metric specification, the default metric (according to the PDDL language specification) is "total-time", that is makespan: in the STRIPS case (where action durations can not be specified) this corresponds to number of time steps. So, for the STRIPS subset of problems, we "choose" this optimization criterion because there is no other choice. (In the STRIPS version of at least one of the domains, openstacks, optimizing the number of actions actually makes sense, in the sense that this is related to the "real" optimization criteron of the modelled problem. This domain was therefore coded in such a way that no parallel actions are possible, so that parallel length equals sequential length. One may also note that in this domain/version, satplan and maxplan were outperformed by sequential optimal planners.)
There were (if memory serves) six optimal planners that participated in IPC5 (satplan, maxplan, one version of IPPLAN, FDP, MIPS-BDD and CPT2 - correct me if I've left anyone out). CPT2 handles temporal problems (makespan minimization only) and MIPS-BDD handles preferences and is able to optimize a metric expression over preferences in the problem. (On problems with weighted preferences, MIPS-BDD is optimal "in the limit": it does a branch-and-bound-type search, finding successively better plans, and when it has exhausted the space of plans, the best plan found is known to be optimal; however, the version that ran in the competition was configured to output the best plan found within the given time/memory bounds, regardless of whether this plan had been proven optimal or not. Because of this, it could not be considered an optimal planner, for the purpose of the competition.) The other four optimal planners participated only in the STRIPS "subtrack".
Finally, I think there's one more point that deserves to be emphasized (even if everyone already knows it): the competition is not a rigorous scientific experiment aimed at determining which is the universally best planner (optimal or otherwise). The "1st place", prizes etc. are a very coarse summary of the competition results (and have mainly entertainment value, I'd say). The complete results (plans and runtimes) are available from the competition websites (back to IPC2, at least) for anyone to examine. If you want to know the state of the art in optimal sequential planning, get the data and compare the results of the optimal sequential planners (there were two in IPC5, and a few (I think two) in IPC4).
/P@trik (sorry for the long rant ;)
Agree as well. However, IMHO, it is still hard to decide which one is computationally easier, although in most cases it is the truth that to minimize action can be more complicated.
Actually, holding such kind of competition, and setting up first/second prize is not a good practise for the purpose of boosting research.
Why not only competition, but no winner, until one day we can find out a reasonable metric.
Post a Comment
<< Home