This one. It looks like they're using GPT3 to translate the natural-language problem context and goal into a format called PDDL (planning domain definition language), then feeding the result into a separate program that generates a plan based on the context and goal.
With that in mind, the thing they're really testing here is how well GPT3 can translate the natural-language prompt into PDDL, evaluated on the basis of whether the generated PDDL can actually solve the problem and how long the resulting solution takes.
Naturally, I could be wrong but that's at least what it looks like.
To summarise, they assume a human expert can provide a domain description, specifying all actions that can be taken at each situation, and their effects. Then it looks like they include that domain description to the prompt, along with an example of the kind of planning task they want it to solve, and get the LLM to generate PDDL in the context of the prompt.