Graph Of Thought: The Bitter Lesson(s)

The Prompt Report: A Systematic Survey of Prompting Techniques offers a comprehensive overview of various methods used in prompting large language models (LLMs). One specific approach highlighted is problem decomposition, which enhances LLM performance by breaking down complex tasks into simpler subproblems. Unfortunately, many papers reuse similar terminology, making it unclear which specific technique a prompting method refers to. The blog post Something-of-Thoughts in LLM Prompting: An Overview of Structured LLM Reasoning provides a clear summary of techniques primarily focused on problem decomposition.

The idea behind chain/tree/graph-of-thought is to decompose complex problems into a series of interconnected solution steps, much like human reasoning. By breaking down a problem into a graph of smaller subproblems, less powerful but faster and more cost-effective models can be enabled to solve tasks that were previously too difficult. Alternatively, this approach might allow powerful models to tackle even more sophisticated problems.

Since both chains and trees can be modeled as graphs, my team is utilizing a simple internal Graph-of-Thought (GoT) framework written in Python for tasks that LLMs handle well. During development, we wrote tests to verify that the implementation was correct. As a final integration test, we created an LLM agent that played Tic-Tac-Toe according to a simple problem decomposition we designed for this purpose. Surprisingly, Tic-Tac-Toe proved too difficult for most models; in fact, the agent often did not even adhere to the game's rules. However, this finding was not really relevant to verifying the GoT framework's correctness, as our primary goal was to test the framework itself.

Nevertheless, I was intrigued by the idea of whether I could, with a better decomposition of the problem, create a strong Tic-Tac-Toe agent even using weaker models.

As this was a personal experiment, I chose to use Clojure to leverage leonoel/missionary, which lets me write highly efficient asynchronous workflows.

Structure of the prompt template

I had the idea of formulating the subproblems without mentioning Tic-Tac-Toe at all. Instead, the prompt is about Alice and Bob owning cells in a table. Alice represents the LLM-agent, Bob the random-agent.

All the prompts are formulated as sections of a markdown document. I used links to reference other sections and Django template syntax for placeholders. The final prompt is then generated by parsing the document via nextjounal/markdown, including the referenced section and finally by inserting the values for the placeholders via yogthos/Selmer.

After extensive experimentation with less capable versions of ChatGPT and Gemini, I decided to use the following Markdown document as the foundation for my prompts:

Markdown document

This table shows who is the owner of each cell of the table. It can be either Alice, Bob or the cell can be empty:

| Row | Column | Cell Owner     |
|-----|--------|----------------|
| 1   | 1      | {{player-1-1}} |
| 2   | 1      | {{player-2-1}} |
| 3   | 1      | {{player-3-1}} |
| 1   | 2      | {{player-1-2}} |
| 2   | 2      | {{player-2-2}} |
| 3   | 2      | {{player-3-2}} |
| 1   | 3      | {{player-1-3}} |
| 2   | 3      | {{player-2-3}} |
| 3   | 3      | {{player-3-3}} |
Markdown

Row Win

include: visual descriptor

Can {{player}} make a move to own an empty cell so that three cells of the same row have {{player}} as the Cell Owner?

include: description of expected result

Column Win

include: visual descriptor

Can {{player}} make a move to own an empty cell so that three cells of the same column have {{player}} as the Cell Owner?

include: description of expected result

Diagonal Win 1

include: visual descriptor

include: question

| Row | Column | Cell Owner |
|-----|--------|------------|
| 1   | 1      | {{player}} |
| 2   | 2      | {{player}} |
| 3   | 3      | {{player}} |
Markdown

?

include: description of expected result

Diagonal Win 2

include: visual descriptor

include: question

| Row | Column | Cell Owner |
|-----|--------|------------|
| 1   | 3      | {{player}} |
| 2   | 2      | {{player}} |
| 3   | 1      | {{player}} |
Markdown

?

include: description of expected result

Best Strategic Move

include: visual descriptor

Which cell would you recommend {{player}} to capture in order to minimize the number of cells left to capture either a complete row, a complete column or a complete diagonal?

include: description of expected result

Question

Can {{player}} claim an empty cell in such a way that, after making this move, {{player}} owns least these three cells?

Expected result

Think hard if this condition is fulfilled. If not, return this XML:

<none/>

But if there is a move to an empty cell and that fulfills the condition, return the move in this XML-format:

<move>
    <row><!--insert row index here--></row>
    <column><!--insert column index here--></column>
</move>

Example:

<move>
    <row>1</row>
    <column>2</column>
</move>

Only return the move as XML and nothing else.

Graph-of-thought implementation

One of the most natural ways to represent a graph is through code, and this is especially true in Clojure:

(defn play-game
  ([] (play-game (init-context)))
  ([context]
   (print-board context) ; visual feedback
   (if-let [end (end? context)]
     (finish-game end)
     (recur
       (ex/with-ex context ; exception-handling that dispatches on keywords 
         (if (alice-turn? context) ; Alice is the LLM-agent, Bob the agent playing randomly
           (if-let [winning-move (find-winning-move context)]
             (move winning-move context)
             ;; look into the future if we need to defend against a move
             (if-let [bob-winning-move (find-winning-move (switch-player context))]
               (move bob-winning-move context) ; make the move with Alice to prevent Bob from winning
               (move (find-alternative context) context)))
           (make-random-move context)) ; random-agent's turn
         (catch :no-retries-left _e ; because of LLM rate limit, wrong answer schema, timeouts, etc.
           (make-random-move context))
         (catch :illegal-move _e ; because of LLM hallucinations
           (make-random-move context)))))))

The functions find-winning-move and find-alternative encapsulate the handling of numerous failure modes encountered when working with LLMs, such as incorrect schemas, incomplete answers, prompts wrongly flagged as harmful, timeouts, rate limits, and many more. Sometimes a retry works (Have you tried turning it off and on again?), but sometimes we need a non-LLM fallback like in our case, make-random-move.

In a future post, I will discuss how to use leonoel/missionary and the decorator pattern to maintain high performance and reduce complexity when handling these various failure modes.

The bitter lesson(s)

Among all the models I tested, Gemini 1.5 Flash performed the worst at playing Tic-Tac-Toe out of the box, making it the most interesting candidate to test my GoT agent against. However, my agent was only able to achieve mediocre results. When the GoT agent started first, it won 79% of the games. Conversely, when the random agent had the first move, the random agent won 37% of all games. This indicates that the GoT agent's play is far from optimal.

Graph of human thought vs. more computation

It seems that The Bitter Lessons struck again. I attempted to embed my human reasoning into the agent, but despite significant programming effort and added complexity, the results were still not great. Instead of using prompt engineering techniques, I could have simply used ChatGPT o1-preview, thrown the laziest prompt at it, and watched it play optimally. Ironically, o1-preview uses some form of Chain-of-Thought approach to improve its performance, but from the viewpoint of a user, this is an implementation detail.

I could have perhaps broken down the problem into even smaller subproblems, but that would have resulted in even more calls to the LLM, which would become uneconomical compared to using a more powerful model.

Prompting as as a form of incantation

Another bitter lesson was that the model sometimes performed better when I added a phrase; sometimes it was necessary to remove a phrase. It's a bit like in a fantasy world, where wizards study magic without truly understanding the physics behind it, but there is some lore (prompting techniques) that, when followed, usually leads to the desired outcome (but not always). Developing the GoT agent involved a lot of guesswork to find out which secret spells I had to cast to make the model perform better. My prompting was more incantation than engineering.

Complexity

Recognizing this problem, some companies and tool builders now offer automated prompt optimization solutions. These tools add another layer atop the inherently indeterminate nature of LLMs in an effort to make them more deterministic. This could be a viable option if the optimization process is transparent to the user, effectively providing the experience of using a more powerful model without additional effort. However, if manual optimization is required, this approach is likely limited to use cases that were previously impossible to achieve, but might not be justified when effective solutions already exist.

This situation is reminiscent of the initial enthusiasm for serverless computing, where many anticipated a complete shift to serverless architectures by merely connecting lambda functions with existing services. However, the significant complexity introduced was often overlooked, limiting serverless adoption to specific cases rather than becoming the dominant approach. Similarly, while LLMs offer great potential, the complexity of prompt engineering can render some tasks uneconomical. In terms of effectiveness, speed, and cost-efficiency, Tic-Tac-Toe might be one of them.

Runtimes (2)