Building Intelligent Agents with Large Language Models¶
The idea of creating autonomous agents powered by Large Language Models (LLMs) is an exciting frontier in AI. These systems are more than just tools for generating polished text, stories, or programs. They represent a leap toward general problem-solving capabilities. Proof-of-concept demonstrations like AutoGPT, GPT-Engineer, and BabyAGI showcase the potential of these systems and inspire new possibilities.
What Makes an LLM-Powered Agent?¶
An LLM-powered agent functions as an autonomous system with the LLM at its core, acting as the brain. However, to handle real-world challenges effectively, it requires several supporting components. These elements work together to make the system adaptable, reflective, and capable of tackling complex tasks.
Planning¶
Planning is at the heart of an intelligent agent. It enables the system to break down problems and continuously improve its approach.
- Subgoal and Decomposition: The agent divides complex tasks into smaller, manageable steps. For instance, if tasked with organizing a conference, the system can create subgoals like finding a venue, arranging speakers, and scheduling logistics.
- Reflection and Refinement: By analyzing its past decisions and outcomes, the agent learns from mistakes. This self-reflection allows it to refine its process and produce better results over time.
Memory¶
Memory enhances the agent's ability to retain and use knowledge for tasks requiring both immediate and long-term context.
- Short-Term Memory: In-context learning allows the agent to adapt dynamically during a session. This is often achieved through carefully designed prompts that guide the agent’s responses.
- Long-Term Memory: By integrating external memory systems, like vector databases, the agent can store and retrieve vast amounts of data. This enables continuity across sessions and makes it possible to tackle long-duration or large-scale projects.
Tool Use¶
No LLM can know everything or perform every task on its own. To address this limitation, agents are equipped to use external tools.
- APIs for Real-Time Data: The agent can fetch live information like current events, stock prices, or weather updates.
- Code Execution: Tasks requiring computation, such as generating reports or running algorithms, can be handled by executing code.
- Access to Specialized Knowledge: The agent can interact with proprietary databases or domain-specific resources, expanding its ability to provide accurate and relevant outputs.
The Power of LLM Agents¶
LLM-powered agents stand out for their adaptability and versatility. They can function with minimal human guidance, making them suitable for a variety of applications. Examples include workflow automation, personalized tutoring systems, and intelligent customer support.
Notable Examples¶
- AutoGPT: Automates complex tasks by iteratively planning and acting.
- GPT-Engineer: Focuses on generating and refining software solutions.
- BabyAGI: Handles dynamic, task-driven processes efficiently.
A Glimpse into the Future¶
The integration of planning, memory, and tool usage transforms LLM-powered agents into general-purpose problem solvers. These systems can seamlessly adapt to different tasks, reflect on their performance, and harness external resources to exceed the limitations of their pre-trained knowledge.
The journey of building intelligent agents is just beginning. With continuous advancements in LLM technology, these systems have the potential to reshape industries and redefine how we approach problem-solving.
Component 1 Planning¶
Task Decomposition : Strategies for Smarter Problem Solving¶
Task decomposition is a foundational concept in AI that enables models to tackle complex problems by breaking them down into smaller, manageable steps. Recent advancements like Chain of Thought (CoT) and Tree of Thoughts (ToT) have refined this process, while approaches like LLM+P integrate external tools to handle long-horizon tasks. Let’s explore these techniques and their applications.
Chain of Thought, introduced by Wei et al. (2022), is a standard prompting technique that enhances model performance by encouraging step-by-step reasoning.
- How It Works: The model is instructed to “think step by step,” transforming large tasks into multiple manageable subtasks. For example, solving a math problem can be broken into logical sequential steps.
- Key Benefits:
- Increases test-time computation by leveraging detailed reasoning.
- Provides interpretability by revealing the model's thought process.
CoT is widely used for tasks requiring logical progression, such as reasoning-based question answering or multi-step computations.
The study demonstrates that such reasoning capabilities naturally emerge in sufficiently large models. For instance, prompting a 540-billion-parameter model with just eight chain-of-thought examples achieved state-of-the-art accuracy on the GSM8K benchmark for math word problems, surpassing even fine-tuned GPT-3 models with verifiers.
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective[https://arxiv.org/pdf/2305.15408]¶
Direct Answer Generation¶
- Transformers struggle to directly produce answers (e.g., (x=1, y=1, z=0)) for complex problems without showing intermediate steps.
- While sufficiently large Transformers can theoretically solve these tasks, the required size becomes impractically large due to representation efficiency constraints.
Log-Precision Transformers¶
- Real-world Transformers operate under log-precision, where internal neurons store floating-point values with precision (O(\log n)), where (n) is the input size.
- This practical limitation prevents Transformers from representing or calculating solutions that require higher precision.
Impossibility Results (Theorems 3.1 and 3.2)¶
- Shallow (bounded-depth), log-precision Transformers cannot solve certain arithmetic and equation-solving tasks unless complexity classes (TC^0) and (NC^1) collapse (a widely regarded impossibility).
- The tasks have complexities lower-bounded by (NC^1), making them intrinsically harder than what shallow Transformers can handle.
Circuit Complexity Theory¶
- Log-precision Transformers are computationally equivalent to shallow circuits in (TC^0), while the math problems studied belong to (NC^1), a more powerful complexity class.
- This mismatch explains why shallow Transformers fail on these tasks.
Chain-of-Thought (CoT) Solutions¶
- CoT prompting allows Transformers to break problems into intermediate steps, significantly increasing their "effective depth."
- Surprisingly, constant-size Transformers (e.g., fixed depth (L=4) or (L=5)) can generate CoT solutions for both arithmetic and equation-solving tasks.
Transformer Operations for CoT¶
- Key operations, such as conditional COPY (retrieving specific data) and MEAN (averaging values), are implemented through softmax attention.
- Feed-forward networks handle operations like multiplication, lookup tables, and conditional selection.
CoT's Effective Depth¶
- By recursively feeding outputs back as inputs, CoT effectively deepens the computational circuit of Transformers.
- This increased depth allows even small Transformers to handle problems requiring high computational complexity.
Comparison with RNNs¶
- Constant-size RNNs cannot solve the same math tasks using CoT, highlighting the architectural advantages of Transformers in structured reasoning.
Dynamic Programming and CoT in LLMs¶
This section explores how Chain of Thought (CoT) extends the problem-solving capabilities of Large Language Models (LLMs) beyond mathematics to tackle general Dynamic Programming (DP) problems. Dynamic Programming, a powerful decision-making framework, benefits greatly from CoT's ability to handle complex tasks.
Dynamic Programming solves problems by breaking them into smaller, interrelated subproblems and solving them sequentially. Key components of a DP algorithm include:
State Space ((I_n)):
- Represents all decomposed subproblems.
- Example: For Longest Increasing Subsequence (LIS), states are array indices.
Transition Function ((T)):
- Defines how to solve each subproblem using results from previous subproblems.
- Example: In LIS, update ( dp[i] ) based on ( dp[j] ) for ( j < i ).
Aggregation Function ((A)):
- Combines results from all subproblems to compute the final answer.
- Example: For LIS, take the maximum ( dp[i] ).
Representation of DP in LLMs:
- CoT-enabled LLMs generate intermediate reasoning steps, aligning with DP's sequential structure.
- Example: For LIS, CoT generates states and transitions step-by-step.
Efficiency Assumptions:
- State space, input elements, and outputs are polynomially bounded.
- Transition and aggregation functions can be approximated efficiently by perceptrons.
Theoretical Guarantees (Theorem 4.7):
- Any DP problem satisfying the assumptions can be solved by a CoT-enabled autoregressive Transformer with:
- Constant depth (L),
- Hidden dimension (d),
- Polynomially bounded parameter values.
- Any DP problem satisfying the assumptions can be solved by a CoT-enabled autoregressive Transformer with:
1. Longest Increasing Subsequence (LIS):¶
- Input: An array of integers.
- State Space: Array indices.
- Transition Function: Update ( dp[i] = \max(dp[i], dp[j] + 1) ) for ( j < i ) and ( array[j] < array[i] ).
- Aggregation Function: ( A = \max(dp[i]) ).
2. Edit Distance (ED):¶
- Input: Two strings.
- State Space: A 2D matrix of indices.
- Transition Function: Compute ( dp[i][j] ) based on insertion, deletion, or substitution costs.
- Aggregation Function: Final value at ( dp[n1][n2] ).
Theorem 4.8 states that bounded-depth Transformers without CoT cannot solve certain DP problems, such as Context-Free Grammar (CFG) Membership Testing, due to computational complexity limitations. However, CoT significantly enhances the expressivity of Transformers, enabling them to solve P-complete problems like CFG Membership Testing.
CoT plays a critical role in enhancing the problem-solving abilities of LLMs for DP tasks. By generating intermediate steps, CoT enables Transformers to tackle tasks that are computationally infeasible for shallow, bounded-depth models.
Tree of Thoughts (ToT) Framework: BFS and DFS¶
The "Tree of Thoughts" (ToT) framework enhances Large Language Models (LLMs) by enabling them to explore multiple reasoning paths systematically. This is achieved through search algorithms like Breadth-First Search (BFS) and Depth-First Search (DFS), which allow the model to navigate through a "tree" of possible thoughts or solutions.
Breadth-First Search (BFS) in ToT:¶
BFS explores all possible thoughts at the current depth before moving deeper, ensuring a comprehensive examination of all immediate options.
Steps:¶
- Initialization: Start with the initial input as the root node.
- Expansion: Generate all possible next thoughts from the current level's nodes.
- Evaluation: Assess the quality of each thought using a state evaluator.
- Selection: Choose the top thoughts based on evaluation scores, up to a specified limit.
- Iteration: Repeat the process for the next level until reaching a predefined depth or finding a satisfactory solution.
Depth-First Search (DFS) in ToT:¶
DFS explores one path deeply before backtracking, allowing for an in-depth examination of each potential solution.
Steps:¶
- Initialization: Start with the initial input as the root node.
- Expansion: Generate the next possible thought from the current node.
- Evaluation: Assess the thought's quality.
- Pruning: If the thought's evaluation exceeds a certain threshold, continue down this path; otherwise, backtrack.
- Iteration: Continue this process, exploring each path deeply and backtracking as necessary, until a solution is found or all paths are exhausted.
Comparison:¶
BFS: Provides a broad exploration, ensuring all potential solutions at each level are considered. This is beneficial for finding the shortest path or when all options need equal consideration.
DFS: Allows for deep exploration of each path, which is useful when a solution is likely to be found deep within a particular branch or when memory resources are limited.
Tree of Thoughts (ToT)¶
Building on CoT, Tree of Thoughts was introduced by Yao et al. (2023). It extends the idea by exploring multiple reasoning possibilities at each step, creating a tree structure.
- How It Works:
- The problem is decomposed into multiple steps.
- For each step, the model generates multiple "thoughts" or solutions.
- The results form a tree structure, allowing systematic exploration.
- Search Techniques:
- Breadth-First Search (BFS): Explores all thoughts at one level before moving to the next.
- Depth-First Search (DFS): Dives deep into one possibility before backtracking.
Each state in the tree is evaluated using methods like classifier prompts or majority votes to choose the best path. This approach is ideal for strategic decision-making and complex planning tasks.
Task Decomposition Methods¶
Task decomposition can be achieved using different strategies, depending on the task's nature and requirements:
1. LLM-Driven Prompting¶
- Examples of prompts:
- "Steps for XYZ:\n1."
- "What are the subgoals for achieving XYZ?"
- The LLM generates sequential steps or subgoals directly from the prompt.
2. Task-Specific Instructions*¶
- Examples:
- "Write a story outline." for creative writing.
- This approach guides the model with specific instructions to structure its output.
3. Human Inputs¶
- Human users manually provide the decomposition, leveraging their expertise to assist the model.
LLM+P: Integrating Classical Planning with LLMs¶
A distinct approach, LLM+P, was introduced by Liu et al. (2023) and combines LLMs with external classical planners to handle long-horizon tasks. This method leverages the Planning Domain Definition Language (PDDL) as an interface.
How It Works:
- The LLM translates the problem into a “Problem PDDL.”
- An external classical planner generates a plan based on the “Domain PDDL.”
- The LLM translates the plan back into natural language.
Advantages:
- Ensures precise and domain-specific planning.
- Effective for fields like robotics, where classical planning tools are prevalent.
Limitations:
- Requires domain-specific PDDL definitions and compatible planners.
- Less practical for domains lacking formal planning tools.
Planning Domain Definition Language (PDDL)¶
The Planning Domain Definition Language (PDDL) is used to describe the components of a planning problem, including the objects, actions, and the goal. It provides a formal way to represent the problem, which can then be solved by a classical planner. Let’s break down how a typical PDDL file for a simple problem, such as cleaning a room, might look.
Example Problem: Cleaning a Room¶
Goal: The robot needs to clean the living room and kitchen.
PDDL Structure¶
A PDDL file is typically divided into two sections:
- Domain Definition (describes the types of actions that can be performed).
- Problem Definition (describes the specific problem, including the initial state and goal).
1. Domain Definition:¶
In the domain definition, we specify the actions the robot can perform, such as moving between rooms, cleaning a room, or picking up objects.
(define (domain cleaning)
(:requirements :strips :typing) ; Specify requirements (e.g., STRIPS and typing)
(:types room) ; Define types of objects (e.g., room)
(:predicates
(clean ?r - room) ; Predicate: Room is cleaned
(dirty ?r - room) ; Predicate: Room is dirty
(at ?r - room) ; Predicate: Robot is at room
)
; Action: Move from one room to another
(:action move
:parameters (?from - room ?to - room)
:precondition (and (at ?from) (not (at ?to)))
:effect (and (not (at ?from)) (at ?to))
)
; Action: Clean a room
(:action clean
:parameters (?r - room)
:precondition (and (at ?r) (dirty ?r))
:effect (and (not (dirty ?r)) (clean ?r))
)
)
Explanation of Domain:¶
Predicates:
(clean ?r - room)
means room?r
is clean.(dirty ?r - room)
means room?r
is dirty.(at ?r - room)
means the robot is at room?r
.
Actions:
move
: This action moves the robot from one room to another. It has parameters for the starting and destination rooms, with a precondition that the robot must be in the starting room and not in the destination room. The effect is that the robot will no longer be at the starting room and will be at the destination room.clean
: This action allows the robot to clean a room. It has parameters for the room, with a precondition that the robot is in the room and the room is dirty. The effect of cleaning is that the room becomes clean and no longer dirty.
2. Problem Definition:¶
The problem definition includes specific details like the initial state (e.g., which rooms are dirty, where the robot is) and the goal (e.g., which rooms need to be cleaned).
(define (problem clean-house)
(:domain cleaning)
(:objects
living-room kitchen - room ; Define rooms
)
(:init
(at living-room) ; Robot starts in the living room
(dirty living-room) ; Living room is dirty
(dirty kitchen) ; Kitchen is dirty
)
(:goal
(and
(clean living-room) ; The living room must be clean
(clean kitchen) ; The kitchen must be clean
)
)
)
Explanation of Problem:¶
Objects:
- We define the objects in the problem, which are the rooms (
living-room
,kitchen
).
- We define the objects in the problem, which are the rooms (
Initial State:
- The robot starts in the living room (
(at living-room)
). - Both the living room and kitchen are initially dirty.
- The robot starts in the living room (
Goal:
- The goal is to have both the living room and kitchen clean.
Execution Flow (Using the PDDL):¶
- The planner receives the domain and problem definitions.
- It starts in the initial state where the robot is in the living room, and both rooms are dirty.
- The planner applies actions, like
move
orclean
, according to the preconditions and effects until the goal is achieved. - Once the rooms are clean, the planner has found a valid plan.
Sample Plan Output:¶
A solution plan generated by a classical planner could look like this:
; Plan
1. move living-room kitchen
2. clean kitchen
3. move kitchen living-room
4. clean living-room
In this plan, the robot first moves to the kitchen, cleans it, then moves back to the living room and cleans it. This series of steps satisfies the goal of cleaning both rooms.
In the context of LLM+P, natural language descriptions (like "clean the living room and kitchen") are first converted into PDDL format, processed by a classical planner, and then the results are converted back into human-readable text. This combination of LLMs and classical planning allows for the efficient generation of complex, optimal plans.
Explore further:
Self-Reflection¶
Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable. The ReAct (Reasoning + Acting) framework combines reasoning with action and can be coded to build agents that iterate between thinking (reasoning) and acting (performing tasks) based on environmental feedback. Here's a step-by-step explanation of how you can code this framework, using Python and libraries such as OpenAI's GPT-3 or GPT-4 for reasoning and a task-oriented environment (e.g., web scraping, robotic environments, or text-based interactions) for actions.
Steps to Code the ReAct Framework:¶
Set up your environment: You need a task environment where the agent will perform its actions, such as querying APIs, interacting with the web, or interacting with a simulated world.
Integrate Reasoning (via Language Models): The agent uses reasoning models like OpenAI’s GPT-3/4 to decide on the next action based on the observations.
Iterate Between Reasoning and Acting: After reasoning through the problem, the agent performs an action (e.g., API query, system command), gets the observation (feedback), and uses that information to update its reasoning.
Example Code Implementation¶
Let’s consider a simple example where the agent answers a question by searching a knowledge base (like Wikipedia) using OpenAI’s GPT model for reasoning. The agent will iterate between reasoning and acting to find the correct answer.
import openai
import wikipedia
# Set up OpenAI API key
openai.api_key = "your-openai-api-key"
# Function to create a reasoning prompt for the agent
def generate_reasoning_prompt(question):
prompt = f"""
Task: I need to answer the question based on available knowledge.
Question: {question}
Steps:
1. I will think about the topic first.
2. I will decide where to find information (e.g., searching Wikipedia or web).
3. I will gather relevant details.
4. Based on what I find, I will update my reasoning and actions.
"""
return prompt
# Function to simulate reasoning via OpenAI's GPT model
def reason_with_gpt(prompt):
completion = openai.Completion.create(
engine="text-davinci-003", # Use GPT-3 or any available GPT model
prompt=prompt,
max_tokens=150
)
return completion.choices[0].text.strip()
# Function to search Wikipedia (acting phase)
def search_wikipedia(query):
try:
# Using the Wikipedia API to get a summary
summary = wikipedia.summary(query, sentences=2)
return summary
except wikipedia.exceptions.DisambiguationError as e:
# Handling ambiguity in the search query
return f"Found multiple results: {e.options}"
# Main ReAct loop
def react_framework(question):
# Step 1: Reasoning about how to approach the task
reasoning_prompt = generate_reasoning_prompt(question)
print("Reasoning Prompt:")
print(reasoning_prompt)
# Step 2: Use GPT to generate reasoning for action
reasoning_output = reason_with_gpt(reasoning_prompt)
print("Reasoning Output:")
print(reasoning_output)
# Step 3: Perform the action based on reasoning (search for info)
print(f"Searching Wikipedia for: {question}")
wiki_summary = search_wikipedia(question)
print("Wikipedia Summary:")
print(wiki_summary)
# Step 4: Update reasoning based on observation and continue
update_reasoning_prompt = f"""
Based on the Wikipedia summary: "{wiki_summary}",
Now I will refine my reasoning and answer the question.
"""
updated_reasoning = reason_with_gpt(update_reasoning_prompt)
print("Updated Reasoning:")
print(updated_reasoning)
return wiki_summary, updated_reasoning
# Run the ReAct framework for a sample question
question = "What is quantum computing?"
summary, updated_reasoning = react_framework(question)
Reasoning Prompt:
Task: I need to answer the question based on available knowledge.
Question: What is quantum computing?
Steps:
1. I will think about the topic first.
2. I will decide where to find information (e.g., searching Wikipedia or web).
3. I will gather relevant details.
4. Based on what I find, I will update my reasoning and actions.
Reasoning Output:
Quantum computing is a complex field that deals with quantum mechanics. I'll first search Wikipedia for a quick overview.
Searching Wikipedia for: What is quantum computing?
Wikipedia Summary:
Quantum computing is an area of computer science and physics that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform computation.
Updated Reasoning:
Based on the Wikipedia summary: "Quantum computing is an area of computer science and physics that uses quantum-mechanical phenomena...", I will now look for specific research papers or advancements in quantum computing.
Reasoning (via GPT):¶
- The
generate_reasoning_prompt
function generates a prompt for the reasoning process. The prompt helps the agent plan how it will gather information. - The
reason_with_gpt
function sends the reasoning prompt to the GPT model (e.g., GPT-3) to simulate the agent's reasoning. GPT produces a response about how to proceed based on the task.
Action (Search):¶
- The
search_wikipedia
function acts as the agent’s ability to interact with an external knowledge source (in this case, Wikipedia). The agent searches for the query term and retrieves a brief summary.
Update Reasoning:¶
- After performing an action (in this case, retrieving a Wikipedia summary), the agent updates its reasoning and continues refining the task.
Iteration:¶
- The agent alternates between reasoning and acting. It starts with reasoning, performs an action (like searching Wikipedia), observes the result, and then refines its reasoning before acting again.
Key Concepts:¶
Iterative Reasoning and Action:¶
- The agent first reasons about how to approach the problem (e.g., whether to search Wikipedia, use a database, or ask for more detailed queries). Then it acts based on that reasoning. After observing the results, the reasoning is updated, and the process continues iteratively.
Action-Based Environments:¶
- While this example uses Wikipedia for gathering information, in more advanced setups, the actions could involve interacting with APIs, databases, physical robots, or web pages.
Dynamic Problem Solving:¶
- ReAct allows the agent to adapt and refine its approach based on the observations it gathers, making it highly adaptable to complex real-world problems.
Enhancing This Framework:¶
Multi-Step Tasks:¶
- For more complex tasks, you can expand the reasoning step to involve multiple actions, like querying different databases or interacting with more complex APIs.
Error Handling and Feedback:¶
- You can include error handling for when an action doesn’t give the expected result, such as retrying searches or taking different actions based on the outcome.
Action-Reasoning Hybrid:¶
- You could integrate reasoning and actions with other models or systems, such as reinforcement learning, to optimize actions and decisions based on rewards or goals.
By using the ReAct framework, you can create agents capable of dynamically adjusting to their environments, making them effective at solving complex, multi-step tasks.
Examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop). (Image source: Yao et al. 2023).
Interpretation of the ReAct Framework Example¶
The image shows two examples of the ReAct (Reasoning + Acting) framework applied to different environments: HotpotQA and AlfWorld Env.
1. HotpotQA Example (Question Answering Task):¶
Act 1: Search Apple Remote: The agent reasons that it needs to search for the Apple Remote and find out which program it is designed to interact with.
- Observation (Obs 1): The agent finds initial information about the Apple Remote and its use for controlling the "Front Row" media center program.
Act 2: Search for Front Row: The agent continues by searching for "Front Row" to identify what devices can interact with it.
- Observation (Obs 2): The search returns unrelated results (e.g., 'Front Row Seat to Earth'), leading the agent to refine its query.
Act 3: Search for Front Row (software): The agent refines its search to "Front Row software" to find relevant information.
- Observation (Obs 3): It finds that "Front Row" is discontinued software, which doesn’t fully answer the agent’s question.
Act 4: Finish: The agent concludes that the Apple Remote is controlled by either the "Front Row software" or the "keyboard function keys."
- Observation (Obs 4): The agent finalizes its answer by deciding that the Apple Remote is controlled via keyboard function keys.
This example demonstrates how the agent refines its search query and reasoning, leading to a conclusion based on the observed data.
2. AlfWorld Env Example (Physical Task Interaction):¶
Act 1: Think about where to find the pepper shaker: The agent begins by reasoning that the pepper shaker is likely to be found in one of several locations (e.g., cabinets or countertops).
- Observation (Obs 1): The agent finds a vase instead of the pepper shaker on the cabinet.
Act 2: Search Cabinet 1: The agent then checks Cabinet 1.
- Observation (Obs 2): It finds another vase.
Act 6: Go to Countertop 3: The agent moves to Countertop 3.
- Observation (Obs 6): The agent discovers a pepper shaker on the countertop.
Act 7: Take the Pepper Shaker: The agent picks up the pepper shaker from the countertop.
- Observation (Obs 7): The agent successfully collects the pepper shaker.
Act 8: Think about where to store the pepper shaker: The agent reasons that it should store the pepper shaker in Drawer 1.
Act 9: Go to Drawer 1: The agent heads to Drawer 1.
- Observation (Obs 9): Drawer 1 is closed.
Act 10: Open Drawer 1: The agent opens the drawer.
- Observation (Obs 10): The drawer opens successfully.
Act 11: Put the Pepper Shaker in Drawer 1: The agent places the pepper shaker in the drawer.
- Observation (Obs 11): The pepper shaker is successfully stored in the drawer.
This example shows how the agent iterates through physical actions and reasoning steps, leading to a successful outcome.
Summary of ReAct Framework:¶
- Reasoning + Acting: The ReAct framework combines reasoning and acting in a dynamic, iterative process. In both examples, the agent refines its actions based on the feedback it receives and adapts accordingly.
- Iterative Process: The agent’s reasoning and actions are adjusted through iterations. In HotpotQA, it refines its search for the correct information, and in AlfWorld Env, it adapts its physical actions to achieve the task's goal.
- Practical Application: The framework is highly adaptable, whether for information retrieval tasks like HotpotQA or physical world tasks like AlfWorld Env, showcasing its versatility in various environments.
Explanation of Reflexion Framework (Shinn & Labash, 2023)¶
The Reflexion framework is a method designed to improve the reasoning capabilities of autonomous agents by introducing self-reflection and dynamic memory. By incorporating these elements, Reflexion enables agents to learn from their past mistakes and adapt their reasoning strategies over time. This is especially useful in complex, real-world tasks where agents need to continuously adjust their approach based on feedback and observed failures.
Key Features of Reflexion¶
Action Space with Augmented Reasoning:¶
- Reflexion builds on the ReAct framework by using an augmented action space that combines task-specific actions with language. This allows the agent to reason through the task, explain its thought process, and consider potential actions before executing them.
- In addition to standard task actions (like querying a database or moving an object), the agent can also perform "reasoning" actions, such as generating explanations or asking questions, enhancing its ability to tackle complex scenarios.
Self-Reflection:¶
- Self-reflection is the core feature of the Reflexion framework. It allows the agent to assess its previous actions and evaluate whether they were successful or not.
- Reflexion uses two-shot examples in which each example consists of:
- A failed trajectory: This is a sequence of actions that leads to an unsuccessful outcome. The agent reviews these to understand why the actions did not work.
- An ideal reflection: This is a corrective guide, showing how the task could be handled differently in the future to avoid repeating the failure.
- These reflections help the agent improve its future decision-making by providing a model of successful behavior in similar contexts.
Heuristic Function:¶
- Reflexion uses a heuristic function to evaluate whether an action trajectory (a sequence of actions) is productive. If a trajectory is deemed inefficient or repetitive, the agent will reconsider its approach.
- Inefficient planning: If the agent’s actions are taking too long without making any progress or reaching the goal, it flags the trajectory as inefficient.
- Hallucination: If the agent repeats the same action multiple times with the same result, it is considered a "hallucination" — an action that doesn’t lead to any new information or progress. Reflexion helps the agent recognize when it is trapped in a loop and guides it to stop and reconsider.
Working Memory:¶
- Reflexion also introduces a working memory that stores up to three reflections. These stored reflections act as useful context when the agent needs to make decisions. For example, if the agent encounters a similar problem again, it can reference previous reflections to avoid repeating the same mistakes.
- By maintaining a small but meaningful set of past experiences (reflections), Reflexion enables the agent to continuously improve its decision-making over time, enhancing its reasoning abilities.
How Reflexion Works¶
Self-Reflection in Action:¶
- After each action, the agent assesses its success or failure. It uses a heuristic function to check whether the trajectory (sequence of actions) was efficient and effective. If the trajectory is deemed inefficient or filled with hallucinations, the agent stops and decides whether to reset the environment and start a new trial.
Learning from Past Mistakes:¶
- The agent uses self-reflection to learn from its mistakes. By storing reflections of failed actions and ideal corrections, the agent can improve its reasoning and decision-making over time. The working memory helps the agent maintain context, enabling it to handle similar situations more effectively in the future.
Continuous Improvement:¶
- Reflexion is not a one-time process; it involves iterative improvement. The agent continually refines its reasoning and actions, creating a loop of learning from past failures and improving future decision-making. This is particularly helpful for tasks that require long-term planning and continuous adjustments.
Explanation of Reflexion Reinforcement Algorithm (Shinn & Labash, 2023)¶
The algorithm for Reinforcement via Self-Reflection is designed to improve the reasoning capabilities of an agent by allowing it to learn from past experiences and improve decision-making through self-reflection. Below is a detailed explanation of how the algorithm works.
Algorithm Overview:¶
Initialization:
- The Actor, Evaluator, and Self-Reflection components are initialized with their respective memories.
- ( M_a ): Memory for the actor (short-term memory).
- ( M_e ): Memory for the evaluator (internal feedback).
- ( M_{sr} ): Memory for self-reflection (reflective memory).
- The agent’s policy ( \pi_\theta ) is initialized with parameters that incorporate the memories of both the actor and evaluator. This policy dictates how the agent selects actions based on its current state.
- The Actor, Evaluator, and Self-Reflection components are initialized with their respective memories.
Generate Initial Trajectory:
- The agent generates an initial trajectory using its policy ( \pi_\theta ). This trajectory consists of a sequence of actions ( a_0, a_1, \dots, a_i ) and their corresponding observations and rewards ( o_0, o_1, \dots, o_i ).
Evaluate the Initial Trajectory:
- The Evaluator (M_e) evaluates the initial trajectory (( \tau_0 )) to check if it was successful or efficient. The evaluator helps the agent determine whether its actions were aligned with the goal.
Generate Initial Self-Reflection:
- After the evaluation, the agent generates its first self-reflection ( sr_0 ) based on the initial trajectory and its evaluation. This reflection helps the agent understand why it succeeded or failed and provides a guide for improving future actions.
Set Initial Memory:
- The agent sets its memory (
mem
) to the first self-reflection ( sr_0 ). This memory will be used to guide future decisions by helping the agent learn from past experiences.
- The agent sets its memory (
Main Loop:
- The algorithm enters a while loop, where the agent continues to generate trajectories and self-reflections until a stopping condition is met:
- The Evaluator (M_e) must pass, or the agent must reach the maximum number of trials.
- Generate New Trajectory (( \tau_t )): In each iteration, the agent generates a new trajectory based on the current policy ( \pi_\theta ).
- Evaluate New Trajectory (( \tau_t )): The evaluator checks if the new trajectory is efficient and successful.
- Generate Self-Reflection: After evaluating the new trajectory, the agent generates a new self-reflection ( sr_t ) that guides future actions.
- The self-reflection ( sr_t ) is appended to the memory
mem
to provide context for the agent’s decision-making process in future iterations. - The trial counter ( t ) is incremented to track the number of trials completed.
- The algorithm enters a while loop, where the agent continues to generate trajectories and self-reflections until a stopping condition is met:
Termination:
- The loop continues until either the evaluator passes (indicating the task has been completed successfully) or the maximum number of trials is reached.
- Once the while loop ends, the algorithm terminates.
- Actor generates a trajectory based on its current policy.
- Evaluator checks if the trajectory is successful and efficient.
- Self-Reflection allows the agent to learn from past actions by providing feedback on what worked and what didn’t.
- Memory stores reflections to help guide future decisions.
- The agent iterates through this process, learning from past experiences and continuously improving its reasoning and decision-making.
The algorithm enables the agent to refine its decision-making process through self-reflection and feedback, creating a loop of continuous improvement. It’s particularly useful for tasks where agents must adapt and learn from failures over time.
Creating Self-Reflection in LLM (Large Language Models)¶
Two-shot Examples:¶
- Two-shot examples refer to providing the agent with a pair of examples to guide its self-reflection process. These examples are designed to help the agent understand how to improve its decision-making by comparing:
- A failed trajectory: A sequence of actions that didn’t lead to a desired outcome.
- An ideal reflection: A guide to correct its actions in the future.
Failed Trajectory:¶
- This is a sequence of actions that the agent took, but they resulted in an unsuccessful outcome or an inefficient path. The agent reflects on why these actions were ineffective or why they failed to achieve the desired result.
Ideal Reflection:¶
- This is a corrective guide showing how the task could have been handled differently to avoid failure. The ideal reflection provides insights into better actions the agent could have taken, helping it learn from its mistakes.
Process of Reflection:¶
- The agent is shown two-shot examples that consist of the following:
- Example 1: A failed trajectory and an ideal reflection for how to improve that trajectory.
- Example 2: Another failed trajectory and its corresponding ideal reflection.
These examples help the agent compare its own failures with the ideal ways to handle those situations.
Adding Reflections to Working Memory:¶
After the agent processes the two-shot examples, it generates reflections based on its own experiences and stores them in its working memory.
The agent can store up to three reflections at a time in its working memory. These reflections provide context for future decisions, allowing the agent to apply the lessons learned from past experiences when making new decisions.
Explanation of Chain of Hindsight (CoH) and Algorithm Distillation (AD)¶
Chain of Hindsight (CoH) and Algorithm Distillation (AD) are techniques that aim to improve the performance of models, particularly in language models and reinforcement learning tasks, by leveraging the history of outputs and actions along with feedback to guide the agent towards better performance over time. Both methods use a similar idea of improving output progressively based on past experiences, but they apply it in different contexts: language models (CoH) and reinforcement learning (AD).
Chain of Hindsight (CoH; Liu et al. 2023)¶
Purpose of CoH: The goal of Chain of Hindsight (CoH) is to improve the model's ability to self-correct by presenting a history of past outputs, each annotated with human feedback, and teaching the model to use this feedback to progressively improve its outputs.
Key Steps and Components:¶
Human Feedback Data:
The human feedback data is composed of feedback tuples, which include:
- Prompt ((P)): The input or question the model needs to respond to.
- Model Completion ((O)): The output generated by the model.
- Human Rating ((R)): The human-assigned rating of the model’s completion, indicating its quality.
- Hindsight Feedback ((H)): The human-provided feedback on how to improve the output.
These feedback tuples are ranked by reward (higher-ranked tuples correspond to more successful outputs with better feedback).
Supervised Fine-Tuning:
- The model is fine-tuned on these feedback sequences. The training data is a sequence in the form of ( (P, O, R, H) ).
- The model is conditioned on the sequence prefix (previous feedback) to predict the next part of the sequence, particularly the hindsight feedback ((H)).
- This allows the model to self-reflect on its outputs and adjust future generations based on prior feedback.
Training Objective:
- CoH is designed to improve the model’s predictions using a feedback sequence. The model learns to predict the hindsight feedback given the sequence of past prompts, outputs, and ratings.
- By conditioning on the sequence of past outputs and feedback, the model is encouraged to improve over time, learning how to produce better outputs based on the feedback.
Avoiding Overfitting:
- To avoid overfitting to the feedback and common patterns, CoH introduces a regularization term that maximizes the log-likelihood of the pre-training dataset (the data used before fine-tuning). This ensures that the model doesn’t just memorize the feedback patterns but also retains its ability to generalize.
- Additionally, CoH randomly masks 0% - 5% of past tokens during training. This reduces the chance of the model shortcutting and copying feedback directly, which is a common issue when there are repetitive phrases in the feedback sequences.
Training Data:
- The training dataset used in CoH experiments is a combination of:
- WebGPT Comparisons: Comparing outputs generated by WebGPT, a web-based GPT model.
- Summarization from Human Feedback: Datasets where humans provide feedback on summarizations generated by the model.
- Human Preference Dataset: This dataset contains human preferences for different model completions, helping guide the model toward higher-quality outputs.
- The training dataset used in CoH experiments is a combination of:
Algorithm Distillation (AD; Laskin et al. 2023)¶
Purpose of AD: Algorithm Distillation (AD) applies the same idea as CoH but in the context of reinforcement learning (RL). AD uses a history of interactions in RL tasks to improve the agent’s policy, encouraging it to build upon its past learning experiences across multiple episodes.
Key Steps and Components:¶
Cross-Episode Trajectories:
- AD is based on the idea that learning from a history of episodes can improve an agent's performance. The policy is conditioned on the history of actions taken over multiple episodes, rather than just the current episode.
- AD concatenates the learning history from previous episodes and uses this to guide the agent's future actions, expecting that the next action will perform better due to the accumulated experience.
Training with Multi-Episode History:
- AD creates a history-conditioned policy where the agent learns from a sequence of interactions across episodes. Instead of learning from a single episode, the model considers the entire history of past actions to predict better outcomes in future episodes.
- This process helps the agent learn the overall process of reinforcement learning (RL) rather than focusing solely on a task-specific policy.
Behavioral Cloning:
- The paper hypothesizes that any RL algorithm generating a learning history can be distilled into a neural network by performing behavioral cloning over the actions.
- The history data used for training comes from source policies trained for specific tasks. During training, a random task is sampled, and a subsequence of multi-episode history is used to train the policy.
Short Episodes for Multi-Episode History:
- In practice, since the model has a limited context window length, episodes need to be short enough to construct a multi-episode history. Multi-episodic contexts of 2-4 episodes are used for training to learn a near-optimal in-context RL algorithm.
Comparison with Baselines:
AD is compared with three baselines:
- ED (Expert Distillation): A method where the agent learns from expert trajectories rather than learning from its own experiences.
- Source Policy: Policies used to generate trajectories for distillation.
- RL^2: An upper bound method that uses online RL (RL with continual learning over time).
AD outperforms these baselines in environments requiring memory and exploration, demonstrating the power of in-context RL. It improves much faster than ED and approaches the performance of RL^2, even without requiring online RL.
The training dataset in their experiments of the CoH (Chain of Hindsight) paper is a combination of WebGPT comparisons, summarization from human feedback, and human preference dataset.
After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence. (Image source: Liu et al. 2023)
The Algorithm Distillation (AD) framework captures the idea of improving performance in reinforcement learning (RL) tasks by leveraging learning histories across multiple episodes. Here's an overview of the process illustrated in the diagram:
Data Generation:
- Learning Histories: The training data consists of RL algorithm learning histories from multiple tasks. These histories include a sequence of observations (
o
), actions (a
), and rewards (r
) over time for each task. - The data spans across episodes, and the learning progress is tracked to show the agent's improvement over time.
- Learning Histories: The training data consists of RL algorithm learning histories from multiple tasks. These histories include a sequence of observations (
Model Training:
- Training with Causal Transformer: The model is trained using the Causal Transformer, which processes the sequence of observations, actions, and rewards from the past. The goal is to predict the next action (
a_t
) based on the previous context (h_{t-1}, o_t
), conditioned on the historical learning progress. - The model learns to predict actions over time by conditioning on both past actions and observations, facilitating better decision-making across episodes.
- Training with Causal Transformer: The model is trained using the Causal Transformer, which processes the sequence of observations, actions, and rewards from the past. The goal is to predict the next action (
Prediction:
- The final model predicts the next action (
a_t
) given the historical context (h_{t-1}, o_t
) using the learned policyP_θ(a_t | h_{t-1}, o_t)
, optimizing performance based on cumulative learning from previous episodes.
- The final model predicts the next action (
The paper hypothesizes that any RL algorithm that generates a set of learning histories (a sequence of actions, observations, and rewards) can be distilled into a neural network by performing behavioral cloning over the actions.
Behavioral Cloning: This is a method where a neural network is trained to mimic the actions of an expert or a source policy. In the case of AD, the model learns by cloning the behavior based on the historical actions, observations, and rewards from previous episodes.
Source policies are trained for specific tasks. These policies are responsible for generating the learning history data that the model will use for training. For example, a source policy might be trained on a specific task like navigation or robotic manipulation. During the training of the AD model, a random task is sampled, and a subsequence of the multi-episode history from the source policy is used for training.
The goal of AD is to train a model that is task-agnostic. This means that, rather than learning a policy for a specific task, the model learns a general policy that works across various tasks. The training uses the history of interactions (comprising multiple episodes) to condition the model’s predictions, which enables it to perform better in a wide range of tasks.
The model used in AD has a limited context window (a fixed number of tokens or time steps that the model can consider at once). This makes it difficult for the model to access long sequences of data at any given time.
To create an effective learning model, AD uses multi-episodic context. This means that rather than using just one episode of experience, the model is trained with a sequence of 2-4 episodes. A multi-episodic context is necessary for learning a near-optimal in-context RL algorithm. It helps the model see patterns and learn from experiences across multiple episodes, which is crucial for RL tasks that require long-term decision making and understanding.
The model's ability to learn from in-context RL requires the model to consider enough past experience to understand the long-term effects of its actions. This is where the history from multiple episodes becomes important, allowing the model to improve its decision-making progressively.
AD is compared to several other methods to highlight its performance and efficiency.
Expert Distillation (ED) uses behavior cloning but with expert trajectories (data collected from an expert agent) rather than the learning history of the model. AD improves much faster than ED because it uses its own learning history, which helps the model generalize better and adapt more quickly.
The source policy generates trajectories for distillation using an approach called Upper Confidence Bound (UCB). AD improves more quickly than the source policy method, as it directly learns from the cumulative experience of multiple episodes.
RL^2 is an advanced RL method that uses online RL (learning continuously during training), where the agent learns in real-time by interacting with the environment. Despite using offline RL (learning without real-time interaction), AD shows performance close to RL^2 and learns much faster than other methods.
When conditioned on partial training history from the source policy, AD not only outperforms ED but also learns significantly faster than other methods.
Comparison of AD, ED, source policy and RL^2 on environments that require memory and exploration. Only binary reward is assigned. The source policies are trained with A3C for "dark" environments and DQN for watermaze. (Image source: Laskin et al. 2023)
Component 2 Memory¶
Memory, both in humans and in computational systems, refers to the processes by which information is acquired, stored, retained, and retrieved when necessary. In the human brain, memory is often divided into different types based on how information is processed and for how long it is retained. Here's an explanation of the different types of memory and their relevance to computational systems, particularly in machine learning and artificial intelligence:
Sensory memory is the ability to retain brief impressions of sensory information (such as visual, auditory, or tactile inputs) after the original stimulus has ended. This type of memory holds information for just a few seconds, allowing us to process and interpret sensory data continuously. Examples include:
- Iconic memory: The brief retention of visual stimuli (e.g., a quick glimpse of a scene or image).
- Echoic memory: The retention of sounds for a few seconds after hearing them.
- Haptic memory: The sensory memory for touch, such as briefly feeling an object.
In AI, sensory memory can be likened to how an AI system processes raw input data (such as images, text, or sound) into initial representations, typically in the form of embeddings. These embeddings represent the raw sensory data, allowing the system to handle it effectively in the next stages.
Short-term memory (STM) or working memory refers to the temporary storage of information that is currently being used for cognitive tasks like reasoning, learning, and decision-making. It has a limited capacity (about 7 items, according to Miller's Law) and can store information for around 20-30 seconds. Examples include:
- Holding a phone number in your mind long enough to dial it.
- Keeping track of instructions or thoughts while solving a problem.
In the context of machine learning, short-term memory can be likened to in-context learning, which is used in models like Transformers. These models have a limited context window where they can process and remember information (like text in a sentence or a sequence of images). The context window is restricted, so the model can only keep a short-term memory of the data being processed at any given time.
Long-term memory (LTM) refers to the storage of information for extended periods, from a few days to decades. The capacity is essentially unlimited. It is divided into two subtypes:
- Explicit/Declarative Memory: This involves conscious recall of facts and events.
- Episodic memory: Memory of personal experiences and events.
- Semantic memory: Memory of general knowledge, facts, and concepts.
- Implicit/Procedural Memory: This involves unconscious memory of skills and habits that are performed automatically, like riding a bike or typing on a keyboard.
In AI, long-term memory can be compared to external vector stores or databases, where a model stores vast amounts of information for retrieval. For example, models can store knowledge such as general facts, learned experiences, and skills that can be retrieved when needed, enabling the model to make more informed decisions over time.
- Explicit/Declarative Memory: This involves conscious recall of facts and events.
Sensory memory in machine learning is similar to the initial representation of raw inputs (e.g., text, images, or other modalities). When raw data is input into a model, it is first converted into an embedding, which captures the essential features of the data to be processed further.
Short-term memory in models like Transformers is an example of in-context learning. Here, the model keeps track of a limited context (like the recent words in a sentence or the frames in a video) to make decisions. However, this memory is short-lived, constrained by the model’s context window, similar to how humans can hold a limited amount of information in short-term memory.
Long-term memory in AI systems is akin to a vector store or external database, where the model can store and retrieve large amounts of data when needed. This memory can be accessed efficiently to enhance decision-making, similar to how humans use past experiences to inform current decisions.
By incorporating these types of memory into AI systems, models can handle tasks more effectively, using short-term context for immediate tasks and long-term memory to improve their overall performance through accumulated knowledge.
MIPS is a method used in situations where we need to retrieve the most relevant or similar data points based on the maximum inner product. The goal is to quickly find the data that is most similar to a given query in a high-dimensional space, where data is typically represented as vectors (e.g., embeddings). This process can be computationally expensive, so optimization strategies are necessary to handle large-scale datasets efficiently.
External memory systems can help alleviate the limitations of finite attention spans (the amount of data a model can process at once). A common practice to address this is storing the embedding representations of data in a vector store database, which allows for quick similarity searches. To make these searches faster, approximate nearest neighbor (ANN) algorithms are commonly used, which prioritize speed over perfect accuracy.
Here’s how some of the popular ANN algorithms work for fast MIPS:
LSH (Locality-Sensitive Hashing)
LSH is a technique that uses a hashing function to map similar items to the same hash buckets with high probability. The idea is that, for a given data set, similar input items will likely be hashed to the same bucket, reducing the number of potential comparisons needed for searching. The number of buckets is much smaller than the number of inputs, making this method very efficient. LSH is widely used for high-dimensional data as it reduces the problem size when searching for similar vectors.
ANNOY (Approximate Nearest Neighbors Oh Yeah)
ANNOY builds the search structure using random projection trees, which are binary trees. Each non-leaf node in the tree represents a hyperplane that splits the input space into two halves. Each leaf node stores a data point. Multiple random trees are built, and the search process happens across all trees. When performing a search, ANNOY checks which half of the data space is closest to the query and iterates through the trees accordingly.
The technique is somewhat similar to KD trees but more scalable because it builds trees randomly rather than deterministically. ANNOY provides fast search speeds while still maintaining reasonable accuracy.
HNSW (Hierarchical Navigable Small World)
HNSW is based on the concept of small-world networks, which are networks where most nodes can be reached from any other node within a few steps (think of the “six degrees of separation” idea). HNSW constructs hierarchical layers of small-world graphs. The bottom layer contains the actual data points, while the middle layers provide shortcuts to speed up searches.
During a search, HNSW starts from a random node in the top layer and navigates down towards the target by moving to nodes that are closer to the query. Each move in the upper layers can cover large distances, while each move in the lower layers refines the search, improving the accuracy. This combination of broad first search and fine-tuned refinement makes HNSW highly effective for large-scale searches.
FAISS (Facebook AI Similarity Search)
FAISS assumes that, in high-dimensional space, the distances between data points often follow a Gaussian distribution, meaning there’s a tendency for points to cluster together. FAISS uses this property to perform efficient searches by applying vector quantization. The vector space is partitioned into clusters, and the search process first identifies the most relevant clusters with coarse quantization. Then, it uses finer quantization within those clusters to refine the search and retrieve the most similar items.
FAISS is especially effective for large-scale search problems and is optimized for both CPU and GPU acceleration.
ScaNN (Scalable Nearest Neighbors)
ScaNN innovates by using anisotropic vector quantization. Instead of just quantizing a data point to the closest centroid, ScaNN quantizes the data point such that the inner product is as similar as possible to the original distance, which ensures that the vector’s relationship with other data points is preserved. This improves the search process, especially for high-dimensional data where traditional quantization might not capture all relationships accurately. ScaNN provides efficient and scalable nearest neighbor search, especially when working with large datasets.
Summary
These approximate nearest neighbor algorithms offer different methods for balancing speed and accuracy in the Maximum Inner Product Search (MIPS) process:
- LSH uses hashing to group similar items together.
- ANNOY builds random projection trees for efficient searches.
- HNSW leverages hierarchical small-world networks for fast traversal.
- FAISS applies vector quantization and clustering to refine search results.
- ScaNN uses anisotropic quantization to preserve relationships between data points.
Each of these algorithms is designed to optimize the retrieval of the most relevant data points quickly, making them invaluable tools for large-scale machine learning and AI systems.
Here's a simple analogy to help remember each of the algorithms used for fast Maximum Inner Product Search (MIPS):
LSH (Locality-Sensitive Hashing)
- Analogy: Think of LSH like a library where books are sorted into categories based on topics. Instead of searching through every single book, you only need to look in the category (or "bucket") that most likely contains the book you're looking for. The books that are similar to each other are more likely to be in the same category.
- Key Idea: Similar items are grouped together in buckets to speed up the search.
ANNOY (Approximate Nearest Neighbors Oh Yeah)
- Analogy: Imagine you are trying to find the closest neighbors in a neighborhood. You walk down a random street (the first tree), and if you don’t find your target, you follow the path in the next tree that seems closest, repeatedly. You move through different streets (trees) to narrow down the best match.
- Key Idea: A set of random trees splits the space, and you search through them to find the closest match.
HNSW (Hierarchical Navigable Small World)
- Analogy: Picture a hierarchical maze where the higher floors (top layers) have shortcuts to faster paths, and as you go down, the paths become more specific. You start from the top, taking broad steps, then move down to refine your search in more detail.
- Key Idea: The search starts broadly in high layers and becomes more specific as you go deeper, mimicking the "six degrees of separation" idea.
FAISS (Facebook AI Similarity Search)
- Analogy: Imagine you are searching for a specific type of fruit in a supermarket. The supermarket is divided into sections (clusters) like apples, bananas, etc. First, you identify the section, then within that section, you find the exact fruit. FAISS uses this concept of clustering and refining the search within smaller groups.
- Key Idea: You search within "broad" clusters first, then refine within those clusters for more accurate results.
ScaNN (Scalable Nearest Neighbors)
- Analogy: Think of ScaNN as sorting a deck of cards by their relative values. Instead of just grouping cards by color or number, you organize them to preserve their relationships (like higher cards being close to each other). This way, when you want to find a nearby card, you are more likely to find the right one by focusing on those relationships.
- Key Idea: ScaNN organizes data by preserving its relationships for better accuracy in nearest neighbor searches.
Check more MIPS algorithms and performance comparison in ann-benchmarks.com.
MRKL (Karpas et al. 2022), short for “Modular Reasoning, Knowledge and Language,” is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules, and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g., deep learning models) or symbolic (e.g., math calculator, currency converter, weather API).
They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability.
Both TALM (Tool Augmented Language Models; Parisi et al. 2022) and Toolformer (Schick et al. 2023) fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs.
ChatGPT Plugins [https://openai.com/blog/chatgpt-plugins] ** and **OpenAI API function calling [https://platform.openai.com/docs/guides/gpt/function-calling] are good examples of LLMs augmented with tool-use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).
HuggingGPT (Shen et al. 2023) is a framework to use ChatGPT as the task planner to select models available in the HuggingFace platform according to the model descriptions and summarize the response based on the execution results.
Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)
The system comprises four stages that guide the process of handling user requests through an LLM-based framework.
Task Planning:
In this stage, the LLM acts as the brain of the system, parsing user input into multiple tasks. Each task is associated with four attributes:- Task type: The type of operation or processing to be done (e.g., generating text, processing an image).
- ID: A unique identifier for the task.
- Dependencies: The tasks that must be completed before the current task can proceed.
- Arguments: The inputs (such as text, image URL, audio URL) required for the task.
The LLM uses few-shot examples to help guide it in parsing the user request and planning the tasks. The system must ensure that the tasks are logically ordered, and dependencies are respected. For example, if a task requires a text generated by another task, the dependency task ID is recorded.
Model Selection:
Once the tasks are planned, the LLM selects appropriate expert models to handle each task. The LLM is given a list of candidate models and asked to choose the most suitable one for the specific task. The selection process involves framing the task as a multiple-choice question and filtering the available models based on the task type. The LLM outputs the selected model ID along with a detailed reason for its choice.Task Execution:
The selected expert models execute the tasks based on the inputs provided. The results of these tasks are logged, and the LLM describes the execution process and results. The system outputs the user-requested response first, then provides an explanation of the task execution process, including how the models arrived at their predictions. If the results involve file paths, the LLM also provides the full path to the user.Response Generation:
In this final stage, the LLM compiles the results from the executed tasks and provides a summarized response to the user. This is the end output of the system, presenting the final results after the tasks have been processed and the necessary models have been executed.
To make this system functional in real-world applications, several challenges need to be addressed:
- Efficiency: The process can slow down due to the multiple rounds of LLM inference and interactions with other models. Optimizing for speed is necessary.
- Context Window: The system relies on a long context window to manage the complexity of task content and dependencies. Handling this effectively is crucial for managing large-scale tasks.
- Stability: Ensuring stable outputs from both the LLM and external models is essential for reliable performance in real-world applications.
This multi-stage framework helps in handling complex tasks by breaking them down into manageable steps, delegating them to specialized models, and generating a coherent response for the user.
Imagine the system as a project management process. First, in the task planning stage, the LLM acts as the project manager, breaking down the user's request into smaller, manageable tasks. Each task is like a project with its own type, ID, dependencies, and required resources. Next, in model selection, the LLM is like a manager choosing the right expert for each task. It selects the best model from a list, much like picking the right person for the job based on the task's requirements. Then, in task execution, the experts carry out their assigned tasks, much like specialists working on different parts of a project. The LLM logs the results and explains how the task was completed. Finally, in response generation, the LLM compiles the results and provides the client (the user) with a summary of the completed tasks. Throughout this process, challenges like efficiency, handling complex data, and ensuring stability are akin to managing a large project where everything must be coordinated smoothly to deliver the final product.
API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLM first has access to API search engine to find the right API to call and then uses the corresponding documentation to make a call.
The provided algorithm outlines the process for handling an API call based on a user's statement. Here's a breakdown of the process:
Input: The user's statement (
us
) is received as input.Check if API Call is Needed: The algorithm first checks if an API call is necessary based on the user's statement. If it's needed, the process continues.
Searching for the API:
- While an appropriate API is not found, the system tries to summarize the user's statement (
keywords ← summarize(us)
) and then searches for the API based on these keywords (api ← search(keywords)
). - If the system reaches the "Give Up" condition, the loop breaks, and no API is called.
- While an appropriate API is not found, the system tries to summarize the user's statement (
API Found: Once an API is found, the algorithm retrieves the API documentation (
api_doc ← api.documentation
).Generate and Execute API Call:
- While the response is not satisfactory, it generates an API call (
api_call ← gen_api_call(api_doc, us)
) and executes it (api_re ← execute_api_call(api_call)
). - If the "Give Up" condition is met again, the loop breaks.
- While the response is not satisfactory, it generates an API call (
Generate Response:
- If the API response is successful (
if response
), the system generates a response based on the API response (re ← generate_response(api_re)
). - If no response is received, a default response is generated (
re ← generate_response()
).
- If the API response is successful (
Output: Finally, the generated response is output as
ResponseToUser
.
In summary, the algorithm is designed to search for an appropriate API, attempt to generate and execute the call based on the user’s statement, and return a response, either from the API or a default one. If the process is unsuccessful or the system cannot find an API, it can stop based on the "Give Up" condition.
The Scientific Discovery Agent is a system that uses a combination of large language models (LLMs) and specialized expert tools to tackle complex scientific problems, such as organic synthesis, drug discovery, and materials design. One example of this is ChemCrow (Bran et al. 2023), a domain-specific agent built to assist in the chemical and biological sciences.
The workflow in ChemCrow follows a structure similar to other reasoning-based systems like ReAct and MRKL. It uses Chain-of-Thought (CoT) reasoning to guide the model's problem-solving approach, combined with expert-designed tools for specific tasks. The LLM is provided with:
- A list of tool names,
- Descriptions of each tool's utility,
- Information about the expected input and output for each tool.
Using this information, the LLM is instructed to answer a user’s prompt, using the appropriate tools when necessary. The instruction follows the ReAct format, which consists of:
- Thought: The model's reasoning about the task.
- Action: The steps it needs to take to accomplish the task.
- Action Input: The input needed for those actions.
- Observation: The feedback or output from those actions.
The model performs a task by combining reasoning with the right tools, such as a chemical synthesis tool, a drug discovery database, or a material design application. In the case of ChemCrow, it has access to 13 expert-designed tools that help the LLM accomplish its tasks in organic chemistry, drug development, and more.
An interesting finding from ChemCrow’s experiments showed that while GPT-4 and ChemCrow performed similarly when evaluated using LLM-based methods, human experts who focused on the chemical correctness of the solutions concluded that ChemCrow outperformed GPT-4 significantly. This highlights a potential limitation of LLMs—they may not be able to accurately evaluate the correctness of solutions in specialized domains without expertise. The LLM, without knowledge of the domain, could miss flaws in its outputs and fail to correctly assess their validity.
In another study by Boiko et al. (2023), they explored LLM-empowered agents for scientific discovery, which could autonomously design, plan, and perform complex scientific experiments. These agents were equipped with various tools, such as the ability to browse the internet, read documentation, execute code, and call APIs for robotics experimentation.
For example, when tasked with "developing a novel anticancer drug," the agent followed a series of reasoning steps:
- Inquired about current trends in anticancer drug discovery.
- Selected a target for the drug.
- Requested a scaffold to target these compounds.
- Once a compound was identified, the agent attempted to synthesize it.
While testing this approach, the researchers also highlighted risks, such as the potential for the agent to produce illicit drugs or bioweapons. In their test set, which included requests to synthesize known chemical weapons, the agent was asked to produce synthetic pathways for these dangerous compounds. It successfully generated solutions for 4 out of 11 requests (36%). However, 7 of the requests were rejected, with 5 of them happening after a Web search, and 2 being rejected based on the prompt alone.
This work demonstrates the promising potential of LLM-augmented agents for scientific discovery, but also underscores the importance of safeguards to ensure that these systems do not engage in dangerous or unethical activities.
Generative Agents (Park, et al. 2023) [https://arxiv.org/pdf/2304.03442] is an experiment that uses large language models (LLMs) to simulate the behavior of virtual characters in a sandbox environment. Inspired by games like The Sims, this experiment aims to create lifelike simulations of human behavior where 25 virtual characters interact with each other and their environment, guided by their past experiences.
Here's a breakdown of how it works:
Memory Stream: This component acts as a long-term memory for the agents. It stores a record of all the events or experiences the agents encounter in natural language. These experiences are updated as the agents interact with each other or with their environment. If one agent communicates with another, it could trigger a new statement in the memory, making the memory stream a dynamic and evolving record of the agents' experiences.
Retrieval Model: When the agent needs to make a decision or react to something in the environment, it uses the retrieval model to surface relevant past experiences from the memory stream. The model ranks memories based on:
- Recency: More recent memories are given higher priority.
- Importance: Core memories, which are more meaningful or significant, are distinguished from mundane memories and given more weight.
- Relevance: The memories that are more closely related to the current situation are prioritized.
Reflection Mechanism: This is a higher-level process where the agent synthesizes its memories to make more abstract inferences. Instead of just recalling memories, the agent reflects on past events and forms high-level conclusions that guide its future behavior. For example, the system may prompt the LLM with the 100 most recent observations and ask it to identify the three most important questions related to those observations. The LLM answers those questions, and the agent uses these insights to plan its next actions.
Planning & Reacting: After reflecting on past experiences and current events, the agent needs to decide what actions to take next. The planning process helps the agent balance short-term and long-term needs to optimize believability in the simulation. The agent’s plan is influenced by its relationships with other agents, as well as the observations it has made about others. The environment in which these agents live is structured as a tree, which helps organize information for planning and decision-making.
In essence, generative agents are capable of remembering past events, reflecting on them to make better decisions, and using those reflections to create a plan for future actions, all while interacting with other agents and the environment in a meaningful way.
This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).
References¶
- Wei et al. "Chain of Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.
- Yao et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv preprint arXiv:2305.10601 (2023).
- Liu et al. "Chain of Hindsight Aligns Language Models with Feedback." arXiv preprint arXiv:2302.02676 (2023).
- Liu et al. "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency." arXiv preprint arXiv:2304.11477 (2023).
- Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
- Google Blog. "Announcing ScaNN: Efficient Vector Similarity Search." July 28, 2020.
- Shinn & Labash. "Reflexion: An Autonomous Agent with Dynamic Memory and Self-Reflection." arXiv preprint arXiv:2303.11366 (2023).
- Laskin et al. "In-Context Reinforcement Learning with Algorithm Distillation." ICLR 2023.
- Karpas et al. "MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledge Sources, and Discrete Reasoning." arXiv preprint arXiv:2205.00445 (2022).
- Nakano et al. "WebGPT: Browser-Assisted Question-Answering with Human Feedback." arXiv preprint arXiv:2112.09332 (2021).
- Parisi et al. "TALM: Tool Augmented Language Models."
- Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv preprint arXiv:2302.04761 (2023).
- Weaviate Blog. "Why is Vector Search so Fast?" Sep 13, 2022.
- Li et al. "API-Bank: A Benchmark for Tool-Augmented LLMs." arXiv preprint arXiv:2304.08244 (2023).
- Shen et al. "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace." arXiv preprint arXiv:2303.17580 (2023).
- Bran et al. "ChemCrow: Augmenting Large Language Models with Chemistry Tools." arXiv preprint arXiv:2304.05376 (2023).
- Boiko et al. "Emergent Autonomous Scientific Research Capabilities of Large Language Models." arXiv preprint arXiv:2304.05332 (2023).
- Joon Sung Park, et al. "Generative Agents: Interactive Simulacra of Human Behavior." arXiv preprint arXiv:2304.03442 (2023).
- AutoGPT
- GPT-Engineer