XAgent: An Autonomous Agent for Complex Task Solving

by: The XAgent Team, Sep 28, 2023

Introduction

The aspiration to develop intelligent agents, capable of mimicking human cognition and executing intricate tasks autonomously, has always captivated the attention of the AI community. The emergence of large language models (LLMs) has ushered in a new era of autonomous agents. LLMs could interpret human intent, generate intricate plans, and act autonomously. Hence, they have an unparalleled ability to make decisions that mirror human-like complexity.

While pioneering projects (e.g., AutoGPT, BabyAGI, CAMEL, MetaGPT, AutoGen, DSPy, AutoAgents, OpenAgents, Agents, AgentVerse, ChatDev) have already demonstrated potential in this direction, the journey to fully autonomous AI agents still presents formidable challenges. Specifically, they fall short in the following aspects:

Limited Autonomy: Existing agents are bound by human-imposed rules, knowledge, and biases, confining their problem-solving capacities in diverse real-world scenarios.
Rigid Task Management: Existing agents lacked flexibility in high-level task management and low-level task execution, often struggling to divide and conquer complex tasks.
Instability and Insecurity: Existing agents' decision-making and execution processes are usually tightly coupled without clear separation, risking system stability and security.
Inconsistent Communication Frameworks: Existing agents lacked a standardized mode of communication, leading to potential misunderstandings and integration challenges.
Limited Human-Agent Interaction: Existing agents do not allow for active human intervention, making them less adaptable and less collaborative in uncertain situations.

In light of these issues, we introduce XAgent, an autonomous agent targeting the realization of complex task-solving in an autonomous manner.

Core Design Philosophy

Overview of XAgent.

Dual-loop Mechanism for Planning and Execution

Existing AI agents (e.g., MetaGPT) have been largely dictated as human-crafted pipelines, making them less agents of their own will and more as extensions of their human designers. Such systems, though effective in certain specific tasks, limit the potential of agents by confining them within the bounds of pre-existing human knowledge and biases. To pave the way for genuine autonomy, XAgent deliberately abstains from infusing human prior knowledge into the system design. Instead, we empower our agents with the capability to undertake their own planning and decision-making processes.

We believe that tackling multifaceted tasks demands considering both the holistic view of the problem and its individual components. Previous agents often lacked the ability to globally plan, focusing more on executing tasks based on predefined strategies. XAgent adopts a dual-loop mechanism: an outer-loop process for high-level task management and an inner-loop process for low-level task execution. The outer-loop process enables the agent to discern and segment overarching tasks into smaller, more actionable components. This hierarchical division mirrors the natural cognitive process humans adopt when approaching intricate challenges. The inner-loop process, in contrast, functions as the detailed executor, laser-focused on the granular aspects of the segmented tasks. The inner-loop embodies the meticulous steps we undertake to solve specific parts of a problem. By separating high-level planning from low-level task execution, XAgent mirrors the natural cognitive hierarchy humans employ and could iteratively refine plans based on the execution results.

ToolServer: Tool Execution Docker

Achieving resilience, efficiency, and scalability is paramount for an agent system. Unlike conventional systems, XAgent champions these attributes using ToolServer, which serves as the execution engine. It operates within a Docker environment, providing an isolated and secure space for tool execution. This isolation ensures that actions carried out by the tool do not jeopardize the stability or security of the main system. This design brings many benefits: (1) safety: Running tools within a Docker container protects the main system from potential harm; (2) modularity: Separating the roles of agent planning and tool execution allows for more manageable code, easier debugging, and scalability; (3) efficiency: The system can start, stop, and restart nodes based on the demand and usage patterns, leading to optimal resource usage. With ToolServer, XAgent disentangles the complexities of the LLM's decision-making process from the tool execution.

Function Calling: The Universal Language of XAgent

A structured mode of communication is essential to the robustness of an agent system. Hence, we employ OpenAI's function calling as the universal language of XAgent. This brings several critical attributes: (1) Structured Communication: Function calls are inherently formatted in a manner that clearly states what is required and expected. This structure minimizes misunderstandings and potential errors; (2) Unified Framework: Different tasks, be it summarization, planning, or API calling, might require distinct approaches in traditional AI systems. By transmuting all tasks into specific function calls, we ensure that every task is approached in a consistent manner. This unification simplifies system design; (3) Seamless Integration with External Tools: Agents often need to communicate with external systems, databases, or tools. Function calling allows for this communication to be standardized, offering a common language that both the agent and the external tool can understand.

Synergistic Human-Agent Collaboration

XAgent adopts an interactive mechanism tailored for enhanced Human-AGent Interaction. XAgent allows users to actively intervene and guide its decision-making process. Firstly, it provides an intuitive interface where users can override or modify actions proposed by XAgent, thereby combining machine efficiency with human intuition and expertise. Secondly, in cases where XAgent confronts an unfamiliar challenge, it is equipped with the ``AskHumanforHelp'' tool. This tool solicits real-time feedback, suggestions, or guidance from the user, ensuring that the agent functions optimally even in uncertain terrains. This interactive paradigm, merging machine autonomy with human wisdom, fosters a symbiotic relationship between humans and XAgent.

Framework

Overview: Planning (Outer-loop) and Execution (Inner-loop)

In XAgent, the decision-making and task-execution processes are orchestrated through a dual-loop mechanism: the outer-loop and the inner-loop. In essence, the outer-loop deals with the high-level management and distribution of tasks, and the inner-loop focuses on the low-level execution and optimization of each sub-task.

Figure: Dual-loop mechanism of XAgent.

Outer-Loop

The outer-loop serves as the high-level planner and the primary orchestrator of tasks, acting as a supervisor for the entire problem-solving sequence. Its responsibilities can be broken down as follows:

Initial Plan Generation: The PlanAgent first generates an initial plan, which lays down a basic strategy for task execution. This involves breaking down a given complex task T into smaller, more manageable sub-tasks. This breakdown can manifest as a task queue. These sub-tasks are clear in intent and can be executed more directly without overwhelming the system. Given a complex task, a PlanAgent is employed to decompose it into a series of subtasks: ${T_{1}, \dots, T_{N}}$ .
Iterative Plan Refinements: After the initial planning, PlanAgent progresses by popping the first subtask from the task queue. This subtask is then passed to the inner-loop. PlanAgent continuously monitors the progress and status of tasks. After the execution of each subtask, the inner-loop reports $Feedback$ from the ToolAgent. Based on the feedback, PlanAgent triggers appropriate handling mechanisms, such as refining the plan or continuing with subsequent subtasks. Formally, we have ${T_{1}^{n e w}, \dots, T_{N^{'}}^{n e w}} = Refine ({T_{1}^{o l d}, \dots, T_{N}^{o l d}}, Feedback)$ , where $Feedback$ is given by the ToolAgent from the lastest subtask. The outer-loop is finished until no subtasks are left in the queue.

Inner-Loop

The inner-loop is pivotal for executing the individual sub-tasks assigned by the outer-loop. Given a subtask $T_{i}$ , an appropriate ToolAgent is designated, ensuring $T_{i}$ reaches its intended outcome. Key aspects of the inner-loop include:

Agent Dispatch and Tool Retrieval: Based on the nature of the subtask, an appropriate ToolAgent is dispatched, who possesses the abilities required to complete the task.
Tool Execution: The ToolAgent first retrieves tools from external systems to aid in task completion. Then the agent employs ReACT for subtask solving. This method searches for the optimal series of actions (tool calls) to complete the subtask $T_{i}$ .
Feedback and Reflection: After a series of actions, the ToolAgent can issue a specific action called ``subtask_submit'' to finish processing the current subtask, and pass feedback and reflection to the PlanAgent. This feedback can indicate whether the subtask is finished successfully, or highlight potential refinements.

PlanAgent: Dynamic Planning and Iterative Refinement

Sophisticated agent systems require adeptness in continuously formulating and revising plans to accommodate mutable environments and emergent requirements. These capabilities are central to ensuring flexibility, resilience, and efficiency in response to unforeseen challenges. To imbue the outer-loop with this adaptability, we introduce the PlanAgent. The PlanAgent is tailored to support the outer-loop by generating the initial plan and continuously revising plans. Specifically, we define four functions for the PlanAgent to refine an existing plan:

Subtask Split: Empowers the system to decompose a specific subtask into granular, more manageable units. Only subtasks that are currently in execution or remain uninitiated are eligible for this operation.
Subtask Deletion: To remove a subtask that has not started. Subtasks that are already in progress or completed are not eligible for deletion. This ensures a degree of agility, where redundant or non-pertinent tasks can be pruned to optimize overall execution.
Subtask Modification: To alter the content of a subtask. The subtask to be modified should not have already started or been completed, preserving the integrity of the overall plan.
Subtask Addition: To insert new subtasks as siblings after a specific subtask. Subtasks can only be added after the currently dealing subtask or its successors. It guarantees that new tasks are orchestrated in a sequential manner, streamlining the execution flow and maintaining coherence.

ToolAgent: Synergizing Reasoning and Acting in Function Calling

As mentioned before, the ToolAgent uses ReACT to search for the optimal series of actions (tool calls, ${a_{1}, \dots, a_{N}}$ ) to complete the subtask $T_{i}$ . At each round $t$ for $T_{i}$ , the agent generates an action $a_{t}^{i}$ based on previous interactions, i.e., $P (a_{t}^{i} | {a_{1}^{i}, r_{1}^{i}, \dots, a_{t - 1}^{i}, r_{t - 1}^{i}}, S_{1}, \dots, S_{i - 1}, T_{i})$ , where $r_{*}$ denotes the tool execution results of $a_{*}^{i}$ from the ToolServer, and $S_{*}$ denotes the summary for the subtask $T_{*}$ , i.e., $S_{*} = Summarize ({a_{1}^{*}, r_{1}^{*}, a_{2}^{*}, r_{2}^{*}, \dots,}, T_{*})$ .

For each action $a_{t}$ , we synergize agent reasoning and acting in the same function call, i.e., both the reasoning trace (``thoughts'') and the action to be taken are treated as parameters of a specific function. Specifically, each $a_{t}$ (function call) has the following components:

Thought: A succinct representation of the agent's primary insight about the situation.
Reasoning: Traces the logical trajectory the agent traverses to arrive at its thought.
Criticism: Captures the agent's self-reflection on its actions, acting as a feedback loop. It highlights potential oversights or areas of improvement.
Command: Dictates the next action the agent decides to undertake based on its reasoning.
Parameters: Enumerates the specific arguments or details for the action to be executed.

ToolServer: Diverse Supported Tools

The ToolServer comprises three key components: (1) ToolServerManager manages the lifecycle of docker containers (i.e., nodes), handling their creation, monitoring, and shutdown. The manager can create a new node whenever a new session starts. The status of these nodes is regularly checked to ensure they are healthy and running; (2) ToolServerMonitor checks the status of nodes, updating their states, and ensuring their efficient execution. If a node remains idle for an extended period, the monitor can stop it to conserve resources; (3) ToolServerNode is the individual execution unit where actions (e.g., API call, file uploading, tool retrieving, etc.) are performed.

Currently, our ToolServer supports the following tools:

FileSystemEnv offers a sophisticated interface for managing file system operations. This tool provides diverse functionalities ranging from basic file reading and writing to intricate operations like content modification and hierarchical representation.
PythonNotebook leverages the capabilities of Jupyter Notebooks, offering an interface to facilitate the execution of Python code seamlessly. XAgent can create or modify notebook cells dynamically and executes them in a controlled and isolated environment.
WebEnv is designed for web interactions and content extraction. It offers the dual capability of searching the web using Bing and subsequently browsing the retrieved web pages. WebEnv also identifies and collates hyperlinks from the source page, offering a more comprehensive perspective on the queried content.
ExecuteShell is designed to programmatically execute shell commands. It returns both the output and any potential errors from the executed command. To ensure robustness in operations, it incorporates mechanisms to handle timeouts, mitigating the risks associated with commands that might otherwise stall indefinitely.
RapidAPIEnv facilitates seamless interaction with RapidAPI, a leading API marketplace. With RapidAPIEnv, XAgent is able to connect with 160000+ real-world APIs.
AskHumanforHelp can be invoked when XAgent finds that the current subtask cannot be fulfilled on its own. In this case, XAgent would issue specific requirements for humans to participate in task-solving.

Experiments

We evaluate XAgent (based on GPT-4) on a suite of benchmarks that require reasoning, planning, and the ability to use external tools, including (1) the ability to search the web for question answering, which is tested on both FreshQA and HotpotQA; (2) Python programming based on MBPP; (3) mathematical reasoning ability based on MATH; (4) interactive coding based on InterCode; (5) embodied reasoning in a textual game based on ALFWorld. For the above benchmarks, we compare XAgent with the vanilla GPT-4 baseline. Considering the lack of a suitable and high-quality benchmark targeted for AI Agent, we also manually curate 50 complex instructions and send them to both XAgent and AutoGPT. Then we employ multiple experts to evaluate the preference (win rate) on the results achieved by XAgent and AutoGPT.

Figure: XAgent v.s. GPT-4 on existing AI benchmarks.

Figure: XAgent v.s. AutoGPT on our curated instructions.

As shown in the above figure, XAgent surpasses the vanilla GPT-4 on all the benchmarks, which shows that the system-level design of XAgent could fully unleash the foundational capabilities within GPT-4. When comparing XAgent with AutoGPT, we find that XAgent is significantly favored, achieving a remarkable preference rate of nearly 90%. This highlights that not only does XAgent excel in traditional AI benchmarks, but it also demonstrates superior adaptability, efficiency, and precision in handling complex real-world instructions.

Case Study

Below we showcase XAgent's ability using a few non-cherry-picked instructions:

Figure: Data statistics by XAgent.

Figure: Data cluster by XAgent.

Data Analysis: Demonstrating the Effectiveness of Dual-Loop Mechanism

We start with a case of aiding users in intricate data analysis. Here, our user submitted an $iris.zip$ file to XAgent, seeking assistance in data analysis. XAgent swiftly broke down the task into four sub-tasks: (1) data inspection and comprehension, (2) verification of the system's Python environment for relevant data analysis libraries, (3) crafting data analysis code for data processing and analysis, and (4) compiling an analytical report based on the Python code's execution results. During execution, XAgent adeptly employed various data analysis libraries such as $pandas$ , $sci-kit learn$ , $seaborn$ , $matplotlib$ , alongside skills in file handling, shell commands, and Python notebooks, even delving into visual data analysis (see in the above Figure). In contrast, AutoGPT, when attempting the same task, plunged into code writing without preliminary checks on the Python environment and related libraries. This led to failure and errors in using essential libraries such as scipy and matplotlib, ultimately resulting in an incomplete data analysis.

Recommendation: A New Paradigm of Human-Agent Interaction

Figure: illustration of ask for human help of XAgent.

Being able to actively seek human assistance and collaborate, XAgent achieves a new level of human-agent cooperation. As depicted in the above figure, a user sought XAgent's aid in recommending some great restaurants for a friendly gathering, yet failed to provide specific details. Recognizing the insufficiency of the provided information, XAgent employed the AskForHumanHelp tool, prompting human intervention to elicit the user's preferred location, budget constraints, culinary preferences, and any dietary restrictions. Armed with this valuable feedback, XAgent seamlessly generated tailored restaurant recommendations, ensuring a personalized and satisfying experience for the user and their friends.

Conversely, AutoGPT, lacking the proactive human interaction element, resorted to indiscriminately scouring the web for restaurant information, leading to recommendations that were off-target and failed to align with the user's budget and preferences. This case demonstrates that XAgent could effectively provide personalized and user-centric solutions.

Training Model: A Sophisticated Tool User

Figure: model training process by XAgent.

XAgent not only tackles mundane tasks but also serves as an invaluable aid in complex tasks such as ML model training. Here we show a scenario where a user desires to analyze movie reviews and evaluate the public sentiment surrounding particular films. In response, XAgent promptly initiates the process by downloading the IMDB dataset to train a cutting-edge BERT model, harnessing the power of deep learning. Armed with this trained BERT model, XAgent seamlessly navigates the intricate nuances of movie reviews, offering insightful predictions regarding the public's perception of various films.

Through this seamless integration of advanced data processing and machine learning, XAgent showcases its exceptional capacity to facilitate complex tasks previously reserved for specialized data scientists and AI experts. Its ability to swiftly adapt to diverse training requirements underscores its pivotal role in democratizing AI capabilities, empowering users to delve into sophisticated analytical tasks with unprecedented ease and efficiency.