LLM Function-Calling Performance: API- vs User-Aligned

Performance of API-Aligned vs User-Aligned LLM Functions

Large language models (LLMs) have evolved beyond basic text generation to become powerful tools capable of programmatically interacting with external systems through function-calling mechanisms. This capability allows LLMs to execute API calls using predefined functions, unlocking their potential for more advanced and practical applications.

For example, imagine a traveler wants to modify the return date of a flight. Through function-calling, an LLM could be given access to an airline’s booking API. When prompted with, “Change my return flight from November 25 to December 1,” the LLM could interpret the request, select the appropriate modifyBooking function, and provide parameters such as the booking ID and new date. It then makes the API call to adjust the itinerary, effectively completing the task without requiring further user intervention.

The success of such operations relies heavily on how functions are defined. Clear and well-structured functions enable the LLM to interpret prompts accurately and execute the right operations, directly affecting performance outcomes. This makes function design a critical factor in achieving high levels of accuracy and utility.

This article compares two distinct approaches to defining functions — API-aligned and user-aligned. Through benchmarking, it examines the performance differences between the two methods and explores how these differences arise.

API-Aligned Approach

In a GenAI application designed to interact with an enterprise system via an API, a natural strategy is to define LLM functions that directly mirror the required API operations. The LLM is then tasked with mapping the incoming prompts to these API-aligned functions.

This approach is demonstrated by Composio through a benchmark using a specific set of prompts and eight functions, each representing an endpoint of a project management service API (see https://composio.dev/blog/gpt-4-function-calling-example/). Initially, without any optimizations, Composio achieved an accuracy rate of 33%. By iteratively refining the function definitions — such as tweaking parameter structures or supplying clearer descriptions — they were able to improve accuracy to 74%.

The accuracy of this approach partly depends on the quality of the LLM itself. A more capable LLM, much like a more experienced developer, can often navigate complex scenarios better. For instance, LLMs perform more effectively when input parameters are flattened rather than deeply nested. While a more advanced model might handle nested parameters seamlessly, less capable models struggle with such complexity.

However, some challenges in API-aligned functions go beyond the raw ability of the LLM — even a “genius” developer-equivalent LLM may fail under certain conditions:

1. Implicit Knowledge Requirements: The correct usage of an endpoint may rely on context or domain knowledge that cannot be deduced from the function definition alone.

Example: A prompt asks the LLM to “archive a project,” but the API endpoint requires additional hidden parameters, such as whether to notify team members or preserve specific metadata.

2. Terminology Mismatch: The terms used in the prompt may not align with the terminology of the API.

Example: A user prompt refers to a “shopping cart,” while the API defines this concept as an ItemList. The LLM may fail to correctly map between these terms without additional guidance.

These challenges highlight inherent limitations in the API-aligned approach. While it can be effective with well-defined APIs and straightforward mappings, achieving consistent accuracy requires not only improving the LLM but also addressing ambiguities and gaps in the function definitions and user input alignment.

User-Aligned Approach

An alternative to the API-aligned method is to define functions based on the structure of input prompts rather than the API itself. This approach shifts the focus closer to how users naturally phrase their requests. The distinction between the two approaches can be summarized as follows:

• API-Aligned Functions: These are tightly coupled with the API’s structure, requiring the LLM to perform more work to map user prompts to the appropriate functions.

• User-Aligned Functions: These are designed to align more closely with user prompts, reducing the burden on the LLM for accurate mapping. However, this approach often requires additional implementation effort, as it may necessitate invoking multiple API endpoints or performing extra computations.

Generating Functions

API-aligned functions are typically generated directly from the API specification, making them straightforward to implement. User-aligned functions, on the other hand, must be derived from prompts, which requires those prompts to be entity-complete — meaning they must explicitly reference all entities and parameters necessary for defining the function.

For example, if an API endpoint requires a mode parameter, the prompt used to generate the user-aligned function must specify this parameter explicitly. If the prompt omits it, the resulting function definition will lack it as well, potentially arbitrarily picking a default value during implementation. Ensuring prompts are entity-complete is critical to generating effective user-aligned functions.

Challenges of User-Aligned Functions

While user-aligned functions simplify the LLM’s task of binding prompts to functions, they also present unique challenges:

1. Proliferation of Functions: A major drawback is the risk of creating too many narrowly tailored functions, each corresponding to a specific variation of input prompts. In the worst-case scenario, this could result in one function per prompt variation, leading to an overly complex and fragmented implementation.

2. Generalization Process: To mitigate this, it is essential to consolidate overfitted functions into a smaller set of generalized functions. This process must ensure that generalization does not compromise the LLM’s ability to bind accurately to the functions. A careful balance is required to maintain efficiency and reliability while preserving the alignment with user inputs.

Testing Both Approaches Using the Composio Function-Calling Benchmark

The Composio benchmark consists of 50 prompts, each with an expected result corresponding to the LLM binding the input to one of eight predefined functions. The benchmark is designed to measure the performance of API-aligned functions across a variety of optimization techniques. To evaluate the user-aligned approach, these same prompts were used to generate corresponding user-aligned functions, which were then executed and compared to the benchmark’s expected results.

A notable challenge in this comparison is that the user-aligned functions, by design, differ structurally from the eight predefined API-aligned functions. To address this, a semantic comparison tool was developed, as described below.

Evaluation Steps

Each prompt in the benchmark was evaluated using the following process:

1. Function Generation: A function definition was generated Gentoro’s function-generation capability based on the content and intent of the prompt.

2. Prompt Execution: The prompt was submitted to the LLM along with the generated function, and the resulting function call trace was recorded.

3. Comparison: The recorded function call trace was semantically compared to the benchmark’s expected output using the custom semantic comparer.

The results of this evaluation showed that, using the user-aligned approach, all 50 prompts were successfully executed.

Benchmark Execution Details

During the benchmark evaluation, discrepancies arose where the expected function calls included values not explicitly provided in the prompts. In the user-aligned approach, such parameters cannot be included in the generated function call trace because they are absent from the input prompts. To resolve this mismatch, the evaluator was configured to ignore parameters in the expected function calls that were not present in the original prompt. The results of the modified benchmark can be seen here: https://github.com/gentoro-gh/Composio-Function-Calling-Benchmark

The following system prompt was used to compare the actual and expected results:

System Message:
  Compare two function calls (expected and actual) to determine if they match. 
  Lean toward passing, failing only when explicitly critical mismatches cannot 
  be resolved through creative reasoning.

User Message:
  Steps for Matching

  1. Parse and Normalize:
      - Extract parameters as key-value pairs.
      - Flatten nested structures (e.g., a=(b=(enabled=true)) becomes a.b.enabled=true).

  2. Handle Aliases:
      - Use mappings for semantically equivalent parameters 
        (e.g., due_dates_enabled ↔ features.due_date.enabled).
      - Translate both lists into a unified format.

  3. Exclude Irrelevant or Unreferenced Parameters:
      - Ignore parameters not explicitly mentioned or required by the prompt.
      - Missing parameters in the actual call are ignored unless explicitly critical.

  4. Compare Parameters:
      - Match keys and values exactly, semantically, or creatively 
        (e.g., private=true ↔ privacy_setting=members-only).
      - Resolve ambiguity in favor of passing.

  5. Ignore Extra Parameters:
      - Extra parameters in the actual call do not affect the outcome.

  6. Report Findings:
      - Summarize matched, excluded, and ignored parameters.

  7. Determine Pass/Fail:
      - Pass: Only if all relevant parameters align or plausible matches exist.
      - Fail: If a critical, explicitly referenced parameter cannot align 
              in any way; in this case, do not try to find another reason to pass.

Experimental Configuration

The evaluation used the following parameters:

• Model: GPT-4o-2024–08–06

• Temperature: 0.8

• Top-p: 0.8

Implications, Challenges, and Conclusion

The evaluation demonstrates the potential of aligning function definitions with user-provided inputs to streamline the function-calling process for LLMs. By reducing reliance on rigid API schemas, the user-aligned approach simplifies function binding and minimizes the need for complex prompt adjustments. However, while this method shows promise, more extensive testing and validation are needed to fully assess its effectiveness across diverse scenarios and use cases.

A notable challenge in the evaluation was the sequential nature of testing, where each tool was created, tested, and discarded individually. Optimizing this workflow to handle multiple prompts in a single evaluation run could significantly enhance efficiency. Additionally, the current benchmarks focus on relatively straightforward scenarios, and expanding them to include more complex, multi-step tasks or higher computational demands would provide deeper insights into the scalability and robustness of the user-aligned approach.

Another challenge involves balancing specificity and generalization. Aligning functions closely with user prompts simplifies LLM function binding but risks generating an excessive number of narrowly tailored functions. Future work should explore techniques to consolidate overfitted functions into generalized ones without sacrificing the LLM’s ability to accurately map prompts to these broader definitions.

Hybrid strategies also represent a promising direction for future research. By dynamically adapting both function definitions and prompts, hybrid methods could combine the precision of API-aligned approaches with the flexibility of user-aligned ones, offering a balanced and scalable solution for real-world applications.

In conclusion, the user-aligned approach provides a compelling direction for improving LLM accuracy in function-calling tasks, but its practical scalability and reliability remain dependent on further refinement and testing. Addressing challenges such as testing efficiency, function generalization, and broader benchmarking will be crucial to unlocking the full potential of this method. By combining these efforts with hybrid strategies, the field can advance toward more versatile and robust LLM function-calling mechanisms.