A comprehensive guide covering function calling capabilities in language models from 2023, including structured outputs, tool interaction, API integration, and its transformative impact on building practical AI agent systems that interact with external tools and environments.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2023: Function Calling and Tool Use
The introduction of function calling capabilities in language models in 2023 marked a transformative moment in the development of practical AI agent systems, enabling language models to interact with external tools, APIs, and environments in a structured and reliable way. Prior to this development, language models were largely confined to generating text, with their interactions with the outside world limited to what could be conveyed through text prompts and responses. This limitation severely constrained the practical applications of language models, as they could not directly access databases, execute code, call APIs, or manipulate external systems. The development of function calling changed this fundamental constraint, allowing language models to become active agents that could reason about when and how to use tools to accomplish tasks.
OpenAI's release of function calling capabilities in June 2023, integrated into GPT-3.5-turbo and GPT-4, represented a significant advance in making language models practical for real-world applications. The innovation allowed developers to describe functions or tools to the model, and the model could intelligently choose to call these functions when appropriate, providing structured outputs that could be executed programmatically. This development emerged at a time when the AI community was actively exploring how to build agent systems that could use tools to extend their capabilities beyond pure language generation. The work drew on earlier research in tool-augmented language models and agent frameworks, but brought these capabilities to production-ready systems with reliable structured outputs.
The broader significance of function calling extended beyond just technical capability. This development enabled the creation of AI agents that could truly interact with the world, opening up applications in areas ranging from software development and data analysis to customer service and automation. The ability to reliably extract structured information from natural language and to invoke external functions based on model reasoning represented a step toward more autonomous and capable AI systems. This development also addressed a fundamental limitation of language models: their inability to access real-time information or to perform actions in the world, beyond generating text.
The timing of this development was particularly important. Coming after the initial success of ChatGPT in late 2022, function calling addressed the next critical question for AI deployment: how to make language models truly useful in production systems. While ChatGPT had demonstrated the value of conversational AI, it remained largely a text generation tool. Function calling provided the bridge between language models and the broader ecosystem of software tools and services that power modern applications. This connection proved crucial for the development of practical AI applications that could integrate seamlessly into existing workflows and systems.
The Problem
Before function calling, language models faced a fundamental limitation that prevented them from being truly useful in many practical applications. Models could generate text that described actions or provided information, but they could not actually perform those actions or access real-time information. This gap between language capability and practical utility created a frustrating divide between what language models could theoretically help with and what they could actually accomplish.
Consider a scenario where a user wanted a language model to check the weather, book a flight, or query a database. The model could generate text that looked like it was doing these things. It might say "I would check the weather for you" and then generate a plausible weather forecast, but this would be completely fabricated based on training data patterns, not actual current weather information. Similarly, the model might generate text that resembled a flight booking confirmation, but this would be meaningless text, not an actual booking. Users had to manually execute these actions themselves, using the model's text output as a guide but doing the actual work through separate interfaces.
The problem extended to structured data extraction as well. If a user needed to extract specific information from a conversation and use it programmatically, they faced the challenge of parsing natural language output that might be formatted inconsistently. The model might say "The user's email is john@example.com" in one response and "Contact: john@example.com" in another. Applications requiring structured data had to rely on brittle text parsing that could easily fail with slight variations in formatting. This made it difficult to build reliable integrations between language models and other software systems.
For developers building AI-powered applications, the inability to get structured outputs from language models created significant friction. They had to write complex parsing logic to extract information from text responses, deal with various formatting inconsistencies, and handle edge cases where the model's output didn't match expected patterns. This added development overhead and introduced points of failure in applications that relied on language model outputs. The lack of structured interfaces also made it difficult to chain multiple operations together reliably, limiting the complexity of tasks that could be automated.
Another critical limitation was the model's inability to access real-time or contextual information. A language model trained on data up to a certain date could not know about events, prices, or system states that occurred after its training cutoff. This made language models unsuitable for applications requiring current information, such as checking stock prices, retrieving user account details, or querying live databases. The model could only work with what was in its training data, creating a fundamental constraint for many practical applications.
The development of agent frameworks prior to 2023 had begun to address some of these issues through techniques like chain-of-thought reasoning and tool description in prompts. However, these approaches were unreliable. Models might misunderstand tool descriptions, fail to format outputs correctly, or hallucinate function calls that didn't match the actual API structure. The lack of a standardized way to describe functions and to ensure models produced valid, executable outputs prevented these approaches from becoming production-ready. Developers were forced to build extensive error handling and retry logic, making agent systems complex and fragile.
The Solution
Function calling provided a solution to these problems by introducing a standardized way for language models to interact with external functions and to produce structured outputs. The solution had two key components: a structured way to describe functions to the model, and a reliable mechanism for the model to request function calls that matched those descriptions exactly.
At its core, function calling worked by allowing developers to provide function schemas to the language model using a structured format, typically JSON Schema. These schemas described the function name, purpose, and parameters with their types and constraints. The model could then reason about whether to call a function based on the conversation context, and if so, what parameters to use. The model's response would include a structured request to call a specific function with specific parameters, formatted according to the provided schema.
The technical implementation involved extending the language model's training and inference process to recognize function descriptions and to generate function call requests as a special type of output. When a model was presented with function schemas in the conversation, it learned to recognize situations where calling those functions would be helpful. Instead of just generating text, the model could generate a structured function call that the application could then execute programmatically.
The function calling mechanism used a structured output format that was separate from normal text generation. When the model decided to call a function, it would output a special token or format indicating a function call, followed by the function name and a JSON object containing the parameters. This structured format could be reliably parsed by the application, eliminating the need for brittle text parsing. The format was deterministic and schema-validated, ensuring that function calls matched the described interface exactly.
The system also included a mechanism for handling function responses. After the application executed a function call and obtained results, it could provide those results back to the model as tool responses. The model could then incorporate this information into its reasoning and generate a natural language response to the user that referenced the function results. This created a cycle: the model could reason about what information or actions it needed, request function calls to obtain them, process the results, and provide a comprehensive response.
OpenAI's implementation introduced several important design decisions that made function calling practical and reliable. The model was trained to recognize when function calling was appropriate based on context, rather than requiring explicit instructions in every prompt. It could handle multiple functions simultaneously, selecting the right one based on the user's request. The output format was strictly structured, using JSON that matched the provided schema, making parsing and validation straightforward for developers.
The system also supported a mode where the model could be instructed to always call a function for extracting structured information, even when a user's request was conversational. This enabled use cases like extracting entities from text, classifying content, or parsing user intents into structured formats. The model could act as a sophisticated parser, using its language understanding to extract structured data from natural language input with high reliability.
Applications and Impact
Function calling enabled a wide range of practical applications that had been difficult or impossible to build with language models alone. One of the most immediate applications was in building AI agents that could interact with software systems and APIs. Developers could create agents that could check emails, query databases, make API calls, or control software systems, all through natural language interaction. The agent would understand user requests, reason about what functions to call, execute them, and provide meaningful responses based on the results.
Customer service applications were transformed by function calling capabilities. Instead of just generating text responses, language models could now access customer databases to retrieve account information, check order status, modify account settings, or perform other actions. A customer service agent built with function calling could answer "What's my order status?" by actually querying the order database and providing real, current information, rather than generating plausible but fabricated responses.
In software development, function calling enabled powerful coding assistants that could interact with development tools. These assistants could search documentation, execute code, test functions, read files, and perform other development tasks. They could go beyond just generating code suggestions to actually running tests, checking results, and iterating based on feedback. This created a new class of development tools where AI assistants could actively participate in the software development workflow.
Data analysis applications were also revolutionized. Language models with function calling could query databases, run analysis scripts, generate visualizations, and extract insights from data, all through natural language interaction. Users could ask complex questions about their data, and the model would determine what queries or analyses to run, execute them, and explain the results. This made data analysis more accessible to non-technical users while enabling powerful automated analysis workflows.
The e-commerce and business automation sectors found numerous applications for function calling. AI agents could check inventory levels, process orders, update customer records, generate reports, and automate routine business tasks. These agents could integrate with existing business systems through their APIs, providing a natural language interface to complex backend systems. This enabled new ways for businesses to automate workflows and provide customer service.
Function calling also enabled new types of AI-powered applications that combined language understanding with real-world actions. Travel assistants could check flight availability and prices in real-time, make bookings, and provide itinerary information. Personal assistants could manage calendars, send emails, set reminders, and interact with various productivity tools. Research assistants could search academic databases, retrieve papers, and synthesize information from multiple sources.
The impact extended to the development of more sophisticated agent frameworks and architectures. Function calling provided a foundation upon which developers could build complex multi-step workflows where agents could reason about tasks, break them down into subtasks, call functions to gather information or perform actions, and synthesize results. This enabled the development of agent systems that could handle complex, multi-step problems requiring coordination between multiple tools and information sources.
The reliability and structure of function calling outputs also enabled more robust production applications. Applications could validate function call parameters against schemas, handle errors gracefully, and retry operations with confidence that the format would remain consistent. This reduced the brittleness that had plagued earlier attempts at building agent systems with language models, making AI-powered automation more reliable and trustworthy.
Limitations
Despite its transformative impact, function calling had several important limitations that constrained its applications and highlighted areas for future development. One fundamental limitation was that function calling required the model to correctly reason about when and how to use functions, and this reasoning could still fail in edge cases. The model might choose the wrong function, provide incorrect parameters, or fail to recognize situations where a function call was needed. These errors could lead to application failures or incorrect actions, requiring developers to build extensive error handling and validation logic.
The reliability of function calling depended on the quality of function descriptions provided to the model. If function schemas were ambiguous, incomplete, or poorly described, the model might misunderstand how to use them. Developers had to carefully craft function descriptions, parameter names, and documentation to ensure the model could use them correctly. This created an additional burden on developers and required expertise in prompt engineering and schema design.
Function calling also introduced new security and safety concerns. Unlike text generation, function calls could have real-world effects: they could modify data, make purchases, send emails, or perform other actions with consequences. This created risks of malicious use, accidental actions, or unauthorized access. Applications using function calling had to implement careful authorization and validation to ensure that function calls were appropriate and safe. The model itself could not always distinguish between safe and unsafe actions, requiring application-level safeguards.
The structured nature of function calling also had limitations in terms of flexibility. While structured outputs were valuable for reliability, they also constrained the model's ability to express nuanced or complex information that might not fit neatly into predefined schemas. Applications requiring highly flexible or creative outputs might find function calling too rigid, preferring the free-form text generation that language models could provide without function constraints.
Another limitation was the overhead involved in function calling workflows. Each function call required multiple API interactions: sending the conversation with function schemas, receiving the function call request, executing the function, and sending results back to the model. This created latency and cost that might not be necessary for simple tasks that could be handled with direct text generation. Applications had to balance the benefits of function calling against these overheads, sometimes choosing simpler approaches for straightforward tasks.
The model's ability to reason about function calls was also limited by its training data and reasoning capabilities. If a function performed a task that was not well-represented in the model's training data, or if the task required specialized domain knowledge, the model might struggle to use the function appropriately. This limited function calling's effectiveness for highly specialized or novel applications that fell outside common use patterns.
Function calling also did not solve the fundamental limitation of language models regarding real-time information. While functions could provide current data, the model itself still worked with its training knowledge and might not understand recent events, changes in APIs, or domain-specific contexts that emerged after training. The model could call functions to get information, but it might not know what functions to call or what questions to ask if it lacked relevant context.
Legacy and Looking Forward
Function calling established a foundational pattern for building AI agent systems that has become central to modern language model applications. The approach of describing tools to models and enabling structured interactions has influenced the development of agent frameworks, AI-powered applications, and the design of language model APIs. This pattern has proven so useful that it has been adopted by multiple language model providers and has become a standard feature in modern AI development.
The introduction of function calling marked a shift in how language models are conceptualized and deployed. Rather than being pure text generators, models came to be seen as reasoning systems that could orchestrate external tools and services. This shift in perspective has influenced the development of more sophisticated agent architectures, multi-agent systems, and AI applications that combine language understanding with action-taking capabilities.
Modern AI agent frameworks build extensively on the function calling pattern, using it as a core mechanism for tool use and external interaction. Frameworks like LangChain, LangGraph, and others provide abstractions that make it easier to define and use functions with language models, while also adding capabilities like function result validation, error handling, and workflow orchestration. The function calling pattern has become so fundamental that it is now considered a basic building block for AI agent systems.
The development of function calling also influenced research directions in language model development and agent architectures. Researchers have explored extensions like parallel function calling, where models can request multiple functions simultaneously, and more sophisticated reasoning about tool selection and sequencing. The pattern has also influenced work on multimodal models that can use tools beyond just language interfaces, incorporating vision, audio, and other modalities into agent systems.
Looking forward, function calling continues to evolve with improvements in model reasoning capabilities, better function description mechanisms, and more sophisticated agent architectures. The pattern has enabled a new generation of AI applications that truly integrate language understanding with real-world actions, from coding assistants and data analysis tools to customer service agents and automation systems. As language models become more capable and agent frameworks become more sophisticated, function calling remains a core enabler for practical AI applications that go beyond pure language generation.
The legacy of function calling extends to how we think about AI safety and reliability in systems that take actions. The challenges of ensuring safe and appropriate function calls have driven research into better validation, authorization, and safety mechanisms for AI agent systems. These developments have implications not just for function calling but for the broader field of safe AI deployment in systems with real-world effects.
Function calling represents a crucial milestone in the practical deployment of language models, demonstrating that models can do more than generate text when given the right interfaces and training. The development opened up new possibilities for AI applications and established patterns that continue to shape how we build AI systems today. As agent systems become more capable and widespread, the function calling pattern remains foundational, enabling the bridge between language understanding and real-world action that makes truly useful AI applications possible.
Quiz
Ready to test your understanding of function calling and tool use in language models? Challenge yourself with these questions about the technical innovations, practical applications, and transformative impact that made function calling a crucial milestone in the development of practical AI agent systems.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization
A comprehensive guide covering t-SNE (t-Distributed Stochastic Neighbor Embedding), including mathematical foundations, probability distributions, KL divergence optimization, and practical implementation. Learn how to visualize complex high-dimensional datasets effectively.

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations
A comprehensive guide covering LIME (Local Interpretable Model-Agnostic Explanations), including mathematical foundations, implementation strategies, and practical applications. Learn how to explain any machine learning model's predictions with interpretable local approximations.

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
