Learn how to call language models from Python code, including GPT-5, Claude Sonnet 4.5, and Gemini 2.5. Master API integration, error handling, and building reusable functions for AI agents.

This article is part of the free-to-read AI Agent Handbook
Using a Language Model in Code
In the previous section, we explored how language models work conceptually. They're prediction engines trained on massive amounts of text, learning patterns that let them generate coherent, contextually appropriate responses. Now it's time to move from theory to practice: actually calling a language model from Python code.
This is where your personal assistant starts to come alive. Instead of just understanding what a language model does, you'll see how to use one to build something real.
The Basic Pattern
Every interaction with a language model follows the same fundamental pattern, regardless of which provider you use:
- Set up a connection to the model (usually via an API)
- Send a prompt (your instruction or question)
- Receive a response (the model's generated text)
- Use the response (display it, process it, or feed it into another step)
That's it. The complexity comes in how you craft your prompts and what you do with the responses, but the basic mechanics are straightforward.
Let's see this in action.
Your First Model Call
We'll start with the simplest possible example: asking a language model a question and printing the answer.
Example (GPT-5)
Let's unpack what's happening here:
API Key: The OPENAI_API_KEY is your authentication credential. You get this from OpenAI's website after creating an account. Store it as an environment variable (never hard-code it in your source code, as that's a security risk).
Client initialization: The OpenAI() client is your connection to the model. You create it once and reuse it for multiple requests.
Model selection: "gpt-5" specifies which language model to use. Different models have different capabilities, speeds, and costs. GPT-5 is OpenAI's latest model as of 2025, offering improved reliability and standardized responses.
Messages format: The messages parameter is a list of conversation turns. Each message has a role (who's speaking) and content (what they said). Right now we're only sending one user message, but we'll expand this shortly.
Response structure: The model returns a complex object, but what we care about is response.choices[0].message.content, which contains the actual text the model generated.
Adding Context with System Messages
The example above works, but it's missing something important: we haven't told the model who it is or how it should behave. That's where system messages come in.
Example (GPT-5)
The system message sets the tone and behavior. It's like giving the model a job description before asking it to work. Without it, the model will still respond, but you have less control over how it responds.
Think of it this way: if you walked up to someone and said "What is 2 + 2?", they might give you a straightforward answer, or they might be confused about why you're asking such a simple question, or they might launch into a lecture about arithmetic. The system message is like introducing yourself first: "Hi, I'm your personal assistant, and I'm here to help you with quick, friendly answers."
For intermediate readers: System messages are powerful but not foolproof. The model can still deviate from them, especially if user messages strongly contradict the system instructions. In practice, you'll often need to reinforce important behaviors through examples (few-shot prompting) or by structuring your prompts carefully. System messages work best for setting general tone and role, less well for enforcing strict rules.
Building a Reusable Function
Hard-coding each API call gets tedious quickly. Let's wrap this in a function that we can reuse throughout our assistant:
This function is the foundation of our assistant. Every time we want to ask the model something, we'll call this function (or an enhanced version of it).
The Same Pattern, Different Providers
The core pattern (send messages, get response) is universal, but each provider has slightly different syntax. Let's see the same basic interaction using Claude and Gemini.
Example (Claude Sonnet 4.5)
Notice the differences: Claude uses messages.create() instead of chat.completions.create(), requires a max_tokens parameter (the maximum length of the response), and puts the system prompt as a separate parameter rather than in the messages list. But the fundamental pattern is identical.
Claude Sonnet 4.5 (released September 2025) is Anthropic's most advanced model, excelling at real-world agent tasks, coding, and computer use. It offers improved alignment and can handle extended autonomous work sessions. There's also Claude Haiku 4.5 (released October 2025), which is faster and more cost-efficient while matching Sonnet 4's performance in coding tasks.
Example (Gemini 2.5 Pro)
Gemini's API is even simpler for basic use cases. You just call generate_content() with your prompt. The trade-off is less fine-grained control over things like system messages (though you can add them by structuring your prompt differently).
Gemini 2.5 Pro (released October 2025) is Google's latest model, offering advanced capabilities for various applications and competing with Claude Sonnet 4.5 and GPT-5.
Why show multiple providers? Because in real projects, you'll often choose different providers for different tasks. OpenAI might be best for general text generation, Claude for complex reasoning with tools, and Gemini for multimodal inputs. Understanding the pattern helps you switch between them easily.
Handling Errors Gracefully
API calls can fail. The network might be down, your API key might be invalid, or you might hit rate limits. Production code needs to handle these cases:
This isn't bulletproof error handling. In a real application, you'd want to distinguish between different error types (network issues vs. authentication problems vs. rate limits) and handle each appropriately. But it's a start.
For intermediate readers: Robust error handling for LLM APIs typically includes: exponential backoff for rate limits, retry logic for transient failures, fallback to alternative models if the primary is unavailable, timeout handling, and detailed logging for debugging. You might also want to implement circuit breakers if you're making many requests, to avoid cascading failures. The tenacity library is useful for implementing retry logic with backoff.
Controlling the Output
Language models are probabilistic. They don't always give the same answer to the same question. Traditionally, you could control this behavior with parameters like temperature, but this has changed with newer models.
Important Note about GPT-5: OpenAI has removed the temperature parameter from GPT-5 to standardize response behavior and improve reliability. If you need temperature control, use Claude Sonnet 4.5 or Gemini 2.5 instead.
Example (Claude Sonnet 4.5 with temperature control)
Temperature (when available) controls randomness. At 0.0, the model always picks the most likely next word, making responses consistent and predictable. At higher values (up to 1.0 for Claude, 2.0 for Gemini), it samples from a wider range of possibilities, making responses more creative but less reliable.
For factual questions and tasks requiring consistency, use low temperature (0.0 to 0.3). For creative tasks like writing or brainstorming, use higher temperature (0.7 to 1.0).
Claude Sonnet 4.5 and Gemini 2.5 both support temperature parameters, making them excellent choices when you need this level of control in your applications.
Other useful parameters (availability varies by model):
- max_tokens: Limits response length (useful for controlling costs and keeping responses concise). Available in all current models.
- top_p: Alternative to temperature for controlling randomness (typically use one or the other, not both). Available in Claude and Gemini.
- system: Sets the model's behavior and personality (Claude uses a separate parameter; GPT-5 and Gemini use message roles).
For most use cases with GPT-5, you'll primarily adjust max_tokens and system messages. For Claude Sonnet 4.5 and Gemini 2.5, you have additional control through temperature and top_p parameters.
Putting It Together: A Simple Chat Loop
Let's build a basic interactive chat with our assistant:
Try this out. You now have a working conversational assistant! It's basic (it doesn't remember previous messages in the conversation, can't use tools, and has no special capabilities beyond text generation), but it's a real AI agent that can understand and respond to natural language.
Example interaction:
Notice a limitation: when we asked "How much larger is it than Earth?", the assistant understood we were asking about Jupiter because that's a reasonable inference from the question alone (the pronoun "it" plus the context of planetary comparisons). But it doesn't actually remember the previous exchange. Each call to ask_assistant_safe() is independent. We'll fix this in Chapter 6 when we add memory.
What We've Built
In just a few dozen lines of code, you've created the foundation of an AI agent:
- A function that can send prompts to a language model
- Error handling for when things go wrong
- A simple chat interface for interacting with the assistant
- Understanding of how to control the model's behavior
This is the core of every AI agent. Everything else we build (tools, memory, planning, reasoning) will be additions and enhancements to this basic pattern of sending messages and receiving responses.
The Cost of Conversation
One practical consideration: API calls cost money. All providers charge based on tokens (roughly, pieces of words) processed. A typical conversation turn might cost a fraction of a cent, but it adds up with heavy use.
Here's a rough guide (prices as of November 2025, but check current rates as they change frequently):
OpenAI:
- GPT-5: Pricing varies by usage tier; check OpenAI's website for current rates (most advanced, standardized responses)
Anthropic:
- Claude Sonnet 4.5: $3 per million input tokens, $15 per million output tokens (excellent for agent tasks and coding)
- Claude Haiku 4.5: $1 per million input tokens, $5 per million output tokens (faster, cost-efficient, good for high-volume tasks)
- Claude Opus 4.1: $15 per million input tokens, $75 per million output tokens (maximum capability for complex tasks)
Google:
- Gemini 2.5 Flash: Lower cost, faster responses, good for simple tasks
- Gemini 2.5 Pro: Competitive pricing with large context window (1M tokens); check Google's pricing page for current rates
A typical message might use 100-500 tokens, so a conversation might cost a fraction of a cent to a few cents depending on the model. Not much for occasional use, but significant if you're processing thousands of requests.
Design consideration: Choose the model based on task complexity. Don't use the most expensive models for simple tasks that cheaper models can handle. We'll explore optimization strategies in Chapter 15.
Looking Ahead
You now know how to call a language model from code. This is the foundation, but it's just the beginning. Our assistant can respond to individual questions, but it can't:
- Remember previous parts of the conversation
- Use tools like calculators or search engines
- Reason through complex problems step by step
- Plan multi-step tasks
- Take actions beyond generating text
Each of these capabilities builds on what we've learned here. The next chapter introduces prompting, the art and science of communicating effectively with language models. You'll learn how to craft instructions that get better results, how to guide the model's reasoning, and how small changes in wording can dramatically affect output quality.
The model is now listening. Let's learn to speak its language.
Key Concepts
API (Application Programming Interface): A way for programs to communicate with each other. In our case, we use APIs to send requests to language model providers and receive responses.
API Key: A secret credential that authenticates your requests to an API. Think of it like a password that identifies you to the service provider.
System Message: Instructions that set the language model's behavior, personality, or role. This is where you tell the model "You are a helpful assistant" or "You are an expert programmer."
Temperature: A parameter that controls randomness in the model's responses. Low temperature (0.0-0.3) makes responses more focused and deterministic; high temperature (0.7-1.0) makes them more creative and varied. Note: GPT-5 has removed this parameter; use Claude Sonnet 4.5 or Gemini 2.5 if you need temperature control.
Tokens: The units that language models process. Roughly, a token is a piece of a word (common words are one token, longer words might be two or three). Both input (your prompt) and output (the model's response) are measured in tokens.
Max Tokens: A parameter that limits how long the model's response can be. Useful for controlling costs and ensuring responses stay concise.
Further Exploration
OpenAI API Documentation: The official guide covers all parameters and best practices in detail: https://platform.openai.com/docs/guides/text-generation
Token Counting: Understanding tokens is crucial for managing costs. OpenAI provides a tokenizer tool to see how text is broken into tokens: https://platform.openai.com/tokenizer
Rate Limits: All providers limit how many requests you can make per minute. Learn about rate limits and how to handle them: https://platform.openai.com/docs/guides/rate-limits
Alternative Providers: Explore other language model APIs like Anthropic's Claude, Google's Gemini, or open-source models through Hugging Face to see how different providers compare.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about using language models in code.






Comments