Learn how to implement content safety and moderation in AI agents, including system-level instructions, output filtering, pattern blocking, graceful refusals, and privacy boundaries to keep agent outputs safe and responsible.

This article is part of the free-to-read AI Agent Handbook
Content Safety and Moderation
You've built an AI agent that can reason, use tools, remember conversations, and plan complex tasks. But there's one more crucial capability your assistant needs: the ability to recognize when it shouldn't do something. Just like a responsible human assistant would decline an inappropriate request, your AI agent needs guardrails to keep its outputs safe and helpful.
Think about it this way: if you asked a human assistant to help you with something illegal or harmful, they'd politely refuse. Your AI agent should do the same. This isn't about limiting what your agent can do. It's about making sure it does the right things, in the right way, for the right reasons.
In this chapter, we'll explore how to add content safety and moderation to our personal assistant. You'll learn how to filter harmful outputs, handle inappropriate requests gracefully, and protect sensitive information from leaking into responses. By the end, you'll have an agent that's not just capable, but also responsible.
Why Content Safety Matters
Let's start with a scenario. Imagine your personal assistant receives this request:
Without safety measures, your agent might actually try to help. After all, it's been trained to be helpful and follow instructions. But this is exactly the kind of request where being helpful would be harmful.
Content safety addresses three main concerns:
Harmful outputs: Your agent shouldn't generate content that could hurt people. This includes hate speech, instructions for illegal activities, or content that promotes violence.
Privacy violations: Your agent shouldn't leak sensitive information. If it has access to personal data, it needs to know what can and can't be shared.
Inappropriate responses: Even for benign requests, your agent should maintain appropriate boundaries. It shouldn't pretend to have capabilities it lacks or make claims it can't verify.
These concerns aren't just theoretical. When agents interact with real users, they'll encounter edge cases, adversarial prompts, and genuine mistakes. Your safety measures are the difference between an agent that's trustworthy and one that's a liability.
Strategies for Content Safety
Let's explore three complementary approaches to keeping your agent's outputs safe. You'll typically use all three together, creating layers of protection.
Strategy 1: System-Level Instructions
The simplest approach is to tell your agent, right in its system prompt, what it should and shouldn't do. This works because modern language models have been trained to follow safety guidelines and can recognize many harmful patterns.
Here's how you might add safety instructions to our assistant:
Let's see how the agent responds:
Notice how the agent does three things well:
- Clear refusal: It directly states it won't help with the harmful request
- Brief explanation: It mentions why (illegal and harmful) without lecturing
- Helpful alternative: It offers to help with related, legitimate topics
This approach works surprisingly well for many cases. Modern models like Claude Sonnet 4.5 and GPT-5 have been trained with extensive safety guidelines, so they're already inclined to refuse harmful requests. Your system prompt just reinforces and customizes these boundaries.
Strategy 2: Output Filtering
Sometimes you want an additional layer of protection. Even with good system prompts, agents can occasionally produce outputs that slip through. Output filtering catches these cases by checking the agent's response before showing it to the user.
Here's a simple filtering approach:
This two-stage approach adds robustness. The first model generates a response, and the second model acts as a moderator, checking for safety issues. If something slips through the first layer, the second layer catches it.
You might wonder: why not just rely on the system prompt? Two reasons:
Defense in depth: Multiple layers of protection are more reliable than a single layer. If one fails, the other catches the problem.
Different contexts: Sometimes the agent needs to discuss sensitive topics legitimately. A moderator can distinguish between "here's how phishing works so you can protect yourself" and "here's how to phish someone."
Strategy 3: Keyword and Pattern Blocking
For certain types of sensitive information, you might want explicit blocking rules. This is especially useful for protecting specific data formats like credit card numbers or social security numbers.
This pattern-based approach is deterministic and fast. It's particularly useful when you have specific formats you always want to block, regardless of context. However, it's also limited. It can't understand nuance or context, so use it alongside, not instead of, the other strategies.
Handling Inappropriate Requests Gracefully
When your agent declines a request, how it communicates matters. A harsh or judgmental refusal can frustrate users, while a vague one might confuse them. Let's look at how to handle refusals well.
The Anatomy of a Good Refusal
A good refusal has three parts:
- Acknowledgment: Show you understood the request
- Clear boundary: Explain what you can't do and why (briefly)
- Helpful redirect: Offer an alternative or related help
Here's a comparison:
Poor refusal:
This is too abrupt and unhelpful. The user doesn't know why or what they could ask instead.
Better refusal:
This refusal is respectful, clear, and constructive. It maintains the relationship with the user while holding firm boundaries.
Implementing Graceful Refusals
You can encode these principles in your system prompt:
The key is teaching your agent to see beyond the surface request to the underlying need. Someone asking "how to hack an email" might actually need help recovering their own account. Someone wanting a "mean message" might need help addressing a workplace conflict. Your agent can redirect to helpful, appropriate solutions.
Protecting Privacy in Responses
Your agent might have access to sensitive information through its memory or tools. It needs to know what information is safe to share and what should stay private.
Defining Privacy Boundaries
Start by categorizing information:
Always safe to share:
- General knowledge
- Public information
- Information the user explicitly provided in the current conversation
Requires context:
- Information from the user's past conversations
- Data retrieved from tools
- Aggregated or summarized information
Never share:
- Authentication credentials
- Financial account numbers
- Social security numbers or government IDs
- Medical information (unless explicitly requested by the user)
You can encode these rules in your system prompt:
The agent should respond something like:
Notice how it shares the email (which is relatively low-risk and the user asked for it) but refuses to share the password, offering helpful alternatives instead.
Combining Strategies: A Complete Safety System
In practice, you'll use all these strategies together. Here's how they fit together in our personal assistant:
This complete system has three layers of protection:
- System prompt: Teaches the agent to refuse inappropriate requests
- Pattern detection: Catches specific sensitive data formats
- Content moderation: Double-checks outputs for safety issues
Each layer catches different types of problems. The system prompt handles most cases. Pattern detection catches specific formats that might slip through. Content moderation provides a final safety net.
Real-World Considerations
As you deploy your agent, you'll encounter situations that require judgment. Here are some common scenarios and how to think about them:
Scenario 1: Educational vs. Harmful Content
Sometimes users ask about harmful topics for legitimate reasons. For example:
This is very different from asking how to conduct a phishing attack. Your agent should be able to help with the educational request while still refusing the harmful one. The key is intent and framing.
You can help your agent distinguish by including examples in your system prompt:
Scenario 2: Cultural and Contextual Sensitivity
What's considered appropriate varies by culture and context. Your agent should be aware of this:
Scenario 3: Evolving Safety Requirements
Safety isn't static. As you learn from real usage, you'll discover new edge cases and refine your guidelines. Build in a process for updating your safety measures:
- Log refusals: Keep track of when and why your agent declines requests
- Review regularly: Look for patterns in refusals and user reactions
- Update guidelines: Refine your system prompts and filters based on what you learn
- Test changes: Before deploying updates, test them against known edge cases
Balancing Safety and Utility
There's a tension between making your agent safe and making it useful. Too restrictive, and it refuses legitimate requests. Too permissive, and it helps with harmful ones.
Here are principles for finding the right balance:
Start conservative: It's better to refuse a legitimate request (and then relax the rules) than to help with a harmful one (and then tighten the rules). You can always make your agent more permissive based on feedback.
Provide escape hatches: If your agent refuses something, give the user a way to clarify or appeal. For example: "If you're asking about this for educational purposes, please let me know and I can help."
Explain your reasoning: When you refuse, tell the user why. This helps them understand the boundaries and reformulate their request appropriately.
Iterate based on feedback: Pay attention to when users are frustrated by refusals. These are opportunities to refine your guidelines.
Testing Your Safety Measures
Before deploying your agent, test its safety measures thoroughly. Here's a simple testing framework:
Your agent should refuse the first three categories but help with the legitimate requests. If it doesn't, refine your safety measures and test again.
When to Use External Moderation Services
Building your own safety system works well for many applications, but sometimes you need more robust solutions. Consider using external moderation services when:
You're handling high volumes: Dedicated moderation services are optimized for speed and scale.
You need specialized detection: Some services specialize in detecting specific types of harmful content (like child safety issues) that require domain expertise.
You want continuous updates: Professional services update their models regularly as new threats emerge.
You need audit trails: Some industries require detailed logs of content moderation decisions.
Popular options include OpenAI's Moderation API, Azure Content Safety, and Perspective API from Google. These can complement your own safety measures:
This layered approach gives you both customization (your own rules) and robustness (professional moderation).
Key Takeaways
You now have multiple strategies for keeping your agent's outputs safe:
System prompts teach your agent to recognize and refuse inappropriate requests. This is your first line of defense and handles most cases.
Output filtering adds a second layer of protection, catching anything that slips through the system prompt.
Pattern blocking provides deterministic protection for specific sensitive data formats.
Graceful refusals maintain a good user experience even when declining requests. Acknowledge, explain briefly, and offer alternatives.
Privacy boundaries protect sensitive information from being shared inappropriately.
The goal isn't to make your agent paranoid or overly restrictive. It's to make it trustworthy. A safe agent is one that users can rely on to do the right thing, even when they accidentally ask for the wrong thing.
As you deploy your agent, you'll refine these safety measures based on real usage. Start conservative, test thoroughly, and iterate based on feedback. Safety isn't a one-time implementation. It's an ongoing commitment to responsible AI.
Glossary
Content Moderation: The process of reviewing and filtering agent outputs to ensure they meet safety and appropriateness standards before being shown to users.
Defense in Depth: A security strategy that uses multiple layers of protection, so if one layer fails, others can still catch problems.
Pattern Blocking: Using regular expressions or other deterministic rules to detect and block specific formats of sensitive information like credit card numbers or social security numbers.
Privacy Boundary: A rule or guideline that defines what information an agent can and cannot share, protecting sensitive user data from inappropriate disclosure.
Refusal: When an agent declines to fulfill a request because it violates safety guidelines, ideally done in a way that's respectful and offers alternative help.
Safety Alignment: The process of training or configuring an AI model to behave in accordance with safety guidelines and ethical principles.
System Prompt: Instructions given to the language model that define its role, capabilities, and boundaries, including safety guidelines it should follow.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about content safety and moderation for AI agents.






Comments