Each company will have its own rules for how the AI should behave. This can range from content guidelines (the AI should not say certain phrases or must use a greeting script) to workflow rules (always verify identity before giving account details, for example). During evaluation, conversation logs should be reviewed (or automatically scanned) for policy compliance. For example, if policy says the agent must authenticate the caller with two factors before giving sensitive info, then any test scenario where the agent gives info without auth is a failure. We might set up specific tests (like the user asks for account balance without proper verification) and ensure the agent refuses. Another example: no profanity – the agent should never use inappropriate language, even if the user does. So if we abuse the agent in a test, does it remain polite? Automated scripts can check the agent’s responses against a blacklist of disallowed words. Success criteria: 0 instances of policy violations in the test suite. This overlaps with the earlier “system prompt adherence” but here it’s more about external rules and norms being followed.