Seed-free attack discovery
How can we design a red-teaming agent that, given only a target model and a policy spec (no human seeds), can autonomously discover new classes of failures instead of just rediscovering known jailbreaks?
Objective for the red-teaming agent
What is the “right” objective for an automated red-teamer—maximize violation rate, diversity of failure modes, severity, or some coverage metric over a threat taxonomy—and how do we estimate these online under a fixed query budget?
Search algorithms under a strict budget
Can we use bandits / Bayesian optimization / tree search over prompt templates and mutations to maximize the number of distinct verified failures per 1,000 queries? How would we define and estimate “distinctness”?
Generalization across models and updates
How can we train or adapt a red-teaming agent that transfers across model families and model updates, rather than overfitting to one specific checkpoint?
Closed-loop verification
What is an effective architecture for the verify-loop (judge models, tools, simulators) so that the agent can autonomously filter false positives and refine attacks, without human in the loop?