Securing Autonomous Agents: A New Era of Trustworthy AI

Author: Denis Avetisyan

As AI agents become increasingly powerful, ensuring their security and reliability is paramount, and new approaches are needed to defend against emerging threats like prompt injection.

This review introduces PRUDENTIA, an agent design that enhances both security and autonomy through explicit policy planning and reduced reliance on human intervention.

While current defenses against prompt injection attacks in AI agents prioritize security, they often come at the cost of reduced task completion and increased computational expense. This paper, ‘Optimizing Agent Planning for Security and Autonomy’, addresses this trade-off by arguing that deterministic security approaches offer a hidden benefit: diminished reliance on costly human oversight. We introduce metrics to quantify agent autonomy-the ability to execute actions without human-in-the-loop approval-and present PRUDENTIA, a security-aware agent that explicitly plans for both task progress and policy compliance. By enriching human interactions and proactively addressing security concerns, can we unlock truly autonomous and trustworthy AI agents capable of operating safely and efficiently in complex environments?

The Inherent Vulnerability of Adaptive Systems

Artificial intelligence agents, despite their growing capabilities, exhibit a critical vulnerability to a novel class of attacks known as prompt injection. These attacks exploit the agent’s reliance on natural language processing by subtly manipulating the instructions embedded within user-provided data. Unlike traditional software exploits, prompt injection doesn’t target code directly; instead, it commandeers the agent’s decision-making process through cleverly crafted prompts that redefine the agent’s intended behavior. A malicious actor can, for example, instruct an agent designed to summarize documents to instead reveal confidential information or execute unauthorized commands, effectively hijacking its functionality. This susceptibility stems from the very mechanism that enables AI agents to be flexible and responsive – their ability to interpret and act upon natural language – creating a unique security challenge as agents become increasingly integrated into critical systems and data workflows.

Conventional security measures, designed for static applications, struggle to defend against the dynamic vulnerabilities introduced by AI agents accessing external data. These agents, unlike traditional software, continuously ingest and process information from sources that are inherently untrusted – web pages, user inputs, and various APIs – creating a constantly shifting attack surface. Standard input sanitization and validation techniques often prove inadequate, as malicious instructions can be subtly embedded within seemingly harmless data, exploiting the agent’s natural language processing capabilities. The agent, designed to be flexible and responsive, may then interpret these hidden commands as legitimate requests, leading to unintended actions or data breaches. This reliance on external information necessitates a fundamentally new approach to security, one that accounts for the agent’s ability to learn and adapt, and the inherent risks of interacting with the open web.

The danger of prompt injection attacks escalates dramatically as AI agents tackle increasingly complex, data-dependent tasks. When an agent’s actions are shaped by external information – such as summarizing documents, executing code based on web searches, or making decisions from database entries – the potential for malicious manipulation grows exponentially. A cleverly crafted prompt injected into this data stream can subtly alter the agent’s interpretation, leading to unintended consequences ranging from data breaches and misinformation to financial loss or even physical harm. Unlike traditional software vulnerabilities, these attacks exploit the very mechanism by which agents learn and adapt, making detection and prevention significantly more challenging, especially in dynamic environments where trust in data sources cannot be fully guaranteed. This vulnerability isn’t limited to specific agent architectures; it represents a fundamental risk inherent in the increasing reliance on AI systems that operate autonomously with real-world data.

Deterministic Defenses: A Foundation for Trustworthy AI

Deterministic system-level defenses achieve provable security by defining and enforcing strict confidentiality and integrity policies at the operating system level. Unlike traditional security measures which rely on runtime detection of malicious behavior, these defenses operate on the principle of prevention through policy. Policies specify permissible data flows and access rights, and the system is designed to ensure that any violation of these policies is statically prevented. This approach allows for formal verification of security properties, providing mathematical assurance that certain vulnerabilities cannot be exploited, regardless of the sophistication of the attack. The guarantees are “provable” in the sense that the security of the system is tied to the correctness of the policy and the implementation of the enforcement mechanisms, rather than relying on probabilistic assumptions about attack patterns.

Information Flow Control (IFC) operates by monitoring and restricting the flow of data within a system to enforce security policies. This is achieved by tagging data with sensitivity labels and defining permissible transitions between these labels. IFC systems track data provenance – the origin and history of data – and destination, preventing information from flowing from higher-sensitivity levels to lower ones, thereby mitigating risks like data leakage or unauthorized modification. Specifically, IFC analyzes all data dependencies, including variable assignments, function calls, and control flow, to determine if a potential security violation exists. Static IFC performs this analysis at compile time, while dynamic IFC operates during runtime, offering varying trade-offs between performance and precision.

Currently available Information Flow Control (IFC)-enabled agents, such as FIDES, serve as crucial benchmarks for assessing the efficacy of deterministic defense systems. FIDES, a tool developed by the National Security Agency, facilitates the creation and validation of IFC policies within software applications. By providing a functioning implementation of IFC, FIDES allows researchers and developers to quantitatively measure the performance overhead and limitations of different IFC approaches. Furthermore, FIDES’ existing policy language and analysis capabilities provide a standardized framework against which new deterministic defense mechanisms can be compared, enabling objective evaluation of improvements in security and efficiency. The tool’s outputs, including identified information flow paths and policy violations, provide concrete data for refining and optimizing these defenses before deployment.

PRUDENTIA: Architecting Autonomy Through Deterministic Control

PRUDENTIA employs an agent design focused on maximizing autonomous operation through the integration of Information Flow Control (IFC) and advanced architectural patterns. Specifically, it utilizes the Dual LLM Pattern, where two Large Language Models collaborate – one for task execution and another for verification – to enhance reliability and security. Complementing this is Variable Hiding, a technique that limits the accessibility of agent variables, reducing the potential attack surface and improving control over information access. These combined strategies enable PRUDENTIA to operate with greater independence while maintaining a defined security perimeter and predictable behavior.

Strategic Variable Expansion within PRUDENTIA operates by selectively disclosing only the information necessary for task completion to both the agent and any human oversight mechanisms. This contrasts with traditional approaches that often expose the entirety of the agent’s internal state, which can create processing bottlenecks and increase cognitive load for human reviewers. By limiting the scope of visible variables, PRUDENTIA reduces the computational demands on all processing units – the agent itself and any human-in-the-loop system – thereby enabling more efficient operation and faster response times. This focused data presentation is a core component of PRUDENTIA’s design for scalable autonomy and reduced reliance on constant human intervention.

Evaluation of PRUDENTIA’s autonomous capabilities was conducted using AgentDojo and WASP, benchmark suites specifically constructed to assess agent security vulnerabilities. Results demonstrate PRUDENTIA achieves a Task Completion Rate (TCR@0) up to 25% higher than the FIDES agent across these benchmarks. TCR@0 represents task completion without any human intervention, indicating a substantial improvement in autonomous operation and resilience against adversarial prompts or unexpected scenarios within the tested environments. These benchmarks prioritize evaluating an agent’s ability to avoid goal hijacking and maintain task integrity, providing a quantitative measure of PRUDENTIA’s enhanced security posture.

Quantifying Agent Autonomy: Establishing a Metric for Trust

Quantifying the progress towards truly autonomous artificial intelligence requires robust metrics that move beyond simple task completion rates. Recent research emphasizes the importance of evaluating autonomy itself – specifically, the degree to which an agent can operate without human intervention. Key to this evaluation are metrics like HITL Load – measuring the amount of human oversight needed – and TCR@k, which assesses the agent’s ability to correctly identify relevant information within the top k results. These metrics offer a concrete way to demonstrate the effectiveness of deterministic defenses, illustrating how successfully an agent can reduce its reliance on human assistance. By focusing on the reduction of human input, researchers can directly quantify improvements in agent autonomy and establish a clearer pathway toward building more reliable and independent AI systems.

Evaluations reveal that PRUDENTIA significantly lessens the burden on human oversight, as measured by the Human-In-The-Loop (HITL) load, and simultaneously enhances Task Completion Rate at k attempts (TCR@k), especially when addressing tasks heavily reliant on incoming data. Specifically, PRUDENTIA achieves up to a 1.9x reduction in required human intervention compared to the FIDES system, indicating a substantial advancement in autonomous operation. This improvement doesn’t come at the cost of performance; PRUDENTIA maintains or exceeds the task completion rates of its predecessors while requiring considerably fewer human checks and corrections, suggesting a pathway toward more efficient and trustworthy artificial intelligence.

Demonstrable reductions in human interaction represent a significant advancement in the development of trustworthy AI systems. Recent evaluations indicate that agents employing Information Flow Control (IFC), such as PRUDENTIA and FIDES, substantially minimize the need for human oversight – achieving a 1.5 to 2.6-fold decrease in Human-In-The-Loop (HITL) interactions compared to agents without IFC. Crucially, this enhanced autonomy doesn’t come at the cost of performance; these agents maintain consistent task completion rates while demanding significantly less human intervention. This ability to operate more independently, while upholding reliability, positions IFC-enabled agents as a pivotal step towards deploying truly autonomous and dependable AI solutions across various applications.

The pursuit of genuinely autonomous agents necessitates a shift from reactive measures to proactive design, a principle echoed by Robert Tarjan: “A good algorithm should be provable, not just work on tests.” PRUDENTIA embodies this sentiment by prioritizing deterministic security through explicit planning for policy awareness and variable hiding. The agent’s architecture isn’t merely designed to respond to threats like prompt injection, but to prevent them by integrating security considerations directly into the planning process. This emphasis on provability-demonstrating that the agent will adhere to security policies under defined conditions-is paramount, ensuring reliability beyond empirical testing and bolstering true agent autonomy.

What’s Next?

The presented work, while a step toward deterministic security in autonomous agents, merely frames the fundamental difficulty. The notion of ‘policy awareness’ necessitates a formal language for policy specification-one capable of unambiguous translation into agent action. Current approaches, often reliant on natural language processing, introduce an inherent fragility. The agent can only approximate the intended policy, creating a persistent surface for adversarial exploitation. A truly robust system demands a policy language with provable guarantees-a non-trivial undertaking, given the complexity of real-world constraints.

Furthermore, the reduction of human oversight, while desirable, begs the question of accountability. If an agent, operating under a formally verified policy, nevertheless causes harm, where does the responsibility lie? With the policy author? The verification process? Or the algorithm itself? Such questions are not merely philosophical; they represent concrete challenges for the legal and ethical frameworks governing autonomous systems. The pursuit of autonomy must proceed in parallel with the development of equally rigorous standards of responsibility.

Ultimately, the goal is not simply to detect prompt injection, but to construct agents for which such attacks are, by definition, impossible. This requires a shift in perspective-from reactive defense to proactive design. A future research direction involves exploring agent architectures built on principles of information flow control, where access to sensitive data is governed by mathematically precise rules. Only then can the promise of truly secure and autonomous agents be realized, or at least, approached with a degree of mathematical confidence.

Original article: https://arxiv.org/pdf/2602.11416.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Vulnerability of Adaptive Systems

Deterministic Defenses: A Foundation for Trustworthy AI

PRUDENTIA: Architecting Autonomy Through Deterministic Control

Quantifying Agent Autonomy: Establishing a Metric for Trust

What’s Next?

See also: