How Can We Secure On-Chain AI Against Jailbreaks and Prompt Attacks?

zuzuzu · May 23, 2025, 3:33pm

Someone asked me about preventation against hacker on the agents, i decide to make this topic and bring here. As Metis Hyperion moves toward enabling AI-native applications, including on-chain LLMs and autonomous agents, it opens up new frontiers—but also introduces serious security risks that Web3 has never dealt with before.

In traditional machine learning environments, we’ve already seen how easily language models can be manipulated with prompt injection, jailbreaking, or subtle adversarial inputs. Now imagine these vulnerabilities not in a closed API, but running directly on-chain where logic is permanent, transparent, and potentially high-stakes.

This raises the question: How do we prevent on-chain AI from being manipulated, abused, or exploited by malicious actors?

Here are a few areas I think we need to explore as a community:

Prompt Validation and Sanitization
Before any prompt reaches an on-chain LLM, there needs to be strict filtering, formatting, and intent verification. Just like how we validate transaction inputs, we may need decentralized prompt parsers that reject malformed or dangerous instructions.
On-Chain Prompt Reputation
We could design a system where prompts (or prompt creators) build a reputation score based on past behavior. If a user or dApp continuously submits prompts that lead to risky or abnormal outputs, their reputation could drop, and access could be limited or flagged.
Rule-Constrained Models
One way to reduce risk is to fine-tune LLMs with hard-coded boundaries. For example, an LLM powering a finance protocol should never generate prompts that execute transfers, change fees, or touch treasury funds unless additional authorization layers are triggered.
Sandboxed Execution for Agents
LLM-based agents could operate within a strict sandbox where they cannot execute arbitrary contract calls. Their responses would be suggestions, which must pass through a verification layer or human approval before actual execution.
AI-Specific Auditing Tools
Just as we audit smart contracts with tools like MythX or Slither, we’ll likely need new tools to simulate how LLMs respond to a wide range of inputs, including edge cases and adversarial examples. These tools would help identify vulnerabilities before deployment.

This is a new layer of risk for the blockchain space. Once you combine the transparency of public blockchains with the ambiguity of language models, new threat surfaces emerge—ones that smart contract logic alone cannot mitigate.

arbilearn · May 24, 2025, 7:24am

Where there is no safe way to be safe in web3, the best way to avoid interacting with trustless protocols and agent shall not have access to ones private key

zuzuzu · May 24, 2025, 5:57pm

thats way we need to be more smarter that any fraudulent action that gonna lead to assets lost.. but yeah, some educational project has less demand than any meme or other kind of protocols, hope there gonna be any educational project that really care about security