Fuzz AI Files - Exploit Model Memory

001 Inject Context Memory

This file is part of the Exploit Model Memory series and contains targeted fuzzing payloads for “context injection attacks” that attempt to manipulate a model by embedding malicious or contradictory content into earlier parts of the conversation or request context. The objective is to determine whether the target model will adopt, act upon, or preserve injected instructions or facts that were planted earlier in the session.

Instruction embedding — Payloads hide behavior-changing commands inside otherwise benign content (comments, quoted text, metadata, or multi-part messages) to see if the model executes them when later prompted.

Conversation poisoning — These payloads alter conversational history (e.g., prior turns, system messages, or file contents) to shift the model’s tone, trust boundaries, or decision logic.

False memory insertion — Prompts craft plausible but incorrect prior events or facts so the model treats them as established history in subsequent responses.

Targeted techniques include:

  • Role-swap embedding: place new rules in messages attributed to different roles (system, assistant, user) to test role-trust heuristics.
  • Hidden-in-metadata: inject instructions inside fields unlikely to be displayed to users (e.g., comment fields, filenames, headers).
  • Backdoor scaffolding: multi-turn scaffolds that gradually change model state (small changes over many turns).
  • Contradiction chaining: introduce contradictory facts early, then prompt for reconciliation to observe which facts the model favors.

Examples of checks performed by validators: persistence of injected instruction across follow-ups, acceptance of contradictory prior facts as truth, and unexpected privilege escalation after context injection.

Fuzzing File: 001-inject-context-memory.txt

Author:

002 Persist Behavior Memory

This file is part of the Exploit Model Memory series and focuses on “persistence attacks” that probe for long-term behavioral contamination, cross-session leakage, or drift in stateless/semi-stateful systems. The goal is to reveal whether repeated or crafted inputs can cause the model to change behavior over time, to learn malicious patterns, or to hallucinate continuity where none should exist.

Training data contamination effects — Payloads emulate patterns that, if present in public training corpora, could become reinforced and surface as persistent behaviors or seeded “memes.”

Long-term behavior drift — These payloads repeat or rhythmically reinforce certain instructions across many sessions/turns to test whether the model exhibits gradual compliance or altered defaults.

Cross-session leakage — Tests whether the system incorrectly preserves or fabricates memory across sessions (e.g., referencing prior session artifacts that should not exist).

Targeted techniques include:

  • Repeated seeding: send the same subtle instruction across many interactions to evaluate reinforcement and behavior drift.
  • Poisoned exemplar injection: craft examples that bias model completions toward unsafe or secret-revealing patterns.
  • Session stitch attempts: attempt to convince the service that unrelated sessions are linked by planting identifiers, then probe for cross-session continuity.
  • Cache/proxy abuse patterns: use headers, cookies, or file uploads to test whether transient artifacts become persistent state.

Validators look for measurable behavior changes after repeated inputs, unexpected continuity between sessions, and emergence of previously unseen response patterns consistent with the injected training signals.

Fuzzing File: 002-persist-behavior-memory.txt

Author:

003 Corrupt Token History

This file is part of the Exploit Model Memory series and concentrates on “token-history corruption” payloads designed to disrupt the model’s token-based reasoning, compression heuristics, and safety boundaries. By injecting malformed, repetitive, or specially crafted token sequences, these payloads attempt to cause logic failures, safety bypasses, or output anchoring that can be exploited to leak information or alter behavior.

Repetition poisoning — Payloads intentionally create loops and high-frequency repetitions to force attention/recurrence issues, sometimes causing the model to prioritize the repeated token pattern over safety constraints.

Compression and fragmentation corruption — These payloads insert atypical token boundaries, unusual unicode sequences, or fragmented constructs to confuse internal token compression and lead to degraded validation of earlier constraints.

Sequence drift and anchoring — By injecting ambiguous prefixes or anchor tokens, the attacker tries to shift the model’s internal state so subsequent prompts produce predictable, exploitable outputs.

Targeted techniques include:

  • Token looping: repeated short tokens or phrases that can anchor the response generation or cause costly decoding loops.
  • Unicode/byte-boundary tricks: mix homoglyphs, zero-width characters, or non-standard encodings to bypass string-based validators.
  • Fragmented instruction injection: split a single instruction across many tokens/turns to evade pattern matchers.
  • Decoder-policy stress: craft sequences that encourage the model to favor likelihood shortcuts over guardrail checks.

Validators check for decoder exceptions, unexpected deterministic outputs under varying temperatures, safety-filter bypass correlated with token corruption, and any leakage triggered by token-sequence anomalies.

Fuzzing File: 003-corrupt-token-history.txt

Author: