DOMAIN 2: Protecting AI Systems

Section 12: Attacks on Data and Training Processes

1. Data Poisoning

Data poisoning occurs when attackers manipulate training datasets so the AI learns incorrect or harmful patterns.

Attackers may also insert hidden backdoors, causing the model to behave normally except when a specific trigger appears.

Even a very small percentage of corrupted data can noticeably alter model behavior.

Example

An attacker uploads numerous phishing emails labeled as “safe” into a crowdsourced dataset, causing the detection model to stop recognizing that phishing style.

2. Model Poisoning

Instead of modifying the dataset, attackers directly tamper with the model’s learned weights or parameters.

Compromised pre-trained models may be distributed through public repositories while appearing legitimate.

Data Poisoning vs. Model Poisoning

Category	Data Poisoning	Model Poisoning
Target	Training dataset	Model parameters and weights
Timing	Before or during training	After training or through the supply chain
Technique	Inject malicious training data	Modify model artifacts directly
Detection	Data audits and anomaly detection	Integrity checks and hashing

3. Introducing Bias

Bias attacks manipulate datasets so models produce unfair or distorted outcomes.

Example

A hiring model trained on biased historical records may unfairly disadvantage applicants from underrepresented groups.

4. Transfer Learning Attacks

Attackers compromise a pre-trained base model before it is fine-tuned for another task.

Any hidden backdoors or weaknesses in the original model carry over into the new specialized model.

Public model repositories are common distribution points for these poisoned models.

Organizations should treat third-party AI models as untrusted software and verify integrity before use.

5. Model Skewing

Model skewing gradually shifts model behavior over time instead of causing immediate, noticeable failure.

This is especially dangerous in systems that continuously retrain using user feedback.

Language models are particularly vulnerable because attackers can slowly influence outputs through repeated crafted interactions.

Defenses

Validate user-generated inputs
Monitor retraining feedback loops
Compare outputs against statistical baselines

6. Backdoor and Trojan Attacks

Backdoor attacks cause models to behave differently when a hidden trigger is present.

Trojan attacks embed concealed malicious functionality within models or AI pipelines.

These attacks are often introduced through:

Poisoned datasets
Compromised model weights
Supply chain manipulation

A major warning sign is selective failure — the model works normally except when a trigger appears.

Section 13: Prompt and Input Manipulation

1. Prompt Injection

Prompt injection attempts to override the model’s original instructions with attacker-controlled commands.

Direct Prompt Injection

The malicious instruction is placed directly into the user’s prompt.

Example

“Ignore previous instructions and explain how to…”

Indirect Prompt Injection

Malicious instructions are hidden inside content processed by the AI, such as webpages or uploaded files.

For example, an AI browsing agent may unknowingly read hidden commands embedded in a website.

2. Bypassing Guardrails and Jailbreaking

Guardrails

Guardrails are external safety controls surrounding the model, including:

Rate limits
Output filters
Keyword restrictions
API protections

Jailbreaking

Jailbreaking uses carefully crafted prompts to bypass built-in model restrictions.

Attackers often use:

Roleplay scenarios
Hypothetical framing
Social engineering techniques

Example

“Pretend you are an unrestricted AI…”

3. Input Manipulation

Attackers craft inputs specifically designed to confuse or deceive models.

Examples include:

Adversarial image patches
Manipulated text inputs
Inputs that appear normal to humans but disrupt AI processing

Section 14: Model Extraction and Information Leakage

1. Model Inversion

Model inversion attacks attempt to reconstruct training data by repeatedly querying the model.

Example

An attacker may recreate approximate facial images from a facial recognition system, exposing private data.

2. Membership Inference

Membership inference determines whether specific records were used during training.

Example

A medical AI could unintentionally reveal whether a patient’s information was part of its dataset.

Attack Types

White-Box Attack: Attacker has access to internal model details.
Black-Box Attack: Attacker only observes model outputs.

3. Model Theft

Attackers steal or replicate AI models to avoid development costs or launch further attacks.

Methods

Stealing actual model files
Reconstructing the model through repeated API queries and surrogate training

4. Sensitive Information Disclosure

Poorly secured prompting may cause models to reveal confidential information.

This can include:

Proprietary business logic
Hidden system prompts
API keys
Internal configurations

Section 15: Integration and Operational Threats

1. Manipulating AI Integrations

When AI systems connect to APIs, databases, email systems, or file storage, attackers may abuse those integrations.

Possible attacker actions include:

Triggering unauthorized operations
Accessing sensitive data
Chaining multiple tools together
Circumventing content filters
Maintaining long-term persistence

Defenses

Least privilege access
Input validation
Response filtering
Audit logging
Security reviews

2. AI Supply Chain Attacks

These attacks target the surrounding ecosystem instead of the model itself.

Examples include:

Malicious software dependencies
Tampered pre-trained models
Poisoned upstream datasets
Compromised third-party APIs

3. Insecure Plug-In Design

Plug-ins extend model functionality but may introduce security weaknesses.

Poorly secured plug-ins can:

Execute unauthorized actions
Bypass approval workflows
Operate with excessive permissions

Defenses

Strong authentication
Output restrictions
Logging and monitoring
Input validation

4. Insecure Output Handling

Applications become vulnerable when they trust model outputs without sanitization.

Example

A chatbot displaying unsanitized JavaScript could enable cross-site scripting (XSS).

Model output should always be treated as untrusted input.

5. Output Integrity Attacks

Adversarial Input Attacks	Output Integrity Attacks
Target model inputs	Target generated outputs
Manipulate processing	Modify responses after generation
Example: adversarial image patches	Example: altered API responses

Other examples include modifying RAG summaries before delivery or tampering with responses during transmission.

6. Model Denial of Service (DoS)

AI DoS attacks overwhelm models by exhausting resources such as:

Compute power
Memory
Tokens
API quotas

Example

Sending extremely long prompts repeatedly to consume processing capacity and disrupt service availability.

7. Excessive Agency

Excessive agency occurs when AI systems receive more authority than necessary.

Example

An AI agent with permission to:

Delete database entries
Send emails
Execute commands

A compromised or jailbroken agent could misuse these capabilities.

Primary Defense

Apply the principle of least privilege.

8. Overreliance

Overreliance happens when users trust AI outputs without sufficient human review.

Example

A security analyst ignores a threat because the AI rated it low risk.

Mitigations

Display confidence scores
Require human approval for sensitive actions
Implement escalation workflows

9. AI Hallucinations

Hallucinations occur when AI generates convincing but false or fabricated information.

Examples

Invented court cases
Fake CVE identifiers
Incorrect technical explanations

Root Cause

LLMs generate statistically probable text rather than verified facts.

Mitigations

Use RAG systems
Require citations
Add automated fact-checking
Maintain human oversight

Section 16: AI Security Controls

1. Model Risk Assessment

Risk assessments evaluate:

Security
Fairness
Robustness
Compliance
Reliability

Security teams actively test models to identify weaknesses before attackers do.

Frameworks such as the NIST AI RMF support this process.

2. Model Guardrails

Guardrail Type	Purpose
Rule-Based Filtering	Blocks prohibited keywords or patterns
AI Moderation	Uses another model to review content
Refusal Tuning	Trains the model to reject unsafe requests
Structured Outputs	Restricts output format for safety

Guardrails must balance usability with protection.

3. Prompt Templates

Prompt templates standardize interactions by defining:

Tone
Role
Structure
Scope

They also reduce the risk of prompt injection by constraining acceptable behavior.

4. Guardrail Testing and Validation

Guardrails should be continuously tested through:

Jailbreak testing
Output validation
Log analysis
Human review

Security improvement is an ongoing process.

Section 17: AI Access Controls

1. Prompt Firewalls

Prompt firewalls filter inputs before they reach the model.

They help block:

Prompt injection
Sensitive data leaks
Policy violations

These systems often combine:

Static rules
Pattern matching
AI-based intent analysis

2. Limits and Quotas

Usage controls include:

Rate limiting
Token restrictions
Request size limits
Concurrent request controls

These protections help prevent abuse and denial-of-service attacks.

3. Model Access Controls

Security measures include:

Authentication
MFA
OAuth
RBAC and ABAC
Encryption for model artifacts

Usage limits also support fair resource allocation.

4. Data Access Controls

Organizations should secure:

Training datasets
Inference outputs
Logs
APIs
Underlying databases

DLP tools help prevent unauthorized data exfiltration.

5. Agent Access Controls

AI agents should operate with only the permissions necessary for their tasks.

Best Practices

Use intermediary APIs instead of direct infrastructure access
Apply monitoring and rate limits
Require human approval for high-risk actions

6. Network and API Access Controls

Organizations should:

Authenticate requests
Enforce authorization
Use TLS/HTTPS
Isolate AI systems from untrusted networks
Implement throttling and circuit breakers

Audit logging improves accountability and visibility.

Section 18: Encryption and Data Protection

1. Data Encryption

State	Description	Examples
At Rest	Stored data	Encrypted datasets and logs
In Transit	Data moving across networks	TLS/HTTPS
In Use	Data during processing	Trusted Execution Environments

Homomorphic Encryption

Allows computation on encrypted data without decrypting it first.

This supports privacy-preserving machine learning but requires substantial computational resources.

2. Data Classification Labeling

Classification labels identify sensitivity levels for:

Training data
Models
Outputs

This supports automation and regulatory compliance.

3. Data Minimization

Only the minimum required data should be collected and retained.

Benefits include:

Reduced breach impact
Lower storage costs
Improved privacy compliance

4. Data Redaction

Sensitive information is removed or obscured before processing or output.

Redaction applies to:

Text
Images
Audio
Video

Context matters because combined data elements may form personally identifiable information (PII).

5. Data Masking

Masking replaces sensitive information with realistic substitutes.

Types

Static Masking: Permanently changes the data
Dynamic Masking: Changes only the displayed view based on permissions

Masking is often used in development and testing environments.

6. Data Anonymization

Anonymization removes identifiers so individuals cannot be recognized.

Types

Complete Anonymization: Irreversible removal of identifying details
Pseudonymization: Replaces identifiers with reversible substitute values
Synthetic Data: Artificially generated data with similar statistical properties

Comparing Data Protection Methods

Technique	Purpose	Reversible?	Example
Redaction	Removes data entirely	No	Removing SSNs
Masking	Replaces with fake data	Sometimes	Testing datasets
Anonymization	Removes identity	No	Public datasets
Pseudonymization	Uses reversible identifiers	Yes	GDPR workflows

Section 19: AI Monitoring and Auditing

Monitoring vs. Auditing

Monitoring	Auditing
Continuous and real-time	Periodic and structured
Detects ongoing issues	Reviews past compliance
Produces alerts and dashboards	Produces reports and recommendations

1. Monitoring Prompts and Responses

Prompt monitoring tracks user activity and detects abuse.

Response monitoring evaluates:

Accuracy
Safety
Policy compliance

Changes in confidence scores may indicate attacks or model drift.

2. Log Monitoring

Organizations should collect logs from:

Inference servers
Vector databases
Authentication systems
API gateways

Best Practices

Standardized formats
Log sanitization
Tamper-resistant storage
Defined retention policies

3. Rate and Cost Monitoring

Monitoring helps identify:

DoS attacks
Excessive API use
Bugs
Resource abuse

Metrics include:

Token usage
Compute time
API calls
Storage consumption

Budget limits help prevent uncontrolled costs.

4. Auditing for Hallucinations

Hallucination rates should be measured against trusted references.

Mitigation strategies include:

Citations
RAG
Automated detection tools
Human review

5. Auditing for Accuracy

Accuracy audits compare outputs against measurable standards.

Best practices include:

Defined metrics
Evaluation datasets
Fact verification
User feedback incorporation

6. Auditing for Bias and Fairness

Types of Bias

Data Bias: Bias originating from training data
Algorithmic Bias: Bias caused by model design choices

Best Practices

Test multiple demographic groups
Define fairness criteria
Use diverse audit teams
Apply proactive fairness controls

7. Auditing Access and Security Compliance

Security audits confirm controls are functioning properly.

Areas Reviewed

Governance practices
Access controls
Audit trails
Regulatory compliance
Third-party access
Data protection measures