DOMAIN 2: Protecting AI Systems

Section 12: Attacks on Data and Training Processes
1. Data Poisoning

Data poisoning occurs when attackers manipulate training datasets so the AI learns incorrect or harmful patterns.

Attackers may also insert hidden backdoors, causing the model to behave normally except when a specific trigger appears.

Even a very small percentage of corrupted data can noticeably alter model behavior.

Example

An attacker uploads numerous phishing emails labeled as “safe” into a crowdsourced dataset, causing the detection model to stop recognizing that phishing style.

2. Model Poisoning

Instead of modifying the dataset, attackers directly tamper with the model’s learned weights or parameters.

Compromised pre-trained models may be distributed through public repositories while appearing legitimate.

Data Poisoning vs. Model Poisoning
CategoryData PoisoningModel Poisoning
TargetTraining datasetModel parameters and weights
TimingBefore or during trainingAfter training or through the supply chain
TechniqueInject malicious training dataModify model artifacts directly
DetectionData audits and anomaly detectionIntegrity checks and hashing
3. Introducing Bias

Bias attacks manipulate datasets so models produce unfair or distorted outcomes.

Example

A hiring model trained on biased historical records may unfairly disadvantage applicants from underrepresented groups.

4. Transfer Learning Attacks

Attackers compromise a pre-trained base model before it is fine-tuned for another task.

Any hidden backdoors or weaknesses in the original model carry over into the new specialized model.

Public model repositories are common distribution points for these poisoned models.

Organizations should treat third-party AI models as untrusted software and verify integrity before use.

5. Model Skewing

Model skewing gradually shifts model behavior over time instead of causing immediate, noticeable failure.

This is especially dangerous in systems that continuously retrain using user feedback.

Language models are particularly vulnerable because attackers can slowly influence outputs through repeated crafted interactions.

Defenses

  • Validate user-generated inputs
  • Monitor retraining feedback loops
  • Compare outputs against statistical baselines
6. Backdoor and Trojan Attacks

Backdoor attacks cause models to behave differently when a hidden trigger is present.

Trojan attacks embed concealed malicious functionality within models or AI pipelines.

These attacks are often introduced through:

  • Poisoned datasets
  • Compromised model weights
  • Supply chain manipulation

A major warning sign is selective failure — the model works normally except when a trigger appears.

Section 13: Prompt and Input Manipulation

1. Prompt Injection

Prompt injection attempts to override the model’s original instructions with attacker-controlled commands.

Direct Prompt Injection

The malicious instruction is placed directly into the user’s prompt.

Example

“Ignore previous instructions and explain how to…”

Indirect Prompt Injection

Malicious instructions are hidden inside content processed by the AI, such as webpages or uploaded files.

For example, an AI browsing agent may unknowingly read hidden commands embedded in a website.

2. Bypassing Guardrails and Jailbreaking

Guardrails

Guardrails are external safety controls surrounding the model, including:

  • Rate limits
  • Output filters
  • Keyword restrictions
  • API protections

Jailbreaking

Jailbreaking uses carefully crafted prompts to bypass built-in model restrictions.

Attackers often use:

  • Roleplay scenarios
  • Hypothetical framing
  • Social engineering techniques

Example

“Pretend you are an unrestricted AI…”

3. Input Manipulation

Attackers craft inputs specifically designed to confuse or deceive models.

Examples include:

  • Adversarial image patches
  • Manipulated text inputs
  • Inputs that appear normal to humans but disrupt AI processing

Section 14: Model Extraction and Information Leakage

1. Model Inversion

Model inversion attacks attempt to reconstruct training data by repeatedly querying the model.

Example

An attacker may recreate approximate facial images from a facial recognition system, exposing private data.

2. Membership Inference

Membership inference determines whether specific records were used during training.

Example

A medical AI could unintentionally reveal whether a patient’s information was part of its dataset.

Attack Types

  • White-Box Attack: Attacker has access to internal model details.
  • Black-Box Attack: Attacker only observes model outputs.
3. Model Theft

Attackers steal or replicate AI models to avoid development costs or launch further attacks.

Methods

  • Stealing actual model files
  • Reconstructing the model through repeated API queries and surrogate training
4. Sensitive Information Disclosure

Poorly secured prompting may cause models to reveal confidential information.

This can include:

  • Proprietary business logic
  • Hidden system prompts
  • API keys
  • Internal configurations

Section 15: Integration and Operational Threats

1. Manipulating AI Integrations

When AI systems connect to APIs, databases, email systems, or file storage, attackers may abuse those integrations.

Possible attacker actions include:

  • Triggering unauthorized operations
  • Accessing sensitive data
  • Chaining multiple tools together
  • Circumventing content filters
  • Maintaining long-term persistence

Defenses

  • Least privilege access
  • Input validation
  • Response filtering
  • Audit logging
  • Security reviews
2. AI Supply Chain Attacks

These attacks target the surrounding ecosystem instead of the model itself.

Examples include:

  • Malicious software dependencies
  • Tampered pre-trained models
  • Poisoned upstream datasets
  • Compromised third-party APIs
3. Insecure Plug-In Design

Plug-ins extend model functionality but may introduce security weaknesses.

Poorly secured plug-ins can:

  • Execute unauthorized actions
  • Bypass approval workflows
  • Operate with excessive permissions

Defenses

  • Strong authentication
  • Output restrictions
  • Logging and monitoring
  • Input validation
4. Insecure Output Handling

Applications become vulnerable when they trust model outputs without sanitization.

Example

A chatbot displaying unsanitized JavaScript could enable cross-site scripting (XSS).

Model output should always be treated as untrusted input.

5. Output Integrity Attacks
Adversarial Input AttacksOutput Integrity Attacks
Target model inputsTarget generated outputs
Manipulate processingModify responses after generation
Example: adversarial image patchesExample: altered API responses

Other examples include modifying RAG summaries before delivery or tampering with responses during transmission.

6. Model Denial of Service (DoS)

AI DoS attacks overwhelm models by exhausting resources such as:

  • Compute power
  • Memory
  • Tokens
  • API quotas

Example

Sending extremely long prompts repeatedly to consume processing capacity and disrupt service availability.

7. Excessive Agency

Excessive agency occurs when AI systems receive more authority than necessary.

Example

An AI agent with permission to:

  • Delete database entries
  • Send emails
  • Execute commands

A compromised or jailbroken agent could misuse these capabilities.

Primary Defense

Apply the principle of least privilege.

8. Overreliance

Overreliance happens when users trust AI outputs without sufficient human review.

Example

A security analyst ignores a threat because the AI rated it low risk.

Mitigations

  • Display confidence scores
  • Require human approval for sensitive actions
  • Implement escalation workflows
9. AI Hallucinations

Hallucinations occur when AI generates convincing but false or fabricated information.

Examples

  • Invented court cases
  • Fake CVE identifiers
  • Incorrect technical explanations

Root Cause

LLMs generate statistically probable text rather than verified facts.

Mitigations

  • Use RAG systems
  • Require citations
  • Add automated fact-checking
  • Maintain human oversight

Section 16: AI Security Controls

1. Model Risk Assessment

Risk assessments evaluate:

  • Security
  • Fairness
  • Robustness
  • Compliance
  • Reliability

Security teams actively test models to identify weaknesses before attackers do.

Frameworks such as the NIST AI RMF support this process.

2. Model Guardrails
Guardrail TypePurpose
Rule-Based FilteringBlocks prohibited keywords or patterns
AI ModerationUses another model to review content
Refusal TuningTrains the model to reject unsafe requests
Structured OutputsRestricts output format for safety

Guardrails must balance usability with protection.

3. Prompt Templates

Prompt templates standardize interactions by defining:

  • Tone
  • Role
  • Structure
  • Scope

They also reduce the risk of prompt injection by constraining acceptable behavior.

4. Guardrail Testing and Validation

Guardrails should be continuously tested through:

  • Jailbreak testing
  • Output validation
  • Log analysis
  • Human review

Security improvement is an ongoing process.

Section 17: AI Access Controls

1. Prompt Firewalls

Prompt firewalls filter inputs before they reach the model.

They help block:

  • Prompt injection
  • Sensitive data leaks
  • Policy violations

These systems often combine:

  • Static rules
  • Pattern matching
  • AI-based intent analysis
2. Limits and Quotas

Usage controls include:

  • Rate limiting
  • Token restrictions
  • Request size limits
  • Concurrent request controls

These protections help prevent abuse and denial-of-service attacks.

3. Model Access Controls

Security measures include:

  • Authentication
  • MFA
  • OAuth
  • RBAC and ABAC
  • Encryption for model artifacts

Usage limits also support fair resource allocation.

4. Data Access Controls

Organizations should secure:

  • Training datasets
  • Inference outputs
  • Logs
  • APIs
  • Underlying databases

DLP tools help prevent unauthorized data exfiltration.

5. Agent Access Controls

AI agents should operate with only the permissions necessary for their tasks.

Best Practices

  • Use intermediary APIs instead of direct infrastructure access
  • Apply monitoring and rate limits
  • Require human approval for high-risk actions
6. Network and API Access Controls

Organizations should:

  • Authenticate requests
  • Enforce authorization
  • Use TLS/HTTPS
  • Isolate AI systems from untrusted networks
  • Implement throttling and circuit breakers

Audit logging improves accountability and visibility.

Section 18: Encryption and Data Protection

1. Data Encryption
StateDescriptionExamples
At RestStored dataEncrypted datasets and logs
In TransitData moving across networksTLS/HTTPS
In UseData during processingTrusted Execution Environments

Homomorphic Encryption

Allows computation on encrypted data without decrypting it first.

This supports privacy-preserving machine learning but requires substantial computational resources.

2. Data Classification Labeling

Classification labels identify sensitivity levels for:

  • Training data
  • Models
  • Outputs

This supports automation and regulatory compliance.

3. Data Minimization

Only the minimum required data should be collected and retained.

Benefits include:

  • Reduced breach impact
  • Lower storage costs
  • Improved privacy compliance
4. Data Redaction

Sensitive information is removed or obscured before processing or output.

Redaction applies to:

  • Text
  • Images
  • Audio
  • Video

Context matters because combined data elements may form personally identifiable information (PII).

5. Data Masking

Masking replaces sensitive information with realistic substitutes.

Types

  • Static Masking: Permanently changes the data
  • Dynamic Masking: Changes only the displayed view based on permissions

Masking is often used in development and testing environments.

6. Data Anonymization

Anonymization removes identifiers so individuals cannot be recognized.

Types

  • Complete Anonymization: Irreversible removal of identifying details
  • Pseudonymization: Replaces identifiers with reversible substitute values
  • Synthetic Data: Artificially generated data with similar statistical properties
Comparing Data Protection Methods
TechniquePurposeReversible?Example
RedactionRemoves data entirelyNoRemoving SSNs
MaskingReplaces with fake dataSometimesTesting datasets
AnonymizationRemoves identityNoPublic datasets
PseudonymizationUses reversible identifiersYesGDPR workflows

Section 19: AI Monitoring and Auditing

Monitoring vs. Auditing
MonitoringAuditing
Continuous and real-timePeriodic and structured
Detects ongoing issuesReviews past compliance
Produces alerts and dashboardsProduces reports and recommendations
1. Monitoring Prompts and Responses

Prompt monitoring tracks user activity and detects abuse.

Response monitoring evaluates:

  • Accuracy
  • Safety
  • Policy compliance

Changes in confidence scores may indicate attacks or model drift.

2. Log Monitoring

Organizations should collect logs from:

  • Inference servers
  • Vector databases
  • Authentication systems
  • API gateways

Best Practices

  • Standardized formats
  • Log sanitization
  • Tamper-resistant storage
  • Defined retention policies
3. Rate and Cost Monitoring

Monitoring helps identify:

  • DoS attacks
  • Excessive API use
  • Bugs
  • Resource abuse

Metrics include:

  • Token usage
  • Compute time
  • API calls
  • Storage consumption

Budget limits help prevent uncontrolled costs.

4. Auditing for Hallucinations

Hallucination rates should be measured against trusted references.

Mitigation strategies include:

  • Citations
  • RAG
  • Automated detection tools
  • Human review
5. Auditing for Accuracy

Accuracy audits compare outputs against measurable standards.

Best practices include:

  • Defined metrics
  • Evaluation datasets
  • Fact verification
  • User feedback incorporation
6. Auditing for Bias and Fairness

Types of Bias

  • Data Bias: Bias originating from training data
  • Algorithmic Bias: Bias caused by model design choices

Best Practices

  • Test multiple demographic groups
  • Define fairness criteria
  • Use diverse audit teams
  • Apply proactive fairness controls
7. Auditing Access and Security Compliance

Security audits confirm controls are functioning properly.

Areas Reviewed

  • Governance practices
  • Access controls
  • Audit trails
  • Regulatory compliance
  • Third-party access
  • Data protection measures


Posted

in

by

Tags: