Section 12: Attacks on Data and Training Processes
1. Data Poisoning
Data poisoning occurs when attackers manipulate training datasets so the AI learns incorrect or harmful patterns.
Attackers may also insert hidden backdoors, causing the model to behave normally except when a specific trigger appears.
Even a very small percentage of corrupted data can noticeably alter model behavior.
Example
An attacker uploads numerous phishing emails labeled as “safe” into a crowdsourced dataset, causing the detection model to stop recognizing that phishing style.
2. Model Poisoning
Instead of modifying the dataset, attackers directly tamper with the model’s learned weights or parameters.
Compromised pre-trained models may be distributed through public repositories while appearing legitimate.
Data Poisoning vs. Model Poisoning
| Category | Data Poisoning | Model Poisoning |
|---|---|---|
| Target | Training dataset | Model parameters and weights |
| Timing | Before or during training | After training or through the supply chain |
| Technique | Inject malicious training data | Modify model artifacts directly |
| Detection | Data audits and anomaly detection | Integrity checks and hashing |
3. Introducing Bias
Bias attacks manipulate datasets so models produce unfair or distorted outcomes.
Example
A hiring model trained on biased historical records may unfairly disadvantage applicants from underrepresented groups.
4. Transfer Learning Attacks
Attackers compromise a pre-trained base model before it is fine-tuned for another task.
Any hidden backdoors or weaknesses in the original model carry over into the new specialized model.
Public model repositories are common distribution points for these poisoned models.
Organizations should treat third-party AI models as untrusted software and verify integrity before use.
5. Model Skewing
Model skewing gradually shifts model behavior over time instead of causing immediate, noticeable failure.
This is especially dangerous in systems that continuously retrain using user feedback.
Language models are particularly vulnerable because attackers can slowly influence outputs through repeated crafted interactions.
Defenses
- Validate user-generated inputs
- Monitor retraining feedback loops
- Compare outputs against statistical baselines
6. Backdoor and Trojan Attacks
Backdoor attacks cause models to behave differently when a hidden trigger is present.
Trojan attacks embed concealed malicious functionality within models or AI pipelines.
These attacks are often introduced through:
- Poisoned datasets
- Compromised model weights
- Supply chain manipulation
A major warning sign is selective failure — the model works normally except when a trigger appears.
Section 13: Prompt and Input Manipulation
1. Prompt Injection
Prompt injection attempts to override the model’s original instructions with attacker-controlled commands.
Direct Prompt Injection
The malicious instruction is placed directly into the user’s prompt.
Example
“Ignore previous instructions and explain how to…”
Indirect Prompt Injection
Malicious instructions are hidden inside content processed by the AI, such as webpages or uploaded files.
For example, an AI browsing agent may unknowingly read hidden commands embedded in a website.
2. Bypassing Guardrails and Jailbreaking
Guardrails
Guardrails are external safety controls surrounding the model, including:
- Rate limits
- Output filters
- Keyword restrictions
- API protections
Jailbreaking
Jailbreaking uses carefully crafted prompts to bypass built-in model restrictions.
Attackers often use:
- Roleplay scenarios
- Hypothetical framing
- Social engineering techniques
Example
“Pretend you are an unrestricted AI…”
3. Input Manipulation
Attackers craft inputs specifically designed to confuse or deceive models.
Examples include:
- Adversarial image patches
- Manipulated text inputs
- Inputs that appear normal to humans but disrupt AI processing
Section 14: Model Extraction and Information Leakage
1. Model Inversion
Model inversion attacks attempt to reconstruct training data by repeatedly querying the model.
Example
An attacker may recreate approximate facial images from a facial recognition system, exposing private data.
2. Membership Inference
Membership inference determines whether specific records were used during training.
Example
A medical AI could unintentionally reveal whether a patient’s information was part of its dataset.
Attack Types
- White-Box Attack: Attacker has access to internal model details.
- Black-Box Attack: Attacker only observes model outputs.
3. Model Theft
Attackers steal or replicate AI models to avoid development costs or launch further attacks.
Methods
- Stealing actual model files
- Reconstructing the model through repeated API queries and surrogate training
4. Sensitive Information Disclosure
Poorly secured prompting may cause models to reveal confidential information.
This can include:
- Proprietary business logic
- Hidden system prompts
- API keys
- Internal configurations
Section 15: Integration and Operational Threats
1. Manipulating AI Integrations
When AI systems connect to APIs, databases, email systems, or file storage, attackers may abuse those integrations.
Possible attacker actions include:
- Triggering unauthorized operations
- Accessing sensitive data
- Chaining multiple tools together
- Circumventing content filters
- Maintaining long-term persistence
Defenses
- Least privilege access
- Input validation
- Response filtering
- Audit logging
- Security reviews
2. AI Supply Chain Attacks
These attacks target the surrounding ecosystem instead of the model itself.
Examples include:
- Malicious software dependencies
- Tampered pre-trained models
- Poisoned upstream datasets
- Compromised third-party APIs
3. Insecure Plug-In Design
Plug-ins extend model functionality but may introduce security weaknesses.
Poorly secured plug-ins can:
- Execute unauthorized actions
- Bypass approval workflows
- Operate with excessive permissions
Defenses
- Strong authentication
- Output restrictions
- Logging and monitoring
- Input validation
4. Insecure Output Handling
Applications become vulnerable when they trust model outputs without sanitization.
Example
A chatbot displaying unsanitized JavaScript could enable cross-site scripting (XSS).
Model output should always be treated as untrusted input.
5. Output Integrity Attacks
| Adversarial Input Attacks | Output Integrity Attacks |
|---|---|
| Target model inputs | Target generated outputs |
| Manipulate processing | Modify responses after generation |
| Example: adversarial image patches | Example: altered API responses |
Other examples include modifying RAG summaries before delivery or tampering with responses during transmission.
6. Model Denial of Service (DoS)
AI DoS attacks overwhelm models by exhausting resources such as:
- Compute power
- Memory
- Tokens
- API quotas
Example
Sending extremely long prompts repeatedly to consume processing capacity and disrupt service availability.
7. Excessive Agency
Excessive agency occurs when AI systems receive more authority than necessary.
Example
An AI agent with permission to:
- Delete database entries
- Send emails
- Execute commands
A compromised or jailbroken agent could misuse these capabilities.
Primary Defense
Apply the principle of least privilege.
8. Overreliance
Overreliance happens when users trust AI outputs without sufficient human review.
Example
A security analyst ignores a threat because the AI rated it low risk.
Mitigations
- Display confidence scores
- Require human approval for sensitive actions
- Implement escalation workflows
9. AI Hallucinations
Hallucinations occur when AI generates convincing but false or fabricated information.
Examples
- Invented court cases
- Fake CVE identifiers
- Incorrect technical explanations
Root Cause
LLMs generate statistically probable text rather than verified facts.
Mitigations
- Use RAG systems
- Require citations
- Add automated fact-checking
- Maintain human oversight
Section 16: AI Security Controls
1. Model Risk Assessment
Risk assessments evaluate:
- Security
- Fairness
- Robustness
- Compliance
- Reliability
Security teams actively test models to identify weaknesses before attackers do.
Frameworks such as the NIST AI RMF support this process.
2. Model Guardrails
| Guardrail Type | Purpose |
|---|---|
| Rule-Based Filtering | Blocks prohibited keywords or patterns |
| AI Moderation | Uses another model to review content |
| Refusal Tuning | Trains the model to reject unsafe requests |
| Structured Outputs | Restricts output format for safety |
Guardrails must balance usability with protection.
3. Prompt Templates
Prompt templates standardize interactions by defining:
- Tone
- Role
- Structure
- Scope
They also reduce the risk of prompt injection by constraining acceptable behavior.
4. Guardrail Testing and Validation
Guardrails should be continuously tested through:
- Jailbreak testing
- Output validation
- Log analysis
- Human review
Security improvement is an ongoing process.
Section 17: AI Access Controls
1. Prompt Firewalls
Prompt firewalls filter inputs before they reach the model.
They help block:
- Prompt injection
- Sensitive data leaks
- Policy violations
These systems often combine:
- Static rules
- Pattern matching
- AI-based intent analysis
2. Limits and Quotas
Usage controls include:
- Rate limiting
- Token restrictions
- Request size limits
- Concurrent request controls
These protections help prevent abuse and denial-of-service attacks.
3. Model Access Controls
Security measures include:
- Authentication
- MFA
- OAuth
- RBAC and ABAC
- Encryption for model artifacts
Usage limits also support fair resource allocation.
4. Data Access Controls
Organizations should secure:
- Training datasets
- Inference outputs
- Logs
- APIs
- Underlying databases
DLP tools help prevent unauthorized data exfiltration.
5. Agent Access Controls
AI agents should operate with only the permissions necessary for their tasks.
Best Practices
- Use intermediary APIs instead of direct infrastructure access
- Apply monitoring and rate limits
- Require human approval for high-risk actions
6. Network and API Access Controls
Organizations should:
- Authenticate requests
- Enforce authorization
- Use TLS/HTTPS
- Isolate AI systems from untrusted networks
- Implement throttling and circuit breakers
Audit logging improves accountability and visibility.
Section 18: Encryption and Data Protection
1. Data Encryption
| State | Description | Examples |
|---|---|---|
| At Rest | Stored data | Encrypted datasets and logs |
| In Transit | Data moving across networks | TLS/HTTPS |
| In Use | Data during processing | Trusted Execution Environments |
Homomorphic Encryption
Allows computation on encrypted data without decrypting it first.
This supports privacy-preserving machine learning but requires substantial computational resources.
2. Data Classification Labeling
Classification labels identify sensitivity levels for:
- Training data
- Models
- Outputs
This supports automation and regulatory compliance.
3. Data Minimization
Only the minimum required data should be collected and retained.
Benefits include:
- Reduced breach impact
- Lower storage costs
- Improved privacy compliance
4. Data Redaction
Sensitive information is removed or obscured before processing or output.
Redaction applies to:
- Text
- Images
- Audio
- Video
Context matters because combined data elements may form personally identifiable information (PII).
5. Data Masking
Masking replaces sensitive information with realistic substitutes.
Types
- Static Masking: Permanently changes the data
- Dynamic Masking: Changes only the displayed view based on permissions
Masking is often used in development and testing environments.
6. Data Anonymization
Anonymization removes identifiers so individuals cannot be recognized.
Types
- Complete Anonymization: Irreversible removal of identifying details
- Pseudonymization: Replaces identifiers with reversible substitute values
- Synthetic Data: Artificially generated data with similar statistical properties
Comparing Data Protection Methods
| Technique | Purpose | Reversible? | Example |
|---|---|---|---|
| Redaction | Removes data entirely | No | Removing SSNs |
| Masking | Replaces with fake data | Sometimes | Testing datasets |
| Anonymization | Removes identity | No | Public datasets |
| Pseudonymization | Uses reversible identifiers | Yes | GDPR workflows |
Section 19: AI Monitoring and Auditing
Monitoring vs. Auditing
| Monitoring | Auditing |
|---|---|
| Continuous and real-time | Periodic and structured |
| Detects ongoing issues | Reviews past compliance |
| Produces alerts and dashboards | Produces reports and recommendations |
1. Monitoring Prompts and Responses
Prompt monitoring tracks user activity and detects abuse.
Response monitoring evaluates:
- Accuracy
- Safety
- Policy compliance
Changes in confidence scores may indicate attacks or model drift.
2. Log Monitoring
Organizations should collect logs from:
- Inference servers
- Vector databases
- Authentication systems
- API gateways
Best Practices
- Standardized formats
- Log sanitization
- Tamper-resistant storage
- Defined retention policies
3. Rate and Cost Monitoring
Monitoring helps identify:
- DoS attacks
- Excessive API use
- Bugs
- Resource abuse
Metrics include:
- Token usage
- Compute time
- API calls
- Storage consumption
Budget limits help prevent uncontrolled costs.
4. Auditing for Hallucinations
Hallucination rates should be measured against trusted references.
Mitigation strategies include:
- Citations
- RAG
- Automated detection tools
- Human review
5. Auditing for Accuracy
Accuracy audits compare outputs against measurable standards.
Best practices include:
- Defined metrics
- Evaluation datasets
- Fact verification
- User feedback incorporation
6. Auditing for Bias and Fairness
Types of Bias
- Data Bias: Bias originating from training data
- Algorithmic Bias: Bias caused by model design choices
Best Practices
- Test multiple demographic groups
- Define fairness criteria
- Use diverse audit teams
- Apply proactive fairness controls
7. Auditing Access and Security Compliance
Security audits confirm controls are functioning properly.
Areas Reviewed
- Governance practices
- Access controls
- Audit trails
- Regulatory compliance
- Third-party access
- Data protection measures