Death by a Thousand Prompts: Open Model Vulnerability Analysis

Brands

Hot News

Publish Time: 05 Nov, 2025

AI models have become increasingly democratized, and the proliferation and adoption of open weight models has contributed significantly to this reality. Open-weight models provide researchers, developers, and AI enthusiasts with a solid foundation for limitless use cases and applications. As of August 2025, leading U.S., Chinese, and European models have around 400M total downloads on HuggingFace. With an abundance of choice in the open weight model ecosystem and the ability to fine-tune open models for specific applications, it is more important than ever to understand what exactly you're getting with an open-weight model-including its security posture.

Cisco AI Defense security researchers conducted a comparative AI security assessment of eight open-weight large language models (LLMs), revealing profound susceptibility to adversarial manipulation, particularly in multi-turn scenarios where success rates were observed to be 2x to 10x higher than single-turn attacks. Using Cisco's AI Validation platform, which performs automated algorithmic vulnerability testing, we evaluated models from Alibaba (Qwen3-32B), DeepSeek (v3.1), Google (Gemma 3-1B-IT), Meta (Llama 3.3-70B-Instruct), Microsoft (Phi-4), Mistral (Large-2 also known as Large-Instruct-2047), OpenAI (GPT-OSS-20b), and Zhipu AI (GLM 4.5-Air).

Below, we'll provide an overview of our model security assessment, review findings, and share the full report which provides a complete breakdown of our analysis.

Evaluating Open-Source Model Security

For this report, we used AI Validation, which is part of our complete AI Defense solution that performs automated, algorithmic assessments of a model's safety and security vulnerabilities. This report highlights specific failures such as susceptibility to jailbreaks. tracked by MITRE ATLAS and OWASP as AML.T0054 and LLM01:2025 respectively. The risk assessment was performed as a black box engagement where the details of the application architecture, design, and existing guardrails, if any, were not disclosed prior to testing.

Across all models, multi-turn jailbreak attacks, where we leveraged numerous methods to steer a model to output disallowed content, proved highly effective, with attack success rates reaching 92.78 percent. The sharp rise between single-turn and multi-turn vulnerability underscores the lack of mechanisms within models to maintain and enforce safety and security guardrails across longer dialogues.

These findings confirm that multi-turn attacks remain a dominant and unsolved pattern in AI security. This could translate into real-world threats, including risks of sensitive data exfiltration, content manipulation leading to compromise of integrity of data and information, ethical breaches through biased outputs, and even operational disruptions in integrated systems like chatbots or decision-support tools. For instance, in enterprise settings, such vulnerabilities could enable unauthorized access to proprietary information, while in public-facing applications, they might facilitate the spread of harmful content at scale.

We infer, from our assessments and analysis of AI labs technical reports, that alignment strategies and model provenance may factor into models' resilience against jailbreaks. For example, models that focus on capabilities (e.g., Llama) did demonstrate the highest multi-turn gaps, with Meta explaining that developers are "in the driver seat to tailor safety for their use case" in post-training. Models that focused heavily on alignment (e.g., Google Gemma-3-1B-IT) did demonstrate a more balanced profile between single- and multi-turn strategies deployed against it, indicating a focus on "rigorous safety protocols" and "low risk level" for misuse.

Open-weight models, such as the ones we tested, provide a powerful foundation that, when combined with malicious fine-tuning techniques, may potentially introduce dangerous AI applications that bypass standard safety and security measures. We do not discourage the continued investment and development into open-source and open-weight models. Rather, we simultaneously encourage AI labs that release open-weight models to take measures to prevent users from fine-tuning the security away, while also encouraging organizations to understand what AI labs prioritize in their model development (such as strong safety baselines versus capability-first baselines) before they choose a model for fine-tuning and deployment.

To counter the risk of adopting or deploying unsafe or insecure models, organizations must consider adopting advanced AI security solutions. This includes adversarial training to bolster model robustness, specialized defenses against multi-turn exploits (e.g., context-aware guardrails), real-time monitoring for anomalous interactions, and regular red-teaming exercises. By prioritizing these measures, stakeholders can transform open-weight models from liability-prone assets into secure, reliable components for production environments, fostering innovation without compromising security or safety.

Comparative vulnerability analysis showing attack success rates across tested models for both single-turn and multi-turn scenarios.

Findings

As we analyzed the data that emerged from our evaluation of these open-source models, we looked for key threat patterns, model behaviors, and implications for real-world deployments. Key findings included:

Multi-turn Attacks Remain the Primary Failure Mode: All models demonstrated high susceptibility to multi-turn attacks, with success rates ranging from 25.86% (Google Gemma-3-1B-IT) to 92.78% (Mistral Large-2), representing up to a 10x increase over single-turn baselines. See Table 1 below:

Alignment Approach Drives Security Gaps: Security gaps were predominantly positive, indicating heightened multi-turn risks (e.g., +73.48% for Alibaba Qwen3-32B and +70% for Mistral Large-2 and Meta Llama 3.3-70B-Instruct). Models that exhibited smaller gaps may exhibit both weaker single-turn defense but stronger multi-turn defense. We infer that the security gaps stem from alignment approach to open-weight models: labs such as Meta and Alibaba focused on capabilities and applications deferred to developers to add additional safety and security policies, while lab with a stronger security and safety posture such as Google and OpenAI exhibited more conservative gaps between single- and multi-turn strategies. Regardless, given the variation of single- and multi-turn attack technique success rates across models, end-users should consider risks holistically across attack techniques.
Threat Category Patterns and Sub-threat Concentration: High-risk threat classes such as manipulation, misinformation, and malicious code generation, exhibited consistently elevated success rates, with model-specific weaknesses; multi-turn attacks reveal category variations and clear vulnerability profiles. See Table 2 below for how different models performed against various multi-turn techniques. The top 15 sub-threats demonstrated extremely high success rates and are worth prioritization for defensive mitigation.
Attack Techniques and Strategies: Certain techniques and multi-turn strategies achieved high success and each model's resistance varied; the selection of different attack techniques and strategies have the potential to critically influence outcomes.
Overall Implications: The 2-10x superiority of multi-turn attacks against the model's guardrails demands immediate security enhancements to mitigate production risks.

The results against GPT-OSS-20b, for example, aligned closely with OpenAI's own evaluations: the overall attack success rates for the model were relatively low, but the rates were roughly consistent with the "Jailbreak evaluation" section of the GPT-OSS model card paper where refusals ranged from 0.960 and 0.982 for GPT-OSS-20b. This result underscores the continued susceptibility of frontier models to adversarial attacks.

An AI lab's goal in developing a specific model may also influence assessment outcomes. For example, Qwen's instruction tuning tends to prioritize helpfulness and breadth, which attackers can exploit by reframing their prompts as "for research," "fictional scenarios", hence, a higher multi-turn attack success rate. Meta, on the other hand, tends to ship open weights with the expectation the developers add their own moderation and safety layers. While baseline alignment is good (indicated by a modest single-turn rate), without any additional safety and security guardrails (e.g., retaining safety policies across conversations or sessions or tool-based moderation such as filtering, refusal models), multi-turn jailbreaks can also escalate quickly. Open-weight centric labs such as Mistral and Meta often ship capability-first bases with lighter built-in safety features. These are appealing for research and customization, but they push defenses onto the deployer. End-users who are looking for open-weights models to deploy should consider what aspects of a model they prioritize (safety and security alignment versus high-capability open weights with fewer safeguards).

Developers can also fine-tune open-weight models to be more robust to jailbreaks and other adversarial attacks, though we are also aware that nefarious actors can conversely fine-tune the open-weight models for malicious purposes. Some model developers, such as Google, OpenAI, Meta, Microsoft, have noted in their technical reports and model cards that they took steps to reduce the likelihood of malicious fine-tuning, while others, such as Alibaba, DeepSeek, and Mistral, did not acknowledge safety or security in their technical reports. Zhipu evaluated GLM-4.5 against safety benchmarks and noted strong performance across some categories, while recognizing "room for improvement" in others. As a result of inconsistent safety and security standards across the open-weight model landscape, there are attendant security, operational, technical, and ethical risks that stakeholders (from end-users to developers to organizations and enterprises that adopt these use) must consider when either adopting or using these open-weight models. An emphasis on safety and security, from development to evaluation to release, should remain a top priority among AI developers and AI practitioners.

To see our testing methodology, findings, and the complete security assessment of these open-source models, read our report here.