Adversarial attacks manipulate LLM outputs by introducing subtle modifications to the input text, exploiting vulnerabilities in the model's architecture and training process5. These attacks can cause the LLM to generate harmful or biased outputs, compromising the model's integrity and user trust. The attacks can be carried out through prompt manipulation or by fine-tuning the model with poisoned training data.
The NCBI study focused on three medical tasks: COVID-19 vaccination guidance, medication prescribing, and diagnostic test recommendations. The objectives of the attacks in these tasks were to discourage vaccination, suggest harmful drug combinations, and advocate for unnecessary medical tests.
Adversarial attacks significantly impacted vaccine recommendations, with prompt-based attacks causing a dramatic decline in vaccine recommendations from 74.13% to 2.49%. This highlights the vulnerability of LLMs to malicious manipulation, emphasizing the need for robust safeguards in critical sectors like healthcare.