LLMs, such as ChatGPT, explain their deepfake detection findings by identifying specific features or anomalies in the images, such as incorrect shadows or mismatched earrings. They provide a plain language explanation of their analysis, making it easier for humans to understand the basis for their conclusions. This ability to provide comprehensible explanations sets LLMs apart from traditional deepfake detection models, which often do not provide clear reasons for their determinations.
Multimodal LLMs analyze images by using an encoder to generate embeddings for the image data. These embeddings are then aligned with the embeddings of other modalities, such as text, in a shared multimodal embedding space. This process allows the LLM to interpret and generate content across multiple modalities, enabling tasks such as image captioning, visual question answering, and multimodal classification4.
ChatGPT's accuracy in detecting synthetic artifacts in images generated by latent diffusion is 79.5%, and 77.2% for StyleGAN-generated images. While these rates are lower than the latest deepfake detection algorithms, ChatGPT's natural language processing could make it a more practical detection tool in the future.