AAI_2025_Capstone_Chronicles_Combined
7 demonstrated that it was possible to build a CNN based classifier to detect images generated through GAN architectures, but later Wang et al. (2022) showed that performance of these kinds of models may drop when tested with new datasets, unseen image generators or compressed video frames. They also argue that the main mechanism behind CNN fake image detection is the capability of CNNs to locate subtle, common flaws caused by the GAN generation process. Another architecture widely employed for this task are Visual Transformers (ViT) (Dosovitskiy et al., 2020). This architecture is an extension of the popular Transformer building block applied to image tasks, and has proved to perform better when dealing with longer range dependencies than CNNs. More specifically, Heo et al. (2022) highlight that ViTs have stronger performance on several standard fake image detection datasets, arguing that their global receptive field may help capture subtle inconsistencies that CNNs may miss. Some researchers have even worked on hybrid CNN-ViT models, so that the smaller scale pattern detection of CNNs complement the larger, broader dependency modelling of ViTs (Soudy et al., 2024). Encoder-Decoder architectures, such as Autoencoders or Variational Autoencoders (VAEs) (Kingma & Welling, 2013), can also be used as a solution to this problem. These model would work different to CNNs and ViTs as, instead of trying to label an input image as real or fake, these models work by detecting anomalies. Khalid and Woo (2020) proposed a very interesting case study for these kind of architectures, where their model was able to achieve very high accuracy by only training on real images and labeling the anomalies as fake. This would be a great advantage for new kinds of fake generators where quality labeled data is scarce.
364
Made with FlippingBook - Share PDF online