Fool a Neural Network

A small neural network is running in your browser and reading the digit on the left. The middle panel shows the direction in pixel space that most increases its error, which is the network's own gradient turned against it. Slide ε up to add a small amount of that noise. The prediction changes even though the image looks the same to you. This is the Fast Gradient Sign Method, and attacks like it are part of why my PhD work was about making models more robust.

Attack strength ε: 0.10 Use the adversarially-trained model

Click a confidence bar to aim the attack at that digit. You can also draw on the left image (right-drag erases; on a touchscreen, press and hold, then drag, to erase); the network reads it live and the attack adapts. The adversarially-trained model has seen attacks during training, so it is much harder to fool (its clean accuracy is a little lower in exchange).

Advanced: PGD, robustness curve, transfer