Intriguing Properties of Neural Networks outlines several key properties that demonstrate counterintuitive behaviors in deep neural networks (DNN) and those properties are listed below:

1) Semantic Information is Distributed: Results of this study show that semantic information is not tied to individual neurons in the network. Instead, it is distributed across the entire activation space of high-level layers. This is demonstrated by the fact that random linear combinations of units are similar to individual units.
2) Adversarial Examples via Input Sensitivity: DNNs are sensitive to small, imperceptible perturbations in the input data, called adversarial examples, which can cause the network to misclassify inputs that would otherwise be correctly classified. The same adversarial perturbations that cause one network to misclassify can also cause other networks (universality), even if trained on different subsets of data. This explains the non-random nature of adversarial examples. Key fact: Adversarial examples generated on one neural network can generalize to other networks, even if those networks have different architectures, hyperparameters, and datasets.
3) Intrinsic "Blind Spots": The networks are shown to have intrinsic "blind spots" where small perturbations in input space can lead to large changes in output. Instability is robust across different network architectures and training sets. The authors used spectral analysis to show that the instability of a network can be quantified by the operator norm of its weight matrices. The results show that early layers of a network can already exhibit significant instability, meaning that small changes in input can propagate and amplify through the network.
4) Lipschitz Constants: The upper bounds on the Lipschitz constants for each layer were computed, revealing how much the output of each layer can change in response to small changes in the input. Large Lipschitz constants indicate layers with more instability.
5) Adversarial Training for Robustness: The paper also explores the idea of training NNs using adversarial examples to improve their robustness. The authors found that continuously introducing adversarial examples into the training set can help reduce test errors and make the model more resistant to adversarial attacks. Similar to the model collapse paper, Model Collapse Paper (I will explain this later in a separate post).

Key fact: Adversarial examples generated for higher layers of the network are more useful for improving robustness than those generated for the input or lower layers.

6) Universality represented in point 2 suggests that the network's vulnerability is related to the fundamental nature of its learned representations rather than overfitting to specific data.
6) Widespread nature of semantic information: The widespread nature of semantic information makes it harder to interpret neural networks by simply looking at individual units, as the information is encoded in the overall pattern of activations, not in specific neurons. (More results are needed in mechanistic interpretability).
7) Non-smooth decision boundaries: The network’s decision boundaries are found to be non-smooth. This non-smoothness contributes to the ease with which adversarial examples can fool the network, and targeted perturbations can push inputs across these boundaries.
8) Hard-negative mining in CV: The process of identifying and optimizing adversarial examples is related to the technique of hard-negative mining used in the CV domain. Hard-negative mining involves identifying difficult-to-classify examples in the training set that the model consistently gets wrong, and emphasizing them during training to improve the model's performance.

Experiments

Experiment 1: Semantic information is not confined to individual neurons. The authors trained NNs on datasets such as MNIST and ImageNet. They analyzed the activation of individual neurons in higher layers of the network and identified what input patterns maximally activate these neurons: natural basis analysis. They performed the same analysis using random linear combinations of neuron activations. Instead of looking at one neuron’s activation, they looked at activations in random directions in the feature space.
Experiment 2: Fundamental weakness of neural network data representation. The authors generated adversarial examples for different neural networks trained on MNIST and ImageNet. These adversarial examples were created by adding small, imperceptible perturbations to images that would cause the network to misclassify them. The perturbations were found by maximizing the prediction error.
Experiment 3: Universality of adversarial examples. Adversarial examples generated for one network were often still effective in causing errors in networks trained on a different subset of the data. The authors trained neural networks on different subsets of MNIST (splitting it into two disjoint datasets: P1 and P2) and generated adversarial examples for one subset of the data.
Experiment 4: Spectral Analysis of Instability. The authors performed a spectral analysis on the network’s layers. For each layer, they computed its operator norm (largest singular value of the weight matrix) and used them to calculate the Lipschitz constant of each layer. The analysis showed that the early layers of a neural network can have significant instability that can propagate and amplify through layers, leading to vulnerabilities in the model.
Experiment 5: Adversarial training improves robustness. Training networks on adversarial examples can improve their robustness. On the MNIST dataset, a network trained with adversarial examples achieved a test error of less than 1.2%. Continuously retraining on hard examples makes networks more resistant to vulnerabilities.

Cross-Model Generalization of Adversarial Examples ( Source Table 2 in the paper)

The table below displays the error rates for distorted examples fed into each model, along with the average distortion relative to the original training set.

Model	FC10(10^-4)	FC10(10^-2)	FC10(1)	FC100-100-10	FC200-200-10	AE400-10	Average Distortion
FC10(10^-4)	100%	11.7%	22.7%	2%	3.9%	2.7%	0.062
FC10(10^-2)	87.1%	100%	35.2%	35.9%	27.3%	9.8%	0.1
FC10(1)	71.9%	76.2%	100%	48.1%	47%	34.4%	0.14
FC100-100-10	28.9%	13.7%	21.1%	100%	6.6%	2%	0.058
FC200-200-10	38.2%	14%	23.8%	20.3%	100%	2.7%	0.065
AE400-10	23.4%	16%	24.8%	9.4%	6.6%	100%	0.086
Gaussian noise, stddev=0.1	5.0%	10.1%	18.3%	0%	0%	0.8%	0.1
Gaussian noise, stddev=0.3	15.6%	11.3%	22.7%	5%	4.3%	3.1%	0.3

For me, the most interesting section is (4.3) on Spectral Analysis of Instability, which explains how to measure and control the instability of DNNs by analyzing the spectral properties of each layer — specifically, the operator norm of the weight matrices. The network is represented as a series of transformations across multiple layers, denoted as:

φ(x) = φK(φK−1(...φ1(x;W1);W2)...;WK),

where φk represents the function mapping from layer k-1 to layer k, and Wk are the trained weights of layer k. The instability is measured using the Lipschitz constant Lk of each layer, defined as:

∀x,r, ||φk(x;Wk)−φk(x+r;Wk)|| ≤ Lk ||r||.

The overall instability is determined by the product of the Lipschitz constants of all layers:

L = ∏Kk=1 Lk.

In rectified layers (ReLU), the mapping is defined as:

φk(x; Wk, bk) = max(0, Wk x + bk),

and the operator norm of Wk, denoted as ||Wk||, provides an upper bound for the Lipschitz constant. Pooling layers are contractive, and the output change is bounded by:

||φk(x)−φk(x+r)|| ≤ ||r||.

Contrast-normalization layers scale changes in input by a factor γ ∈ [0.5, 1]. The operator norm for convolutional layers is computed using Fourier transform and Parseval's theorem, with the formula:

||W|| = supξ ||A(ξ)||,

where A(ξ) is a matrix derived from the Fourier transform of convolutional kernels. This spectral analysis quantifies network instability and helps mitigate vulnerabilities through control of the Lipschitz constants and operator norms.

For more details, see the original paper: Intriguing properties of neural networks