Trainer IRM¶
Invariant Risk Minimization¶
Decompose a classification task into feature extraction \(\Phi(\cdot)\) and classificaiton layer \(w(\cdot)\), then the task loss is
\(\ell^{(d)} (w \circ \Phi) = \mathbb{E}_{(X, Y) \sim \mathcal{D}_d}[\ell(w \circ \Phi(X), Y)]\) where we use \(\ell\) to denote the cross entropy for a classification task, and \(\mathcal{D}_d\) for distribution of domain \(d\).
The idea of IRM is to choose classifier \(w\) to be in the intersection of optimal classifiers for each domain $\(d\)$.
regardless of feature extractor \(\Phi(\cdot)\), this serves as a constraint on the choice of classifiers \(w\).
The feature extractor \(\Phi(\cdot)\) then get optimized under this constraint.
Thus IRM forms a bi-level optimization by jointly optimize \(\Phi\) and \(w\) which is hard to solve, so in practice IRMv1 is used.
IRMv1¶
In DomainLab, we write the loss function as $\(\ell(\cdot) + \lambda R(\cdot)\)$, which result in the optmization below:
where \(\lambda\) is a hyperparameter that controls the trade-off between the empirical risk and the penalty. One interpretation can be the penalty encourages the representation \(\Phi\) to be orthogonal to the gradient of the loss (e.g. cross entropy) at \(w = 1.0\) across all domains.
In practice, one could simply divide one mini-batch into two subsets, let \(i\) and \(j\) to index these two subsets, multiply subset \(i\) and subset \(j\) forms an unbiased estimation of the L2 norm of gradient. In detail: the squared gradient norm via inner product between \(\nabla_{w|w=1} \ell(w \circ \Phi(X^{(d, i)}), Y^{(d, i)})\) of dimension dim(Grad) with \(\nabla_{w|w=1} \ell(w \circ \Phi(X^{(d, j)}), Y^{(d, j)})\) of dimension dim(Grad) For more details, see section 3.2 and Appendix D of : Arjovsky et al., “Invariant Risk Minimization.”
Examples¶
python main_out.py --te_d=0 --task=mnistcolor10 --model=erm --trainer=irm --nname=conv_bn_pool_2