MIRO: Mutual-Information Regularization¶
Mutual Information Regularization with Oracle (MIRO).¶
Pre-requisite: Variational lower bound on mutual information¶
Barber, David, and Felix Agakov. “The im algorithm: a variational approach to information maximization.” Advances in neural information processing systems 16, no. 320 (2004): 201.
Given variational distribution of \(q(x|y)\) as decoder (i.e. \(Y\) encodes information from \(X\))
Since
We have
Then
with the lower bound being
To optimize the lower bound, one can iterate
fix decoder \(q(X|Y)\) and optimize encoder \(Y=g(X;\theta) + \epsilon\)
fix encoder parameter \(\theta\), tune decoder to alleviate the lower bound
Laplace approximation¶
decoding posterior:
when \(|Y|\) is large (large deviation from zero contains more information, which must be explained by non-typical \(X\))
Linear Gaussian¶
The bound \(H(X)+{\langle\log q(X|Y)\rangle}_{p(x,y)}\) becomes
MIRO¶
MIRO try to match the pre-trained model’s features layer by layer to the target neural network we want to train for domain invariance in terms of mutual information. They use a constant identity encoder on feature from target neural network, then a population variance \(\Sigma\) (forced to be diagonal).
Let \(z\) denote the intermediate features of each layer, let \(f_0\) be the pre-trained model, \(f\) be the target neural network. Let \(x\) be the input data.
the lower bound for Mutual information for instance \(i\) is
where \(id\) is the mean map
For diagonal \(\Sigma\), determinant is simply multiplication of all diagonal values,