Composition of maps and order adaptativity

Question

Hi,

I am using the TransportMaps library to solve sequential inference problems. One problem I'm facing is the adaptivity of map orders. I currently build the tranport maps by iteration on the maps order and monitoring of the variance diagnostic. However, I'm very interested in the composition of low-order maps described in ref [TM1] "Bayesian inference with optimal maps".

Hence, I have some questions about it:

- How to set the progression of the noise covariance to build intermediate maps ?
- How to be sure that the first map is close to a normal distribution in order to have a low rank coupling ?
- If the map is unscaled due to a too large variance, should I also add a linear map ?
- Is this approach always efficient in terms of computation time in comparison to a direct map with high order degree (with non decomposable density) ?

Best regards,

Paul.

dabi · Answer 1 · Jun 22, 2018

Best answer

Hi Paul,

I don't have personal experience with the approach described in [TM1] "Bayesian inference with optimal maps", but here are my insights with regard to map composition and tempering. In the following I assume that you have whitened the posterior, so that the prior distribution is a Standard Normal and tempering is applied to the whitened likelihood (either by noise covariance progression, data injection or forward accuracy).

I would say that the progression of the noise covariance should be slow enough so that at each step the map is able to capture most of the relation between reference (Standard Normal) and the tempered posterior. You should measure this using the variance diagnostic.
Regarding computational cost: if the maps are simple enough (at least order two or three for accuracy) then the approach can result in something more efficient than directly computing a transport map of high-order. You can think to the final map as many layer of a neural network with each layer corresponding to a map in the composition. This can help explain the expressivity of the representation. On the upside, of using compositions of monotone triangular maps is that the transformation is invertible (a property sometimes needed).
The composition of maps should take the prior/reference $\nu_\rho$ to posterior $\nu_\pi$ progressively, but it should not result in the "optimal composition" due to the inherent sequential construction. One could correct the maps in a final sweep over the maps to improve them, e.g. solving the $n$ problems:
\[ T_i = \arg\min_{T\in \mathcal{T}_>} \mathcal{D}_{\text{KL}}\left( (T_1\circ T_{i-1})_\sharp \nu_\rho \middle\Vert T^\sharp (T_{i+1}\circ T_n)^\sharp \nu_\pi \right) \]
warm started at the $T_i$ already learned through the tempering procedure.

Can you clarify what you mean with "first map to get low-rank coupling"? Do you mean to find the rotation such that the next maps become the identity almost everywhere (like Likelihood Informed sub-spaces or Active sub-spaces)?

Also what do you mean with "unscaled"? Usually the scaling problem can arise when one tries to find maps from samples not when constructing maps from densities . So using a linear map at the beginning is not strictly necessary, but it shouldn't hurt either.

I hope to have addressed some of your questions.

I will try to gather more insights from my collegues in the coming days.

Best,

Daniele

answered Jun 22, 2018 by dabi (307 points)
selected Aug 11, 2018 by PaulB

Thank you for your answer,

I think I don’t fully understand the technic especially the final sweep over the maps. Once I computed the n intermediate maps I have to solve once again n minimization problems ?

From my understanding if two density are « close enough » the coupling between those two densities can be represented by a low order map. As we want intermediate densities gradually evolve from standard normal density to target density, the first intermediate density should be close to a standard normal density in order to have intermediate low order maps accurate.

I tried to use composition of maps to represent the transport map between the standard normal distribution and the Gumbel distribution from the tutorial. I only used 2 intermediate densities with larger variances than the target density. From your answer I don’t know if it is the parameters of the formulation that are wrong (variances, number of maps) or the formulation itself. My goal was to obtain a composition of low order maps leading to a variance diagnostic of $10^{-3}$ (variance diagnostic obtained with a 6-order map).

You can find below the script I use.

Best,

Paul.

import logging
import warnings
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import TransportMaps as TM
import TransportMaps.Maps as MAPS
import TransportMaps.Diagnostics as DIAG
warnings.simplefilter("ignore")
TM.setLogLevel(logging.INFO)

import TransportMaps.Distributions as DIST

class GumbelDistribution(DIST.Distribution):
    def __init__(self, mu, beta):
        super(GumbelDistribution,self).__init__(1)
        self.mu = mu
        self.beta = beta
        self.dist = stats.gumbel_r(loc=mu, scale=beta)
    def pdf(self, x, params=None):
        return self.dist.pdf(x).flatten()
    def log_pdf(self, x, params=None):
        return self.dist.logpdf(x).flatten()
    def grad_x_log_pdf(self, x, params=None):
        m = self.mu
        b = self.beta
        z = (x-m)/b
        return (np.exp(-z)-1.)/b
    def hess_x_log_pdf(self, x, params=None):
        m = self.mu
        b = self.beta
        z = (x-m)/b
        return (-np.exp(-z)/b**2.)[:,:,np.newaxis]

mu = 3.
beta = 4.
pi_tar = GumbelDistribution(mu,beta)

rho=DIST.StandardNormalDistribution(1)

pi1= GumbelDistribution(mu,3*beta)
pi2= GumbelDistribution(mu,1.5*beta)

X=np.linspace(-20,30,100)

plt.figure()
plt.plot(X,pi_tar.pdf(X),label='$\pi_{tar}$')
plt.plot(X,pi1.pdf(X),label='$\pi_1$')
plt.plot(X,pi2.pdf(X),label='$\pi_2$')

plt.legend()

order = 2
T1 = TM.Default_IsotropicIntegratedSquaredTriangularTransportMap(
    1, order, 'full')

push_rho = DIST.PushForwardTransportMapDistribution(T1, rho)

qtype = 3           # Gauss quadrature
qparams = [10]  # Quadrature order
reg = None          # No regularization
tol = 1e-5         # Optimization tolerance
ders = 1            # Use gradient and Hessian
log = push_rho.minimize_kl_divergence(
    pi1, qtype=qtype, qparams=qparams, regularization=reg,
    tol=tol, ders=ders)

pull_T1_pi1=DIST.PullBackTransportMapDistribution(T1,pi1)
qtype=3
qparams=[10]
var_diag=DIAG.variance_approx_kl(rho,pull_T1_pi1,qtype=qtype,qparams=qparams)
var=np.log10(var_diag)
print('variance diagnostic: rho -> T_1^# \pi1:', var)

order = 2
T2 = TM.Default_IsotropicIntegratedSquaredTriangularTransportMap(
    1, order, 'full')

push_rho = DIST.PushForwardTransportMapDistribution(T2, rho)

pull_T1_pi2=DIST.PullBackTransportMapDistribution(T1,pi2)

qtype = 3           # Gauss quadrature
qparams = [10]  # Quadrature order
reg = None          # No regularization
tol = 1e-5         # Optimization tolerance
ders = 1            # Use gradient and Hessian
log = push_rho.minimize_kl_divergence(
    pull_T1_pi2, qtype=qtype, qparams=qparams, regularization=reg,
    tol=tol, ders=ders)

pull_T_pi=DIST.PullBackTransportMapDistribution(T2,pull_T1_pi2)
qtype=3
qparams=[10]
var_diag=DIAG.variance_approx_kl(rho,pull_T_pi,qtype=qtype,qparams=qparams)
var=np.log10(var_diag)
print('variance diagnostic: rho -> T_2^#T_1^# \pi1:', var)

T1T2=MAPS.CompositeTransportMap(T1,T2)

order = 2
T = TM.Default_IsotropicIntegratedSquaredTriangularTransportMap(
    1, order, 'full')

push_rho = DIST.PushForwardTransportMapDistribution(T, rho)

pull_T1T2_pi_tar=DIST.PullBackTransportMapDistribution(T1T2,pi_tar)

qtype = 3           # Gauss quadrature
qparams = [10]  # Quadrature order
reg = None          # No regularization
tol = 1e-5         # Optimization tolerance
ders = 1            # Use gradient and Hessian
log = push_rho.minimize_kl_divergence(
    pull_T1T2_pi_tar, qtype=qtype, qparams=qparams, regularization=reg,
    tol=tol, ders=ders)

pull_T_pi=DIST.PullBackTransportMapDistribution(T,pull_T1T2_pi_tar)
qtype=3
qparams=[10]
var_diag=DIAG.variance_approx_kl(rho,pull_T_pi,qtype=qtype,qparams=qparams)
var=np.log10(var_diag)
print('variance diagnostic: rho -> T^#T_2#T_1^# \pi1:', var)


order = 6
T = TM.Default_IsotropicIntegratedSquaredTriangularTransportMap(
    1, order, 'full')

push_rho = DIST.PushForwardTransportMapDistribution(T, rho)

qtype = 3           # Gauss quadrature
qparams = [10]  # Quadrature order
reg = None          # No regularization
tol = 1e-5         # Optimization tolerance
ders = 1            # Use gradient and Hessian
log = push_rho.minimize_kl_divergence(
    pi_tar, qtype=qtype, qparams=qparams, regularization=reg,
    tol=tol, ders=ders)

pull_T2_pi=DIST.PullBackTransportMapDistribution(T,pi_tar)
qtype=3
qparams=[10]
var_diag=DIAG.variance_approx_kl(rho,pull_T2_pi,qtype=qtype,qparams=qparams)
var=np.log10(var_diag)
print('variance diagnostic: rho -> T_full^# \pi_tar:', var)

commented Jun 27, 2018 by PaulB (17 points)

Hi Paul, I see what you mean.
There are several issues with what you are trying to do, but I think you are in the right direction

First of all the approach described in the paper applies to Bayesian inference, where the likelihood covariance can be used to tune the shrinking effect of the likelihood over the prior. This means that starting with a likelihood with a large covariance will give a posterior that is similar to the prior (which is assumed to be whitened $\mathcal{N}(0,{\bf I})$, and therefore to match the reference). Gradually decreasing the scale of the covariance leads to a sequence of distributions that gradually transition from the prior to the desired posterior.
The algorithm for tempered posteriors you are trying to implement works as follows. Let $(\nu_\pi^i)_{i=0}^n$ be a sequence of slowly changing (through tempering of the likelihood covariance) target distributions where $\nu_\pi^n=\nu_\pi$ is the final target posterior distribution and $\nu_\pi^0=\nu_\rho = \mathcal{N}(0,{\bf I})$ is the reference/prior distribution. Then you apply the following recursion
\[
T_1 = \arg\min_{T\in\mathcal{T}_>} \mathcal{D}_{\text{KL}}\left(\nu_\rho\middle\Vert T^\sharp \nu_\pi^1\right) \\
T_i = \arg\min_{T\in\mathcal{T}_>} \mathcal{D}_{\text{KL}}\left(\nu_\rho\middle\Vert T^\sharp \left(T_1\circ T_{i-1}\right)^\sharp \nu_\pi^i\right)
\]
At the end you should obtain $ (T_1\circ T_n)_\sharp \nu_\rho \approx \nu_\pi $.
What I meant with the additional sweeping of the maps in the composition is not really necessary if one is satisfied with the final variance diagnostic. The composition of maps obtained by the approximation of sequentially tempered posteriors may not be the optimal composition from prior to posterior. The additional sweep could improve on the obtained maps.

I hope this clarifies the approach.

Daniele

commented Jun 28, 2018 by dabi (307 points)

Also I tried to run your script and it seems to work the right way. Keep in mind that composing three 2nd order maps is not equivalent to using one single 6th order map. There are two main factors of error entering the map composition:

The space spanned by the composition of 2nd order maps will have a lower dimension of the 6th order map because some basis combination cannot be represented as a composition.
The composed maps do not represent the optimal composition of maps approximating the transport from prior to posterior (that's why I was suggesting maybe to try also a final sweep).

Anyway your script works. Comparable performaces are starting for map compositions of order 4. For order 5 the composition outperforms an order 6 map.

You also need to think this applied to higher dimension, where the number of basis used for a 6th total order map is way higher than the number of basis for a 4th total order map.

commented Jun 28, 2018 by dabi (307 points)

Hi,

It's more clear to me, thanks for your guidance,

Paul.

commented Aug 11, 2018 by PaulB (17 points)

Composition of maps and order adaptativity

Please log in or register to add a comment.

1 Answer

Please log in or register to add a comment.