Hi Paul,
I don't have personal experience with the approach described in [TM1] "Bayesian inference with optimal maps", but here are my insights with regard to map composition and tempering. In the following I assume that you have whitened the posterior, so that the prior distribution is a Standard Normal and tempering is applied to the whitened likelihood (either by noise covariance progression, data injection or forward accuracy).
- I would say that the progression of the noise covariance should be slow enough so that at each step the map is able to capture most of the relation between reference (Standard Normal) and the tempered posterior. You should measure this using the variance diagnostic.
- Regarding computational cost: if the maps are simple enough (at least order two or three for accuracy) then the approach can result in something more efficient than directly computing a transport map of high-order. You can think to the final map as many layer of a neural network with each layer corresponding to a map in the composition. This can help explain the expressivity of the representation. On the upside, of using compositions of monotone triangular maps is that the transformation is invertible (a property sometimes needed).
- The composition of maps should take the prior/reference \(\nu_\rho\) to posterior \(\nu_\pi\) progressively, but it should not result in the "optimal composition" due to the inherent sequential construction. One could correct the maps in a final sweep over the maps to improve them, e.g. solving the \(n\) problems:
\[ T_i = \arg\min_{T\in \mathcal{T}_>} \mathcal{D}_{\text{KL}}\left( (T_1\circ T_{i-1})_\sharp \nu_\rho \middle\Vert T^\sharp (T_{i+1}\circ T_n)^\sharp \nu_\pi \right) \]
warm started at the \(T_i\) already learned through the tempering procedure.
Can you clarify what you mean with "first map to get low-rank coupling"? Do you mean to find the rotation such that the next maps become the identity almost everywhere (like Likelihood Informed sub-spaces or Active sub-spaces)?
Also what do you mean with "unscaled"? Usually the scaling problem can arise when one tries to find
maps from samples not when constructing
maps from densities . So using a linear map at the beginning is not strictly necessary, but it shouldn't hurt either.
I hope to have addressed some of your questions.
I will try to gather more insights from my collegues in the coming days.
Best,
Daniele