VIP - Variational Inversion Package with example implementations of Bayesian tomographic imaging

Bayesian inference has become an important tool to solve inverse problems and to quantify uncertainties in their solutions. Variational inference is a method that provides probabilistic, Bayesian solutions efficiently by using optimization. In this study we present a Python Variational Inversion Package (VIP), to solve inverse problems using variational inference methods. The package includes automatic differential variational inference (ADVI), Stein variational gradient descent (SVGD) and stochastic SVGD (sSVGD), and provides implementations of 2D travel time tomography and 2D full waveform inversion including test examples and solutions. Users can solve their own problems by supplying an appropriate forward function and a gradient calculation code. In addition, the package provides a scalable implementation which can be deployed easily on a desktop machine or using modern high performance computational facilities. The examples demonstrate that VIP is an efficient, scalable, extensible and user-friendly package, and can be used to solve a wide range of low or high dimensional inverse problems in practice.


Introduction
In a variety of academic and practical applications that concern the Earth's subsurface we wish to find answers to specific scientific questions.In the geosciences this is often achieved by imaging subsurface properties using data recorded on the surface, and by interpreting those images to address questions of interest.The subsurface is usually parameterised in some way, and a physical relationship is defined that predicts data that would be recorded for any particular set of model parameters, while the inverse relationship can not be determined uniquely.Once real data have been observed, the imaging problem is thus established as an inverse problem (Tarantola, 2005).
Because of non-linearity in the physical relationship, insufficient data coverage and noise in the data, inverse problems almost always have non-unique solutions: many sets of parameter values can fit the data to within their uncertainty.It is therefore important to characterize the family of possible solutions (in other words, the solution uncertainty) in order to interpret the results with the correct level of confidence, and to provide well-justified and robust answers to the scientific questions (Arnold and Curtis, 2018).
Solutions to an inverse problem are often found by seeking an optimal set of parameter values that minimizes the difference or misfit between observed data and model-predicted data to within the data noise.Since most inverse problems have non-unique solutions, some form of regularization is often imposed on the parameters in order to make the computational so-lution unique (Aki and Lee, 1976;Tarantola, 2005;Aster et al., 2018).Many codes have been developed using this class of methods (Rawlinson, 2005;Rücker et al., 2017;Afanasiev et al., 2019;Wathelet et al., 2020;Komatitsch et al., 2023).However, since regularization is often chosen using ad-hoc criteria, these methods produce deliberately biased results, and valuable information can be concealed in the process (Zhdanov, 2002).Moreover, no such optimisation method can provide accurate estimates of uncertainty.To overcome these issues, the SOLA-Backus-Gilbert inversion method has recently been applied to large scale linearised tomographic problems.This method evaluates the weighted average of the true model parameters and provides both resolution and uncertainty estimates (Zaroli, 2016;Zaroli et al., 2017).In addition, the method does not require regularization and can be conducted in a parameter-free way which avoids bias caused by parameterisation (Zaroli, 2019).Unfortunately, the method is only developed for linear problems; since most Geophysical problems are significantly nonlinear, our goal is to provide methods that estimate solutions and uncertainties for that class of problems.
Bayesian inference solves both linear and nonlinear inverse problems by updating a prior probability density function (pdf) with new information contained in the data to produce a posterior pdf which describes the full state of information about the parameters post inversion (Tarantola, 2005).If we define the prior pdf as p(m), the posterior pdf p(m|d obs ) can be computed using Bayes' theorem: where p(d obs |m) is the likelihood function which describes the probability of observing the recorded data d obs if model parameters took the values in m, and p(d obs ) is a normalization factor called the evidence.This posterior pdf describes the full uncertainty in parameter values by combining the prior information and the uncertainty contained in the data.Markov chain Monte Carlo (McMC) is one commonlyused method to solve Bayesian inference problems and has been used widely in many fields.The method constructs a set (chain) of successive samples that are distributed according to the posterior pdf by performing a structured random walk through parameter space (Brooks et al., 2011); thereafter, these samples can be used to estimate statistical information about parameters in the posterior pdf (Mosegaard and Tarantola, 1995;Tarantola, 2005) and to find answers to specific scientific questions (Arnold and Curtis, 2018;Siahkoohi et al., 2022b;Zhang and Curtis, 2022;Zhao et al., 2022b;McKean et al., 2023).The Metropolis-Hastings algorithm is one such method that originates from physics (Metropolis and Ulam, 1949;Hastings, 1970), and has been applied to a range of geophysical inverse problems (Mosegaard and Tarantola, 1995;Malinverno et al., 2000;Andersen et al., 2001;Mosegaard and Sambridge, 2002;Sambridge and Mosegaard, 2002;Ramirez et al., 2005;Gallagher et al., 2009).However, the algorithm becomes inefficient in high dimensional space because of poor scaling due to its random walk behaviour.
In an attempt to improve the efficiency of Bayesian inference for certain types of problems, variational inference has been introduced to geophysics as an alternative to McMC.In variational inference one seeks a best approximation to the posterior pdf within a predefined family of (simplified) probability distributions by minimizing the difference between the approximating pdf and the posterior pdf (Bishop, 2006;Blei et al., 2017).One commonly-used measure of the difference between the pdfs is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) as it is easier to estimate computationally than other measures.Variational inference therefore solves Bayesian inference problems by minimizing the KL divergence, which is an optimisation rather than a stochastic sampling problem.The method has been demonstrated to be computationally more efficient and more scalable to high dimensionality in some classes of problems (Bishop, 2006;Zhang et al., 2018a).The method can also be applied to large datasets by dividing the data set into random minibatches and using stochastic and distributed optimisation (Robbins and Monro, 1951;Kubrusly and Gravier, 1973).By contrast, the same strategy cannot easily be used for McMC because it breaks the detailed balance condition required by most McMC methods (O' Hagan and Forster, 2004).In addition, variational inference methods can usually be parallelized at the individual sample level, whereas in McMC this cannot be achieved because of dependence between successive samples.
Variational inference has been applied to a range of geophysical inverse problems.Nawaz and Curtis (2018) used mean-field variational inference to invert for subsurface geological facies distributions and petrophysical properties using seismic data, with further developments by Nawaz and Curtis (2019) and Nawaz et al. (2020).Although these methods are computationally efficient, the mean-field approximation ignores correlations between parameters, and the methods of Nawaz and Curtis involved the development of bespoke mathematical derivations and implementations for each class of problem.While these developments result in exceptional speed of calculation, this approach restricts the method to a small range of problems for which correlations are not important and the derivations can be performed (Parisi, 1988;Bishop, 2006;Blei et al., 2017).To extend variational inference to general inverse problems, Kucukelbir et al. (2017) used a Gaussian family in variational inference to create a method called automatic differential variational inference (ADVI), which has been applied to travel time tomography (Zhang and Curtis, 2020a) and earthquake slip inversion (Zhang and Chen, 2022), and extended to the family of sums (mixtures) of multiple Gaussians by Zhao and Curtis (2024).By using a sequence of invertible and differential transforms (called normalizing flows), Rezende and Mohamed (2015) proposed normalizing flow variational inference in which flows (functions, or simply, relationships) are designed which convert a simple initial distribution to an arbitrarily complex distribution that approximates the posterior pdf.In geophysics and related fields the method has been applied to travel time tomography (Zhao et al., 2022a), seismic imaging (Siahkoohi et al., 2020(Siahkoohi et al., , 2022a)), seismic data interpolation (Kumar et al., 2021), transcranial ultrasound tomography (Orozco et al., 2023) and cascading hazards estimation (Li et al., 2023).
By using a set of samples of parameter values (called particles) to represent the density of an approximating pdf, Liu and Wang (2016) introduced a method called Stein variational gradent descent (SVGD), which itera-tively updates those particles by minimizing the KL divergence so that the final particle density provides an approximation to the posterior pdf.SVGD has been demonstrated to be an efficient method in a range of geophysical applications, such as travel time tomography (Zhang and Curtis, 2020a), full waveform inversion (FWI) (Zhang andCurtis, 2020b, 2021;Lomas et al., 2023;Wang et al., 2023), earthquake source inversion (Smith et al., 2022), hydrogeological inversion (Ramgraber et al., 2021), post-stack seismic inversion (Izzatullah et al., 2023) and neural network based seismic tomography (Agata et al., 2023).However the method becomes inefficient and inaccurate in high dimensional problems because of the finite number of particles and the practical limitation of computational cost (Ba et al., 2022).To reduce this issue, Gallego and Insua (2018) introduced the stochastic SVGD (sSVGD) method by combining SVGD and McMC: the efficiency of this method has recently been demonstrated when it was used to estimate the first Bayesian solution for a fully nonlinear, 3D FWI problem (Zhang et al., 2023).
Despite these theoretical and practical advances, variational inference has not been widely used in geophysics.This is partly because the method is not easily accessible to non-specialists, and also because there is no common code framework to perform geophysical inversions using the method.In this study we therefore present a Python variational inversion package (VIP), which includes ADVI, SVGD and sSVGD, to make it more straightforward to solve geophysical inverse problems using variational inference methods.The package provides complete implementations of 2D travel time tomography and 2D full waveform inversion problems, including test results for users to check that their implementation is correct.Users can also solve other inverse problems by supplying their own forward functions and gradient calculation codes.In addition, to solve large inverse problems the package is designed in a scalable way such that it can be deployed on a desktop computer as well as in modern high performance computational (HPC) facilities.
In the following section we describe the concept of variational inference, and algorithmic details of ADVI, SVGD and sSVGD.In section 3 we provide an overview of the VIP package, and in section 4 we demonstrate VIP using examples of 2D travel time tomography and 2D full waveform inversion.We thus show that VIP is an efficient, scalable, extensible and user-friendly package that will enable users to solve geophysical inverse problems using variational methods.Making these methods more tractable for practitioners should allow them to be tested on a wide range of problems.

Variational inference
Variational inference solves Bayesian inference problems using optimisation.To achieve this, we first define a simplified family of pdf's Q = {q(m)}, for example, the family of all Gaussian distributions.The method then seeks an optimal approximation q * (m) to the posterior probability distribution p(m|d obs ) within this family by minimizing the KL divergence between q(m) and p(m|d obs ): (2) The KL divergence measures the difference between two probability distributions: + logp(d obs ) (3) where logp(m, d obs ) is the joint distribution of model m and data d obs .The expectations are calculated with respect to the known pdf q, and we have used Bayes' theorem to expand the posterior pdf p(m|d obs ) in the second line of equation ( 3).It can be shown that the KL divergence is non-negative and only equals zero when q(m) = p(m|d obs ) (Kullback and Leibler, 1951).Because the evidence term logp(d obs ) is computationally intractable, the KL divergence cannot be calculated directly.We therefore rearrange the above equation by moving the evidence term and the KL divergence onto the same side: Given that the KL divergence is non-negative, the lefthand side defines a lower bound on the evidence, which is therefore called the evidence lower bound (ELBO): The latter epxression can be estimated in practice using numerical methods because it does not involve the intractable evidence term.Since the evidence logp(d obs ) is a constant for a specific problem, minimizing the KLdivergence is equivalent to maximizing the ELBO.Variational inference in equation ( 2) can therefore be expressed as: In variational inference, the choice of the variational family Q is important because it determines both the accuracy of the approximation and the complexity of the optimisation problem.A good choice should be a family which is rich enough to approximate the posterior pdf accurately or at least provides the information that we seek, but simple enough such that the optimisation problem is tractable.Different choices of family may also allow different types of algorithm to be developed.In the VIP package we implement three different algorithms, ADVI, SVGD and sSVGD to solve inverse problems.

Automatic differential variational inference (ADVI)
ADVI uses the family of (transformed) Gaussians to solve variational inference problems (Kucukelbir et al., 2017).The transform arises because physical model parameters describe quantities that often have hard bounds, while Gaussian variables have infinite support.We therefore first transform the physical parameters into an unconstrained space using an invertible transform T : θ = T (m).In this unconstrained space the joint distribution p(m, d obs ) becomes: where J T −1 (θ) is the Jacobian matrix of the inverse of T which accounts for the effects of changes in hypervolume between the unconstrained and constrained parameter spaces.In this unconstrained space define a Gaussian variational family where ζ represents variational parameters, that is, the mean vector µ and the covariance matrix Σ.To ensure that the covariance matrix Σ is positive semi-definite, we use a Cholesky factorization Σ = LL T where L is a lower triangular matrix, to reparameterise Σ.
With the above definition, the variational problem in equation ( 6) becomes: (9) This optimisation problem can be solved by using a gradient ascent algorithm.As shown in Kucukelbir et al. (2017), the gradients of the ELBO with respect to variational parameters µ and L can be calculated using: where η is a random variable distributed according to the standard normal distribution N (η|0, I).The expectations can be estimated using Monte Carlo (MC) integration, which in practice only requires a low number of samples because the optimisation is performed over many iterations so that statistically the gradients will lead to convergence towards the correct solution (Kucukelbir et al., 2017).The variational problem in equation ( 9) can now be solved by using gradient ascent methods.In the VIP package we implement four optimisation algorithms: stochastic gradient descent (SGD), ADAGRAD (Duchi et al., 2011), ADADELTA (Zeiler, 2012) and ADAM (Kingma and Ba, 2014).The final approximation to the Bayesian solution can be obtained by transforming q(θ; ζ * ) back to the original space.
For transform T we implement a commonly-used logarithmic transform (Team et al., 2016; Zhang and Curtis,   2020a) where m i and θ i represent the i th parameter in the original and transformed space respectively, and a i and b i are the lower and upper bound on m i .The final approximation obtained using ADVI is therefore limited in complexity by the Gaussian distribution q(θ; ζ * ) and the transform T .Note that if no transform is performed, the method approximates the posterior pdf using a Gaussian distribution directly.

Stein variational gradient descent (SVGD)
Instead of using a specific form of pdf (for example, the Gaussian distribution in ADVI) in variational inference, it is also possible to use the density of a set of samples to represent the approximating probability distribution.SVGD is one such method in which the set of samples are called particles.In SVGD those particles are iteratively updated by minimizing the KL divergence so that the density of the final set of particles is distributed according to the posterior probability distribution.If we define the set of particles as {m i }, SVGD updates each particle using a smooth transform: where m i is the i th particle, φ(m i ) is a smooth vector function which describes the perturbation direction, and is the magnitude of the perturbation.When is sufficiently small, the transform is invertible since the Jacobian of the transform is close to an identity matrix.Denote q(m) as the pdf represented by the set of particles, and q T (m) as the transformed probability distribution of q(m) using equation (13).In order to reduce the KL divergence between q T (m) and p(m|d obs ), we first calculate the gradient of the KL divergence with respect to , which is found to be (Liu and Wang, 2016): where A p is the Stein operator defined as . This equation implies that one can obtain the steepest descent direction of the KL-divergence by maximizing the right-hand expectation E q [trace (A p φ(m))], and consequently the KL divergence can be reduced by stepping a small distance in that direction.Iteratively re-calculating equation ( 14) and stepping in each revised direction locates a minimum in the KL divergence.
The optimal direction φ * that maximizes the expectation E q [trace (A p φ(m))] in equation ( 14) can be found using kernels.Assume x, y ∈ X and define a mapping φ from X to a space where an inner product , is defined (called a Hilbert space); a kernel is a function that satisfies k(x, y) = φ(x), φ(y) .Given a kernel function k(m , m), the optimal φ * can be calculated using (see details in Liu and Wang, 2016): In the VIP package, we implement a commonly-used kernel function, the radial basis function (RBF): where h is a scale factor that controls the magnitude of similarity between the two particles based on their distance apart.Given equations ( 14) and ( 15), the KL divergence can be minimized by iteratively applying the transform in equation ( 13) with the optimal φ * to a set of initial particles: where l represents the l th iteration.Note that the expectation in equation ( 15) can be estimated using the particles' mean value, so we can compute φ * l using: where n is the number of particles.For sufficiently small { l } the transform is invertible, and the process converges to the posterior distribution asymptotically as n → ∞ (Liu and Wang, 2016).Note that even though the posterior distribution p(m l j |d obs ) is unknown in practice, we can always calculate its value up to an unknown constant for a specific model.As a result, its gradient ∇ m l j logp(m l j |d obs ) can be obtained, and hence the φ * l .
The first term in equation ( 15) is the kernel weighted average of gradients of the posterior pdf from all particles, and drives particles toward high probability areas.For the RBF kernel the second term becomes j m−mj σ 2 k(m j , m) which move particles away from its neighbouring particles.This term therefore acts as a repulsive force that prevents particles from collapsing to a single mode.SVGD balances the drive towards high probabilities and the repulsive force such that the density of particles moves towards the posterior pdf.
Note that the scale factor h in the RBF kernel controls the weighting value of particles.As suggested in several studies (Liu and Wang, 2016;Zhang and Curtis, 2020a), we take h as d/ √ 2logn where d is the median of pairwise distances between all particles.This choice enables that for particle m i the contribution form its own gradient is balanced from all other particles as j =i k(m i , m j ) ≈ nexp(− 1 2h 2 d2 ) = 1.If h → 0, the method reduces to independent gradient ascent for each particle.
In SVGD the accuracy of estimation increases with the number of particles.For one single particle the method becomes a standard gradient ascent method toward the model with maximum a posterior (MAP) pdf value.This implies that even for a small number of particles SVGD can still produce an accurate parameter estimate as MAP estimation has been demonstrated to be an effective method in practice.Thus, in practice, one can start from a small number of particles and gradually increase the particles to produce more accurate estimates of the uncertainty.

Stochastic SVGD
Although SVGD has been applied in many fields (Gong et al., 2019;Zhang and Curtis, 2020a;Pinder et al., 2020;Ramgraber et al., 2021;Ahmed et al., 2022), the method can produce biased results in high dimensional problems because of the finite number of particles and the limitation of computational cost in practice (Ba et al., 2022).In order to further improve accuracy of the method, Gallego and Insua (2018) proposed a variant of SVGD, called stochastic SVGD (sSVGD), which combines SVGD and McMC by adding a Gaussian noise term to the dynamics of SVGD.By doing this sSVGD becomes an McMC method with multiple interacting Markov chains, and since every set of particle values can be regarded as a sample of the posterior pdf, the method can generate many samples that are distributed according to the posterior pdf.Under certain conditions (see below), sSVGD guarantees asymptotic convergence to the posterior pdf as the number of iterations tends to infinity, which standard SVGD with a finite number of particles cannot achieve.As a result sSVGD can produce more accurate results than the SVGD method, provided that the number of iterations is sufficient to remove effects of the distribution of samples near the start of the chain (the so-called burn-in period) (Gallego and Insua, 2018;Zhang et al., 2023).
To introduce sSVGD, we start from a stochastic differential equation (SDE).For a random variable z, the SDE is defined as: where f (z) is called the drift, W(t) is a Wiener process, and D(z) represents a positive semidefinite diffusion matrix.All continuous Markov process can be expressed as an SDE, and consequently one can construct a Markov chain by simulating the SDE (Oksendal, 2013).Assume p(z) as the posterior distribution, an SDE that converges to the p(z) can be constructed as (Ma et al., 2015): where Q(z) is a skew-symmetric curl matrix, and To simulate this process, we can discretize the above equation using the Euler-Maruyama discretization: ) where N (0, 2 t D(z t )) represents a Gaussian distribution with covariance 2 t D(z t ).The gradient ∇logp(z t ) can be computed using the full data set, or using uniformly randomly selected minibatch data subsets which results in a stochastic gradient approximation.In either case the above process converges to the posterior distribution asymptotically as t → 0 and t → ∞ (Ma et al., 2015).Matrices D(z) and Q(z) can be adjusted to obtain faster convergence to the posterior distribution.For example, if we set D = I and Q = 0, one obtains stochastic gradient Langevin dynamics (Welling and Teh, 2011).If we construct an augmented space z = (z, x) by concatenating a moment term x to the state space z, and set D = 0 and Q = 0 −I I 0 then the stochastic Hamiltonian Monte Carlo method can be derived (Chen et al., 2014).
In sSVGD we define an augmented space z = (m 1 , m 2 , ..., m n ) by concatenating the set of particles {m i }, and use equation ( 21) to generate samples from the posterior distribution p(z) where k(m i , m j ) is a kernel function defined in equation ( 16) and I d×d is an identity matrix.According to the definition of kernel functions, the matrix K is positive definite (Gallego and Insua, 2018).By setting Q(z t ) = 0 and D(z t ) = K, we obtain the stochastic SVGD algorithm: Note that without the noise term N (0, 2 t K), the above equation becomes the standard SVGD method -compare equations ( 23) with equation ( 18), repeated here: sSVGD is therefore an McMC method that uses the gradients from SVGD to produce successive samples.According to equation ( 20), this process converges to p(z) = n i=1 p(m i |d obs ) asymptotically.Note that when n is sufficiently large, the noise term N (0, 2 t K) becomes arbitrarily small.In such cases sSVGD and SVGD produce the same results.
The process defined in equation ( 23) requires samples to be generated from the distribution N (0, 2 t K).In order to perform this efficiently, we first define a matrix where K is an n × n matrix with K ij = k(m i , m j ).
The matrix D K can be constructed from K using D K = PKP T where P is a permutation matrix The action of this permutation matrix on a vector z rearranges the order of the vector from the basis where the particles are listed sequentially to that where the first coordinates of all particles are listed, then the second, etc.With these definitions, a random sample η can be generated efficiently using where L D K is the lower triangular Cholesky decomposition of matrix D K .Taking into account the fact that D K is a block-diagonal matrix, L D K can be computed easily as only the lower triangular Cholesky decomposition of matrix K is required.In practice this calculation is computationally negligible because the number of particles n is usually modest (< 1000).One can now use equation ( 23) to generate samples from the posterior distribution.

Code overview
The VIP package implements the suite of variational methods to solve geophysical inverse problems using the Python programming language.The package includes a set of specific forward and inverse problems such as 2D travel time tomography and 2D full waveform inversion, and also allows users to provide their own forward functions.In variational inference one needs to compute the gradient of the posterior pdf with respect to model parameters.We use the adjoint method to calculate the gradient in the case of seismic full waveform inversion (Lions, 1971;Tarantola, 1984;Tromp et al., 2005;Fichtner et al., 2006;Plessix, 2006), and the ray tracing method in the case of travel time tomography (Rawlinson and Sambridge, 2004).For userspecified forward problems it is required that users implement their own function that computes gradients.The prior pdf is important in Bayesian inference as it provides information about model parameters independent of the data.The VIP package provides two commonly-used prior distributions: Uniform and Gaussian pdf's (note that these are only used as prior pdf's, and do not place any additional constraints on the variational families described above).To implement the Uniform distribution we employ two strategies.In the first strategy we impose hard constraints on model parameters, that is, for any parameter that assumes a value outside the distribution we reset the value to be the closest limit.Note that a similar strategy cannot be used in ADVI as the method assumes a Gaussian variational family which cannot be defined in a constrained space.The second strategy involves using equation ( 12) to transform model parameters into an unconstrained space and perform variational inversion in that space, which provides a more flexible way to employ a variety of variational families.In addition, users can provide their own prior distributions by implementing an appropriate pdf function (see details in the code documentation).Python is a popular high-level interpreted programming language which suffers from slow execution for computationally intensive numerical simulations.We therefore implement time-consuming components of the code (e.g., the forward modelling functions) using Fortran and produce compiled C extensions for these codes using the Cython framework (Behnel et al., 2010).By doing this the code achieves C-like speeds.To further improve efficiency of the code, we use a Python library called Dask, which is designed for parallel and distributed computing, to parallelize the forward computation at the sample (particle) level (Rocklin et al., 2015).The package therefore provides an efficient, scalable and user-friendly implementation which can be deployed on a desktop as well as modern high performance computation facilities.Our aim is to implement a framework which can be used to solve various inverse problems, ranging from educational examples to complex, realistic studies.
Figure 1 shows the structure of VIP.The inversion code (vip in Figure 1) is implemented separately from forward modelling codes (forward in Figure 1), and only requires an interface of forward functions that returns logarithmic posterior pdf values and gradients (details can be found in the code documentation and in two examples tomo and fwi2d).Thus, users can easily combine their own forward functions with the package.In the vip code the prior distributions, kernel functions and variational algorithms are implemented in three different directories (prior, kernel and pyvi in Figure 1) so that the code can easily be extended to other prior pdfs, kernel functions and variational methods.For example, users can implement their own prior pdfs by adding a proper pdf function in the pdf code in the prior directory.Note that both SVGD and sSVGD methods are implemented in the svgd code.

Travel time tomography
As a first example we use the VIP package to solve a 2D tomographic problem.Specifically, we create Love wave group velocity maps of the British Isles using ambient seismic noise data recorded by 61 seismometers (blue triangles in Figure 2a).The geological setting and the main terrain boundaries of the British Isles are shown in Figure 2b.The ambient noise data were recorded in [2001][2002][2003][2006][2007] and in 2010 using three different subarrays.The two horizontal components of the data (N and E) were first rotated to the transverse and radial directions, and the obtained transverse data were cross correlated to produce Love waves between different station pairs.Travel times associated with group velocity at different periods between different station pairs are then estimated from those love waves.Details of the data processing procedures can be found in (Galetti et al., 2017).In this study we use a total number of 401 travel time measurements at 10 s period.
We parameterise the study region using a regular grid of 37 × 40 cells with a spacing of 0.33 • in both longitude and latitude directions.The prior pdf for group velocity in each cell is set to be a Uniform distribution between 1.56 km/s to 4.8 km/s, of which the lower and upper bound were chosen to exceed the range of group velocities between all station pairs when assuming a great circle ray path (Zhao et al., 2022a).The likelihood function is chosen to be a Gaussian distribution to represent the data noise, which is estimated from independent travel time measurements by stacking randomly selected subsets of daily cross correlations (Galetti et al., 2017).In the inversion the predicted travel times are calculated using the fast marching method (Rawlinson and Sambridge, 2004).
We apply the above suite of methods to solve this tomographic problem, and compare the results with those obtained using the Metropolis-Hastings McMC (MH-McMC) method (Zhao et al., 2022a).The Uniform prior distribution is implemented using the second strategy that transforms variables into an unconstrained space in variational inversions.For ADVI, we started the method with a standard Gaussian distribution in the unconstrained space, and performed 10,000 iterations at which point the misfit value ceases to decrease using the ADAM optimisation algorithm (Kingma and Ba, 2014).To visualize the results we generated 5,000 samples from the obtained Gaussian distribution and transformed them back to the original space to estimate posterior statistics.For SVGD, we generated 500 particles from the prior distribution and updated them using equation ( 18) for 3,000 iterations at which point the mean and standard deviation models became stable.The final particles are used to calculate the mean and standard deviation of the posterior distribution.For sSVGD, we started from 20 particles generated from the prior distribution, and updated them using equation ( 23) for 6,000 iterations after an additional burn-in period of 2,000 iteration, after which the average misfit value across all particles became approximately stationary.To reduce the memory and storage cost, we only retained samples every fourth iteration after the burn-in period, which results in a total of 30,000 samples.
Figure 3 shows the mean and standard deviation maps obtained using the suite of variational methods, as well as those obtained using the MH-McMC algorithm (Zhao et al., 2022a).Overall the results obtained using different methods show similar mean structures which have a good agreement with the known geology and previous tomographic studies in the British Isles (Nicolson et al., 2012(Nicolson et al., , 2014;;Galetti et al., 2017;Zhao et al., 2022a).For example, in the Scottish highlands the mean maps clearly exhibit high velocities (annotation 1 in Figure 3) which are consistent with the distribution of Lewisian and Dalradian complexes in this area.Similarly high velocities associated with the accretionary complex of the Southern Uplands (annotation 2) are clearly visible around 4°W, 55°N following a SW-NE trend.Between the Highland Boundary Fault and the Southern Uplands Fault a similar trend of low velocity zone (annotation 3) is found in the Midland Valley.Low velocities are also observed in a number of sedimentary basins such as the East Irish Sea (4.5°W, 54°E -annotation 4), the Cheshire Basin (2.5°W, 52.5°E -annotation 6), the Anglian-London Basin (0°, 52°N -annotation 7), the Weald Basin (0°, 51°N -annotation 8) and the Wessex Basin (3°W, 50.5°N -annotation 9).By contrast, high velocities can be found in granitic intrusion regions, for example, in northwest Wales (around 4°W, 53°N -annotation 5) and Cornwall (around 4.5°W, 50.5°N -annotation 10).More detailed discussion and interpretation of the velocity structures can be found in Galetti et al. (2017).
Among these results the mean map obtained using ADVI shows the smoothest structure, whereas other maps provide more detailed information.This has also been observed in previous studies (Zhang and Curtis, 2020a;Zhao et al., 2022a) and is likely caused by the limitation of implicit Gaussian assumption made in ADVI.In far offshore areas because few ray paths go through the open marine regions, the mean maps obtained using ADVI and SVGD show almost homogeneous velocity structure across these areas whose value is consistent with the mean of prior distribution.In comparison, the results obtained using sSVGD and MH-McMC exhibit more heterogeneous structures, which probably indicates that the two methods have not converged sufficiently.These areas are only loosely constrained by the data (or not at all) and hence have large posterior uncertainties requiring many more randomly generated samples in order to explore and represent the posterior distribution accurately compared to areas with tighter constraints from the data.Note that both sSVGD and MH-McMC involve random sampling of the posterior distribution, whereas samples obtained using SVGD are found deterministically by optimisation.As a result, SVGD produces smoother results (Zhang and Curtis, 2021;Zhang et al., 2023).
Overall the standard deviation maps obtained using SVGD, sSVGD and MH-McMC show similar structures.For example, the results show lower uncertainties in the Scottish highlands and southern England because of dense arrays in those areas.In the offshore areas the standard deviation is around 0.93 which is the standard deviation of the prior as no ray path goes through these regions.On the east side of the island just off the coast, although no seismometer is deployed, there are rays that travel through those areas (see details in Galetti et al., 2017), and consequently the standard deviation is smaller than that of the prior.There is a high uncertainty loop around the low velocity anomaly in the Anglian-London Basin (annotation 7 in Figure 3), which has also been observed in previous studies (Galetti et al., 2015(Galetti et al., , 2017) ) and reflects uncertainty in the shape of the anomaly.In addition, the East Irish Sea (annotation 4) shows high uncertainties.This is probably because few ray paths go through this area due to its lower velocity, and consequently the area is not well constrained by the data.By contrast, the standard deviation map obtained using ADVI shows different features.Although in the Scottish highlands the results still show lower uncertainty, the rest of the area within the receiver array has almost the same uncertainty level with little variation.In addition, in the West Irish Sea and the North Sea area between Northern Scotland and Shetland Islands the results show lower uncertainties which are not observed in the results obtained using other methods.This suggests that ADVI can produce biased results because of its underlying Gaussian assumption as found in previous studies (Zhang and Curtis, 2020a).
Table 1 compares the number of forward simulations required by each method to obtain these results, which provides a good metric of the computation cost as the forward simulation is the most computationally expensive component of each method.Note that the three variational methods require computation of derivatives of the posterior pdf with respect to model parameters, which adds computational cost compared with the MH-McMC method.In this travel time tomography example the derivatives are calculated using ray paths, which are traced through the computed travel time field.This calculation requires a computation equivalent to approximately 0.08 forward simulations.We therefore compute the equivalent number of simulations by multiplying the number of simulations required by the three variational methods by 1.08, which are shown in the third column in Table 1.
The results indicate that ADVI is apparently the most efficient method as it only requires 10,000 simulations, but we have demonstrated that the method probably produces biased results.SVGD demands the highest computational cost among the three variational methods, while sSVGD requires about 10 times fewer simulations than SVGD.This makes sSVGD a good choice for practical applications as noted in Zhang et al. (2023).Nevertheless, all three variational methods are significantly more efficient than the basic MH-McMC method implemented here as a bench-mark, which required 15 millions simulations in total with 10 independent parallel chains.
We note that the above comparison depends on subjective assessment of the point of convergence for each method, so the absolute number of simulations required by each method may not be entirely accurate (especially the number used for the MH-McMC algorithm).Nevertheless the comparison at least provides insights into the relative computational cost of each method.A more careful and thorough comparison between the same MH-McMC method and variational methods can be found in Zhao et al. (2022a) which again demonstrated that variational methods were computationally efficient.

Full-waveform inversion
For the second example we use the VIP package to solve a 2D full waveform inversion problem.The input model is selected to be a part of the Marmousi model (Figure 4a, Martin et al., 2006), and is discretized using a regular 120 × 200 grid with a spacing of 20 m.Ten sources are equally distributed at 20 m water depth (red stars in Figure 4), and 200 receivers are equally spaced at the depth of 360 m on the seabed across the horizontal extent of the model.We simulate the waveform data using a timedomain finite difference method with a Ricker wavelet of 10 Hz central frequency, and added Gaussian noise to the data whose standard deviation is set to be 2 percent of the median of the maximum amplitude of each seismic trace.The gradients of the logarithm posterior pdf with respect to velocity are calculated using the adjoint method (Tarantola, 1988;Tromp et al., 2005;Fichtner et al., 2006;Plessix, 2006).
The prior distribution is set to be a Uniform distribution over an interval of 2 km/s at each depth (Figure 4b).To ensure that the rock velocity is higher than the velocity in the water, we imposed an extra lower bound of 1.5 km/s.For the likelihood function we use a Gaussian distribution to represent uncertainties on the waveform data: where i is the index of time samples, and σ i is the standard deviation of that sample.
We apply SVGD and sSVGD to solve this full waveform inversion problem as we have demonstrated that these methods provide more accurate results than ADVI.For SVGD we used 600 particles that are initially generated from the prior distribution (an example is shown in Fig- ure 4c) and updated them using equation ( 18) for 600 iterations.The final particles are used to calculate statistics of the posterior distribution.For sSVGD we generated 20 particles from the prior distribution and updated them for 4,000 iterations after an additional burnin period of 2,000.Similarly, to reduce the memory and storage cost we only retain samples from every tenth iterations, which results in a total of 8,000 samples.Those final samples are then used to compute statistics of the posterior distribution.
Figure 5 shows the mean and standard deviation models obtained using SVGD and sSVGD.Overall the two methods produce similar results.For example, both mean models (Figure 5a and c) show similar structures to the true structure, especially in the shallow part (< 1.5 km).In the deep part (> 1.5 km) and close to the sides, the mean models appear to be less similar to the true structure because the waveform data are less sensitive to the velocity structure in those areas.However, the mean obtained using sSVGD is more similar to the true structure than that obtained using SVGD.This reflects the fact that sSVGD can produce more accurate results than SVGD in high dimensional spaces, which has also been observed in other studies (Gallego and Insua, 2018;Zhang et al., 2023).Note that similarly to the travel time tomography example above, the mean obtained using SVGD shows smoother structures than that obtained using sSVGD.This is likely because sSVGD is an McMC method which generates samples using stochastic sam-  The prior distribution of seismic velocity, which is set to be a Uniform distribution with an interval of 2 km/s at each depth.An additional lower bound of 1.5 km/s is also imposed on the velocity to ensure that the rock velocity is higher than the velocity in water.(c) An example particle generated from the prior distribution.
pling, whereas in SVGD particles are obtained deterministically using optimisation.A similar phenomenon has also been observed in other studies when comparing results obtained using SVGD and sSVGD or McMC (Zhang and Curtis, 2021;Zhang et al., 2023).
Overall the standard deviation models show similar structural shapes to those in the mean model as has been observed in other studies (Gebraad et al., 2020;Zhang andCurtis, 2020b, 2021;Zhang et al., 2023).In the shallow part (< 1.0 km) the results show lower uncertainties and in the deeper part the uncertainty is higher because of lower data coverage.Those higher velocity anomalies in the deeper part are clearly associated with lower standard deviations, which likely reflects that those anomalies have large influences on the waveform data and hence have lower uncertainty.Simi-larly to the mean structures, the standard deviations obtained using SVGD show smoother structures than are obtained using sSVGD.In addition, the magnitude of the standard deviation obtained using SVGD is slightly lower than that obtained using sSVGD, which is likely because SVGD can underestimate uncertainties in high dimensional spaces due to the limited number of posterior samples produced (Ba et al., 2022;Zhang et al., 2023).
To further understand the results we show marginal distributions obtained using SVGD and sSVGD along three vertical profiles whose locations are denoted by dashed black lines in Figure 5. Overall the results show broader distributions in the deeper part (> 1 km) than in the shallow part as we have observed in the standard deviation models.Furthermore, the distributions Table 2 Computational cost required by SVGD and sSVGD for FWI.
obtained using sSVGD are broader than those obtained using SVGD, which again demonstrates that SVGD can underestimate uncertainties.Note that in the results obtained using SVGD some true velocities lie outside the high probability area at large depths (> 1.5 km), whereas those obtained using sSVGD generally include the true velocity in values with non-zero uncertainty.This shows that SVGD can produce biased results for high dimensional problems as noted in several studies (Ba et al., 2022;Zhang et al., 2023).Similarly to the above section we measure the computational cost required by each method using the number of forward and adjoint simulations (Table 2).Specifically, SVGD required 360,000 simulations to converge, while sSVGD used 120,000 simulations.This again demonstrates that sSVGD can be more computationally efficient than SVGD because sSVGD requires fewer particles yet generates many more samples.To give an overall idea of the computational cost, the above inversions required 49 hours for sSVGD using 40 AMD EPYC CPU cores, and 3 days for SVGD using 90 CPU cores.

Discussion
Although in the VIP package we only implemented 2D travel time tomography and 2D full waveform inversion, the code can easily be applied to other types of problems, and also to larger scale problems by using modern high performance computation (HPC) facilities.For example, users can implement 3D full waveform inversion by providing a 3D forward and adjoint simulation code (see more details in the code documentation, and an example in Zhang et al., 2023).In order to enable easy deployment on HPC facilities, the code provides a guide on how to parallelize the computation using the Sun Grid Engine queuing system.Other queuing systems can be implemented in a similar way.
Although we have demonstrated that sSVGD can generate more accurate results than SVGD in high dimensional problems and requires less computational cost in total, the method generally requires many more iterations.As a result, sSVGD may be less efficient than SVGD in wall clock time when a large number of CPU cores is available.This is why we implement SVGD in the VIP package as in practice it may be a better choice for low dimensional problems.ADVI may become inefficient in a high dimensional space because of the increased size of the covariance matrix.To enable applications in such cases, we also implement a diagonal covariance matrix, that is, a mean-field approximation (Kucukelbir et al., 2017).In SVGD and sSVGD besides the radial basis function kernel used in above examples, the package also implements diagonal matrix-valued kernel functions which are constructed by combining a positive definite diagonal matrix Q and the radial basis function (Wang et al., 2019;Zhang and Curtis, 2021).The elements of Q can be set as the inverse of the variance calculated across particles (Zhang and Curtis, 2021).
To promote reproducibility and show how to use the code, we included several examples along with the code which can be used to reproduce those results obtained in the above section.We encourage interested readers to begin with these examples to familiarize themselves with the code.Finally, we note that VIP is actively being developed and expanded, and contributions from the community are welcome.

Conclusion
VIP is a Python package which solves general inverse problems using variational inference methods, including automatic differential variational inference (ADVI), Stein variational gradient descent (SVGD) and stochastic SVGD (sSVGD).The package is designed to be easy enough for beginners to use, and efficient enough to solve complex inverse problems.In addition, VIP is implemented in a scalable way such that it can be deployed on a desktop as well as in high performance com-putation facilities.We demonstrated the package using two examples: 2D travel time tomography and 2D full waveform inversion.Users can also use the package to solve their own inverse problems by providing an appropriate forward modelling and gradient calculation code.We conclude that VIP can be used to solve a wide range of inverse problems in practice.The most recent release of the code can be downloaded from GitHub (https://github.com/xin2zhang/VIP)and a stable version is available on Zenodo (Zhang and Curtis, 2023).

Figure 1
Figure 1 Code structure of VIP.Each rectangle represents a folder or file in the package.Users can implement their own forward functions similarly to the way this is implemented in examples tomo and fwi2d.

Figure 2
Figure 2 (a) Locations of seismometers (blue triangles) around British Isles used in this study.(b) Terrane boundaries in the British Isles from Galetti et al. (2017).Abbreviations are as follows: OIT, Outer Isles Thrust; GGF, Great Glen Fault; HBF, Highland Boundary Fault; SUF, Southern Uplands Fault; WBF, Welsh Borderland Fault System.

Figure 3
Figure 3 Mean (top row) and standard deviation (bottom row) maps of group velocity at 10 s period obtained using ADVI, SVGD, sSVGD and MH-McMC respectively.White triangles denote locations of seismometers.Black dashed lines show the Terrane boundaries in Figure 2. Black numbers are referred to in the main text.

Figure 4
Figure 4 (a) The true structure used in the full waveform inversion example.Ten sources are located at the depth of 20 m (red stars) and 200 receivers (not shown) are equally spaced at the depth of 360 m on the seabed.(b)The prior distribution of seismic velocity, which is set to be a Uniform distribution with an interval of 2 km/s at each depth.An additional lower bound of 1.5 km/s is also imposed on the velocity to ensure that the rock velocity is higher than the velocity in water.(c) An example particle generated from the prior distribution.

Figure 5
Figure 5 The mean (top row) and standard deviation (bottom row) obtained using SVGD (left panel) and sSVGD (right panel), respectively.Black dashed lines denote well log locations referred to in the main text.

Figure 6
Figure 6 Marginal distributions at three well logs (black dashed lines in Figure 5) obtained using (a) SVGD and (b) sSVGD, respectively.Red lines show the true velocity profiles and white dashed lines show the lower and upper bound of the prior distribution.

Table 1 A
comparison of computational cost for ADVI, SVGD, sSVGD and MH-McMC.