SeisMIC - an Open Source Python Toolset to Compute 3 Velocity Changes from Ambient Seismic Noise

We present SeisMIC, a fast, versatile, and adaptable open-source software to estimate 10 seismic velocity changes from ambient seismic noise. SeisMIC includes a broad set of tools and 11 functions to facilitate end-to-end processing of ambient noise data, from data retrieval and raw 12 data analysis via spectrogram computation, over waveform coherence analysis, to post-processing 13 of the final velocity change estimates. A particular highlight of the software is its ability to invert ve-14 locity change time series onto a spatial grid, making it possible to create maps of velocity changes.


Introduction
Over the past twenty years, the analysis of temporal changes in seismic velocity has become a standard tool in seismology.Seismologists exploit records of repeating sources, such as explosives (e.g., Nishimura et al., 2000;Hirose et al., 2017), vibrators (e.g., Clymer and McEvilly, 1981;Ikuta et al., 2002), airguns (e.g., Wegler et al., 2006;Yang et al., 2018), or earthquake doublets (e.g., Poupinet et al., 1984;Sawazaki et al., 2015), to quantify such changes.Commonly, the analysis of delays focuses on the later arriving, multiply scattered wave train -the so-called coda, which samples the medium to a greater spatial extent than the first-arriving energy and is sensitive even to minute velocity changes (dv/v) in the order of per-mills (Snieder et al., 2002).We refer to this technique as coda wave interferometry.
While active source coda wave interferometry accurately resolves dv/v, studies using artificial sources are logistically challenging and expensive.Repeating natural sources, on the other hand, rarely occur in regular patterns, 2 Modular Structure

Whom is it for? -The Philosophy behind SeisMIC
As outlined above, monitoring surveys are applied to a broad spectrum of research scopes resulting in a high diversity of requirements for research software.With that in mind, we developed SeisMIC to be flexible and adaptable to user needs.As opposed to working with a black box, users work close to the source code, making it easy to develop individualised workflows or even individually use some modules, submodules, or down to single objects or functions of the code.Yet, the software remains a light and fast package, in which we avoid overhead due to non-essential functionality.For example, in contrast to MSNoise, we avoid heavy database management structure for continuous observatory monitoring, resulting in a significantly faster processing (see section 2.3.2) and giving SeisMIC an advantage in the analysis of campaign based data.
Learning to use a new code and even only determining whether a code satisfies one's need is a large time investment.To guarantee a fast start and a steep learning curve, we aligned SeisMIC closely with ObsPy (Beyreuther et al., 2010), with whose syntax almost all seismologists are familiar.In addition, we host tutorials and extensive, regularly-updated documentation at https://petermakus.github.io/SeisMIC/.All objects, methods, and functions have documentation strings according to the Sphinx standard.
As developers, we follow the FAIR principles Hong et al. (2022).That is, we make SeisMIC findable, accessible, interoperable, and reusable.SeisMIC is a community code with clearly communicated community standards, and users can discuss or report issues, suggest changes, or submit pull requests via GitHub.We distribute SeisMIC under

Velocity Change Spatial Postprocess
Figure 1 A flowchart summarising SeisMIC's modules and their purposes.A general workflow starts with data retrieval, continues with the computation of correlation functions, from which a velocity change time series can subsequently be estimated.We illustrate this with the example given in section 3. The depicted floppy disk marks database management modules.Operations and processes are shown in blue, whereas objects and databases are shown in orange.For the sake of simplicity, we omit non-essential objects and functions, instead, the flowchart focuses on the core processes.
As shown in Figure 1, SeisMIC consists of four main modules.seismic.trace_datahosts the code for reading raw waveform data and station information.Alternatively, it can request data from FDSN servers.SeisMIC handles waveform data in miniseed format in daily chunks, while it saves station information in StationXML format.Generally, station response information is only necessary if the user should opt to remove the station response before correlating.However, basic station information, such as the station's geographic coordinates, is always required.
All objects and functions to preprocess waveform data and compute correlation functions (CFs) are located in seismic.correlate .We include commonly used preprocessing functions such as detrending, tapering, amplitude clipping, sign-bit-normalisation, or spectral whitening (Bensen et al., 2007).For a complete and up-to-date list of preprocessing functions, consult SeisMIC's documentation.Users can easily import custom processing functions into the workflow.We compute CFs by transferring traces to matrices, computing the Fourier transform, and then computing their cross-correlation in the frequency domain.Suppose we want to calculate all available correlations from a dataset of M waveforms, of which each has N samples.Then, the respective mathematical operations can be expressed as follows: First, we compute the discrete Fourier transform of the matrix s containing the waveforms in the time domain: where i is the imaginary number.Secondly, we obtain the correlation matrix C by computing the product of the matrix with the complex conjugate of itself.We then repeat the operation M times, each time rolling the complex conjugate matrix by j = {1, 2, .., M } lines: where the bar indicates the complex conjugate.In the described scenario, we obtain M 2 CFs, which are subsequently transferred back to the time domain: The CFs are then stored as special objects with attributes, plotting and post-processing methods.Finally, SeisMIC writes the CFs to a storage-and computationally-efficient HDF5 container (Koranne, 2011).
All functionality to estimate velocity changes from the CFs resides in seismic.monitor .Currently, SeisMIC supports the estimation of velocity changes using the stretching technique (Sens-Schönfelder and Wegler, 2006) and we are implementing the wavelet-cross-spectrum analysis (Mao et al., 2020).
The stretching technique compares a reference correlation function Cn to a CF C l n computed from data at an arbitrary subwindow l of the total time series.Note that we omit the index o indicating the station pair since this operation is independently executed for each station pair.There are several approaches to obtaining C, all with their unique advantages, SeisMIC supports the use of single or multiple references (Sens-Schönfelder et al., 2014b).In SeisMIC, we implemented a grid search, in which we evaluate C at a new time vector τ stretched (or compressed) with the stretching factor κj: Note that we base the exponential stretching on a Taylor extension for small velocity changes.This assumption is more accurate than the more common τj ≈ τ (1 + κj) and has the advantage of yielding linearly reversible stretched functions.In the supplementary material, we provide a derivation.
Using our stretched time vector, we obtain a stretched reference correlation matrix with J lines, where J is the total number of tested stretch factors.Afterwards, we compute the zero-lag correlation (i.e., the normalised dot product) between each stretched reference and C l : (5) The stretching factor κj = −dv/v resulting in the maximum R l j corresponds to the negative apparent velocity change at time step l.The maximum value of R measures the velocity change estimate's stability and is often referred to as coherence.We then compute R l j for all time steps resulting in the similarity matrix R, the final velocity change time series, and a corresponding coherence time series.Note that R is usually not computed for the whole coda, but just for a user-defined subset of lag time samples.
Finally, the computed velocity change time series can be post-processed and plotted using pre-implemented or custom functions.In addition, SeisMIC can invert a set of velocity change time series from different stations onto a map using the inversion method described by Obermann et al. (2013a).To our knowledge, SeisMIC is currently the only software that supports spatial inversion of velocity change time series.
The workflow steps outlined above rely entirely on well-known Python libraries, including NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), ObsPy (Beyreuther et al., 2010), Matplotlib (Hunter, 2007), and h5py (Collette et al., 2020).To ensure the best stability, we only utilise the most well-maintained projects and keep the number of dependencies to a minimum.Some of SeisMICs core functionalities are based on the MIIC software project (Sens-Schönfelder et al., 2014a).SeisMIC's latest beta version 0.5.3 is compatible with Python 3.10 and 3.11.

Benchmark and Performance
In ambient noise seismology, it is not uncommon to work with data volumes in the order of terabytes.We address the arising computational and storage challenges with efficient and high-performance computing (HPC) compatible code design enabling parallel computing of correlations, velocity change estimates and spatial inversions, where the computation of CFs is the most expensive operation by a large margin.We implement parallel computing using mpi4py (Dalcin and Fang, 2021), which relies on the message parsing interface (MPI).In contrast to other Python multi-threading solutions, MPI-based solutions work seamlessly on high-performance computing (HPC) and cluster solutions.

Multicore Scaling
To test how SeisMIC's computational performance scales with the number of used threads, we compute autocorrelations from three component data on a single cluster node featuring an Intel Cascadelake CPU structure that is equipped with 2 CPU sockets, each holding 20 physical cores that can each execute two threads in parallel.For our test, we compute CFs from 30 days of waveform data.SeisMIC reads daily chunks of miniseed files, which it subsequently decimates, here to a sampling rate of 25 Hz, after imposing an anti-alias filter.The daily waveforms are then detrended, tapered, and filtered with a pass band between 0.01 and 12 Hz.The data is then sliced into hourly traces, which are again linearly detrended, filtered between 2 and 8 Hz, and clipped if the amplitude exceeds a threshold of 2.5 times its standard deviation.Then, SeisMIC computes hourly CFs in the frequency domain and saves them in a customised HDF5 container after performing an inverse Fourier transform.We provide the YAML file containing the processing parameters in the supplementary material.We execute this operation using 1, 2, 4, 8, 16, 32, and 64 threads for data from 1, 2, 4, and 8 stations (i.e., 3, 6, 12, and 24 channels and component combinations).For each configuration, we repeat the computation ten times.
Figure 2 shows the mean processing time and standard deviation over the ten operations per unique n threadsnstations-combination.We normalise the processing times by the time required for n threads = 1 and nstations = 1.
While n threads ≤ n channels , where, in our case, n channels = 3nstations, the processing time scales close to linearly with the number of used threads, indicating an excellent parallel computing performance.As most of the parallel processing in SeisMIC works on a one-core-per-channel basis, only very little increase can be expected beyond this threshold.
Indeed, for n channels < n threads , the code reaches a performance plateau.From here on, the processing time increases with a further increase of n threads , probably due to MPI's communication overhead.Based on the shown results, we would discourage hyperthreading (i.e., using more threads than available physical cores), which leads to a significant performance drop.Generally, one should not employ more threads than the total number of available channels for the computation of correlation functions or the total number of channel combinations for the dv/v estimation.Figure 2 Multi-core scaling properties of SeisMIC.We show compute times for auto-correlations as a function of number of three-component datasets and number of parallel processing threads.The data points correspond to the mean processing time and the error bars to its standard deviation for ten operations (mostly too small to be visible).The processing times are normalised by the time needed to compute the correlations for one station using only one thread.The shaded area marks the area where the number of threads exceeds the number of physical cores, 40, i.e., the area where hyperthreading is employed.

Comparison with MSNoise
To analyse how SeisMIC's processing speed compares to the latest release of MSNoise (Lecocq et al., 2014), 1.6.3,we choose to calculate cross-correlations, which is the most expensive operation in a standard workflow, taking up more than 95% of the total compute time.In this benchmark, we retrieve hourly cross-correlations for 14 days of raw waveform data between eight 3-component broadband seismometers sampling at 100 Hz.We set the preprocessing to be identical for both programs.First, the data are decimated to 25 Hz.Subsequently, we detrend, taper, and bandpass filter the data between 2 and 4 Hz.Before computing the CFs, we apply one-bit normalisation and spectral whitening.Finally, we save the hourly CFs and daily CF stacks for all six unique component combinations with a length of 50 seconds.We perform the benchmark on the same Intel-Cascadelake-based node that we use in section 2.3.1.We show the processing times required by MSNoise and SeisMIC for the outlined operation as a function of employed threads in Figure 3.Despite having received a significant performance boost with the update to version 1.6.x, MSNoise still needs about five times as long and thrice as much random access memory (RAM) as SeisMIC to execute the cross-correlation workflow, putting SeisMIC at a similar efficiency level as NoisePy (see Jiang and Denolle, 2020).
In addition, SeisMIC offers a broader range of preprocessing options than NoisePy or MSNoise.MSNoise creates one miniseed file per CF, resulting in less complex and more evenly distributed writing operations.For this benchmark, this translates to a slightly better scaling but a high number of files, which can be undesirable for large datasets.
SeisMIC, on the other hand, creates one file per component combination.In every case, MSNoise remains more than twice as slow as SeisMIC.Note that the shown times do not include the time that MSNoise takes to set up a database and scan new data, which can take a significant amount of time, whereas these operations are practically instantaneous in SeisMIC.
While the presented results are encouraging, we remark that we could decrease compute times even further by exploiting the potential of modern graphic processing units (GPUs

Time Series
In this section, we demonstrate how to obtain a dv/v time series using a minimal workflow in SeisMIC.In the supplementary material, we provide two Jupyter notebooks containing the source code used for this workflow.The exemplary data are recorded by station X9.IR1 around the date of the M7.2 Zhupanov earthquake in Kamchatka, Russia.
In the following, we investigate the impact of the event on the seismic velocity in the station's vicinity.A discussion of the result lies beyond the scope of this technical paper and has already been performed by Makus et al. (2023b).
We conducted this analysis using SeisMIC's implemented workflow, which is parametrised using a simple YAML file (see supplementary material).In the following, we will take a step-by-step tour through said workflow and provide some minimal code examples.For further examples, we advise the reader to consult SeisMIC's documentation and our GitHub page.

Data Retrieval
To start, we download data from an FDSN-compatible server.In our case, we download data from station X9.IR1, available over the GEOFON FDSN service (Quinteros et al., 2021).For conciseness, we restrict this example to 11 days of data from 25 January to 5 February 2016.This short time window comprises the 28 January Zhupanov earthquake, whose coseismic velocity drop we want to investigate.In SeisMIC, we can initiate the data download using the Store_Client class and its method download_waveforms_mdl : Listing 1 Downloading data using SeisMIC Under the hood, this will initiate ObsPy's (Beyreuther et al., 2010) MassDownloader to download continuous waveform data from the specified station if not already present locally.Here, we will compute autocorrelations using only the east component of the seismogram.We can use SeisMIC to get a first idea of the spectral content of our waveform and to investigate in which frequency bands we might find stable noise sources suitable for PII.We show a spectrogram computed using Welch windows (see, e.g., Barbe et al., 2010) as implemented in SeisMIC in Figure 4.

Computing Autocorrelations
After downloading the waveforms, we can correlate them to obtain CFs.When computing correlations, we have ample preprocessing options, which, for brevity, we will not discuss here in detail.Most fundamentally, we must set the correlation length (i.e., the duration of the time windows to be correlated), the correlation method (in our case, autocorrelation), and the frequency window to be filtered.The user defines all options in the YAML file, but they can also provide parameters in a Python dictionary.For this example, we choose a correlation length of one hour and a frequency band between 2 and 4 Hz.In SeisMIC, the Correlator class handles the correlation workflow.
Listing 2 Downloading data using SeisMIC  Its pxcorr method will internally handle preprocessing and correlation.It will also initiate MPI to enable parallel processing.In Figure 5, we plotted the CFs using SeisMIC's plotting tools.Due to the high noise level in the chosen time window and frequency band, a well-defined coda emerges from the CFs (see Makus et al., 2023b, for details).

Waveform Coherence
For a first assessment of which frequency bands are well-suited for a velocity change analysis, we can use a spectrogram like the one we show in Figure 4. Additionally, one can use SeisMIC's waveform coherence function.The waveform coherence corresponds to the averaged zero-lag cross-correlation between a reference CF and CFs at time t (Steinmann et al., 2021).In Figure 6, we show the waveform coherence for our exemplary dataset computed between hourly CFs and the average CF as a reference.We determine the coherence for 5s long lapse-time windows and one-octave-wide frequency bands jointly for positive (causal) and negative (acausal) lag times.SeisMIC computes waveform coherence using the Monitor class and its compute_waveform_coherence_bulk() method (see supplementary material).2023b).Therefore, we henceforth focus on the analysis of dv/v between 2 and 4 Hz.

Computing Velocity Changes Using the Stretching Method
Using the procedure theoretically outlined in section 2.2, we can estimate the evolution of the seismic velocity in the study period.Like previously, the parametrisation is handled over the YAML file (see supplementary material).
Before computing dv/v, we smooth the one-hour CFs with a 4-hour long Hanning window.As reference CF, we use the mean of all CFs.Then, we compute dv/v for lag times between 3.5 s and 12 s simultaneously from the causal and acausal parts of the coda.We plot the resulting velocity change time series using one of SeisMIC's standard plotting templates in Figure 7.
Even though we do not focus on data interpretation in this article, we should take a brief look at the presented results.Most notably, we identify a clear velocity drop coinciding with the regional M7.2 Zhupanov earthquake.
Interestingly, the resolution of the dv/v time series is high enough to identify a diurnal cycle that could be caused by air temperature and pressure variations, for example, observed by Wang et al. (2020), or be due to lunar and solar tides as reported by Yamamura et al. ( 2003) and Sens-Schönfelder and Eulenfeld (2019).Lastly, we note that the

Spatial Imaging of Velocity Changes
Velocity change estimates like the one presented in Figure 7 show dv/v as a function of time but do not directly yield insight into the spatial distribution of these velocity changes.The spatially extended sampling of coda waves increases the sensitivity to distributed weak velocity changes and the detectability of localised changes but prevents This is a non-peer reviewed manuscript submitted to SEISMICA SeisMIC -Seismological Monitoring using Interferometric Concepts a simple inference of the affected location along a ray path or Fresnel volume.The affected location can, however, be estimated using sensitivity kernels that describe the time-dependent energy distribution of the wavefield for a statistically uniform medium.For a theoretical derivation of the sensitivity kernels based on the Radiative Transfer Theory, refer to Mayor et al. (2014), Margerin et al. (2016), andZhang et al. (2022).
In SeisMIC, we implemented a simplified approach relying on sensitivity kernels derived from an approximate solution of the Boltzmann equation for a homogeneous medium (Paasschens, 1997) describing isotropic scattering of acoustic waves.Using these sensitivity kernels and a linearised inversion scheme proposed by Obermann et al.
(2013b), we can map a 2-dimensional distribution of dv/v at a fixed time ti resulting in dv/v(ti, x, y).
In SeisMIC, the module seismic.monitor.spatialcontains the necessary functions for the outlined approach.
To illustrate the procedure and make our example easily adaptable and reproducible, we create a synthetic velocitychange model, which we then forward model onto a random station configuration.After adding noise to the synthetic data, we try to recover the initial model using the inverse algorithm.In detail, we proceed as follows: First, we create a synthetic velocity change model with an extent of 40 km×40 km and a spatial resolution of 1 km (Figures 8 (b) and (d)).The background medium has a homogeneous velocity of 3 km s and a transport mean free path l0 of 30 km.Then, we place an arbitrary number of stations on random positions along the grid.Using sensitivity kernels of crossand autocorrelations, we solve the forward problem to compute dv/v, as it would be obtained from the CFs in the presence of the spatial velocity variations.The sensitivity kernels are computed for lapse time windows between 14 and 34 s.To the dv/v values, we add random noise.This noise follows a Gaussian distribution around 0% velocity change with a standard deviation of 0.1%.Finally, we invert for the synthetic model employing the damped linearised inversion (Obermann et al., 2013b).We show the results of this inversion in Figures 8 (a) and (c) for 4 and 32 stations, respectively.There, we also indicate the used damping parameters.The optimal damping parameters minimise both the misfit between the initial and the retrieved model and the model complexity and can be found using the L-curve criterion, as discussed by Obermann et al. (2013b).
The results demonstrate that increasing the number of stations is the most powerful tool to decrease the misfit between the inversion result and the input model.While the geometry of the synthetic model is poorly retrieved for a configuration using only four stations, we can reproduce the model quite accurately with 32 stations.
The supplementary material contains a Jupyter notebook to reproduce or modify these results with an arbitrary number of stations, velocity change model, and damping parameters.We also include options to invert for dv/v only utilising data from auto-or cross-correlations and using sensitivity kernels from split coda windows (i.e., with lapse time windows sliced into narrow sub-windows).In the supplement, we show results that exploit these options.
Based on these, we argue that adding dv/v information from auto-and cross-correlations, improves the accuracy of the result notably, whereas splitting the coda yields only minor improvements.

Conclusion and Outlook
We presented SeisMIC, a software to estimate changes in the seismic propagation velocity from continuous records of seismic ambient noise.SeisMIC contains functionalities for the end-to-end processing of velocity-change time series, including data retrieval, the computation of correlation functions, calculating velocity change time series using the stretching method, and postprocessing as well as inverting dv/v time series onto a spatial grid.While these functions can be part of a workflow, they are also intended to be used separately and can easily be altered and adapted to individual processes.
In the near future, we will release versions capable of estimating dv/v employing algorithms other than the stretching method, like the wavelet-cross-spectrum analysis (Mao et al., 2020).Other future milestones include exploiting the computational power of GPUs to decrease the compute time of noise correlations even further and adding solutions that automatically update correlation function databases.
SeisMIC complements existing software to process ambient noise.Highlights are its broad functionality, high efficiency, and versatility applicable to local small-scale studies on a laptop computer as well as surveys using large-N arrays processed on computer clusters.SeisMIC is available on GitHub as a well-documented and regularly maintained open-source software.

Figure 3
Figure 3Compute times for a cross-correlation workflow for all six unique component combinations between eight seismic stations using MSNoise 1.6.3(Lecocq et al., 2014) and SeisMIC 0.5.3.The height of the bars indicates the mean processing time over five iterations with the error bars representing the standard deviation.For hardware information and the exact parametrisation of the workflows, consult the text body.

Figure 4
Figure 4 Time dependent spectrogram of the raw waveform at X9.IR1.We compute the spectrogram after removing the instrument response using 2 hours Welch windows.Note the energy spike caused by the Zhupanov earthquake.The energy amplitude is normalised by its maximum.

Figure 6 Figure 5
Figure6leads us to infer a high stability and energy content between 0.5 and 4 Hz.The coherence remains high until late lag times, e.g. for 3 Hz centre frequency, up to 75 periods.From this, we infer a highly scattering medium paired with a high energy content in this frequency band originating from the volcanic system (seeMakus et al.,

Figure 6 Figure 7
Figure6The waveform coherence as a function of lag time and frequency for the dataset from station X9.IR1 and channel HHE.For details, consult the text body.

Figure 8
Figure 8 Two examples of the spatial inversion using different parametrisations and station configurations.(a)Result of the spatial inversion algorithm using four stations, a model variance σ m = 0.1 km km 2 , and a correlation length λ = 2 km.(b) The synthetic velocity model and station configuration used to obtain (a).(c) Result of the spatial inversion algorithm using 32 stations, σ m = 0.01 km km 2 , and λ = 2 km.(d) The synthetic velocity model and station configuration used to obtain (c).For an exhaustive description of the parametrisation and the inversion steps, consult the text body.

Practical Example of a Workflow: From Raw Waveform Data to a Velocity Change
), which can correlate ambient seismic noise with high efficiency (Clements and Denolle, 2021; Wu et al., 2022).Implementing such algorithms belongs to the intermediate-term goals of SeisMIC's development.