Curated Pacific Northwest AI-ready Seismic Dataset

,


Introduction
The Pacific Northwest (PNW) region of the United States is a dynamic tectonic plate boundary between the North American continental plate and the Juan de Fuca oceanic plate. The active margin between the two plates is a subduction zone that hosts a wide variety of earthquake behaviors: fast and large megathrust earthquakes (Witter et al., 2003), intraslab earthquakes (Gene A. Ichinose, 2004), crustal earthquakes (Gomberg and Bodin, 2021), slow repeating earthquakes (Rogers and Dragert, 2003;Wech and Bartlow, 2014;Bartlow, 2020), tectonic tremor (Wech et al., 2010), and low-frequency events (A.A. Royer and M.G.Bostock, 2014). The PNW has over twenty active volcanoes that have experienced eruptions in the historical record. The PNW has hundreds of glaciers in the Cascades, the Olympic Peninsula, and sitting atop the Cascade Volcanoes. Due to the active tectonics and the particular mid-latitude climate, the PNW also experiences hundreds of landslides every year (Luna and Korup, 2022). Such geohazards generate seismic waves that are well recorded (Allstadt, 2013;Allstadt et al., 2018a;Hibert et al., 2019).
The Pacific Northwest Seismic Network (PNSN) is the Beroza, 2022) that is driven by the high dimensionality of seismological data, the dramatic growth in data volumes (Hutko et al., 2017), and the effort by the community to curate seismic datasets. There exists today several curated datasets that have become standards for machine-learning seismological research: STEADa dataset of local and regional earthquakes and highfrequency noise recorded globally (STanford EArthquake Dataset, Mousavi et al., 2019), INSTANCE (Italian seismic dataset for machine learning, Michelini et al., 2021), ETHZ (Eidgenössische Technische Hochschule Zürich, Woollam et al., 2022), SCEDC (Southern California Earthquake Data Center, SCEDC, 2013), and Iquique, a data collection of subduction-zone earthquakes and regional recordings (Woollam et al., 2019). These datasets contain earthquake and noise time series recorded by various seismometers. The typical data attributes are basic earthquake source and receiver characteristics, including locations, magnitudes, focal mechanisms, and waveforms. The majority of the earthquake sources in these datasets are of tectonic origins: transform plate boundaries such as in California, subduction zone, and intra-continental crustal earthquakes (Woollam et al., 2019;Michelini et al., 2021). Such datasets are considered "AI-ready" since their data and attributes are packaged in data formats commonly used by the Machine Learning community.
Surface processes may also generate seismic waves. Environmental seismology is a blooming field that utilizes seismic waves to understand surface and environmental processes. There is a body of research done on the seismic signatures of landslides events (Chmiel et al., 2021;Yan et al., 2020;Hibert et al., 2014), avalanche signals (Braun et al., 2020), and debris flows (Chmiel et al., 2021), most of which investigate specific case studies. Catalogs of such events are available in the Incorporated Research Institutions for Seismology (IRIS) Exotic Seismic Event Catalog (ESEC) (e.g., Allstadt et al., 2017;Bahavar et al., 2019;Collins et al., 2022); these refined and ground-truth catalogs only contain a few (∼100) events.
Our study provides a novel curated AI-ready dataset of event and waveform data for a diverse range of shortduration seismic sources that include tectonic earthquakes, explosions, surface events such as ice/rock falls and avalanches, sonic booms, and thunderstorms. Not included are phenomena such as non-volcanic tremors or low amplitude low-frequency earthquakes (LFEs). We leverage the 21 years of data curation by the PNSN seismic analysts and researchers to measure the event P-and S-phase arrival times and other attributes. To enable optimal re-usability of our dataset for machine learning studies, we organized the dataset using the SeisBench data format (Woollam et al., 2022) to improve accessibility in the machine learning ecosystem. We acknowledge the accompanying human biases that often pollute AI-ready datasets (Paullada et al., 2021) are well present in our catalog of event and waveform attributes. Some of these identified biases are discussed below and are obvious topics of future investigations.

Data Selection and Preparation
The PNSN has been monitoring the seismicity in the PNW since 1969. However, seismic waveform data from PNSN were recorded on film and paper until 1980, when digital data became available. From 1980 to 2002, eventtriggered waveform data (often with a limited duration) were saved, but continuous archiving did not start until 2002. For machine-learning applications, long seismic traces as input data are preferred to allow user flexibility when trimming and shifting the data in future investigations (e.g., data augmentation, Zhu et al., 2020). The data must also have the same dimensions, i.e., the same number of samples. To get waveforms that are long enough (i.e., 150 seconds and longer in this study), we start the curation when continuous data are available from IRIS Data Management Center (DMC) since 2002. The drawback of this choice is that it excludes the largest tectonic earthquakes in the region because they occurred before 2002 (e.g., Nisqually Earthquake of 28 February 2001). In addition, we require that both a Pwave arrival time and an S-wave arrival time information are available for the same station for each event. This requirement removes some of the smaller, older earthquakes for which no S-picks were available. In the context of AI-ready datasets, the associated metadata (labels or attributes) include event-derived parameters, station parameters, and waveform parameters. We use the SeisBench metadata format: Table 2 lists the attributes that we associate with each set of waveforms.

Event Parameters
The detection of new events is both automated and manually reviewed by the regional seismic network staff. The PNSN monitors and reports on the seismicity in the region using data from seismic stations (Figures S1 and S2). A trigger at a station occurs when the short-term-average-long-time-average of the seismic data (STA/LTA, Allen, 1982) exceeds a threshold. When a few stations from a designated geospatial group of seismic stations, called a subnet, experience a trigger, events are automatically saved. The PNSN analysts review all automatically detected events and remove erroneous ones by visual inspection of the event waveforms, a process they refer to as "trigger review". Teleseisms are also identified but not further processed.
If the waveform has a clear but emergent signal, does not contain distinct P and S arrivals, and the frequency content is relatively low, the PNSN assigns a "surface event" label ( su ) to the source type. Most surface events are "ice"-quakes or avalanches associated with glaciers in the Cascades and on the volcanoes; however, some may be debris flows or rock falls. Other non-earthquake phenomena occasionally saved by analysts are recordings of sonic booms, thunderstorms, and other "interesting" events. Such waveforms are picked at very few nearby stations (one or two), and we gather the phase pick information in a catalog that we refer to as the "Exotic Event" catalog.
Once the trigger review identifies an event as an actual earthquake, the PNSN analysts further process the data. First, the automated system picks the ar-

Figure 1
The event counts of ComCat and exotic catalog included in the AI-ready PNW dataset as a function of time.
rival times of seismic phases from the recorded seismograms, which are one of the most important and primary data products extracted from the raw waveforms. The analyst reviews and modifies the picks.
Seismic phase picking is the cornerstone of seismological research. With accurate phase arrival information, the analysts can locate the event and estimate its origin time. At the PNSN, the first P-and S-waves are the phases picked for local and regional events. As a part of the PNSN's ANSS Quake Monitoring System (AQMS), the network analysts use Jiggle, a graphical user interface in Java to pick arrivals, locate events, and recalculate magnitudes (Hartog et al., 2019). The analysts will manually annotate the arrival time and estimate the uncertainties of their picks. The phase arrivals are only picked on a single component per station, with Pwaves usually picked on vertical channels (Z component) and S-waves on horizontal channels (E/N or 1/2 components). When it is clear, the polarity (first motion is up-positive-, or down -negative-) of the P-phase is labeled by the analyst as well. Both acceleration and velocity channels are used for phase picking, although velocity channels are the most commonly used. The PNSN operates sites with both velocity channels (broadband or short-period high-gain seismometers) and acceleration channels (low-gain accelerometers used for "strong motion" seismology). Velocity channels are preferred when both instrument types exist since they usually have a higher signal-to-noise ratio than the strongmotion channel.
Additional earthquake characteristics may be obtained from the phase polarity and amplitudes, such as focal mechanisms and magnitudes. All event parameters are saved in PNSN's AQMS database, and reasonably well-located earthquakes and explosions are reported to the ANSS Comprehensive Earthquake Catalog (ComCat, Survey, 2017) via USGS Product Distribution Layer (PDL), the software-server infrastructure that all the ANSS regional networks use to distribute earthquake products. It is important to note that the combination of automated tools, which get updated through time, and manual intervention renders the event pa-rameters not statistically stationary over time.
This study splits the PNW catalog into several datasets: one that has PNSN analyst-verified event attributes that were sent to the USGS, which we refer to as the "ComCat event" dataset, one that we refer to as the "exotic event" dataset and that has remained internal in the PNSN AQMS database, and one that focuses on the 2022 Northern California earthquake sequence. These datasets are packaged in different files because they have different window lengths and data attributes. We collect and organize the data from these. We show in Figure 1 the annual event counts for the two sets of events, ComCat and exotic, that are selected for the curated dataset. The temporal patterns ought not to be interpreted as changes in seismicity rate since there are systematic biases in the detection and labeling of the events through time, whether they are human (analyst) or instrumental (increased instrumental coverage).

ComCat Events
We query the ANSS ComCat and download 65,384 events with magnitudes greater than 0 from 1 January 2002 to 31 December 2022, which we refer to as "ComCat events". We only select the events from ComCat sent by the PNSN, whose event ID has a "uw" prefix. The event metadata, including phase picks, are downloaded using libcomcat (Hearne and Schovanec, 2020) and stored in the QuakeML format (v1.2, Schorlemmer et al., 2011). The source type of these events are either earthquakes or explosions. The download contains 997,213 associated phase picks. Among these picks, 944,220 were made on velocity channels and only 52,982 (5.3%) on strong-motion channels. For single-channel stations where only the vertical channel (Z) exists (e.g., EHZ), Swaves were also picked only if the onsets were clear. The temporal evolution of the ComCat events reflects a combination of increased coverage and sensitivity of the seismometers. In 2009, a large number of the cataloged events came from an intense swarm of earthquakes at Wooded Island in eastern Washington (Gomberg et al., 2012). As listed in Table 1, the number of events repre- sented in our final curated dataset is less than what we originally downloaded due to data selection criteria described in Section .

Exotic Events
We also collect data from 5,657 events cataloged by the PNSN since 2002 that are neither labeled as earthquakes nor explosions. The exotic events are not incorporated in the ANSS ComCat and are only available through the PNSN's ANSS Earthquake Monitoring System (AQMS) database. In this dataset, we include events that were labeled as "surface event", "thunder", "sonic boom", and unfortunately a "plane crash" (a confirmed event near Whidbey Island, Washington, 3 March 2013). We refer to these events as "exotic events" herein. Figure 2 shows the number of events in each category for our final dataset.
The temporal evolution of the exotic event catalog depends on manual intervention by the analysts. Because non-tectonic earthquakes are not the priority of the PNSN, analysts only pick when time permits. Most of the labeled exotic events, such as surface events, are detected on well-instrumented volcanoes (see Figure  S2). The lower event count in the period 2005-2008 coincides with volcanic unrest at Mt. St. Helens, when the network was also desensitized during this period to the events around Mt. St. Helens due to the intense rate of volcano-tectonic seismicity. It is quite possible that other surface events outside of the volcanoes are missing, due to having fewer stations elsewhere.
Most of the exotic events are small in magnitude and seismic amplitude and thus local to a few stations. Due to a lack of additional observation of the events (e.g., a ground truth imagery as done in the ESEC catalog), source characteristics such as the source origin time, location, and magnitude are not provided for these events.

Northern California Ferndale Earthquake Sequence
We also include events associated with the 20 December 2022 M6.4 Ferndale (northern California) Earthquake. This sequence provided us with a rare oppor-tunity to add labels for moderate-to-large earthquake sizes. These events are outside of the PNSN's authoritative boundary and, thus are not routinely processed by the network. We select 20 events of M ≥3 reported by the California Integrated Seismic Network (CISN) from that sequence and manually pick 609 P-wave arrivals. Table  S1 lists events included in the dataset.

Station Metadata
The station metadata describes the technical information necessary for seismic data processing and tracks the history of any metadata changes. The IRIS DMC stores station metadata as dataless SEED files, but they can be downloaded in the StationXML format from IRIS International Federation of Digital Seismograph Networks Web Service (FDSN-WS, http://service.iris.edu/ fdsnws). The up-to-date station metadata we use is downloaded using ObsPy (Krischer et al., 2015). These stations are either long-term installations maintained by a seismic network (e.g., UW, University of Washington, 1963) or long-time experiments that lasts several years (e.g., US Transportable Array, FDSN code TA, IRIS Transportable Array, 2003).

Event Waveforms
All digitized data from the PNSN are requested and downloaded through the IRIS FDSN-WS. In total, we download ∼70 TB of continuous data in miniSEED from 1 January 2002 to 31 December 2022, which takes 2 months to complete. We first curate waveforms from high-gain velocity seismometers and specific channels from short-period (EH?) and broad-band (either BH? or HH?) seismometers. We do not use the SL? and SH? channels since they are simply derived from EH? channel after low-pass filtering or down-sampling. We also include waveforms from strong-motion EN? stations separately since there are also picks made on these channels by the analysts. We do not correct for instrumental response and do not integrate the acceleration to velocity. All waveforms are resampled to 100 Hz from their original sampling rates, which may be 40 (most BH? channels) or 100 (most EH? and HH? channels).

120°W
118°W 116°W The resampling step is necessary for deep neural networks with fixed input sizes. We keep the data as is, even if it is clipped.
For each ComCat event, we only select the stations where both P-and S-wave are picked. We prepare 150second data for ComCat events: the window starts 50 seconds before and ends 100 seconds after the source origin time (200 seconds after the origin time for the Northern California earthquake sequence). The same length of traces before this time window is curated as the noise waveforms. The reason for including so much noise window ahead of the origin time is to allow user flexibility when trimming and shifting the data in future investigations. In the ComCat events, less than 1% of the S-wave picks arrive later than 60 seconds after the origin time. Thus, most S-wave arrivals are included in the time window. Then, we apply a linear detrending. We also resample all waveforms to 100 Hz, which upsamples the broad-band BH? channels. Due to the small inaccuracy (∼0.00008%) of the digitizer clock of the analog EHZ stations, the sampling rate at these stations shifts away from strictly 100 Hz. We correct this by resampling to 100 Hz. Gappy traces are discarded. Missing channels, for example, the vertical-component-only instruments (e.g., channel EHZ) are filled with zeroes to keep the consistency of a three-component stream (further detailed below). Picks are only done with data from a single instrument per site, even if a site may have several sensors. Therefore, each "stream" is independent of the other. Examples of earthquake waveforms can be  Table 1 Number of included ComCat events in each magnitude range. The magnitude used here includes duration (Md), local (Ml), and hand (Mh) magnitude. The number of streams includes both velocity and acceleration channels. Also provided is the number of included events as a percentage of downloaded ComCat events. Numbers in the parentheses show the events and streams from the 20 December 2022 Northern California earthquake sequence and are not included in the total number of events/streams. found in Figure S3 and S4 for the velocity-seismograms and Figure S5 for the acceleration seismograms. Examples of explosion waveforms can be found in Figure S6, S7, and S8. The PNSN operates seismic stations that are particularly remote. The transfer of data through telemetry sometimes leads to artifacts in the time series. Furthermore, the transition from triggered to continuous data was progressive. Sometimes, both triggered waveforms, which are detrended, and continuous data, which are unprocessed, are sent together: the triggered data overwrites the continuous data, creating a step in the data. These show in both short-period (EH?) and broad-band (BH? and HH?) stations. For example, the time series may contain offsets that could be corrected in the future in the seismic archive at the IRIS DMC (see Figures S9 and S10).
The waveforms extracted for an exotic event are not aligned with the source origin time, which is mostly unknown. Instead, we align the waveforms by the phase picks provided by the analysts. The waveforms start 70 seconds before P-wave picks or 80 seconds before Swave picks, whichever is available. Most exotic events have no picked S-waves, but if both P-and S-wave picks exist, the P-wave is prioritized to align the time window. The time window is 180 seconds long for all types of exotic events, given the occasional long duration and elongation (e.g., cigar-shaped waveforms, Manconi et al., 2017) of the surface events. We follow the same datacurating process and formats as we process the Com-Cat events. Examples of surface-event waveforms can be found in Figure S11 and S12. Examples of thunderquakes can be found in Figures S13 and S14. Examples of sonic boom events are found in Figures S15 and S16, and all waveforms from the plane crash event in Figure S17.
We also extract noise-only waveforms. These waveforms are extracted just ahead of the event waveforms. We selected high-gain velocity channels (EH?, HH?, and BH?) using a random selection. To further test if there are hidden events in the noise waveforms, we run the machine learning model (see Section ) to test whether events could be detected and only found very few occasions where events may have been present.
We organize the three-component waveforms into NumPy arrays and define a stream as a three-component array (Harris et al., 2020;Krischer et al., 2015). To improve accessibility in the machine-learning ecosystem, we follow the SeisBench data format convention. The metadata is stored in CSV (commaseparated values) files, while all waveforms are stored in the Hierarchical Data Format version 5 (HDF5) format. The signal-to-noise ratios (SNR) are calculated (detailed below) and saved as attributes in the metadata file.
After applying the selection criteria described above, more than 70% of the ComCat events are kept in the dataset. Figure 3 shows the map of the selected events. The datasets cover events within the authoritative boundary of the PNSN, offshore in the Jan de Fuca Ridge, underneath Vancouver Island, and further East in Idaho. We provide an overview of the final number of ComCat waveforms and events in Table 1. The summary compiles the data volume across magnitudes from 0 to 6.4. It is possible that most of the events discarded by the selection had no S-wave picks for clipped waveforms. Our selection criteria also excluded more events before 2010, which we attribute to the much fewer S picks available when the data is clipped or when only vertical-component stations are available.

Machine Learning Phase Picker and Enhanced Earthquake Picks
We provide an alternative catalog of phase picks from the earthquake event catalog as a use-case of the dataset and a research-grade catalog of new picks of P and S waves using Machine Learning (ML). Automating phase picking using deep neural networks has revived the methodological development for picking seismic waves (Mousavi and Beroza, 2022;Münchmeyer et al., 2022).
Here, we use the Earthquake Transformer architecture from Mousavi et al. (2020) and implement phasepicking benchmark tests on the ComCat events. The SeisBench toolbox provides a set of Earthquake Transformer weights for models pre-trained with different datasets. We select all windowed waveforms from HH?, BH? and EH? channels and detrend the waveform. We compare the picks made by these models trained on STEAD, ETHZ, SCEDC, and INSTANCE datasets with the PNSN analyst picks recorded in the ComCat events. We demonstrate their performance by showing the residuals between ComCat picks and ML-predicted picks for P-and S-waves. The performance metrics are the mean absolute error (MAE), the root-mean-square error (RMS) for the phase picking, and the percentage of detected picks relative to ground truth picks. The input size of the Earthquake Transformer using SeisBench is 3-component, 60 seconds at 100 Hz. The probability threshold for picking is 10%. Figure 4 shows the distributions of the residuals among models and for both P and S wave picks.
The approaches to benchmark the detection and picking performance are i) the seismic network-specific expectations for the manual picking uncertainties and ii) the comparison of bias and variance in the residual distributions relative to other studies (Mousavi et al., 2020;Münchmeyer et al., 2022). We find a general trade-off between detection accuracy (completeness) and phasepick quality (low errors). The model trained with the STEAD dataset has the best quality in phase picks relative to the (ground truth) analyst's picks, but it misses more than 20% of the detections. In contrast, the model trained with the SCEDC dataset had the best detectability and only missed about 5% of arrivals for both P-and S-waves, but the picking accuracy, especially that of Swaves, is poor. Both models show negative mean residuals for both P-and S-waves, indicating that the ML picks are always earlier than the manual picks. There is also a similar pattern on the model trained with ETHZ and IN-STANCE dataset in Figure 4. The performance trade-off between detection and picking accuracy makes retraining the phase pickers using the PNW data necessary.
Using our curated dataset of ComCat earthquakes and explosions, we retrain the Earthquake Transformer model. Instead of training from scratch (randomly initialized weights), we start the training from the SeisBench-trained model, which used the STEAD dataset, and continue training for additional 100 epochs on our dataset. We note that about 3% of the STEAD data set contains PNSN data, which may be problematic for data leakage. However, the STEAD data is bandpass-filtered 1-45 Hz, while we do not filter the data. We randomly select 70% of the ComCat dataset for training and use the rest 30% for testing. We use a triangular label with a 10-sample half-width. We use the same loss function that Mousavi et al. (2020) used to train the Earthquake Transformer (a weighted sum of loss from P-, S-and detection branches). We use a small learning rate (1 × 10 −4 ) with the Adam optimizer (Kingma and Ba, 2014) during the training. Compared with the other pre-trained models, the transfer-learning on the PNW dataset improves the detection accuracy, considerably improves the S-wave picks, and performs as well as the STEAD-trained dataset (see Figure 4). Although not eliminated, the negative mean residuals are reduced after retraining. We also test all these models on strong-motion (acceleration) channels, for which INSTANCE contains the most acceleration waveforms (28.3%). The PNW transfer-learned model outperforms other pre-trained models on strong-motion channels, as shown in Figure S18.
The ability to find more and accurate picks by the retrained Earthquake Transformer makes it possible to create a future Machine-Learning-enhanced earthquake catalog. We revisit waveforms from the Com-Cat events that included either P or S picks. There are 683,133 P-and 244,431 S-wave picks for 62,054 events from these waveforms. We detect 16,201 (2%) and 207,146 (85%) new arrivals out of 686,748 time windows for P-and S-waves using the refined phase picker. As a crude quality control, we remove the picks where the ratio between the S-travel time and the P-wave travel time exceeds 2.5 or below 1.5. We add these picks with PNSN manual picks as a part of the curated dataset in a separate file. We also use this retrained model to predict the noise waveform and drop those with any prediction greater than 0.1. This step effectively removes unpicked seismic events in the noise waveform.

Description of the AI-ready Dataset
The datasets consist of two files per set, one HDF5 file containing the waveforms and a CSV file with the metadata (attributes).

Waveforms
There are 190,016 and 9,267 three-component streams curated from ComCat and exotic event catalogs, respectively. Figure 5 shows the counts of streams arranged by channel type as a yearly estimate. We store all waveforms in HDF5 files using h5py (Collette et al., 2021) and index them by the trace name in the metadata. The attribute trace_start_time in YYYY-MM-DDTHH:MM:SS.SSSZ format describes the UTC time at which the stream begins. Listing 1 illustrates how users can read the waveform data and locate the stream in Python.
Listing 1 Read stream data from SeisBench format waveform file using h5py import h5py f = h5py.File("/path/to/waveform.hdf5", "r") trace_name = "bucket1$0,:3,:15001" bucket, array = trace_name.split('$') x, y, z = iter ([int(i)  The data is saved as vertical concatenated NumPy arrays of fixed window length (here 150 s), three components. It is distributed over several "buckets" that are "groups" under the HDF5 taxonomy. The trace name (a data attribute saved in the metadata data frame), the index of the data in the bucket, and the index of the first dimension.

Metadata
The metadata describes the waveform data and its attributes and is essential to our dataset. Each stream corresponds to one record (or a row) in the metadata file. We follow SeisBench conventions again. The unit of each attribute is appended as part of the attribute's name. For example, source_latitude_deg indicates the latitude of the source in degrees. A full description of the attributes is listed in Table 2. As many attributes are self-explanatory, we provide more details below.

Station Network Code
Stations selected in both datasets may come from nine different FDSN network codes. These stations are either installed and maintained by PNSN (e.g., UW and UO) or used by PNSN when doing phase picking and events locating (e.g., PB, CC, IU, CN, HW, TA, US). Maps of the stations shown in the dataset show a similar distribution for both ComCat ( Figure S1) and exotic events ( Figure  S2). All stations are in-land stations, and no off-shore stations (e.g., OOI) are used in our dataset. The numbers of streams from each FDSN network and their references are listed in Table 3. PNSN stations contribute more than 85% of streams in the ComCat and Exotic event datasets.

Event ID
An event identifier (ID) is given to each event by the PNSN after the processing is finalized and sent to ANSS through USGS Product Distribution Layer (PDL). The ComCat events contributed by the PNSN have IDs of eight-digit numbers with a "uw" prefix, e.g., "uw10568488". The event IDs are unique in the catalog. The exotic event IDs are internal to the PNSN AQMS database and cannot be accessed through USGS.  Table 2 Attributes in the metadata file. Some source attributes are not available for exotic events.
To distinguish them from ComCat events, we add a "pnsn" prefix to their event IDs.

Event Type
When processing a seismic event as the seismic data comes in, the event type is manually specified by the network analysts. For example, the PNSN labels "probable explosion" waveforms that have the characteristics of shallow quarry blasts (strong P waves and location near known quarries). Until the 1990s, the PNSN would confirm these explosions by phone confirmation, though this is no longer routinely done. When sending the finalized event from the AQMS database to the ComCat, PNSN maps and merges several types of events into one: "earthquake", "slow earthquake", and "long period volcanic earthquake" are mapped into the "earthquake" category; "explosion", "shot" and "probable explosion" are merged into the "explosion" category. For simplicity and consistency, we use the event types "earthquake" and "explosion" for the ComCat events, but their original event types are also included for reference in the metadata. Table S2 lists the latest PNSN event-type labels from the PNSN AQMS database.

Source Magnitude and Type
The event size, as represented by the source magnitude, is only available for the ComCat events. All Com-Cat events included in the dataset have magnitudes less than seven and greater than zero, as shown in Table  1. The magnitude completeness of the catalog is estimated using the method of Wiemer and Wyss (2000) and found to be around 2 for the years 2019-2022 (Figure S19). The types of magnitudes reported are typical to regional earthquakes that have local seismicity: the local magnitude (Ml) and the duration magnitude (Md).
There are three types of magnitude used in the dataset. The PNSN uses a local magnitude (Ml, Richter, 1958;Jennings and Kanamori, 1983) that measures the magnitude of a local earthquake using the average maximum amplitudes of two horizontal seismograms converted to have the Wood-Anderson response, preferably taken from broad-band seismometers, and corrected for the distance between the source and the receiver. Such magnitude is reported by the National Earthquake Information Center (NEIC) for all earthquakes in the US and Canada. The coda duration magnitude Md is calculated based on the duration of shaking measured on the vertical component and could be the only available magnitude product for small events or those not well recorded on well-calibrated stations with horizontal components. Over the course of time, processes to calculate the magnitudes vary because of varied processing routines and analyst interventions.
Until 2012, the PNSN only reported duration magnitude to ComCat for most earthquakes using the algorithm from Crosson (1972), except for a few significant events that were manually changed to the local magni- tude. The early seismic stations of the PNSN only had vertical components, a small dynamic range, and shortperiod sensors that would clip even for relatively small magnitude events. It is not possible to obtain a local magnitude from such data. As the network modernized over time, higher dynamic-range three-component sensors were added, and the data quality improved, which allowed PNSN to determine an Ml for more events. From 2002 to 2011, 46,326 events had duration magnitude preferred, while only 483 events (average magnitude 2.45) had local magnitude reported as the preferred magnitude type. From 2012 to 2015, the PNSN calculated and reported both duration and local magnitudes, though the local magnitude was still only calculated for larger events. Since 2015, the PNSN has switched from having duration magnitude to the local magnitude as the preferred and default magnitude. 80% of all events included in the ComCat dataset until 2008 have a duration magnitude preferred, after when there were increasingly more Ml-preferred magnitudes (Figure 6). While the duration magnitude is still calculated, it is only the preferred magnitude for about 10% of the events each year. From 2002 to 2022, there were also 111 events with an Mh magnitude in the dataset, extracted from the NEIC and manually added by the network analysts. Note that there is no moment magnitude Mw reported in this dataset because the moment magnitude is obtained from low-frequency seismograms, which are often buried in the seismic noise for small earthquakes. Mw magnitude may be included as Mh.
There are potential challenges in interpreting the magnitudes as ground truth labels. Md and Ml have known systematic biases that arise from the particularly high near-source scattering of shallow earthquakes or quarry blasts (Koper et al., 2020;Wang et al., 2021). In 2012, the PNSN adopted AQMS, which included a method to measure coda duration that was inconsis-tent with the previously used method. The PNSN staff did a rough recalibration of their Md relationship to partially account for the systematic difference. However, there is a known inconsistency of the Md magnitudes for the smallest events before 2012 and after 2012. Future efforts must be made to re-calculate the magnitudes more systematically, ideally using consistent methods, throughout the 2002-2022 period. Table 1 shows the event counts per magnitude bin for this dataset. The largest event in the dataset comes from Mw 6.4 Northern California, 20 December 2022 by the CISN, but this event was outside the PNSN's authoritative boundaries. Thus, ComCat preferred an origin contributed by CISN. The largest earthquake in this dataset within PNSN's authoritative boundaries is Md 4.8 Brinnon, Washington, on 25 April 2003 (event ID uw10583988). Relatively small magnitude uncertainty (0.04), depth uncertainty (0.59 km), and horizontal uncertainty (0.347 km) were reported.

Stream Signal-to-Noise Ratio
The signal-to-noise ratio (SNR) is an important factor in measuring the noise level in the traces. Similar to Michelini et al. (2021), we define the noise window as 8 seconds before the P-wave arrival for the ComCat events. To better capture the energy of emergent Swave onsets, the signal window is defined as 1 second before to 2 seconds after the S-wave arrival. For the exotic event catalog, since P-wave and S-wave arrivals may not be available, the noise window is defined to begin 12 seconds after the beginning of the traces. The signal window is the same as exotic events, P-or S-wave, whichever is available. For each component, the SNR is defined as where |S98| and |N98| are the 98% percentile of the absolute values in the signal and noise window, respectively. When no data is available, e.g., a single-channel station with only the EHZ channel, NaN (not-a-number) is filled as a placeholder in the missing channels. Figure 8 shows the distribution of individual SNRs calculated from the ComCat and exotic event catalogs. The traces with SNR > 80 db (indicating an error in the noise window) or < −20 db (indicated too low of a signal) are removed from the dataset.

Uncertainties
The metadata includes four types of uncertainties for the ComCat events. The P-and S-waves arrival uncertainties are estimated at the time of picking. Before the PNSN used AQMS, the uncertainty was directly measured and recorded in the phase data, and a weight was calculated. Using Jiggle from AQMS since 2012, the analysts assign weight as an integer ranging from zero to four to each pick by visually measuring the impulsivity of the arrival. A zero weight indicates the highest accuracy of picks, typically for P-wave arrivals, and has 0.03 seconds of uncertainty. A weight of three indicates a low pick accuracy, typically for S-wave arrival with 0.3 seconds of uncertainty. Phase uncertainties are used when locating the events, but those with uncertainty weights of four are typically not used in earthquake locations. Before 2012, PNSN used Spong (an adaption of Fasthypo, Herrmann, 1979) as the location engine. This changed to HYPOINVERSE (Klein, 2002) after PNSN started using AQMS and Jiggle.
The origin location (depth and horizontal) uncertainties are the error estimated from the location engine. Figure S28 shows the locations of the events with horizontal uncertainty greater than 20 km. Note the cluster off-shore Oregon that is outside of the PNSN authoritative boundaries. The PNSN has poor location constraints on these events since there are almost no offshore seismic stations except for the Ocean Observatories Initiative Regional Cable Array (FDSN network code OO, Rutgers University, 2013), which are occasionally picked during PNSN routine data processing. ComCat may not choose these origin products from PNSN as preferred. However, the events with high horizontal uncertainty only make up 0.4% of all ComCat events, and their picks are still accurate enough to be part of the dataset.
We also include the magnitude uncertainties in the metadata. The magnitude is first evaluated on the channel level. For three-component stations, the channellevel local magnitude is calculated only if a P-or S- Table 3 Description of network FDSN code and their references. Networks annotated by an asterisk mark (*) are maintained by the PNSN. The number of streams shown for each network is from ComCat events, exotic events, and noise, respectively. PB and HW network does not have a registered FDSN network DOI.
wave is picked on one of the components to only select clear signals. Since 2012, a few single-component stations (EHZ) also contribute to the local magnitude and have the same weight as three-component stations. The event magnitude is the median of all channel magnitudes that meet the SNR criteria. The event magnitude uncertainty is the median absolute deviation (MAD) of channel magnitudes used for event magnitude calculation. These uncertainties are calculated for all magnitude types except Mh.

P-wave Polarity
When analysts pick the phase arrivals, Jiggle also automatically measures the first motion of the P-wave picks with weights less than one (e.g., best waveforms), leaving the rests as "undecidable". The analysts can manually override these polarities if they are confident. Less than 42% of P-waves in this dataset have undecidable polarity information. The P-wave polarity ratio between positive and negative as a function of the year is shown in Figure S20. The sudden switch to a preference to assign or report positive polarities in 2012 highly suggests that the switch to AQMS and Jiggle in 2012 has affected the PNSN analysts' output. Until this data collection effort, we were unaware of this fact, and the reason for the abrupt change is unclear.

Conclusion
This work contributes to collecting and curating a seismic dataset for the Pacific Northwest region. The curated dataset is provided with the long-standing work and labeling of the Pacific Northwest Seismic Network analysts and seismologists. We described the temporal and spatial characteristics of the data attributes.
This original contribution focused on preparing the seismic waveforms and PNSN-provided data attributes (phase picks and default source parameters). We picked additional waveforms for the recent 20 December 2022 Northern California earthquake sequence, the largest event recorded recently in proximity to the PNSN authoritative boundaries. We also transfer-learned an established phase picker, the Earthquake Transformer (Mousavi et al., 2020), on the best quality of the PNSN picks and provided additional picks for S waves, which we provided in this contribution as an alternate catalog of picks.
There remains tremendous work to improve the quality and consistency of the data attributes. We use version control on the curated dataset through GitHub to allow for future development of the data and metadata (Ni, 2023). Examples of future developments may be a refinement of the current attributes or the addition of new labels. In particular, the attribute "magnitude" should be carefully interpreted as 60% of the catalog uses duration magnitude, and 40% of the catalog uses the local magnitude, but both may have biases. Therefore, a follow-up task is to re-calculate these magnitudes using consistent methods. Another avenue for improvement is to re-estimate the polarity of the P and S waves, using the known labels and predicting the "undecided" labels. Furthermore, we have not yet included other types of tectonic events, such as low-frequency earthquakes (Ducellier and Creager, 2022), but these would improve the diversity of events. Finally, an obvious next step will be event classification work that will take the waveforms and predict the event type.