Cactus to Clouds: Processing The SCEDC Open Data Set on AWS

Tim Clementsand Marine Denolle

Department of Earth and Planetary Sciences, Harvard University, Cambridge, MA, USA

Conference Paper to accompany poster Presentation at 2019 SCEC Annual Meeting in Palm Springs, CA.

Abstract

Data from the Southern California Earthquake Data Center (SCEDC) are now in the cloud! Amazon Web Services (AWS) is currently hosting twenty years of data (1999-2019, 552 stations, ~50 TB) from the Southern California Seismic Network as an Open Data Set. Here, we share the promises and pitfalls of leveraging cloud computing for seismic research from our initial workings with SCEDC data on AWS. We present an AWS-based workflow for our usage case: ambient noise cross-correlation of all stations in Southern California for groundwater monitoring. Ambient noise cross-correlation is both I/O and compute heavy.

Analyzing seismic data in the cloud substantially reduces download times. Our Julia and Python APIs for transferring data miniSEED files from Amazon Simple Storage Service (S3) to Amazon Elastic Compute Cloud (EC2) achieve transfer speeds of up to 250 MB/s. We use AWS ParallelCluster for deploying Simple Linux Utility for Resource Management (SLURM)-based clusters on EC2, allowing us to spin up to thousands of cores in minutes. Processing of one year (8TB) of data at 20 Hz costs less than $100. Our initial work with AWS suggests that cloud computing will decrease time-to-science for I/O and compute heavy seismic workloads.

What is cloud computing and why is it better to store data on the cloud?

You may have heard that cloud computing allows for scaled, on-demand high performance computing (HPC) and storage. But what is cloud computing and how does one access it? Cloud computing comprises a number of different services and models under one umbrella. Google Drive and Dropbox are examples of cloud computing software a service: the user interacts with software on a web browser rather than the underlying cloud infrastructure. Cloud compute providers like Amazon Web Services(AWS), Microsoft Azure or Google Cloud Compute (GCC) act similarly as water authorities, providing infrastructure for customers to do whatever they need. The providers (AWS, Azure, GCC) pay for the wells/pumps (CPUs), delivery infrastructure, including pipes and canals (bandwidth), and short to long-term storage in the case of water towers and reservoirs (SSDs and HDDs). Water authorities only charge customers for the water (compute resources) they use. For larger customers, water prices fall the more they use.

Data on the Cloud

As of now (Sept 2019) there is more than 50 Tb of data from the Southern California Earthquake Data Center on Amazon Simple Storage (S3). Downloading all these data at 1 Mb/s would take months. If instead, we bring our code to the data, we can start computing in minutes. To do this, one would normally activate an Elastic Cloud Compute (EC2) instance, install his/her code (from Github or with a package manager), then transfer data from S3 to EC2 and start computing. This tutorial skips the steps of spinning up an EC2 instance and installing software by using a Nextjournal Julia notebook to access the SCEDC Open Dataset on AWS.

Accessing the SCEDC Dataset on AWS

This Julia notebook serves as a demonstration to access the Southern California Earthquake Data Center Open Dataset on AWS. The dataset is stored in the scedc-pds bucket in the US-West-2 Region on Amazon Simple Storage (S3). For this tutorial, we'll use Nextjournal to connect to AWS instead of an API such as Boto3 or AWSS3.jl. Though not shown in this tutorial, using the SeisNoise.jl download client on AWS we can achieve up to 250 MB/s transfer speeds from S3 to EC2! This is 10x faster than downloading files using a node on a cluster. 250 MB/s is equivalent to about 1 TB/hour/instance transfer speed. So, if we used 100 instances, we could transfer the entire SCEDC catalog in about 30 minutes.

Connecting to AWS with Nextjournal and Julia

Nextjournal can mount an AWS bucket as a local read-only directly. A bucket is a publicly available storage directory on S3. Here, I've mounted the scedc-pds bucket as a directory.

scedc-pds

Public

Let's see what's in scedc-pds

readdir("/scedc-pds")

2.4s

Julia

24-element Array{String,1}: "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" ⋮ "2014" "2015" "2016" "2017" "2018" "2019" "FDSNstationXML" "index" "indexmap"

Data is stored in /year/year_day/{network}{station}{location}{channel}{year}{day}.ms format. Let's have a look at the first day of 2013:

readdir("/scedc-pds/2013/2013_001")

19.5s

Julia

1859-element Array{String,1}: "AZBZN__BHE___2013001.ms" "AZBZN__BHN___2013001.ms" "AZBZN__BHZ___2013001.ms" "AZBZN__HHE___2013001.ms" "AZBZN__HHN___2013001.ms" "AZBZN__HHZ___2013001.ms" "AZBZN__LHE___2013001.ms" "AZBZN__LHN___2013001.ms" "AZBZN__LHZ___2013001.ms" "AZCPE__BHE___2013001.ms" ⋮ "NNSHP__HHE___2013001.ms" "NNSHP__HHN___2013001.ms" "NNSHP__HHZ___2013001.ms" "TA214A_BHE___2013001.ms" "TA214A_BHN___2013001.ms" "TA214A_BHZ___2013001.ms" "TA214A_LHE___2013001.ms" "TA214A_LHN___2013001.ms" "TA214A_LHZ___2013001.ms"

We'll load these data using the the SeisIO.jl package (the lean Julia equivalent of Obspy).

Loading data with SeisIO

Before we load anything we need to install packages to load and play with data (this may take 90 seconds or so).

using Pkg
Pkg.add("Plots")
Pkg.add(PackageSpec(name="SeisIO", rev="master"))
Pkg.add(PackageSpec(name="SeisNoise", rev="master"))
using Dates, SeisIO, SeisNoise, Plots

296.6s

Julia

Now we can read a daylong file using SeisIO.jl's read_data function:

S = read_data("mseed","/scedc-pds/2013/2013_001/CIWWC__BHZ___2013001.ms")

20.7s

Julia

SeisData with 1 channels (1 shown) ID: CI.WWC..BHZ NAME: CI.WWC..BHZ LOC: 0.0 N, 0.0 E, 0.0 m FS: 40.0 GAIN: 1.0 RESP: a0 1.0, f0 1.0, 0z, 0p UNITS: SRC: MISC: 0 entries NOTES: 0 entries T: 2013-01-01T00:00:00.020 (0 gaps) X: +8.710e+02 +8.750e+02 ... +7.460e+02 (nx = 3456102) C: 0 open, 0 total

Now that we have data, let's plot a seismogram:

starttime, endtime = start_end(S[1])
t = starttime: Millisecond(1 / S[1].fs * 1e3) : endtime + Millisecond(1 / S[1].fs * 1e3)
ind = findall(x -> x < DateTime(2013,1,1,0,10,0),t)
SeisIO.detrend!(S)
SeisIO.taper!(S)
filtfilt!(S,fl=0.1, fh = 1., np=2, rt="Bandpass")
Plots.plot(t[ind],S[1].x[ind],xlabel = "time")

69.5s

Julia

Instrument Responses

Instrument responses (stationXML) for all CI network stations are stored on AWS in the scedc-pds bucket. To access stationXML files head to scedc-pds/FDSNstationXML/CI

readdir("/scedc-pds/FDSNstationXML/CI/")

10.6s

Julia

790-element Array{String,1}: "CI_ABL.xml" "CI_ACP.xml" "CI_ADL.xml" "CI_ADO.xml" "CI_AGA.xml" "CI_AGM.xml" "CI_AGO.xml" "CI_ALP.xml" "CI_ALV.xml" "CI_AMS.xml" ⋮ "CI_XMS.xml" "CI_XTL.xml" "CI_YAQ.xml" "CI_YEG.xml" "CI_YEG2.xml" "CI_YMD.xml" "CI_YUC.xml" "CI_YUH.xml" "CI_YUH2.xml"

Let's read the stationXML for station CI.WWC using SeisIO:

R = read_sxml("/scedc-pds/FDSNstationXML/CI/CI_WWC.xml")
R.id

4.1s

Julia

59-element Array{String,1}: "CI.WWC..ACE" "CI.WWC..BHE" "CI.WWC..BHE" "CI.WWC..BHN" "CI.WWC..BHN" "CI.WWC..BHZ" "CI.WWC..BHZ" "CI.WWC..HHE" "CI.WWC..HHE" "CI.WWC..HHN" ⋮ "CI.WWC..VHZ" "CI.WWC..VHZ" "CI.WWC..VKI" "CI.WWC..VMU" "CI.WWC..VMU" "CI.WWC..VMV" "CI.WWC..VMV" "CI.WWC..VMW" "CI.WWC..VMW"

There are 59 different channels for this one station! We'll need to select the BHZ response.

ind = findfirst(x -> occursin("BHZ",x),R.id)
R = R[ind]

1.4s

Julia

SeisChannel with 0 samples ID: CI.WWC..BHZ NAME: West Wide Canyon LOC: 33.9407 N, -116.409 E, 619.0 m FS: 40.0 GAIN: 6.27368e8 RESP: a0 4.87623e7, f0 0.03, 2z, 5p UNITS: m/s SRC: MISC: 4 entries NOTES: 0 entries T: X: (empty)

Now that we have the correct channel, we can remove the instrument response using remove_response!

freqmin = 0.01
freqmax = 8.
remove_response!(S,"/scedc-pds/FDSNstationXML/CI/CI_WWC.xml",freqmin,freqmax)

6.7s

Julia

Ambient Noise Cross-Correlation on AWS

Here, we'll do some ambient noise cross-correlations using the SeisNoise.jl package. SeisNoise is written for easy, fast cross-correlation in Julia. Here are the input parameters for the cross-correlation:

fs = 10. # sampling frequency in Hz
freqmin,freqmax = 0.1,0.2 # minimum and maximum frequencies in Hz
cc_step, cc_len = 450, 1800 # corrleation step and length in S
maxlag = 60. # maximum lag time in correlation
smoothing_half_win = 3 # spectral smoothing half length

0.3s

Julia

Next we'll read waveforms from two separate stations in the SCEDC dataset:

S1 = read_data("mseed","/scedc-pds/2016/2016_183/CISDD__HHZ___2016183.ms")
S2 = read_data("mseed","/scedc-pds/2016/2016_183/CIHOL__HHZ___2016183.ms")

5.8s

Julia

SeisData with 1 channels (1 shown) ID: CI.HOL..HHZ NAME: CI.HOL..HHZ LOC: 0.0 N, 0.0 E, 0.0 m FS: 100.0 GAIN: 1.0 RESP: a0 1.0, f0 1.0, 0z, 0p UNITS: SRC: MISC: 0 entries NOTES: 0 entries T: 2016-07-01T00:00:00.008 (1 gaps) X: +5.420e+02 +5.300e+02 ... +5.290e+02 (nx = 8640000) C: 0 open, 0 total

Cross-correlations are more efficiently computed in the frequency domain. Let's compute Fourier transforms of the waveforms

FFT1 = compute_fft(S1,freqmin, freqmax, fs, cc_step, cc_len,max_std=10.)
FFT2 = compute_fft(S2,freqmin, freqmax, fs, cc_step, cc_len,max_std=10.)

1.6s

Julia

FFTData with 183 ffts NAME: "CI.HOL..HHZ" ID: "2016-07-01" LOC: 0.0 N, 0.0 E, 0.0 m FS: 10.0 GAIN: 1.0 FREQMIN: 0.1 FREQMAX: 0.2 CC_LEN: 1800 CC_STEP: 450 WHITENED: false TIME_NORM: false RESP: a0 1.0, f0 1.0, 0z, 0p MISC: 0 entries NOTES: 8 entries T: 2016-07-01T00:00:00.000 … FFT: 9001×183 Array{Complex{Float32},2…

Now we can compute cross-correlations from the FFTs:

coherence!(FFT1,smoothing_half_win)
coherence!(FFT2,smoothing_half_win)
C = compute_cc(FFT1,FFT2,maxlag,corr_type="coherence")
clean_up!(C,freqmin,freqmax)
abs_max!(C)

0.5s

Julia

And plot it!

corrplot(C)

17.1s

Julia

Shift+Enter to run

Julia