Cactus to Clouds: Processing The SCEDC Open Data Set on AWS
Tim Clements
Conference Paper to accompany poster Presentation at 2019 SCEC Annual Meeting in Palm Springs, CA.
Abstract
Data from the Southern California Earthquake Data Center (SCEDC) are now in the cloud! Amazon Web Services (AWS) is currently hosting twenty years of data (1999-2019, 552 stations, ~50 TB) from the Southern California Seismic Network as an Open Data Set. Here, we share the promises and pitfalls of leveraging cloud computing for seismic research from our initial workings with SCEDC data on AWS. We present an AWS-based workflow for our usage case: ambient noise cross-correlation of all stations in Southern California for groundwater monitoring. Ambient noise cross-correlation is both I/O and compute heavy.
Analyzing seismic data in the cloud substantially reduces download times. Our Julia and Python APIs for transferring data miniSEED files from Amazon Simple Storage Service (S3) to Amazon Elastic Compute Cloud (EC2) achieve transfer speeds of up to 250 MB/s. We use AWS ParallelCluster for deploying Simple Linux Utility for Resource Management (SLURM)-based clusters on EC2, allowing us to spin up to thousands of cores in minutes. Processing of one year (8TB) of data at 20 Hz costs less than $100. Our initial work with AWS suggests that cloud computing will decrease time-to-science for I/O and compute heavy seismic workloads.
What is cloud computing and why is it better to store data on the cloud?
You may have heard that cloud computing allows for scaled, on-demand high performance computing (HPC) and storage. But what is cloud computing and how does one access it? Cloud computing comprises a number of different services and models under one umbrella. Google Drive and Dropbox are examples of cloud computing software a service: the user interacts with software on a web browser rather than the underlying cloud infrastructure. Cloud compute providers like Amazon Web Services(AWS), Microsoft Azure or Google Cloud Compute (GCC) act similarly as water authorities, providing infrastructure for customers to do whatever they need. The providers (AWS, Azure, GCC) pay for the wells/pumps (CPUs), delivery infrastructure, including pipes and canals (bandwidth), and short to long-term storage in the case of water towers and reservoirs (SSDs and HDDs). Water authorities only charge customers for the water (compute resources) they use. For larger customers, water prices fall the more they use.
Data on the Cloud
As of now (Sept 2019) there is more than 50 Tb of data from the Southern California Earthquake Data Center on Amazon Simple Storage (S3). Downloading all these data at 1 Mb/s would take months. If instead, we bring our code to the data, we can start computing in minutes. To do this, one would normally activate an Elastic Cloud Compute (EC2) instance, install his/her code (from Github or with a package manager), then transfer data from S3 to EC2 and start computing. This tutorial skips the steps of spinning up an EC2 instance and installing software by using a Nextjournal Julia notebook to access the SCEDC Open Dataset on AWS.
Accessing the SCEDC Dataset on AWS
This Julia notebook serves as a demonstration to access the Southern California Earthquake Data Center Open Dataset on AWS. The dataset is stored in the scedc-pds
bucket in the US-West-2 Region on Amazon Simple Storage (S3). For this tutorial, we'll use Nextjournal to connect to AWS instead of an API such as Boto3 or AWSS3.jl. Though not shown in this tutorial, using the SeisNoise.jl download client on AWS we can achieve up to 250 MB/s transfer speeds from S3 to EC2! This is 10x faster than downloading files using a node on a cluster. 250 MB/s is equivalent to about 1 TB/hour/instance transfer speed. So, if we used 100 instances, we could transfer the entire SCEDC catalog in about 30 minutes.
Connecting to AWS with Nextjournal and Julia
Nextjournal can mount an AWS bucket as a local read-only directly. A bucket is a publicly available storage directory on S3. Here, I've mounted the scedc-pds
bucket as a directory.
Let's see what's in scedc-pds
readdir("/scedc-pds")
Data is stored in /year/year_day/{network}{station}{location}{channel}{year}{day}.ms
format. Let's have a look at the first day of 2013:
readdir("/scedc-pds/2013/2013_001")
We'll load these data using the the SeisIO.jl package (the lean Julia equivalent of Obspy).
Loading data with SeisIO
Before we load anything we need to install packages to load and play with data (this may take 90 seconds or so).
using Pkg Pkg.add("Plots") Pkg.add(PackageSpec(name="SeisIO", rev="master")) Pkg.add(PackageSpec(name="SeisNoise", rev="master")) using Dates, SeisIO, SeisNoise, Plots
Now we can read a daylong file using SeisIO.jl's read_data
function:
S = read_data("mseed","/scedc-pds/2013/2013_001/CIWWC__BHZ___2013001.ms")
Now that we have data, let's plot a seismogram:
starttime, endtime = start_end(S[1]) t = starttime: Millisecond(1 / S[1].fs * 1e3) : endtime + Millisecond(1 / S[1].fs * 1e3) ind = findall(x -> x < DateTime(2013,1,1,0,10,0),t) SeisIO.detrend!(S) SeisIO.taper!(S) filtfilt!(S,fl=0.1, fh = 1., np=2, rt="Bandpass") Plots.plot(t[ind],S[1].x[ind],xlabel = "time")
Instrument Responses
Instrument responses (stationXML) for all CI network stations are stored on AWS in the scedc-pds
bucket. To access stationXML files head to scedc-pds/FDSNstationXML/CI
readdir("/scedc-pds/FDSNstationXML/CI/")
Let's read the stationXML for station CI.WWC
using SeisIO:
R = read_sxml("/scedc-pds/FDSNstationXML/CI/CI_WWC.xml") R.id
There are 59 different channels for this one station! We'll need to select the BHZ response.
ind = findfirst(x -> occursin("BHZ",x),R.id) R = R[ind]
Now that we have the correct channel, we can remove the instrument response using remove_response!
freqmin = 0.01 freqmax = 8. remove_response!(S,"/scedc-pds/FDSNstationXML/CI/CI_WWC.xml",freqmin,freqmax)
Ambient Noise Cross-Correlation on AWS
Here, we'll do some ambient noise cross-correlations using the SeisNoise.jl package. SeisNoise is written for easy, fast cross-correlation in Julia. Here are the input parameters for the cross-correlation:
fs = 10. # sampling frequency in Hz freqmin,freqmax = 0.1,0.2 # minimum and maximum frequencies in Hz cc_step, cc_len = 450, 1800 # corrleation step and length in S maxlag = 60. # maximum lag time in correlation smoothing_half_win = 3 # spectral smoothing half length
Next we'll read waveforms from two separate stations in the SCEDC dataset:
S1 = read_data("mseed","/scedc-pds/2016/2016_183/CISDD__HHZ___2016183.ms") S2 = read_data("mseed","/scedc-pds/2016/2016_183/CIHOL__HHZ___2016183.ms")
Cross-correlations are more efficiently computed in the frequency domain. Let's compute Fourier transforms of the waveforms
FFT1 = compute_fft(S1,freqmin, freqmax, fs, cc_step, cc_len,max_std=10.) FFT2 = compute_fft(S2,freqmin, freqmax, fs, cc_step, cc_len,max_std=10.)
Now we can compute cross-correlations from the FFTs:
coherence!(FFT1,smoothing_half_win) coherence!(FFT2,smoothing_half_win) C = compute_cc(FFT1,FFT2,maxlag,corr_type="coherence") clean_up!(C,freqmin,freqmax) abs_max!(C)
And plot it!
corrplot(C)