SeisIO: A Fast, Efficient Geophysical Data Architecture for the Julia Language
Joshua P. Jones, Kurama Okubo, Tim Clementsand Marine A. Denolle
Portland, Oregon, USA
Department of Earth and Planetary Sciences, Harvard University, Cambridge, MA, USA
This is a tutorial to accompany the article published in Seismological Research Letters doi: https://doi.org/10.1785/0220190295
Abstract
SeisIO for the Julia language is a new geophysical data framework that combines the intuitive syntax of a high‐level language with performance comparable to FORTRAN or C. Benchmark comparisons against recent versions of popular programs for seismic data download and analysis demonstrate significant improvements in file read speed and orders‐of‐magnitude improvements in memory overhead. Because the Julia language natively supports parallel computing with an intuitive syntax, we benchmark test parallel download and processing of multiweek segments of contiguous data from two sets of 10 broadband seismic stations, and find that SeisIO outperforms two popular Python‐based tools for data downloads. The current capabilities of SeisIO include file read support for several geophysical data formats, online data access using a variety of services, and optimized versions of several common data processing operations. Tutorial notebooks and extensive documentation are available to improve the user experience. As an accessible example of performant scientific computing for the next generation of researchers, SeisIO offers ease of use and rapid learning without sacrificing computational efficiency.
Installation
SeisIO.jl is a registered Julia package and can be installed using the Julia package manager.
using Pkg
Pkg.add("SeisIO")
Getting Started
Start SeisIO by loading the package into memory:
using SeisIO
You'll need to do this at the start of every Julia session. SeisIO uses an array-like structure called SeisChannel
for single-channel seismic data.
C = SeisChannel()
Access fields by name to view their contents directly:
C.x
You can also overwrite them by name:
C.x = rand(Float32,1024)
C.loc = GeoLoc(lat=-90.,lon=-0.0, el=9300.)
C
The SeisData
structure is the basic container that SeisIO uses. These objects behave like arrays of SeisChannel
objects: they store multiple channels of data, but they're not matrices (more like arrays of arrays). You can initialize SeisData
structures as empty containers for 0 or more channels:
S = SeisData(6)
Let's create a random example for tutorial purposes.
using SeisIO.RandSeis
S = randSeisData(6)
SeisData fields are accessed by name and index.
x1 = S.x[1]
If you access a SeisData structure by a single integer index, the output is a SeisChannel:
C_new = S[1]
If you access a SeisData
structure by a range of indices, the output is a SeisData
structure whose channel indices match your request:
S2 = S[1:3]
These commands access identical information, but the first is faster:
x1 = S.x[1] # get first index of data field :x
x2 = S[1].x # extract S[1] to a SeisChannel, then get its data field :x
x1 == x2 # true unless there are NaNs
Working with Structures
A collection of SeisChannel
objects becomes a SeisData
structure:
S = SeisData(randSeisChannel(), randSeisChannel())
You can push channels onto existing SeisData structures:
push!(S, C_new)
S
SeisData
structures can be concatenated:
append!(S, randSeisData(4))
S
The addition operator also does this:
S += randSeisChannel()
S
Thus, this command works: (note: please don't actually code like this)
S = SeisData(
randSeisData(5),
randSeisChannel(),
SeisChannel(
id="UW.SEP..EHZ",
name="Darth Exploded",
loc=GeoLoc(lat=46.1967, lon=-122.1875, el=1440.0),
t=[1 1569374150379000; 1024 0],
x=rand(1024)
)
)
Warning: Combining structures calls prune!
to remove empty channels, defined as channels with no data (empty :x
); this is one of two operations\ needed to guarantee commutativity (i.e., for any SeisData
structures S1, S2,\ S1 + S2 == S2 + S1
). Thus, in the (heinous) creation syntax below, the command SeisData(3)
, which should add three empty channels, does nothing; prune!
deletes the new channels. (To actually append empty channels, use append!
)
S1 = SeisData(
randSeisData(5),
randSeisChannel(),
SeisChannel(
id="UW.SEP..EHZ",
name="Darth Exploded",
loc=GeoLoc(lat=46.1967, lon=-122.1875, el=1440.0),
t=[1 1569374150379000; 1024 0],
x=rand(1024)
)
) + SeisData(3)
Navigating Structures
There are two easy ways to find channels of interest in a data structure. The command findid
returns the first channel number that matches the ID supplied:
i = findid("UW.SEP..EHZ", S)
If you want partial matches, use the function findchan
:
findchan("EHZ", S)
Variant Structures
SeisData
is a subtype of GphysData
. The Abstract Type GphysData
includes variant data structures, all of which have the same basic fields (:id
, :x
, etc). Other data structures will be added to GphysData
as modules develop, but they will always contain the same basic fields. For example, in the Quake submodule, the EventTraceData
structure has additional fields for location parameters like azimuth (:az
) and has a custom Type for a phase catalog (:pha
). You can convert between GphysData
subtypes with convert
, but extraneous fields will be lost, and fields that are not contained in the source Type will be initialized to default values:
using SeisIO.Quake
S_ev = randSeisData(6)
Tr = convert(EventTraceData, S_ev)
Tr
Let's add some phases to see how conversion works:
Tr.pha[1]["P"] = SeisPha(amp = 1.0e6, d = 40075.0, tt = 1344.0, unc = 1.5)
Tr.pha[2] = randPhaseCat()
S_ev2 = convert(SeisData, Tr)
S_ev == S_ev2 # true unless S_ev.x contains NaNs
...but since the phase catalog wasn't retained by the SeisData
object S_ev2
,
Tr2 = convert(EventTraceData, S_ev2)
Tr == Tr2
File I/O with SeisIO
Here we'll explain how to read and download data.
Before we do that, we'll need to grab some data from the SeisIO github page using svn
sudo apt-get update
sudo apt-get install subversion
svn export https://github.com/jpjones76/SeisIO-TestData/trunk/Tutorial/
Loading Entire Files
read_data
reads one or more entire files into memory. Use this with legacy data formats like SAC, SEG Y, and mini-SEED:
?read_data
If you know the file format that you're trying to read, pass it as the first argument to read_data
in lowercase:
S = read_data("mseed", "Tutorial/2018.224.00.00.00.000.TA.C26K..BHZ.R.mseed")
S = read_data("sac", "Tutorial/2018.224.00.00.00.000.TA.C26K..BHZ.R.SAC")
f you don't know a file's format, read_data
calls a (somewhat slower) function called guess
that can usually identify it:
S2 = read_data("Tutorial/2018.224.00.00.00.000.TA.C26K..BHZ.R.SAC")
S == S2
Reading from large volumes
Modern data formats create large volumes on disk, from which data are read in user-specified time windows. SeisIO currently supports reading from ASDF files through the function read_hdf5
:
?read_hdf5
hdf = "Tutorial/2days-40hz.h5"
id = "CI.SDD..HHZ"
ts = "2019-07-07T23:00:00"
te = "2019-07-08T01:59:59.975"
S = read_hdf5(hdf, ts, te, id = id)
FDSN-style wildcards can be used for the ID string.
idr = "C*.SDD..HH?"
S1 = read_hdf5(hdf, ts, te, id = id)
S == S1
If the id
keyword isn't used, then all channels with data matching the time query are read into memory.
S2 = read_hdf5(hdf, ts, te)
Contents of an HDF5 volume can be scanned at the "station" (default) or "trace" level with scan_hdf5
:
scan_hdf5(hdf)
scan_hdf5(hdf, level="channel")
File and Format Information
Information on files and formats can be found in a number of places, including the command-line interface.
guess("Tutorial/2018.224.00.00.00.000.TA.C26K..BHZ.R.SAC")
ls Tutorial
fname = "Tutorial/99011116541W"
g = guess("Tutorial/99011116541W")
SeisIO.formats[g[1]] # what are we looking at...?
S2 = read_data(g[1], fname)
Saving Data
SeisData
and SeisChannel
structures can be written to ASDF, SAC, or to SeisIO's native format. ASDF format is portable (most languages can read HDF5) and well- suited to large data sets. SAC has the advantage that it's almost universally readable, but only writes 32-bit float data, and each contiguous data segment creates one file. SeisIO format saves complete information and does low-level object writing with a file index; it's the fastest of the three writers.
Writing to ASDF files
ASDF is an HDF5 file format designed for seismic data. Write to ASDF with the\ command write_hdf5
.
?write_hdf5
write_hdf5( hdf_out::String, S::GphysData )
Write data in a seismic HDF5 format to file hdf_out
from structure S
.
See also: read_hdf5
The simplest ASDF write creates a new file:
hdf_out = "test.h5"
write_hdf5(hdf_out, S)
Writing to an existing ASDF volume adds the data to the Waveforms/
group, creating new Waveforms/(net.sta)/
paths as needed. Channel information is written to Waveforms/(net.sta)/StationXML
:
write_hdf5(hdf_out, S1)
Writing to SAC files
Writing to SAC creates one file for each contiguous data segment in each data channel. Although large collections of small files have been an obsolete data archival strategy since the early 2000s, SeisIO supports writing to SAC because the data format is almost universal.
writesac(S) # filenames are auto-generated. no need to specify.
writesacpz(S, "req_1.pz") # in case you need instrument responses later.
Two useful keywords control the creation of SAC files:
fname=FF
uses filename FF, rather than creating file names automatically. This keywork only works with GphysChannel objects.xy=true
writes generic x-y data to file with time as the independent variable.
writesac(S[1], fname="tmp.SAC") # writesac also works with SeisChannel objects.
Writing to SeisIO files
The SeisIO low-level file format is comprehensive and fast. A SeisIO file can contain multiple structures, and structures written to file can be any Type in SeisIO (including, for example, just an instrument response). Information in structures is fully preserved except for exotic data Types stored in the :misc
field. To read from a SeisIO file, you'll need to specify one or more object numbers, or the output will be of Type Array{Any,1}
:
wseis("req_1.seis", S)
S2 = rseis("req_1.seis")[1]
S == S2
Data Requests
Here, we'll learn to download data using SeisIO.
get_data
is the wrapper to online time-series data requests. You can use it with FDSN dataselect and IRIS timeseries functions.
?get_data
Let's try an example.
First, we'll get the current local time.
using Dates
ds = Dates.now(); ds -= (Day(1) + Millisecond(ds) + Second(ds))
s = string(ds)
Now, let's use that to request some data. From the help text, the keywords s=
and t=
accept Strings, DateTime objects, and numbers. So let's start at s
, as defined above, and end at t=600
, or 10 minutes later.
cha_str = "UW.MBW..EHZ, UW.SHW..EHZ, UW.HSR..EHZ, UW.TDH..EHZ, CC.PALM..EH?"
S = get_data("FDSN", cha_str, src="IRIS", s=s, t=600)
What each positional argument does
"FDSN"
tells get_data to use the FDSN dataselect service for our requestThe long string gives the data channels requested as a comma-separated list
What each keyword does
s=s
sets the start time tos
, the string created in the cell above.t=600
sets the termination (end) time to 600 seconds afters
.src="IRIS"
tells get_data to check the IRIS FDSN dataselect server.
Note: the last keyword isn't equivalent to setting the first positional argument to "IRIS", because IRIS runs FDSN dataselect and its own timeseries request service (plus many others).
Any sign of TDH?
findid("UW.TDH..EHZ", S)
Where can we look for data? What servers are available?
?seis_www
I bet that CalTech is happy to handle a random download request!
S2 = get_data("FDSN", "CI.SDD..BHZ", src="SCEDC", s=s, t=600, fmt="mseed", msr=true, w=true, demean=true, rr=true)
What the new keywords do:
src="SCEDC"
tellsget_data
to use the SCEDC FDSN servers.fmt="mseed"
specifies the data format for the download. (Note: mseed is actually the default, but including this keyword is useful for tutorial purposes.)w=true
write the download directly to disk, byte for byte, before any parsing happens. The file extension is always ".fmt
". The entire request is saved even if a parsing error happens -- which is rare, but possible with SEED. (Some Blockettes and data decoders are so rare that we've literally never seen them)demean=true
removes the mean of each channel after downloading.rr=true
removes the instrument response, flattening to DC.msr=true
uses the multi-stage instrument response. Most users don't need that much detail, somsr
defaults tofalse
.
# Example of single-stage response
S.resp[1]
# Example of multi-stage response
S2.resp[1]
Check logs early and often
All file I/O and data processing operations are logged to the :notes
\ fields of SeisIO data structures. For example:
S2.notes[1]
Saving requests
Remember, from above: data requests can be written directly to disk with keyword w=true
. This writes raw output to file, even if data parsing somehow fails.
In addition, SeisData and SeisChannel structures can be written to ASDF, SAC, or to SeisIO's native format, as we saw above.
wseis("req_1.seis", S)
Data request syntax is always the same
NN.SSSSS.LL.CC (net.sta.loc.cha, separated by periods) is the expected syntax for all web functions. The maximum field width in characters corresponds to the length of each field (e.g. 2 for network). Fields can’t contain whitespace.
Data requests in SeisIO all use this syntax, even though IRIS timeseries, FDSN dataselect, and SeedLink format strings differently. Request strings are converted to the appropriate syntax for the request protocol.
# these are identical requests
channels = "UW.KMO.., IU.COR.00.BHZ, CC.LON..BH?" # single String
channels = ["UW.KMO..", "IU.COR.00.BHZ", "CC.LON..BH?"] # Vector{String}
channels = ["UW" "KMO" "" ""; "IU" "COR" "00" "BHZ"; "CC" "LON" "" "BH?"] # Matrix{String}
?chanspec # where to find help on channel specification
See also: https://seisio.readthedocs.io/en/latest/src/Appendices/web_syntax.html
Multiple data requests to one structure
\ Web requests to a structure S always create a new channel in S for each channel in each request, even if a channel with the same ID exists. This is necessary to prevent TCP congestion. \
This is different from multiple file reads to one structure; file reads \ always attempt to append channels with matching IDs.
You can "flatten" structures with redundant channels by calling merge!
. To see how this works, let's append a new data request to our first one:
get_data!(S, "FDSN", cha_str, src="IRIS", s=ds+Minute(10), t=600)
S.id
With two sets of requests to the same channels, each channel ID should appear twice. Let's clean this up.
merge!(S)
Check the results:
S.id
Saving your work
All data from this tutorial can be written to file using commands from the File IO tutorial. To review:
wseis("fname.seis", S)
writes S
to low-level SeisIO native format file fname.seis
.
writesac(S)
writes S
to SAC files with auto-generated names.
write_hdf5("fname.h5", S)
writes S
to ASDF (HDF5) file fname.h5
.
Additional Examples
The examples below are also found in https://seisio.readthedocs.io/en/latest/src/Appendices/examples.html
FDSN get_data
Request the last 600 seconds of data from the IRIS FDSNWS server \ for channels CC.PALM, UW.HOOD, CC.TIMB, CC.HIYU, UW.TDH
S_fdsn = get_data("FDSN", "CC.PALM, UW.HOOD, CC.TIMB, CC.HIYU, UW.TDH", src="IRIS", s=-600, t=0)
S.id
IRIS get_data
S_iris = get_data("IRIS", ["CC.TIMB..BHE", "CC.TIMB..BHN", "CC.TIMB..BHZ", "UW.HOOD..HHE", "UW.HOOD..HHN", "UW.HOOD..HHZ"], s=-3600, t=-1800)
FDSNevt
Request waveform data for the Tohoku-Oki great earthquake, recorded by some borehole strain meters and seismometers in WA (USA), from IRIS (USA). This function is part of the Quake submodule.
using SeisIO.Quake
S_evt = FDSNevt("201103110547", "PB.B004..EH?,PB.B004..BS?,PB.B001..BS?,PB.B001..EH?")
Processing
Now, we'll explain how to process data. Let's start with the data we acquired at the end of the last tutorial:
S = (
if isfile("req_1.seis")
rseis("req_1.seis")[1]
else
ds = Dates.now(); ds -= (Day(1) + Millisecond(ds) + Second(ds))
s = string(ds)
get_data("FDSN", "UW.MBW..EHZ, UW.SHW..EHZ, UW.HSR..EHZ, UW.TDH..EHZ, CC.PALM..EH?", src="IRIS", s=s, t=600)
end
)
if isfile("req_1.seis") == false
wseis("req_1.seis", S)
end
List of Data Processing Functions
convert_seis
: differentiate and integrate seismogramsdemean
: remove meandetrend
: remove trend lineenv
: compute envelopefiltfilt
: zero-phase filternanfill
: fill NaNsresample
: Data resamplingsync
: time-synchronize a SeisData structuremerge
: merge dataungap
: remove gaps in channelsunscale
: divide out :gain from all trace data
"In-Place" vs. "Safe" Operations
Processing functions each have two versions: an "in-place" variant that ends with an exclamation mark ("!"), and a "safe" version that stores the result in a new data structure. The in-place variant overwrites data in place. This generally isn't reversible. Here's a simple set of processing operations, common for seismic data:
U = deepcopy(S) # back up, just ... in... case...
detrend!(S)
ungap!(S)
resample!(S, fs=50.0)
filtfilt!(S, rt="Lowpass", fh=10.0)
S.notes[1]
The filter uses a mix of default values and custom keyword parameters; the result applies a 4-pole Butterworth filter to each (regularly-sampled) channel of S in place.
Instrument Response
Translating instrument responses can be tricky. Let's work an example of that next. First, we create a target instrument response. Since the data in S are mostly short-period geophones, let's keep it reasonable -- a 10s rolloff frequency.
?fctoresp
fctoresp(f)
fctoresp(f, c)
Create PZResp or PZResp64 instrument response from lower corner frequency f
and damping constant c
. If no damping constant is supplies, assumes c = 1/sqrt(2)
.
See Also
PZResp, PZResp64
resp_new = fctoresp(0.1)
Now let's update the sensitivity, so that the translated instrument responses have meaningful velocity amplitudes.
resp_a0!(resp_new)
resp_new
?translate_resp
Update the first three channels of S to the new response:
translate_resp!(S, resp_new, chans=1:3)
Remove the instrument response from the rest completely, flattening to DC:
Sf = remove_resp(S, chans=4:S.n)
Synchronization
In some situations, it's useful to force all traces in a SeisData
container to start (and perhaps end) at the same time(s). Do this with the command sync
:
S = deepcopy(U)
# don't worry about memorizing these commands; they're
# only for demonstration purposes.
d0 = u2d(1.0e-6*(minimum([S.t[i][1,2] for i in 1:S.n]) - 1000000))
d0 -= Millisecond(d0)
d0 += Millisecond(5)
t0 = string(d0)
t1 = string(d0+Second(300))
Synchronized traces can still have trace lengths that differ by a sample if the start and end times align exactly with one trace's sample intervals but not another's; that's why we added 5 ms to t0 above.
sync!(S, s=t0, t=t1)
S
[length(i) for i in S.x]
...are they all equal? They should be within two samples.
For more help
Please consult the Processing chapter of the official documentation: https://seisio.readthedocs.io/en/latest/src/Processing/processing.html