Reverse Engineering an Opera

A Visual Guide to Spectral Analysis and the Fast Fourier Transform

Periodicity surrounds us. We find cyclical signals in just about any observable data, and at all scales, from the motions of trans-galactic structures to vibrating nuclei and electron waves.

Spectral analysis helps us isolate underlying oscillations in these observations. Spectral from the study of light, where the spectrum of visible colors equates to a spectrum of light wave frequencies. In our case study, we'll analyze music. Audio data is reasonably familiar, the frequency ranges are approachable (sampling audio yields managable quantities of data), and spectral analysis of audio has many real-world applications.

Let's focus on the case of isolating instruments in a clip from the overture to Mozart's Opera, the Magic Flute [from a Musopen project recording].

Magic Flute Clip.mp3
Open in new window

We'll build towards taking this audio data, and transforming it to create this spectrogram: a plot type designed for analyzing dynamic signals.

Figure: A spectrogram of the overture to Mozart's Magic Flute Opera. Note that individual instruments are identifiable, particularly near the opening of the clip, between 0-20 seconds and around 0.5-1.5 kHz.

In this plot, color intesity marks the spectral power in the recording at a given frequency, at a specific time. Even the untrained eye can detect patterns and link them to specific events in the audio. For example, the rising flutes at the opening near 1 kHz, the other woodwinds that follow around 0.5 and 1.5 kHz, and at around 41s we see the slowly growing volume and bandwidth of the entire string section leading up to the end of the clip. Throughout, even in the busier times, we see the alignment of individual notes across the entire orchestra.

1.
Defining the waveform

The term waveform can cover a lot of ground. In general, any recorded observations of a single parameter, taken at a regular cadence, may contain waveform data. We can find periodicities in such data taken in any realm. We expect to find them in the displacement of a violin string vs. its own length, the ripples in the Earth's crust due to quakes, or fluctuations in the fabric of spacetime due to gravity waves.

Where we might not readily expect to see oscillations? In the luminosity of a star, for example; however, even from our own star we see a regular cycle of sunspots. Moreover, subtle changes in luminosity are a big part of how we now detect exoplanets. Even where we do expect to find periodic signals, like in radio waves, careful analysis yields knowledge about unknown signal sources—some manmade, but some natural, such as the Jovian decametric radiation or the Cosmic Microwave Background radiation.

Thus, if periodicities exist in data, some form of spectral analysis can yield significant amounts of information: about wave frequencies, relative or absolute power of the oscillations, and given the right measurements, information about polarization or arrival direction of incoming waves.

A general, one-dimensional, monochromatic (single frequency), sinusoidal wave—whether the dimension is time, distance, or something less concrete—is fully characterized by three parameters: Amplitude AA, frequency ff, and phase ϕ\phi. These combine to yield a function of amplitude vs. time as

A(t)=Asin(2πft+ϕ)=Asin(ωt+ϕ),A(t) = A \sin\left(2\pi f t + \phi\right) = A \sin\left(\omega t+\phi\right),

where the angular frequency ω=2πf\omega = 2 \pi f. This A(t)A(t) symbolizes what we'll measure in the real world, whether it's displacement of a physical body, the field detected by a radio antenna, or the voltage induced on a wire by a microphone.

In this plot, the y-axis is frequency in kilohertz, the x-axis is time in the audio recording, in seconds, and then the color represents the spectral power in the record at a given frequency, at a specific time. Many interesting features are visible, and can be linked to specific events in the audio, such as the rising flutes at the opening near 1 kHz, the other woodwinds that follow around 0.5 and 1.5 kHz, and at around 41s we can see the slowly growing volume and bandwidth of the entire string section leading up to the end of the clip. Throughout, even in the busier times, it is often possible to see the alignment of individual notes across the entire orchestra.

One might ask how this 'spectral power' relates back to the actual loudness of the instruments, or the performing group as a whole. In the case of audio recordings, the original form of the power relates to the voltage differences generated at a microphone (or set of microphones), but after going through the myriad production stages of amplification, mixing, and output processing, it is effectively only a relative measure vs. the entirety of a given recording.

To start out on our quest to create the above spectrogram, let's build some Julia types for future utility. The Timeseries type will serve to hold sample times for our waves, with the sampling rate, start and end times, and the actual points in time. The Waveform type can store the three basic parameters discussed above in a StaticWaveform, or alternatively a function describing a more complex wave in a DynamicWaveform. Finally, the Wave type can store the combination of Waveforms and Timeseries as a MonochromaticWave, and the summation of such waves as a CompositeWave. We'll also add the convenience function plotxy() to clean up plot-making, and its methods plotxyl() and plotxyp() for line-only and marker-only plots.

Pkg.add("PlotlyJS");
using PlotlyJS
plot(args...; kwargs...) = Plot(args...; kwargs...)
PlotlyJS Setup
/
Julia
INFO: Nothing to be done
INFO: METADATA is out-of-date — you may not have the latest version of PlotlyJS
INFO: Use `Pkg.update()` to get the latest versions of your packages
# Takes either a rate (in Hz), a start time, and an end time, 
# or an explicit Array or Range, and stores all of the above.
immutable Timeseries{T<:Number,T2<:Union{Array, Range}}
  rate::T
  t_start::T
  t_end::T
  times::T2
end
Timeseries(rate, t_start, t_end) = Timeseries(rate, t_start, t_end, t_start:1/rate:t_end-1/rate)
Timeseries(times) = Timeseries(-1, times[1], times[end], times)

abstract Waveform

# Takes and stores Amplitude, frequency, phase.
immutable StaticWaveform{T<:Union{Number,Symbol}} <: Waveform
  A::T
  f::T
  Phi::T
end

# Takes a function of t that will be run against a Timeseries
immutable DynamicWaveform <: Waveform
  ft::Function
end
Waveform(A::Number, f::Number, Phi::Number) = StaticWaveform(promote(A,f,Phi)...)
Waveform(ft::Function) = DynamicWaveform(ft)

abstract Wave

# Takes a Waveform and a Timeseries, does the actual math of calculating 
# samples, and stores the samples and the original inputs.
immutable MonochromaticWave <: Wave
  samples::Array
  wf::Waveform
  ts::Timeseries
end
MonochromaticWave(wf::StaticWaveform, ts::Timeseries) = MonochromaticWave([ wf.A*cos(2*pi*wf.f*t + wf.Phi) for t in ts.times ], wf, ts)
MonochromaticWave(wf::DynamicWaveform, ts::Timeseries) = MonochromaticWave([ wf.ft(t) for t in ts.times ], wf, ts)

# Made up of multiple Waveforms
immutable CompositeWave <: Wave
  samples::Array
  wf::Array{Waveform}
  ts::Timeseries
end

# Define how a Wave is constructed, either as a monochromatic or composite,
# and either directly, or from a Waveform and Timeseries.
Wave(wf::Waveform, ts) = MonochromaticWave(wf, ts)
Wave(samples, wf::Waveform, ts) = MonochromaticWave(samples, wf, ts)
Wave(wfs::Array{Waveform}, ts) = sum([MonochromaticWave(wf, ts) for wf in wfs])
Wave(samples, wf::Array, ts) = CompositeWave(samples, wf, ts)

# Add functionality to length and +, so we can inspect and add things
import Base: +,length

length(ts::Timeseries) = length(ts.times)
length(wv::Wave) = length(wv.ts.times)

# This isn't great
function +(x::Timeseries, y::Timeseries)
  newarray = sort(vcat(x.times,y.times))
  Timeseries(-1, newarray[1], newarray[end], newarray)
end

# But this is pretty okay
function +(x::Wave, y::Wave)
  if (x.ts != y.ts) error("Waves must be based on the same timeseries to add!") end
  CompositeWave(x.samples+y.samples, [x.wf y.wf], x.ts)
end
Type Definitions
/
Julia
# Convenience function wrapping around scatter(), Layout(), and plot().
function plotxy(xd::Union{Array,Range}, yd::Union{Array,Range}, ms::Union{Array,String}; name=nothing, layoutargs...)
  if length(xd) != length(yd)
    warn("plotxy() Unequal input lengths!  This may have terrible consequences.")
  end
  layout = Layout(legend_orientation="h"; layoutargs...)
  if length(xd[1]) > 1
    if typeof(name) == Void
      lines = [ scatter(x=xd[i],y=yd[i],mode=ms) for i in 1:length(xd) ]
    else
      lines = [ scatter(x=xd[i],y=yd[i],name=name[i],mode=ms[i]) for i in 1:length(xd) ]
    end
    plot(lines,layout)
  else
    line = scatter(x=xd, y=yd, name=name, mode=ms)
    plot(line,layout)
  end
end
# Plus two even lazier methods for line-only or marker-only plots
plotxyl(xd, yd; args...) = plotxy(xd, yd, (length(xd[1]) > 1 ? repmat(["lines"],length(xd)) : "lines") ; args...)
plotxyp(xd, yd; args...) = plotxy(xd, yd, (length(xd[1]) > 1 ? repmat(["markers"],length(xd)) : "markers") ; args...)
Plot Setup
/
Julia

Using these, let's build a simple 5 Hz wave. We'll sample it at 100 Hz for one second to make sure we get a good picture of it.

ts1 = Timeseries(100, 0, 1) # 100 Hz sampling, for one second
wf1 = Waveform(2.5, 5, 0) 	# A = 2.5, f = 5, no phase shift
wv1 = Wave(wf1, ts1)

plotxy([ wv1.ts.times, wv1.ts.times ], [ wv1.samples, wv1.samples ],
  [ "lines", "markers" ], name=[ "Function","Samples"], 
  xaxis_title="Time [s]", yaxis_title="Amplitude",
  title="$(wv1.wf.f) Hz Wave"
)
5 Hz First Wave
/
Julia

Figure: A simple 5 Hz wave (blue line), sampled at 100 Hz (orange points).

Here, the blue line takes the form of the 5 Hz function itself, while the orange dots are the points we've sampled at 100 Hz. Alright, so that's a single monochromatic wave, how about we liven things up?

ts = Timeseries(100, 0, 1)
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
threewaves = [ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ]

plots = [ plotxyl(ts.times, wv.samples, yaxis_range=(-2.5,2.5)) 
  for wv in threewaves ]

p = hcat(plots...)
p.layout["height"] = 200
p.layout["showlegend"] = false
p
Three Waves
/
Julia

Figure: Three monochromatic waves with different parameters.

Three waves with different amplitudes, frequencies, and phases. To a good approximation, sound waves (and most waves) are linear, and so they follow the superposition principle. That means that to combine waves, we can simply sum their amplitudes at each sample time.

wv = sum(threewaves) 

plotxyl(wv.ts.times, wv.samples, showlegend=false,
  yaxis_title="Amplitude", xaxis_title="Time [s]", 
  title="Composite Wave")
Composite Wave
/
Julia

Figure: A composite wave which is the sum of the three waves shown above.

Now that's a lot more interesting. Given some work, one could probably figure out by-hand the base frequencies when combining three waves, but what if there are ten different frequencies? A hundred? Or, as in a more real-world application, some frequency sources aren't even close to monochromatic, but rather emit over a broad range of frequencies? Along the same lines, the specific aural characteristics of most instruments come from various harmonics and resonances, and the way these shift over time—all things we'd find quite difficult to pick out from a raw waveform.

2.
The Fast Fourier Transform

This is where spectral analysis comes in, and Fast Fourier Transforms (FFTs) are one of the oldest and most widely implemented forms of this.

We'll leave the mathematical details to another article, but the fundamental concept is that a Fourier transform takes a function of a particular variable (such as time), and transforms it into a function of the frequency domain related to that variable. For our case, we'll literally just be transforming from the time domain to the frequency domain, whereas e.g. a transform dependent on a spatial measure would yield the frequency domain of changes over spatial distances.

Aside: The mathematical viewpoint on why this is possible is that the sine and cosine functions form a complete, orthonormal basis for a large set of continuous, integrable functions (the 'square-integrable' functions). This means that any function which falls within this set can be represented by a sum of sines or cosines—though in many cases it will be an infinite sum. The Fourier transform allows us to determine the frequencies of the most powerful sinusoids that make up our signal. For a different view of sinusoids and Fourier transforms, check this article out.

Thus, while our signal is in an Amplitude vs. Time space, the Fourier transform of the signal can yield an Amplitude (or more commonly, Power) vs. Frequency space, showing us the actual spectrum of frequencies that make up our signal. A single transformation of data taken over one timeframe is commonly referred to as a power spectrum, and an array of power spectra chained into a pseudo-3D timewise plot is called a spectrogram.

Figure: Example from the Cornell Ornithology Department's Raven Sound Analysis Software, showing a waveform, its spectrogram, and a power vs. frequency slice.

Alright, but then what is a Fast Fourier Transform? Well, the original form of the Fourier transform as applied to discrete samples of a function or signal is now simply known as the Discrete Fourier Transform, or DFT. Variants on this sort of analysis have been used since Baylonian times, and formulas mathematically equivalent to modern ones have existed since the 1700s, but directly computing a DFT on an nn-point snapshot of an arbitrary signal has a computational complexity of O(n2)O(n^2). The FFT was first developed by Gauss in 1805, and then rediscovered and modernized for computing by Cooley and Tukey in 1965. The Cooley-Tukey algorithm can reduce computation time to O(nlogn)O(n \log n) for many inputs, and further advancements in FFT algorithms for specific cases can optimize it even further. Many algorithms work fastest (or exclusively) when applied to snapshots of 2n2^n samples for any integer nn, and so such array sizes are often seen in high data rate, time sensitive, or computationally bereft applications.

The code for this article is written in Julia, and Julia includes a direct interface to FFTW, "The Fastest Fourier Transform in the West", which is an open-source discrete transform library created at MIT. FFTW has been found in benchmark tests to be generally the fastest FFT library. Its speed—even competing with and surpassing libraries custom-tuned for specific hardware—comes in part from its 'planning' structure. Before running the actual transforms, the library plans to a specified thoroughness, examining the hardware it will be running on and determining optimal methods for computing transforms on the user-specified array sizes and dimensions, memory positions, and transform types, and running on a given number of compute threads.

FFTW is an extremely flexible library, but it isn't terribly intimidating at the basic level at which Julia interfaces with it: there are several different functions for different transform types and their inverse transforms. Each function takes input data, and specifications on which dimensions to transform along. There are also planned versions of the functions which take example data matrices and return a plan object which is applied to the actual data matrix with the * operator, and in-place versions of most functions, which place the output into the input array, saving some memory (though not necessarily halving memory use).

The more advanced interfaces in the full C library allow for extensive control over data structure, input and output locations, repetition of transforms, and parallel transforms via multithreading and/or distributed computing.

Because of everything FFTW takes into account when planning (including running speed tests on various architectural features such as SIMD instructions), thorough planning can often take a long time—up to several minutes if the FFTW.PATIENT or FFTW.EXHAUSTIVE flags are passed. Because the information the library learns while making its plans is often applicable to future planning, FFTW is able to save the 'wisdom' it has learned from its planning to memory and to disk, allowing for reuse of wisdom at future times. In the Julia interface, this data may be saved to disk with the undocumented FFTW.import_wisdom(wisdomfile) and FFTW.export_wisdom(wisdomfile) functions.

3.
FFT Input

Now let's look at the actual process of going from a signal to a power spectrum. To do so we'll start adding frameworks, starting with a one-shot function to make a single line spectrum.

"""
Generates frequency array
"""
function fft_freq(samples::Integer,ts::Timeseries)
  sampling_freq = ts.rate > 0 ? ts.rate : (ts.times[end]-ts.times[1])/length(ts)
  0:sampling_freq/samples:sampling_freq/2
end

"""
fft_oneshot(<Real data>; window=<window Function>)

Returns a single line spectra.  Applies a window function, if provided.
"""
function fft_oneshot{T<:Real}(data::Array{T,1}; window::Union{Function,Void}=nothing)
  len = length(data)
  
  # Check for existing wisdom
  wisdomdir = "/files/fftw_plans/"
  wisdomfile = "$(wisdomdir)rdft-1d-$(len).plan"
  try FFTW.import_wisdom(wisdomfile) end
  
  # Plan patiently
  plan = plan_rfft(similar(data), 1, flags=FFTW.PATIENT, timelimit=Inf)

  # Try to save generated wisdom
  if !isdir(wisdomdir)
    try mkdir(wisdomdir) end
  end
  try FFTW.export_wisdom(wisdomfile) end

  # Check for window
  if typeof(window) != Void
    w = [window(n,len+1) for n in 1:len]
    data = w .* data
    wcorrect = sum(w)/len
  else
    wcorrect = 1
  end

  # Run the actual FFT, normalize
  # We'll use the same array for output, to save memory.
  data = (plan * data)/len
  
  # Output square-magnitude for powers, corrected for window
  ( (abs(data.*data)).^2 )/wcorrect
end
fft_oneshot(data, ts::Timeseries; args...) = fft_oneshot(data; args...), fft_freq(length(data),ts)
fft_oneshot(wv::Wave; args...) = fft_oneshot(wv.samples; args...), fft_freq(length(wv),wv.ts)
One-Shot FFT Function
/
Julia

We'll start out by looking at the input side—here's our simple 5 Hz wave again:

3.1.
Sampling

Figure: 5 Hz wave.

As before, we'll look at the line as the fundamental signal, whether it's a microphone picking up sound waves, an antenna picking up radio waves—whatever. The points then are where we've actually sampled, in this case at 100 Hz, and since we've sampled for one second, we have 100 data points in this 1-second snapshot of the 5 Hz wave.

To the eye, it's pretty clear that the wave is very well sampled, and this leads to the question of just how many samples we need in a given time period to properly capture a waveform. This is answered by Nyquist Theory, which gives us that, for a given sampling rate SS, the maximum frequency of wave that can be accurately sampled, or Nyquist frequency, is S/2S/2.

However, waves with frequency greater than the Nyquist, known as undersampled waves, do not just disappear. The undersampled waves will show an effect called aliasing, in which they mirror down, appearing as potentially spurious signals in a frequency spectrum.

ts = Timeseries(500,0,5)
wv = sum([ Wave(Waveform(1,f,0), ts) for f in 30:30:10*30 ])

spec,freqs = fft_oneshot(wv)

p = plotxyl(freqs, spec, 
  xaxis_title="Frequency [Hz]", yaxis_title="Power");
p.layout["height"] = 200; p
Aliasing in a Spectrum
/
Julia

Figure: A line spectrum sampled at 500 Hz, showing ten input frequencies spaced 30 Hz apart. Note that at the top end, where the last two inputs—270 Hz and 300 Hz—are greater than the Nyquist frequency, the last two peaks have aliased, mirroring around and showing up at 230 Hz and 200 Hz.

Here, note that the input frequencies are evenly spaced by 30 Hz, and so when we reach the Nyquist frequency, they mirror around, showing as peaks at false frequencies starting near the Nyquist. To try to grasp why this happens, let's look at some waveforms.

Figure: Why aliasing occurs. A 7 Hz wave (orange line), sampled at 10 Hz (green points) looks exactly the same as a 3 Hz wave (blue line).

If we have the orange 7 Hz signal, but only sample it at 10 Hz (i.e. on the green points), then those samples end up looking exactly the same as what we'd see from the blue 3 Hz signal. As we saw from the spectrum above, the aliased wave frequency appears at the sampling frequency minus the real wave's frequency.

When a data acquisition system is designed for it, undersampling and aliasing can be used in various ways, such as bandpass downconversion. However, for a straight sampling system, aliased signals are a huge problem, and so hardware filters will commonly be used on the input side to remove frequencies above the Nyquist.

It's worth elaborating that the Nyquist frequency is an absolute limit, and so it applies most precisely to ideal, clean, phase-aligned signals. For real-world observations even in very good conditions, it's advisable to have a Nyquist well above—4x or more—the maximum frequency of interest, and strong filters that cut off frequencies well below the Nyquist. In cases where higher harmonics, noise, and other unknowns may be pertinent to the investigation, a Nyquist dozens of times the fundamental frequency involved would not be out of the question.

In general the Nyquist frequency is also a good measure of how well a given sampling system can reproduce a signal. Modern, consumer-grade ('CD-quality') audio is sampled at 44.1 kHz, and the average upper limit of human hearing is generally considered to be about 20 kHz, with most perception below 10 kHz. Thus we see that consumer-grade systems can do a reasonable job of reproducing audio; however, production-grade often goes further, sampling at 96 kHz or even 192 kHz in order to better capture and reproduce the audio signal as well as allow for more mixing and filtering techniques.

3.2.
Snapshots

Okay, so that's an answer to what the minimum is for how many samples we need in a given time period to capture waves at a given range of frequencies. Going in the other direction, do we get anything out of oversampling? This question leads into one of the main tradeoffs that occurs on the input side of sampling for FFTs.

No matter what our sampling rate, we can technically feed as many samples as we want to the FFT—we'll just need more space in memory to store the snapshot's input and output arrays. While we'll cover the relation more in the output section, for every two input samples we add to our snapshot, we'll get one more frequency bin in the output, and therefore more frequency resolution. Thus, larger snapshots can allow us to distinguish separate waves that are very close in frequency.

However, at a given sampling rate, we'll need to sample for a longer time period in order to get those additional samples. Thus, if we're trying to make a spectrogram, the tradeoff is between resolution on the frequency axis, and resolution on the time axis: longer snapshots will give us better frequency resolution, but our view of time will reciprocally get more coarse.

This is particularly important when you add in the subject of time-limited signals with potentially changing frequencies, and even more so when these signals do not themselves have any inherent perodicity over the course of their lifetime. This includes many manmade signals; e.g, instruments in a musical piece need not necessarily repeat any specific pattern. Other examples of such signals can be found among natural radio emissions, such as whistler waves.

Figure: Whistler wave observed at Palmer Station, Antarctica by the Stanford VLF Group.

These waves are generated in the upper atmosphere by lightning, and propagate down magnetic field lines. They have important effects in the Earth's plasma environment, including precipitating charged particles in the radiation belts. They were named because of their audible descending tone when the radio signal is directly converted to an audio signal.

So, what happens if our snapshots are so long that they end up encompassing an entire whistler wave? With zero time resolution along the period of the whistler, we'll just see a broad band of power along the entire range of the whistler's frequencies, and we won't grasp the true nature of the wave in time at all.

Of course, if make our snapshots too narrow in time, we won't be able to resolve the different frequencies that make up the whistler, either. When possible, it's often helpful to try different snapshot sizes, yielding multiple views of the frequency vs. time structure in the data.

That said, raising the sampling rate will also raise the overall resolution possible in either or both directions, but this requires additional data storage, and possibly better sampling equipment. While this isn't terribly burdensome in the audio frequency range, equipment capable of fully sampling well above the Nyquist for high-frequency radio waves (3 MHz up) can cost tens of thousands of dollars, and can easily require a gigabyte of storage for each minute of observation.

3.3.
Windowing

A complication of the FFT process lies in the fact that, while we only input a time-limited snapshot to the FFT algorithm, the transform is based on math on infinitely periodic signals. The practical effect this has is as if the waveform snapshot we feed in is repeated infinitely. So, if we take the simple 5 Hz waveform from earlier, what's actually being transformed is more like...

Figure: A depiction of the theoretical view of what's being transformed in an FFT: the input snapshot (orange line) extends infinitely in both directions (blue line, for particularly small values of infinity).

This presents a problem, because in general the endpoints of our input waveform are not going to be continuous when connected end-to-end. If we take our earlier example of a three-component waveform, with 1-second snapshots we see a significant discontinuity at the endpoints.

wv = sum(threewaves)

ts2 = Timeseries(100,-1,2)
r_wv = Wave(repmat(wv.samples,3), wv.wf, ts2)

p = plotxyl([r_wv.ts.times, wv.ts.times], [r_wv.samples, wv.samples], 
  name=["What's Transformed", "Input Snapshot"], title="3-Component Wave", 
  yaxis_title="Amplitude", xaxis_title="Time [s]")
p.layout["height"] = 300
p
Composite Wave End-Point Discontinuities
/
Julia

Figure: An example of boundary discontinuities when a snapshot (orange line) is repeated (blue line). Note the vertical jump where the orange and blue lines connect.

Reproducing discontinuities like this would add a bunch of extraneous frequencies to our spectrum, potentially ruining our analysis. This generation of spurious frequencies is known as spectral leakage.

So, what can we do about this? The canonical solution is to modify our input waveform samples in a way that preserves as much of the information we're trying to extract as possible, adds as little distortion as possible, and makes the endpoints match up when the snapshots are placed end-to-end. This is done by windowing the input so that the ends go to zero in a smooth manner, using a window function w(n)w(n). With the ends at zero, there is no discontinuity at the periodic boundaries.

For each sample FnF_n in our NN-sample input, w(n)w(n) will produce a number by which we multiply FnF_n to get our windowed input set, fnf_n:

fn=Fn×w(n)n{1N}f_n = F_n \times w(n) \quad \forall n\in \left\{1 \rightarrow N\right\}

One of the more common and simple window functions is the Hann window, which makes use of a cosine function:

w(n)=0.5(1cos(2πnN1))w(n) = 0.5\left(1-\cos\left(\frac{2\pi n}{N-1}\right)\right)
"Hann Window Function"
function window_hann{T<:Integer}(n::T, N::T)
  0.5*(1-cos(2pi*n/(N-1)))
end

ts = Timeseries(100,0,1); wf = Waveform(2.5,5,0); wv = Wave(wf,ts)

p1 = plotxyl(wv.ts.times, wv.samples; yaxis_title="Amplitude", xaxis_title="Time [s]", title="$(wv.wf.f) Hz Wave")

N = length(wv); n_set = 1:N;
w = [ window_hann(n,N+1) for n in n_set ]
p2 = plotxyl(n_set, w; yaxis_title="Multiplier", xaxis_title="Sample", title="Hann Window")

winwave = Wave(wv.samples .* w, wv.wf, wv.ts)
p3 = plotxyl(winwave.ts.times, winwave.samples; yaxis_title="Amplitude", xaxis_title="Time [s]", title="Windowed Wave")

p = [p1 p2 p3]
p.layout["showlegend"] = false
p.layout["height"] = 200

p
Hann Window
/
Julia

Figure: The shape and effect of a Hann window on an input snapshot. On the left a snapshot FF of a 5 Hz wave, in the center the function w(n)w(n) for the Hann window, and right the effect of applying the window, F×w(n)F \times w(n).

Adding a window is equivalent to impressing an additional signal onto the data (and thus the spectral components required to reproduce that signal), but as long as we remain aware of this, it can be controlled far more effectively than the boundary-discontinuity noise of spectral leakage.

There are a wide variety of window functions, with wider peaks, flat regions, or different roll-offs towards zero. Each one is developed to have particular characteristics affecting the signal, with different noise levels, bandwidth, and frequency resolution. There are even some specific cases where the 'rectangular' or 'boxcar' window (equivalent in most cases to applying no windowing function at all) has the most desirable characteristics.

Let's look at five different windows on two signals: first, a single 5 Hz sine wave, and second, a 3-wave composite signal, the most powerful signal at 23.7 Hz, a signal with 1/3rd of the amplitude at 8.92 Hz, and a weak signal at 1/15th the amplitude, at 5 Hz. We'll sample at 1 kHz for pretty good frequency resolution, and zoom in on the 0-50 Hz range.

Figure: A comparison between five window functions. In the columns, left to right, no window (aka rectangular or boxcar window), the Hann window, the Blackman-Harris window, a Flat-top window, and a Planck-taper window. On the top row, the shape of the window function w(n)w(n) itself, middle line spectra of a windowed 5 Hz wave, and on the bottom line spectra of three waves of various parameters. Note for each spectrum the distance between peaks and the noise floor, and the sharpness or spreading of the peaks.

There's a lot going on above. One of the most important things to note is that the spectra are all in a logarithmic scale with units of decibels (dB), and shifted so that the highest peak is at 0 dB. The top row of spectra (for the 5 Hz wave) are all set to the same 800 dB y-axis range, while the lower row are individually set to show the most important data. From just this relatively simple comparison, we can see a clear tradeoff exists.

While all four of the cosine-based windows have very low noise floors, approximately 600 dB below peak for the single, pure sine wave, the rectangular window is clearly the best, with a narrow, sharp peak. The more complex the construction of a cosine-based window, the more the peak spreads out, and the Planck-taper window is just shy of nonsense.

However, when we add in just two other waves, the rectangular window becomes comparatively bad due to the spectral leakage: the spreading from the peaks is huge, and the noise floor rises to more like 50 dB off-peak, almost entirely hiding the weak 5 Hz peak, which only shows up as a tiny blip above the spread. For our signal, the best window appears to be the Hann window, which shows all three peaks well, with significantly less spread than the rectangle.

Aside: Is the noise and spreading of peaks entirely due to windowing? No, the discretization/sampling process itself introduces noise and artifacts. One way to look at it is that each frequency bin covers a range of frequencies dependent on the snapshot size and sampling rate, and it's fairly unlikely that a given fundamental wave in our signal will fall directly at the center of a bin. Effectively, each frequency in a bin (or sample in time) is a little fuzzy in frequency/time, and this introduces noise.

From this small test comparison we can see at least a hint towards the need for case-by-case consideration of various window functions, depending on factors such as the signal source, probable peak separation, dynamic range, and noise levels. Particularly careful consideration is necessary in cases where the transformed data is what is stored, such as systems with limited storage (remote collection) or bandwidth (space probes) that can only store and/or transmit a selection of frequencies. In such systems, you only have one chance to get it right.

A final complication of windowing is that the window can affect the absolute powers in the result. This may be fairly obvious simply by noting that the the sum over the amplitudes in a Hann window—c=n(w(n))c = \sum_n\left(w(n)\right)—comes to 0.5, i.e. half of the amplitude in a waveform is going to be damped by the Hann window. A simple and oft-used fix for this is to simply globally correct the resultant powers by dividing out cc. This is a useful rough approximation, but is only exact for a pure-noise signal. In general, a window function will have different effects over the range of the spectrum, and the distortion will depend upon the input signal.

4.
FFT Output

The raw output of an FFT is not specifically what we're looking for here, where we'll be concentrating on spectral power. To get to power, we first need to look at what the Fourier transform is actually producing.

4.1.
i\mathcal{i}

Briefly, both the input and output of the Fourier transform ideally contain all of the information about the wave—amplitude, frequency, and phase—in the form of a complex numberX=x+iyX=x+iy. But wait, we didn't feed it a complex number—we only had real samples, right? This is true, and the simplest way to think about it is that this means we need two samples on the input side to yield a single complex number on the output side. There are ways of sampling a signal that yield complex data, but for real samples such as ours the output of a standard DFT/FFT will be 'mirrored', i.e. one half of the output will be complex numbers arrayed by frequency, and the other half will be the same numbers, in reverse order.

The specific FFTW function we're using here, rfft(), knows that we're inputing real numbers, and so it returns the correct-order half of the output (plus a 0-frequency offset bin), and we don't have to deal with discarding half of the outputs. This also saves some processing time on the transform.

The full complex-number output is useful in various ways. Taking tan1(Fim/Fre)\tan^{-1}\left(F_{\textrm{im}}/F_{\textrm{re}}\right) will yield the phase angle of each frequency component. With multiple perpendicular sampling antennas, the complex components can be mixed in various ways to find things like wave polarization and direction of arrival. For our purposes, let's look at the details of turning our complex numbers into powers.

4.2.
Power Spectra

To get the power PP that we're interested in, we'll have to go through a number of steps to process a given output datum XX.

  • Textbook definitions of the transform can vary, but FFTW returns unnormalized output. What this means is that, to get accurate output power, we must first divide our output by the length of our FFT snapshot, but then multiply by 2 because our input is real.
  • We'll use the general correction for our window, i.e. divide out the total gain found by summing over the window—for the Hann window this means dividing out 0.5.
  • To get an amplitude, we then take the magnitude of the complex number XX, i.e. the absolute value of the conjugate transpose times the original, XX|\overline{X}X|.
  • Finally, squaring the magnitude gives us a power.

Putting it all together yields

P=X2N2X2N22=XNXN2=XXN22.P = \left|\frac{\overline{X}}{2\frac{N}{2}} \frac{X}{2\frac{N}{2}}\right|^2 = \left|\frac{\overline{X}}{N}\frac{X}{N}\right|^2 = \left|\frac{\overline{X} X}{N^2}\right|^2.

4.3.
Frequency Bins & Snapshot Times

The last pieces we need to get some nicely formatted output are to associate our output bins to frequencies, and make sure our times are aligned correctly.

We already know that, for an FFT snapshot of size NN, our output will have N/2N/2 bins. The only other thing we need is the range of frequencies we're dealing with. This range has its max at the Nyquist frequency, which at our sampling frequency SS will just be S/2S/2. Thus, we're spreading a range of 00 to S/2S/2 Hz over N/2N/2 bins, meaning each bin will have an F/NF/N span, and so our frequencies are the range 0:F/N:F/2.

The times assigned to each of a spectrogram's snapshots are easier, but contain a choice. For each sample nn in our total data set, we have a tnt_n from the set of all times T\mathcal{T}. With NN samples per snapshot, to get our times we just have to go through T\mathcal{T} with a step size of NN. The choice left to the user is where within each step you take the time for a given snapshot: from the beginning, middle, or end. For the following we'll use the beginning time, and so snapshot_times = [ T[i] for i in 1:N:length(T) ]

5.
Spectra

With our inputs sorted and our outputs processed, we can look at our frequency-domain results. Let's take a look at our original 3-component signal, sampled at 100 Hz. From here on we'll use a Hann window.

ts = Timeseries(100, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])
wp = plotxyl(wv.ts.times,wv.samples; name="Signal", xaxis_range=[0,1], xaxis_title="Time [s]", yaxis_title="Amplitude")

spec,freqs = fft_oneshot(wv, window=window_hann)
sp = plotxyl(freqs, spec; name="Power Spectrum", xaxis_range=[0,10], xaxis_title="Frequency [Hz]", yaxis_title="Power")

p = [wp sp]
p.layout["showlegend"] = false
p.layout["height"] = 300

p
Composite Wave Power Spectrum
/
Julia

Figure: Power spectrum of a 3-frequency composite wave.

We can see that this clean signal—with a Nyquist far above any of the input frequencies and a long snapshot time—comes out pretty nicely, although it's tough to see the lowest-amplitude wave. What if we view it scaled logarithmically? The standard is to take the base-10 logarithm of a given power (bels), and then multiply by ten (decibels, or dB).

ts = Timeseries(100, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])
wp = plotxyl(wv.ts.times,wv.samples; name="Signal", xaxis_range=[0,1], xaxis_title="Time [s]", yaxis_title="Amplitude")
spec,freqs = fft_oneshot(wv, window=window_hann)
spec = 10*log10(spec)
sp = plotxyl(freqs, spec; name="Power Spectrum", xaxis_range=[0,10], yaxis_range=[-200,0], xaxis_title="Frequency [Hz]", yaxis_title="Power [dB]")

p = [wp sp]
p.layout["showlegend"] = false
p.layout["height"] = 300

p
Decibels
/
Julia

Figure: The same power spectrum as above, but with the y-axis in the logarithmic scale of decibels (10log1010 \log_{10}).

This allows one to view even very powerful peaks while still keeping the noise floor and any lower-powered peaks in view, and is particularly helpful when viewing spectrograms rather than single line spectra, as it can fit a lot of range into a given color scale.

What if we add in some noise? We'll add in a 'wave' made up of some Gaussian noise, and bump up the sampling rate a bit to make sure we're seeing clearly.

Figure: Spectrum of the same signal, but with light, random Gaussian noise added.

We can see there's still something there in the input, though it's having a bad time. The spectrum is in pretty good shape, but let's increase the noise amplitude by about a factor of 5.

Figure: Spectrum with a much larger noise amplitude.

Now the signal is a real mess, and even in the power spectrum it's getting tough to tell signal from noise, but if we go back to a linear plot...

#using the wave from the previous cell
wp = plotxyl(wv.ts.times,wv.samples; name="Signal", xaxis_range=[0,1], xaxis_title="Time [s]", yaxis_title="Amplitude")
spec,freqs = fft_oneshot(wv, window=window_hann)
sp = plotxyl(freqs, spec; name="Power Spectrum", xaxis_range=[0,10], xaxis_title="Frequency [Hz]", yaxis_title="Power")

p = [wp sp]
p.layout["showlegend"] = false
p.layout["height"] = 300

p
Noisier Linear Spectrum
/
Julia

Figure: Power spectrum with the same noise as above, but with the y-axis on a linear scale.

Now that's the clearer one, though there can easily be spurious noise peaks in the data as well, hiding or distorting the low-powered signals. Either view can be useful in some cases, and misleading in others. To wrap things up, let's look at a more complicated signal, with 50 random component waves and a bit of noise.

Figure: Analysis of a signal with 50 random waves and some significant background noise.

Comparing the Linear and Log power spectra in the middle row allows us to easily discern which peaks are the largest, while also seeing the total group of frequencies which have a strong signal above the noise level. In the bottom row, the Power or dB range have been divided into 50 levels, and what's shown is how many bins are above a given level.

6.
Spectrograms

Now that we've looked at single line spectra, let's introduce a bit of dynamism and examine some spectrograms. As previously mentioned, a spectrogram is a pseudo-3D plot of spectral power vs. both frequency and time, showing the power at a set of frequencies over time. Of course, for a constant signal this isn't much more useful than a single line spectrum, but if it's an unknown signal the spectrogram can help you learn whether it's constant in the first place.

First, we'll build a framework for spectrograms like we had for line spectra, but more involved: a function to perform the FFTs, another to snip out specific frequency and/or time ranges, and a function to display a spectrogram, along with a set of custom color maps.

data_colormaps.csv
Open in new window
"""
fft_specgram(data, len, [window=wfunc])
 
Computes repeated 1D FFTs on `data`, using snapshots of size `len`.  If `len` does not evenly divide `length(data)`, the last snapshot will be zero-padded.  If `window` is provided, this function will be applied to each snapshot before the run.
"""
function fft_specgram{T<:Real}(data::Array{T}, len::Integer; window::Function=((x,y) -> 1))
  datasize = length(data)
  # if there aren't enough samples in the input to make an 
  # integer number of snapshots, fill with zeroes
  (dim2, rem) = fldmod(datasize,len)
  if rem > 0
    data = vcat(data,zeros(T,len-rem))
    dim2 += 1
  end
  data = reshape(data, len, dim2)

  w = T[window(n,len+1) for n in 1:len]
  if (w != ones(len))
    data = mapslices(X -> w.*X, data, 1)
    wcorrect = sum(w)/len
  else
    wcorrect = 1
  end

  sizestring = join([string(n) for n in size(data)],"x")
  wisdomdir = "/files/fftw_plans/"
  wisdomfile = "$(wisdomdir)rdft-1d-$(sizestring).plan"

  if isfile(wisdomfile)
    try FFTW.import_wisdom(wisdomfile) catch e print("Wisdom import error: $(e).") end
  end
  
  plan = plan_rfft(similar(data), 1, flags=FFTW.PATIENT, timelimit=Inf)
  
  if !isdir(wisdomdir)
    try mkdir(wisdomdir) catch e print("Wisdir create error: $(e).") end
  end
  try FFTW.export_wisdom(wisdomfile) catch e print("Wisdom export error: $(e).") end

  data = (plan * data)/len
  
  data = (abs(data.*data)).^2
  data /= wcorrect

  data.' # transpose to get the right order for Plotly freq vs. time specgrams
end
fft_specgram(data, len, ts::Timeseries; args...) = fft_specgram(data, len; args...), fft_freq(len,ts), ts.times[1:len:end]
fft_specgram(wv::Wave, len; args...) = fft_specgram(wv.samples, len; args...), fft_freq(len,wv.ts), wv.ts.times[1:len:end]

# Snip function
function speclimit(spec, freqs, times; fl="all", tl="all")
  f_ind = fl == "all" ? (1:length(freqs)) : (findfirst((x) -> x>=fl[1],freqs):findlast((x) -> x<=fl[2],freqs))
  t_ind = tl == "all" ? (1:length(times)) : (findfirst((x) -> x>=tl[1],times):findlast((x) -> x<=tl[2],times))
  spec[t_ind,f_ind], freqs[f_ind], times[t_ind]
end

#
# Load custom colormaps
#
using DataFrames

data_cmaps = readtable(data_colormaps.csv,separator=',',header=true,quotemark=['|'])
cmaps = Dict()
for name in names(data_cmaps)
  cmaps[name] = [eval(parse(line)) for line in data_cmaps[name]]
end

"""
specgram(z,x,y; z_range, x_range, y_range, colorscale, title, x_label, y_label, reverse_y)

Convenience function for making Plotly spectrogram plots.
"""
function specgram(zdata, xdata=[], ydata=[];
	z_range::Union{Array,String}="auto", colormap::Array=cmaps[:greyscale], 
  kwargs...)
  
  # heatmap() shows a grid of colored cells, exactly what we need
	data = heatmap(z=zdata, x=xdata, y=ydata, 
	  colorbar=Dict(:title => "dB"), colorscale=colormap, 
	  dx = 1, dy = 1, name="dB", 
	  x0 = 0, y0 = 0, 
	  zauto = (z_range == "auto" ? true : false), 
	  zmax = (z_range == "auto" ? nothing : z_range[1]), 
    zmin = (z_range == "auto" ? nothing : z_range[2])
	)
	layout = Layout(
    xaxis_showline=1,xaxis_linewidth=1,xaxis_mirror=true, 
    yaxis_showline=1,yaxis_linewidth=1,yaxis_mirror=true, 
    hovermode="x"; kwargs...)
    
  plot(data, layout)
end
Spectrogram Functions
/
Julia

Below, the 'unknown' signal appears to be a frequency sweep in the form of a triangle wave.

"""
t_wave(min,max,period,time)

Triangle wave function
"""
function sweep_triangle(center,sweep,period,time)
  t = time % period
  if t<period/2
    2pi*t*(center+sweep*(2t/period-1))
  else
    2pi*t*(center+sweep*(3-2t/period))
  end
end

ts = Timeseries(200,0,10)
wf = Waveform((t) -> 2.5*cos(sweep_triangle(50,10,5,t)))
wv = Wave(wf,ts)
spec,freqs,times = fft_specgram(wv,64,window=window_hann)
spec = 10*log10(spec)
p = specgram(spec,times,freqs)

p.layout["height"] = 400
p.layout["showlegend"] = false

p
Triangle-Shaped Frequency Sweep
/
Julia

Figure: Spectrogram of a triangularly sweeping signal. Note that the final spectrum is washed out due the final input snapshot being zero-filled with enough samples to complete it.

The advantage of a spectrogram over other such representations (such as waterfall plots) is that the power dimension is represented via color and/or luminosity differences. This 2D representation of 3D data can allow for easier viewing and interpretation, and is optimal for print media; however, if care is not taken, it can lead to misinterpretation of relative or absolute peak powers. In cases where a proper reading is crucial, it can be beneficial to compare both linear and decibel spectrograms, and to view individual line spectra at particularly complex times—or, conversely, slices showing the power at specific frequencies, over time.

To finish up, we'll go back to the clip from Mozart's opera.

We'll load higher-fidelity, single-channel audio from a .wav file. To get a rough idea of the audio amplitude over time, we can look at the average wave power every 50 samples.

Magic Flute Clip.wav
Open in new window
using WAV

data, fs, nbits = wavread(Magic Flute Clip.wav,format="native");
data = convert(Array{Float32,2},data)
ttime = length(data)/fs
print("Samples: ",length(data),", Frequency: ",fs,", Bit Depth: ",nbits,", Total Time: ",ttime,".")

gc() # garbage collection to keep memory use down

# decimated average
dnum = 100 # factor by which to decimate
ddata = [ sum(data[i-dnum:i+(dnum-1)])/(2*dnum) for i in dnum+1:(2*dnum):length(data)-dnum ]; gc()
ndata = length(ddata)
dtimes = 0:ttime/ndata:ttime-ttime/ndata # generate time array

gc()

plotxyl(dtimes,ddata,title="Magic Flute Signal",showlegend=false,
  yaxis_title="Sound Amplitude",yaxis_showticklabels=false,
  xaxis_title="Time [s]")
Magic Flute Signal
/
Julia
Samples: 3073706, Frequency: 48000.0, Bit Depth: 32, Total Time: 64.035545.

Figure: A 50x averaged amplitude of the Magic Flute audio data.

From the output line we can see that the audio is slightly above CD quality, with a 48 kHz sample rate. We'll have to use a different system to display this much data in a spectrogram. To generate the spectrogram as an image, we'll use heatmap() from Plots.jl with the GR backend...

# Plots.jl with GR: less interactivity (none in our case), but it 
# has no problem generating a heatmap image file with over a million cells
Plots.gr()

smlfont=Plots.font("Open Sans", 12)
midfont=Plots.font("Open Sans", 14)
bigfont=Plots.font("Open Sans", 18)
Plots.gr(size=(850,480),fillcolor=:inferno,
  titlefont=bigfont,tickfont=smlfont,guidefont=midfont,
)
Plots/GR Setup
/
Julia
ts = Timeseries(promote(fs,0,ttime)...)
spec,freqs,times = fft_specgram(data, 4096, ts; window=window_hann)
lspec,lfreqs,ltimes = speclimit(spec,freqs,times,fl=[0,10000]);
spec = 0; gc() # more memory cleanup
lspec = 10*log10(lspec)
clim = [ maximum(lspec)-110, maximum(lspec)-10 ]
gc()
Magic Flute FFTs
/
Julia
Plots.heatmap(ltimes,lfreqs/1e3,lspec.',
	xticks=ceil(ltimes[1]):5:floor(ltimes[end]),
  yticks=(ceil(lfreqs[1]):floor(lfreqs[end]))[2:end],
  clims=(clim...),
  title="Magic Flute Spectrogram", 
  xlabel="Time [s]", ylabel="Frequency [KHz]"
)
gc()
Large Spectrogram Generation
/
Julia

Here we can really begin to see the power of this form of analysis. While the resolution is limited for 48 kHz sampling, we can nevertheless see the three ascending patterns of the flute near the opening of the clip at around 1 kHz, as well as later descending patterns at slightly lower frequency ranges, and around 8s we see the oboe come in around 1.5 kHz. We can also see the ability of a logarithmic scale to encompass large changes in volume between the silent segments and the loud ones.

If we zoom in a bit on the frequency axis, we see that we're hitting the limitations of the sampling rate.

ts = Timeseries(promote(fs,0,ttime)...)
spec,freqs,times = fft_specgram(data, 4096, ts; window=window_hann)
lspec,lfreqs,ltimes = speclimit(spec,freqs,times,fl=[0,4000]);
spec = 0; gc() # more memory cleanup
lspec = 10*log10(lspec)
clim = [ maximum(lspec)-110, maximum(lspec)-10 ]

Plots.heatmap(ltimes,lfreqs/1e3,lspec.', 
	xticks=ceil(ltimes[1]):5:floor(ltimes[end]), 
  yticks=ceil(lfreqs[1]):0.5:floor(lfreqs[end]),
  fillcolor=:inferno, clims=(clim...), 
  title="Magic Flute Spectrogram, 0-4 kHz", 
  xlabel="Time [s]", ylabel="Frequency [KHz]"
);
Zoomed Spectrogram
/
Julia

Figure: A zoom on the lower frequency region of the above spectrogram. Note that while we can detect the presence of various instruments, both the frequency and time resolution are a bit blurry at this zoom level.

We can clean up the fine details of the spectrogram a little by using a simple—though computationally expensive—technique called overlapping. Here, rather than having a series of consecutive snapshots such that the first sample of snapshot S+1S+1 is the sample following the end of snapshot SS, we treat the snapshots as a sliding window. Thus, if we have an overlap of 0.5, we gain additional power spectra at half timesteps compared to a non-overlapped spectrogram. With four overlaps, we shift the snapshot by 25% of NN for each fractional timestep.

N = 1024; n_set = 1:N;
w = [ window_hann(n,N+1) for n in n_set ]
plotxyl( [ n_set, n_set+0.25N, n_set+0.50N, n_set+0.75N, n_set+N ],
    [ w, w, w, w, w ],
    name=[ "Snapshot S", "Overlap S1", "Overlap S2", "Overlap S3", "Snapshot (S+1)"], 
    xaxis_title="Samples", title="Overlapped Snapshot Windows")
Overlap
/
Julia

Figure: A depiction of the window functions of overlapping snapshots. In a spectrogram without overlap, only th leftmost (blue) and rightmost (purple) windows would be used. Note that the overlap windows analyze the data which would be close to zero in the non-overlapped case.

Here, without overlapping only the leftmost blue and rightmost purple snapshots would be used. The data near the boundaries then has minimal significance in our spectrogram, and the time resolution is in a fixed reciprocal relation with frequency resolution, as described above. So we see an additional theoretical benefit of this technique is that the overlap will more fully examine data that is multiplied towards zero near the boundaries of a window function.

One might think that there's nothing stopping us from only sliding the snapshot by one sample each time, achieving maximum overlap. Technically this is correct, but this must be weighed against the fact that each additional fractional timestep increases the number of FFTs required—and the amount of data that must be stored and plotted—by 100%.

Moreover, while in practice overlapping can allow us to increase our frequency resolution while maintaining a good time resolution, in a spectrogram this can quickly reach a limit of utility: highly monochromatic sources will become so vertically narrow on the plot that they can be hard to see, and the larger snapshots will still weaken our view of things in the time domain, regardless of overlap.

For the finest of detail work, it is still generally most effective to find a good balance of frequency resolution, time resolution, and computational demands when crafting a spectrogram (with consideration to the added flexibility that overlapping yields), and then view single line spectra at times of interest—and remember, overlapping can't help us with a spectrum at a single time.

Usually a modest number of overlaps is required to significantly increase a spectrogram's utility. Below, we'll increase our frequency resolution by a factor of 2, to 8192, and then overlap by a factor of 16, meaning each line spectrum will represent a shift of 512 samples.

snaplen = 8192
olpieces = 16; olratio = 1/olpieces; # 4 overlaps per window
ollen = convert(Integer,ceil(snaplen/olpieces))

#= 
  Rather than make a single input matrix with the data properly shifted for 
  overlap in each snapshot, we'll just do multiple runs of FFTs with the 
  data shifted by (ollen) for each run.
  The additional overhead is negligible for our size of matrix.
=#
lspecs = []
for i in 1:olpieces
	if i > 1
		data[1:ollen] = 0; data = circshift(data,ollen);
	end
	spec,freqs,times = fft_specgram(data, snaplen, ts; window=window_hann)
	lspec,lfreqs,ltimes = speclimit(spec,freqs,times, fl=[0,4000])
	unshift!(lspecs,lspec)
end

dt = diff(ltimes); push!(dt,dt[end]); # extrapolate final dt
# for each time, tack on (olpieces-1) fractional times
oltimes = [ time for (i,t0) in enumerate(ltimes[1:end]) for time in t0:olratio*dt[i]:t0+dt[i]-olratio*dt[i] ]

gc()
# rearrange from array of shifted matrices to one large matrix
spec = hcat([ spectrum[i,:] for i in 1:size(lspec,1) for spectrum in lspecs ]...)

spec = 10*log10(spec)
clim = [ maximum(spec)-120, maximum(spec)-0 ]
Overlapped FFTs
/
Julia
Plots.heatmap(oltimes,lfreqs/1e3,spec,  
	xticks=ceil(oltimes[1]):5:floor(oltimes[end]), 
  yticks=ceil(lfreqs[1]):0.5:floor(lfreqs[end]),
	fillcolor=:inferno, clims=(clim...), 
  title="Magic Flute Spectrogram", 
  xlabel="Time [s]", ylabel="Frequency [KHz]"
);
Overlapped Spectrogram
/
Julia

Figure: The return of our first figure, now that we know what goes into making it.

The overlapping particularly clears up the low end of the spectrum: the overlapping allows us to increase the snapshot size, narrowing down the vertical size of the signals of individual instruments, while the still slightly improving the time resolution. Comparing to the non-overlapped figure, it is now clear that around 10 seconds, we can see the alternating quick note successions of the bassoon and clarinet around 0.5 kHz.

While the most sonically crowded times are still difficult to analyze without better sampling, an experienced composer could probably determine the total number of instruments at each time by examining individual line spectra, and possibly even reconstruct a reasonable form of the clip from this data.

7.
Conclusion

Hopefully this article has served as a convincing presentation of the power and utility of spectral analysis. These and related techniques are pervasive in the modern world: frequency-domain data is used as a part of data compression, machine learning, medicine, and behavioral and phenomenological prediction. While we observe the world in the time domain, it is often the promininent features in the frequency domain which make up what is most interesting in our observations.

While we have only touched on some of the simplest uses of FFTs, and there are now other methods which can be more effective for some applications (such as wavelet analysis), in their fullest form FFTs remain the most accessible, widely implemented, and computationally thrifty tools available for spectral analysis.

If you'd like to further explore some of the ideas we've presented:

  • For relating component waves back to the sinusoidal functions' roots in trigonometry and circles in the complex plane, see the fancy animations here.
  • The Mathworld Fast Fourier Transform page makes a good launching point to study FFT algorithms themselves.
  • For deeper coverage of the mathematics of the Fourier transform and Fourier series, see this lecture from here, or this, from here.
  • The Scientist and Engineer's Guide to Digital Signal Processing textbook is available free online, and covers everything from the basics of data acquisition and fundamentals of spectral analysis to myriad applications.
  • NASA's educational materials include a brief introduction to how light spectra are used in astronomy and cosmology.