Reverse Engineering an Opera

Periodicity surrounds us. We find cyclical signals in just about any observable data, and at all scales, from the motions of trans-galactic structures to vibrating nuclei and electron waves.

Spectral analysis helps us isolate underlying oscillations in these observations. Spectral from the study of light, where the spectrum of visible colors equates to a spectrum of light wave frequencies. In our case study, we'll analyze music.

Audio data is reasonably familiar, the frequency ranges are approachable (sampling audio yields managable quantities of data), and spectral analysis of audio has many real-world applications.

Let's focus on the case of isolating instruments in a clip from the overture to Mozart's opera, the Magic Flute [from a Musopen project recording].

Magic Flute Clip.mp3

We'll build towards taking this audio data, and transforming it to create this spectrogram: a plot type designed for analyzing dynamic signals.

Overlapped Spectrogram

Figure: A spectrogram of the overture to Mozart's Magic Flute Opera. Note that individual instruments are identifiable, particularly near the opening of the clip, between 0-20 seconds and around 0.5-1.5 kHz.

In this plot, the y-axis is frequency in kilohertz, the x-axis is time in the audio recording, in seconds, and color intesity marks the spectral power in the recording at a given frequency, at a specific time. Even the untrained eye can detect patterns and link them to specific events in the audio. For example, the rising flutes at the opening near 1 kHz, the other woodwinds that follow around 0.5 and 1.5 kHz, and at around 41s we see the slowly growing volume and bandwidth of the entire string section leading up to the end of the clip. Throughout, even in the busier times, we see the alignment of individual notes across the entire orchestra.

One might ask how this 'spectral power' relates back to the actual loudness of the instruments, or the performing group as a whole. In the case of audio recordings, the original form of the power relates to the voltage differences generated at a microphone (or set of microphones), but after going through the myriad production stages of amplification, mixing, and output processing, it is effectively only a relative measure vs. the entirety of a given recording.

1. Defining the waveform

The term waveform can cover a lot of ground. In general, any recorded observations of a single parameter, taken at a regular cadence, may contain waveform data. We can find periodicities in such data taken in any realm. We expect to find them in the displacement of a violin string vs. its own length, the ripples in the Earth's crust due to quakes, or fluctuations in the fabric of spacetime due to gravity waves.

Where might we not readily expect to see oscillations? In the luminosity of a star, for example; however, even from our own star we see a regular cycle of sunspots. Moreover, subtle changes in luminosity are a big part of how we now detect exoplanets. Even where we do expect to find periodic signals, like in radio waves, careful analysis yields knowledge about unknown signal sources—some manmade, but some natural, such as the Jovian decametric radiation or the Cosmic Microwave Background radiation.

Thus, if periodicities exist in data, some form of spectral analysis can yield significant amounts of information: about wave frequencies, relative or absolute power of the oscillations, and given the right measurements, information about the polarization or arrival direction of incoming waves.

A general, one-dimensional, monochromatic (single frequency), sinusoidal wave—whether the dimension is time, distance, or something less concrete—is fully characterized by three parameters: amplitude , frequency , and phase . These combine to yield a function of amplitude vs. time as

A(t)=Asin(ϕ+2πft)=Asin(ϕ+ωt),\begin{array}{rl} A(t) & = A \sin\left(\phi + 2\pi f t\right) \\ & = A \sin\left(\phi + \omega t\right), \end{array}

where the angular frequency . This symbolizes what we'll measure in the real world, whether it's displacement of a physical body, the field detected by a radio antenna, or the voltage induced on a wire by a microphone.

To start out on our quest to create the above spectrogram, let's build some Julia types for future utility. The Timeseries type will serve to hold sample times for our waves, with the sampling rate, start and end times, and the actual points in time.

# import length and +, so we can overload them for our new types
import Base: length, +

# Takes either a rate (in Hz), a start time, and an end time, 
# or an explicit Array or Range, and stores all of the above.
struct Timeseries{T<:Number,T2<:Union{Array, AbstractRange}}
Timeseries(rate, t_start, t_end) = Timeseries(rate, t_start, t_end, t_start:1/rate:t_end-1/rate)
Timeseries(times) = Timeseries(-1, times[1], times[end], times)

length(ts::Timeseries) = length(ts.times)

# This isn't great, but we'll define + as ordered concatenation
function +(x::Timeseries, y::Timeseries)
  newarray = sort(vcat(x.times,y.times))
  Timeseries(-1, newarray[1], newarray[end], newarray)

The Waveform type can store the three basic parameters discussed above in a StaticWaveform, or alternatively a function describing a more complex wave in a DynamicWaveform.

abstract type Waveform end

# Takes and stores Amplitude, frequency, phase.
struct StaticWaveform{T<:Union{Number,Symbol}} <: Waveform

# Takes a function of t that will be run against a Timeseries
struct DynamicWaveform <: Waveform
Waveform(A::Number, f::Number, Phi::Number) = StaticWaveform(promote(A,f,Phi)...)
Waveform(ft::Function) = DynamicWaveform(ft);

Finally, the Wave type can store the combination of Waveforms and Timeseries as a MonoWave—a wave built off one waveform. We'll build a CompositeWave type later.

abstract type Wave end

# Takes a Waveform and a Timeseries, does the actual math of calculating 
# samples, and stores the samples and the original inputs.
struct MonoWave <: Wave
MonoWave(wf::StaticWaveform, ts::Timeseries) = MonoWave([ wf.A*cos(2*pi*wf.f*t + wf.Phi) for t in ts.times ], wf, ts)
MonoWave(wf::DynamicWaveform, ts::Timeseries) = MonoWave([ wf.ft(t) for t in ts.times ], wf, ts)

# Define how a Wave is constructed, either directly 
# or from a Waveform and Timeseries.
Wave(wf::Waveform, ts) = MonoWave(wf, ts)
Wave(samples, wf::Waveform, ts) = MonoWave(samples, wf, ts)

length(wv::Wave) = length(wv.ts.times);

We'll set up some defaults for our plots. We'll use the GR backend for static plots. Also, set up methods so we can plot() Waves directly.

using Plots
import Plots.PlotMeasures: px,pt

nj_width = 668  # current maximum unscaled plot width
nj_height = 413 # golden ratio?  ¯\_(ツ)_/¯

# Open Sans Font Size shorthand function
osfs(size::Integer) = Plots.font("Open Sans", size)

# Set defaults
gr(size=(nj_width,nj_height), linewidth=3, legend=:none, fillcolor=:inferno,
   titlefont=osfs(14), guidefont=osfs(12), tickfont=osfs(10))

import Plots: plot, plot!

# plot of Wave uses the Timeseries times as x, and samples as y
plot(wv::Wave; kwargs...) = plot(wv.ts.times, wv.samples; kwargs...)
plot!(wv::Wave; kwargs...) = plot!(wv.ts.times, wv.samples; kwargs...)
plot! (generic function with 4 methods)

Using these tools, let's build a simple 5 Hz wave. We'll sample it at 100 Hz for one second to make sure we get a good picture of it.

ts = Timeseries(100, 0, 1) # 100 Hz sampling, for one second
wf = Waveform(2.5, 5, 0) 	# A = 2.5, f = 5, no phase shift
wv = Wave(wf, ts)

# plot A(t)
plot(wv, label="Function", legend=:best, 
     xlabel="Time [s]", ylabel="Amplitude", title="$( Hz Wave")
# add sample points
plot!(wv, lw=0, marker=:circle, label="Samples")

Figure: A simple 5 Hz wave (blue line), sampled at 100 Hz (orange points).

Here, the blue line takes the form of the smooth, continuous 5 Hz function itself, while the orange dots are the points we've sampled at 100 Hz. Alright, so that's a single monochromatic wave, how about we liven things up?

ts = Timeseries(100, 0, 1)
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
threewaves = [ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ]

p = plot(ts.times, [x.samples for x in threewaves], ylims=(-2.5,2.5), 
         layout = grid(3,1))

plot!(subplot=1, title="Three Waves", xformatter=_->"")
plot!(subplot=2, ylabel="Amplitude",  xformatter=_->"")
plot!(subplot=3,  xlabel="Time [s]")

Figure: Three monochromatic waves with different parameters.

Three waves with different amplitudes, frequencies, and phases. To a good approximation, sound waves (and most waves) are linear, and so they follow the superposition principle. That means that to combine waves, we can simply sum their amplitudes at each sample time. We'll store such summations in a CompositeWave type.

struct CompositeWave <: Wave

# Define summation of Waves.  We're not going to get terribly complicated 
# here, so we'll assert that waves must be based on the same Timeseries to 
# be summed, and not deal with resampling or anything like that.
function +(x::Wave, y::Wave)
  if (x.ts != y.ts) 
    error("Waves must be based on the same timeseries to add!")
  CompositeWave(x.samples+y.samples, [], x.ts)

# construct directly, or through summation
Wave(samples, wf::Array, ts) = CompositeWave(samples, wf, ts)
Wave(wfs::Array{Waveform}, ts) = sum([MonoWave(wf, ts) for wf in wfs]);

Now we can plot the sum of the three waves above.

wv = sum(threewaves) 

plot(wv, legend=:none,
     ylabel="Amplitude", xlabel="Time [s]", title="Composite Wave")

Figure: A composite wave which is the sum of the three waves shown above.

Now that's a lot more interesting. Given some work, one could probably figure out by-hand the base frequencies when combining three waves, but what if there are ten different frequencies? A hundred? Or, as in a more real-world application, some frequency sources aren't even close to monochromatic, but rather emit over a broad range of frequencies? Along the same lines, the specific aural characteristics of most instruments come from various harmonics and resonances, and the way these shift over time—all things we'd find quite difficult to pick out from a raw waveform.

2. The Fast Fourier Transform

This is where spectral analysis comes in, and Fast Fourier Transforms (FFTs) are one of the oldest and most widely implemented forms of this.

We'll leave the mathematical details to another article, but the fundamental concept is that a Fourier transform takes a function of a particular variable (such as time), and transforms it into a function of the frequency domain related to that variable. For our case, we'll literally just be transforming from the time domain to the frequency domain, whereas e.g. a transform dependent on a spatial measure would yield the frequency domain of changes over spatial distances.

Aside: The mathematical viewpoint on why this is possible is that the sine and cosine functions form a complete, orthonormal basis for a large set of continuous, integrable functions (the 'square-integrable' functions). This means that any function which falls within this set can be represented by a sum of sines or cosines—though in many cases it will be an infinite sum. The Fourier transform allows us to determine the frequencies of the most powerful sinusoids that make up our signal. For a different view of sinusoids and Fourier transforms, check this article out.

Thus, while our signal is in an Amplitude vs. Time space, the Fourier transform of the signal can yield an Amplitude (or more commonly, Power) vs. Frequency space, showing us the actual spectrum of frequencies that make up our signal. A single transformation of data taken over one timeframe is commonly referred to as a power spectrum, and an array of power spectra chained into a pseudo-3D timewise plot is called a spectrogram.

Alright, but then what is a Fast Fourier Transform? Well, the original form of the Fourier transform as applied to discrete samples of a function or signal is now simply known as the Discrete Fourier Transform, or DFT. Variants on this sort of analysis have been used since Babylonian times, and formulas mathematically equivalent to modern ones have existed since the 1700s, but directly computing a DFT on an -point snapshot of an arbitrary signal has a computational complexity of . The FFT was first developed by Gauss in 1805, and then rediscovered and modernized for computing by Cooley and Tukey in 1965. The Cooley-Tukey algorithm can reduce computation time to for many inputs, and further advancements in FFT algorithms for specific cases can optimize it even further. Many algorithms work fastest (or exclusively) when applied to snapshots of samples for any integer , and so such array sizes are often seen in high data rate, time sensitive, or computationally bereft applications.

The code for this article is written in Julia, and Julia has a package for direct interface to FFTW, "The Fastest Fourier Transform in the West", which is an open-source discrete transform library created at MIT. FFTW has been found in benchmark tests to be generally the fastest FFT library. Its speed—even competing with and surpassing libraries custom-tuned for specific hardware—comes in part from its 'planning' structure. Before running the actual transforms, the library plans to a specified thoroughness, examining the hardware it will be running on and determining optimal methods for computing transforms on the user-specified array sizes and dimensions, memory positions, and transform types, and running on a given number of compute threads.

FFTW is an extremely flexible library, but it isn't terribly intimidating at the basic level at which Julia interfaces with it: there are several different functions for different transform types and their inverse transforms. Each function takes input data, and specifications on which dimensions to transform along. There are also planned versions of the functions which take example data matrices and return a plan object which is applied to the actual data matrix with the * operator, and in-place versions of most functions, which place the output into the input array, saving some memory (though not necessarily halving memory use).

The more advanced interfaces in the full C library allow for extensive control over data structure, input and output locations, repetition of transforms, and parallel transforms via multithreading and/or distributed computing.

Because of everything FFTW takes into account when planning (including running speed tests on various architectural features such as SIMD instructions), thorough planning can often take a long time—up to several minutes if the FFTW.PATIENT or FFTW.EXHAUSTIVE flags are passed. Because the information the library learns while making its plans is often applicable to future planning, FFTW is able to save the 'wisdom' it has learned from its planning to memory and to disk, allowing for reuse of wisdom at future times. In the Julia interface, this data may be saved to disk with the undocumented FFTW.import_wisdom(wisdomfile) and FFTW.export_wisdom(wisdomfile) functions.

3. FFT Input

Now let's look at the actual process of going from a signal to a power spectrum. To do so we'll start adding frameworks, starting with a one-shot function to make a single line spectrum.

One-Shot FFT FunctionJulia
Julia Audio Analysis Install
using FFTW

fft_oneshot(<Real data>; window=<window Function>)

Returns a single line spectra.  Applies a window function, if provided.
function fft_oneshot(data::Array{T,1}; window::Union{Function,Nothing}=nothing) where {T <: Real}
  len = length(data)
  # Check for existing wisdom
  wisdomdir = "/shared/fftw_plans/"
  wisdomfile = "$(wisdomdir)rdft-1d-$(len).plan"
  try FFTW.import_wisdom(wisdomfile) catch end
  # Plan patiently
  plan = plan_rfft(similar(data), 1, flags=FFTW.PATIENT, timelimit=Inf)

  # Save generated wisdom
  try mkdir(wisdomdir) catch end
  try FFTW.export_wisdom(wisdomfile) 
          catch e print("Wisdom export error: $(e).") end

  # Check for window
  if typeof(window) != Nothing
    w = [window(n,len+1) for n in 1:len]
    data = w .* data
    wcorrect = sum(w)/len
    wcorrect = 1

  # Run the actual FFT, normalize
  # We'll use the same array for output, to save memory.
  data = (plan * data)/len
  # Output square-magnitude for powers, corrected for window
  ( (abs.(data.*data)).^2 )./wcorrect
fft_oneshot(data, ts::Timeseries; args...) = fft_oneshot(data; args...), fft_freq(length(data),ts)
fft_oneshot(wv::Wave; args...) = fft_oneshot(wv.samples; args...), fft_freq(length(wv),wv.ts);

A helper function is also needed, to calculate the frequency of each bin in the output spectra.

function fft_freq(samples::Integer, ts::Timeseries)
  sampling_freq = ts.rate > 0 ? ts.rate : (ts.times[end]-ts.times[1])/length(ts)

3.1. Sampling

We'll start out by looking at the input side—here's our simple 5 Hz wave again:

5 Hz First Wave

Figure: 5 Hz wave.

As before, we'll look at the line as the fundamental signal, whether it's a microphone picking up sound waves, an antenna picking up radio waves—whatever. The points then are where we've actually sampled, in this case at 100 Hz, and since we've sampled for one second, we have 100 data points in this 1-second snapshot of the 5 Hz wave.

To the eye, it's pretty clear that the wave is very well sampled, and this leads to the question of just how many samples we need in a given time period to properly capture a waveform. This is answered by Nyquist Theory, which gives us that, for a given sampling rate , the maximum frequency of wave that can be accurately sampled, or Nyquist frequency, is .

However, waves with frequency greater than the Nyquist, known as undersampled waves, do not just disappear. The undersampled waves will show an effect called aliasing, in which they mirror down, appearing as potentially spurious signals in a frequency spectrum.

ts = Timeseries(500,0,5)
# build a wave with ten frequencies, 30 Hz apart
wv = sum([ Wave(Waveform(1,f,0), ts) for f in 30:30:10*30 ])

spec,freqs = fft_oneshot(wv)

plot(freqs, spec, xlab="Frequency [Hz]", ylab="Power", size=(nj_width,200))

Figure: A line spectrum sampled at 500 Hz, showing ten input frequencies spaced 30 Hz apart. Note that at the top end, where the last two inputs—270 Hz and 300 Hz—are greater than the Nyquist frequency, the last two peaks have aliased, mirroring around and showing up at 230 Hz and 200 Hz.

Here, note that the input frequencies are evenly spaced by 30 Hz, and so when we reach the Nyquist frequency, they mirror around, showing as peaks at false frequencies starting near the Nyquist. To try to grasp why this happens, let's look at some waveforms.

ts = Timeseries(200,0,1)
wv = Wave(Waveform(1,3,0),ts)
plot(wv, ylims=[-1.1, 1.1], size=(nj_width,300), 
     legend=:best, label="$( Hz")

wv = Wave(Waveform(1,7,0),ts)
plot!(wv, label="$( Hz")

ts = Timeseries(10,0,1)
wv = Wave(Waveform(1,7,0),ts)
plot!(wv, lw = 0, marker=:circle, label="$(wv.ts.rate) Hz Sample Points")

Figure: Why aliasing occurs. A 7 Hz wave (orange line), sampled at 10 Hz (green points) looks exactly the same as a 3 Hz wave (blue line).

If we have the orange 7 Hz signal, but only sample it at 10 Hz (i.e. on the green points), then those samples end up looking exactly the same as what we'd see from the blue 3 Hz signal. As we saw from the spectrum above, the aliased wave frequency appears at the sampling frequency minus the real wave's frequency.

When a data acquisition system is designed for it, undersampling and aliasing can be used in various ways, such as bandpass downconversion. However, for a straight sampling system, aliased signals are a huge problem, and so hardware filters will commonly be used on the input side to remove frequencies above the Nyquist.

It's worth elaborating that the Nyquist frequency is an absolute limit, and so it applies most precisely to ideal, clean, phase-aligned signals. For real-world observations even in very good conditions, it's advisable to have a Nyquist well above—4x or more—the maximum frequency of interest, and strong filters that cut off frequencies well below the Nyquist. In cases where higher harmonics, noise, and other unknowns may be pertinent to the investigation, a Nyquist dozens of times the fundamental frequency involved would not be out of the question.

In general the Nyquist frequency is also a good measure of how well a given sampling system can reproduce a signal. Modern, consumer-grade ('DVD-quality') audio is sampled at 48 kHz, and the average upper limit of human hearing is generally considered to be about 20 kHz, with the most detailed perception below 10 kHz. Thus we see that consumer-grade systems can do a reasonable job of reproducing audio; however, production-grade often goes further, sampling at 96 kHz or even 192 kHz in order to better capture and reproduce the audio signal as well as allow for more mixing and filtering techniques.

3.2. Snapshots

Okay, so that's an answer to what the minimum is for how many samples we need in a given time period to capture waves at a given range of frequencies. Going in the other direction, do we get anything out of oversampling? This question leads into one of the main tradeoffs that occurs on the input side of sampling for FFTs.

No matter what our sampling rate, we can technically feed as many samples as we want to the FFT—we'll just need more space in memory to store the snapshot's input and output arrays. While we'll cover the relation more in the output section, for every two input samples we add to our snapshot, we'll get one more frequency bin in the output, and therefore more frequency resolution. Thus, larger snapshots can allow us to distinguish separate waves that are very close in frequency.

However, at a given sampling rate, we'll need to sample for a longer time period in order to get those additional samples. Thus, if we're trying to make a spectrogram, the tradeoff is between resolution on the frequency axis, and resolution on the time axis: longer snapshots will give us better frequency resolution, but our view of time will reciprocally get more coarse.

This is particularly important when you consider time-limited signals with potentially changing frequencies, and even more so when these signals do not themselves have any inherent perodicity over the course of their lifetime. This includes many manmade signals; e.g, instruments in a musical piece need not necessarily repeat any specific pattern. Other examples of such signals can be found among natural radio emissions, such as whistler waves.

These waves are generated in the upper atmosphere by lightning, and propagate down magnetic field lines. They have important effects in the Earth's plasma environment, including precipitating charged particles in the radiation belts. They were named because of their audible descending tone when the radio signal is directly converted to an audio signal.

So, what happens if our snapshots are so long that they end up encompassing an entire whistler wave? With zero time resolution along the period of the whistler, we'll just see a broad band of power along the entire range of the whistler's frequencies, and we won't grasp the true nature of the wave in time at all.

Of course, if make our snapshots too narrow in time, we won't be able to resolve the different frequencies that make up the whistler, either. When possible, it's often helpful to try different snapshot sizes, yielding multiple views of the frequency vs. time structure in the data.

That said, raising the sampling rate will also raise the overall resolution possible in either or both directions, but this requires additional data storage, and possibly better sampling equipment. While this isn't terribly burdensome in the audio frequency range, equipment capable of fully sampling well above the Nyquist for high-frequency radio waves (3 MHz up) can cost tens of thousands of dollars, and can easily require a gigabyte of storage for each minute of observation.

3.3. Windowing

A complication of the FFT process lies in the fact that, while we only input a time-limited snapshot to the FFT algorithm, the transform is based on math on infinitely periodic signals. The practical effect this has is as if the waveform snapshot we feed in is repeated infinitely. So, if we take the simple 5 Hz waveform from earlier, what's actually being transformed is more like...

ts = Timeseries(100, 0, 1); wf = Waveform(2.5, 5, 0); wv1 = Wave(wf,ts)
ts = Timeseries(100,-3,4); wv2 = Wave(repeat(wv1.samples,7),, ts)

plot([wv2.ts.times, wv1.ts.times], [wv2.samples, wv1.samples], 
  size=(nj_width,300), lw=3, 
  legend=:best, labels=["What's Tranformed" "Input Snapshot"], 
  title="$( Hz Wave", ylab="Amplitude", xlab="Time [s]")

Figure: A depiction of the theoretical view of what's being transformed in an FFT: the input snapshot (orange line) extends infinitely in both directions (blue line, for particularly small values of infinity).

This presents a problem, because in general the endpoints of our input waveform are not going to be continuous when connected end-to-end. If we take our earlier example of a three-component waveform, with 1-second snapshots we see a significant discontinuity at the endpoints.

wv = sum(threewaves)

ts2 = Timeseries(100,-1,2)
r_wv = Wave(repeat(wv.samples,3),, ts2)

plot([r_wv.ts.times, wv.ts.times], [r_wv.samples, wv.samples], 
     labels=["What's Transformed", "Input Snapshot"], 
     ylab="Amplitude", xlab="Time [s]", title="3-Component Wave")

Figure: An example of boundary discontinuities when a snapshot (orange line) is repeated (blue line). Note the vertical jump where the orange and blue lines connect.

Reproducing discontinuities like this would add a bunch of extraneous frequencies to our spectrum, potentially ruining our analysis. This generation of spurious frequencies is known as spectral leakage.

So, what can we do about this? The canonical solution is to modify our input waveform samples in a way that preserves as much of the information we're trying to extract as possible, adds as little distortion as possible, and makes the endpoints match up when the snapshots are placed end-to-end. This is done by windowing the input so that the ends go to zero in a smooth manner, using a window function . With the ends at zero, there can be no discontinuity at the periodic boundaries.

For each sample in our -sample input, will produce a number by which we multiply to get our windowed input set, :

fn=Fn×w(n)n{1N}f_n = F_n \times w(n) \quad \forall n\in \left\{1 \rightarrow N\right\}

One of the more common and simple window functions is the Hann window, which makes use of a cosine function:

w(n)=0.5(1cos(2πnN1))w(n) = 0.5\left(1-\cos\left(\frac{2\pi n}{N-1}\right)\right)
"Hann Window Function"
function window_hann(n::T, N::T) where {T <: Integer}
ts = Timeseries(100,0,1); wf = Waveform(2.5,5,0); wv = Wave(wf,ts)

plot(layout = grid(1,3), size=(nj_width,nj_height/2), 

plot!(wv; subplot=1,
      ylab="Amplitude", xlab="Time [s]", title="$( Hz Wave")

N = length(wv); n_set = 1:N;
w = [ window_hann(n,N+1) for n in n_set ]
plot!(n_set, w; subplot=2, ylim=[-1,1], 
      ylab="Multiplier", xlab="Sample", title="Hann Window")

winwave = Wave(wv.samples .* w,, wv.ts)
plot!(winwave; subplot=3, 
      ylab="Amplitude", xlab="Time [s]", title="Windowed Wave")

Figure: The shape and effect of a Hann window on an input snapshot. On the left a snapshot of a 5 Hz wave, in the center the function for the Hann window, and right the effect of applying the window, .

Adding a window is equivalent to impressing an additional signal onto the data (and thus the spectral components required to reproduce that signal), but as long as we remain aware of this, it can be controlled far more effectively than the boundary-discontinuity noise of spectral leakage.

There are a wide variety of window functions, with wider peaks, flat regions, or different roll-offs towards zero. Each one is developed to have particular characteristics affecting the signal, with different noise levels, bandwidth, and frequency resolution. There are even some specific cases where the 'rectangular' or 'boxcar' window (equivalent in most cases to applying no windowing function at all) has the most desirable characteristics.

Let's look at five different windows on two signals: first, a single 5 Hz sine wave, and second, a 3-wave composite signal, the most powerful signal at 23.7 Hz, a signal with 1/3rd of the amplitude at 8.92 Hz, and a weak signal at 1/15th the amplitude, at 5 Hz. We'll sample at 1 kHz for pretty good frequency resolution, and zoom in on the 0-50 Hz range.

Additional Window FunctionsJulia
Julia Audio Analysis Install
# Boxcar Window Function
function window_boxcar(n::T, N::T) where {T<:Integer}

# Blackman-Harris Window Function
function window_blackman_harris(n::T, N::T) where {T<:Integer}
  a = [ 0.35875 0.48829 0.14128 0.01168 ]
  sum([(-1)^k * a[k+1] * cos(2*pi*k*n/(N-1)) for k in 0:2])
#  a[1] - a[2]*cos(2pi*n/(N-1)) + a[3]*cos(4pi*n/(N-1)) 
#       - a[4]*cos(6pi*n/(N-1))

# Flat-top Window Function
function window_flattop(n::T, N::T) where {T<:Integer}
  a = [ 1 1.93 1.29 0.388 0.028 ]
  sum([(-1)^k * a[k+1] * cos(2*pi*k*n/(N-1)) for k in 0:4])
#  a[1] - a[2]*cos(2pi*n/(N-1)) + a[3]*cos(4pi*n/(N-1)) 
#       - a[4]*cos(6pi*n/(N-1)) + a[5]*cos(8pi*n/(N-1))

# Planck-taper Window Function
function window_planck_taper(n::T, N::T; epsilon::Real=0.25) where {T<:Integer}
  lbound = epsilon*(N-1); rbound = (1-epsilon)*(N-1) 
  if 0 <= n < lbound
    res = 1/(exp(2epsilon*(1/(1 + (2n/(N-1)-1)) 
                         + 1/(1 - 2epsilon + (2n/(N-1)-1))))+1)
  elseif lbound <= n <= rbound
    res = 1
  elseif rbound < n <= N-1
    res = 1/(exp(2epsilon*(1/(1 - (2n/(N-1)-1)) 
                         + 1/(1 - 2epsilon - (2n/(N-1)-1))))+1)
    res = 0
Window ComparisonJulia
Julia Audio Analysis Install
# Sample at 1 kHz
ts = Timeseries(1000,0,1); wf = Waveform(1,5,0); wv = Wave(wf,ts);
wavedefs = [[1 5 0],[15 23.7 pi/4],[5 8.92 pi/7]]
wvs = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])

N = length(ts); n_set = 1:N; plots = []; specs = []
nyquist = ts.rate/2; freq_set = 0:nyquist/(N/2):nyquist;
winticks = 0:0.25:1;
specxrange = (0, 50); spec1yrange = (-750, 50); spec1yticks = -600:200:0;
specxticks = 0:10:50; spec2yrange = (-200, 0); spec2yticks = -200:50:0;

# Build list of Dicts of windows
windows = [Dict(:name => "No Window", :s_ylim => (-90,10), 
      :func => window_boxcar, :mult => 1)]
push!(windows, Dict(:name => "Hann", :s_ylim => (-150,10), 
      :func => window_hann, :mult => 1))
push!(windows, Dict(:name => "Blackmann-Harris", :s_ylim => (-240,20), 
      :func => window_blackman_harris, :mult => 1))
# scaling by 1/5 to make comparable to other windows
push!(windows, Dict(:name => "Flat-top", :s_ylim => (-240,20), 
      :func => window_flattop, :mult => 1/5)) 
push!(windows, Dict(:name => "Planck-taper", :s_ylim => (-190,10), 
      :func => window_planck_taper, :mult => 1))

p = plot(layout=grid(3,5), size=(nj_width,600),
for (i, win) in enumerate(windows)
  w = [ win[:func](n, N+1) for n in n_set ]*win[:mult]
  plot!(n_set, w; ylim=[-0.1, 1.1], title=win[:name], subplot=i,
        xticks=(0:250:1000, ["0","","","","1000"]), yticks=winticks)
  spec = 10*log10.(fft_oneshot(wv.samples, window=win[:func]))
  spec .-= maximum(spec)
  # limit data to workaround
  xinds = findall(f -> (f >= specxrange[1] && (f <= specxrange[2])), freq_set)
  plot!(freq_set[xinds], spec[xinds]; subplot=5+i, 
        xlim=specxrange, ylim=spec1yrange, 
        xticks=(specxticks, []), yticks=spec1yticks)
  spec = 10*log10.(fft_oneshot(wvs.samples, window=win[:func]))
  spec .-= maximum(spec)
  # limit data — still some jank in y, but whatever
  xinds = findall(f -> (f >= specxrange[1]) && (f <= specxrange[2]), freq_set)
  plot!(freq_set[xinds], spec[xinds]; subplot=10+i, 
        xlim=specxrange, ylim=spec2yrange, 
        xticks=specxticks, yticks=spec2yticks)

# spiff things up a little
plot!(xlab="Frequency [Hz]", subplot=13)
plot!(xlab="Sample", subplot=3)
plot!(ylab="Multiplier", subplot=1)
plot!(ylab="Power [dB]", subplot=6)
plot!(ylab="Power [dB]", subplot=11)
for i=2:5
  plot!(subplot=i,    yticks=(winticks,[]))
  plot!(subplot=5+i,  yticks=(spec1yticks,[]))
  plot!(subplot=10+i, yticks=(spec2yticks,[]))


Figure: A comparison between five window functions. In the columns, left to right, no window (aka rectangular or boxcar window), the Hann window, the Blackman-Harris window, a Flat-top window, and a Planck-taper window. On the top row, the shape of the window function itself; middle row line spectra of a windowed 5 Hz wave; and on the bottom line spectra of a composite of three waves of various parameters. Note for each spectrum the distance between peaks and the noise floor, and the sharpness or spreading of the peaks.

There's a lot going on above. One of the most important things to note is that the spectra are all in a logarithmic scale with units of decibels (dB), and shifted so that the highest peak is at 0 dB. The top row of spectra (for the 5 Hz wave) are all set to the same 800 dB y-axis range, while the lower row are individually set to show the most important data. From just this relatively simple comparison, we can see a clear tradeoff exists.

While all four of the cosine-based windows have very low noise floors, approximately 600 dB below peak for the single, pure sine wave, the rectangular window is clearly the best, with a narrow, sharp peak. The more complex the construction of a cosine-based window, the more the peak spreads out, and the Planck-taper window is just shy of nonsense.

However, when we add in just two other waves, the rectangular window becomes comparatively bad due to the spectral leakage: the spreading from the peaks is huge, and the noise floor rises to more like 50 dB off-peak, almost entirely hiding the weak 5 Hz peak, which only shows up as a tiny blip above the spread. For our signal, the best window appears to be the Hann window, which shows all three peaks well, with significantly less spread than the rectangle.

Aside: Is the noise and spreading of peaks entirely due to windowing? No, the discretization/sampling process itself introduces noise and artifacts. One way to look at it is that each frequency bin covers a range of frequencies dependent on the snapshot size and sampling rate, and it's fairly unlikely that a given fundamental wave in our signal will fall directly at the center of a bin. Effectively, each frequency in a bin (or sample in time) is a little fuzzy in frequency/time, and this introduces noise.

From this small test comparison we can see at least a hint towards the need for case-by-case consideration of various window functions, depending on factors such as the signal source, probable peak separation, dynamic range, and noise levels. Particularly careful consideration is necessary in cases where the transformed data is what is stored, such as systems with limited storage (remote collection) or bandwidth (space probes) that can only store and/or transmit a selection of frequencies. In such systems, you only have one chance to get it right.

A final complication of windowing is that the window can affect the absolute powers in the result. This may be fairly obvious simply by noting that the the sum over the amplitudes in a Hann window— —comes to 0.5, i.e. half of the amplitude in a waveform is going to be damped by the Hann window. A simple and oft-used fix for this is to simply globally correct the resultant powers by dividing out . This is a useful rough approximation, but is only exact for a pure-noise signal. In general, a window function will have different effects over the range of the spectrum, and the distortion will depend upon the input signal.

4. FFT Output

The raw output of an FFT is not specifically what we're looking for here, where we'll be concentrating on spectral power. To get to power, we first need to look at what the Fourier transform is actually producing.


Briefly, both the input and output of the Fourier transform ideally contain all of the information about the wave—amplitude, frequency, and phase—in the form of a complex number . But wait, we didn't feed it a complex number—we only had real samples, right? This is true, and the simplest way to think about it is that this means we need two samples on the input side to yield a single complex number on the output side. There are ways of sampling a signal that yield complex data, but for real samples such as ours the output of a standard DFT/FFT will be 'mirrored', i.e. one half of the output will be complex numbers arrayed by frequency, and the other half will be the same numbers, in reverse order.

The specific FFTW function we're using here, rfft(), knows that we're inputing real numbers, and so it returns the correct-order half of the output (plus a 0-frequency offset bin), and we don't have to deal with discarding half of the outputs. This also saves some processing time on the transform.

The full complex-number output is useful in various ways. Taking will yield the phase angle of each frequency component. With multiple perpendicular sampling antennas, the complex components can be mixed in various ways to find things like wave polarization and direction of arrival. For our purposes, let's look at the details of turning our complex numbers into powers.

4.2. Power Spectra

To get the power that we're interested in, we'll have to go through a number of steps to process a given output datum .

  • Textbook definitions of the transform can vary, but FFTW returns unnormalized output. What this means is that, to get accurate output power, we must first divide our output by the length of our FFT snapshot, but then multiply by 2 because our input is real.
  • We'll use the general correction for our window, i.e. divide out the total gain found by summing over the window—for the Hann window this means dividing out 0.5.
  • To get an amplitude, we then calculate the magnitude of the complex number , i.e. the absolute value of the conjugate transpose times the original, .
  • Finally, squaring the magnitude gives us a power.

Putting it all together yields

P=X2N2X2N22=XNXN2=XXN22.P = \left|\frac{\overline{X}}{2\frac{N}{2}} \frac{X}{2\frac{N}{2}}\right|^2 = \left|\frac{\overline{X}}{N}\frac{X}{N}\right|^2 = \left|\frac{\overline{X} X}{N^2}\right|^2.

4.3. Frequency Bins & Snapshot Times

The last pieces we need to get some nicely formatted output are to associate our output bins to frequencies, and make sure our times are aligned correctly.

We already know that, for an FFT snapshot of size , our output will have bins. The only other thing we need is the range of frequencies we're dealing with. This range has its max at the Nyquist frequency, which at our sampling frequency will just be . Thus, we're spreading a range of to Hz over bins, meaning each bin will have an span, and so our frequencies are the range 0:F/N:F/2.

The times assigned to each of a spectrogram's snapshots are easier, but contain a choice. For each sample in our total data set, we have a from the set of all times . With samples per snapshot, to get our times we just have to go through with a step size of . The choice left to the user is where within each step you take the time for a given snapshot: from the beginning, middle, or end. For the following we'll use the beginning time, and so snapshot_times = [ T[i] for i in 1:N:length(T) ]

5. Spectra

With our inputs sorted and our outputs processed, we can look at our frequency-domain results. Let's take a look at our original 3-component signal, sampled at 100 Hz. From here on we'll use a Hann window.

ts = Timeseries(100, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])

plot(wv, label="Signal", xlim=(0,1), 
     xlab="Time [s]", ylab="Amplitude", size=(nj_width,300), layout=grid(1,2))

spec,freqs = fft_oneshot(wv, window=window_hann)
plot!(freqs, spec; label="Power Spectrum", xlim=(0,10), 
      xlab="Frequency [Hz]", ylab="Power", subplot=2)

Figure: Power spectrum of a 3-frequency composite wave.

We can see that this clean signal—with a Nyquist far above any of the input frequencies and a long snapshot time—comes out pretty nicely, although it's tough to see the lowest-amplitude wave. What if we view it scaled logarithmically? The standard is to take the base-10 logarithm of a given power (bels), and then multiply by ten (decibels, or dB).

ts = Timeseries(100, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])

plot(wv, label="Signal", xlim=(0,1), 
     xlab="Time [s]", ylab="Amplitude", size=(nj_width,300), layout=grid(1,2))

spec,freqs = fft_oneshot(wv, window=window_hann)
spec = 10*log10.(spec)
plot!(freqs, spec; label="Power Spectrum", xlim=(0,10), ylim=(-200,0), 
      xlab="Frequency [Hz]", ylab="Power [dB]", subplot=2)

Figure: The same power spectrum as above, but with the y-axis in the logarithmic scale of decibels ( ).

This allows one to view even very powerful peaks while still keeping the noise floor and any lower-powered peaks in view, and is particularly helpful when viewing spectrograms rather than single line spectra, as it can fit a lot of range into a given color scale.

What if we add in some noise? We'll add in a 'wave' made up of some Gaussian noise, and bump the sampling rate to Hz to make sure we're seeing clearly.

ts = Timeseries(400, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])
# Gaussian noise
wv += Wave(2.5*randn(length(ts)),Waveform(0,0,0),ts)
plot(wv, label="Signal", xlim=(0,1), 
     xlab="Time [s]", ylab="Amplitude", size=(nj_width,300), layout=grid(1,2))

spec,freqs = fft_oneshot(wv, window=window_hann)
spec = 10*log10.(spec)
plot!(freqs, spec; label="Power Spectrum", xlim=(0,10), 
      xlab="Frequency [Hz]", ylab="Power [dB]", subplot=2)

Figure: Spectrum of the same signal, but with light, random Gaussian noise added.

We can see there's still something there in the input, though it's having a bad time. The spectrum is in pretty good shape, but let's increase the noise amplitude by about a factor of 5.

ts = Timeseries(400, 0, 4);
wavedefs = [[2.5 5 0],[1 7.25 pi/4],[1.84 3.77 pi/7]]
wv = sum([ Wave(Waveform(wd[1], wd[2], wd[3]), ts) for wd in wavedefs ])
# Gaussian noise
wv += Wave(12*randn(length(ts)),Waveform(0,0,0),ts)
plot(wv, label="Signal", xlim=(0,1), 
     xlab="Time [s]", ylab="Amplitude", size=(nj_width,300), layout=grid(1,2))

spec,freqs = fft_oneshot(wv, window=window_hann)
spec = 10*log10.(spec)
plot!(freqs, spec; label="Power Spectrum", xlim=(0,10), 
      xlab="Frequency [Hz]", ylab="Power [dB]", subplot=2)

Figure: Spectrum with a much larger noise amplitude.

Now the signal is a real mess, and even in the power spectrum it's getting tough to tell signal from noise, but if we go back to a linear plot...

#using the wave from the previous cell
plot(wv, label="Signal", xlim=(0,1), 
     xlab="Time [s]", ylab="Amplitude", size=(nj_width,300), layout=grid(1,2))

spec,freqs = fft_oneshot(wv, window=window_hann)
plot!(freqs, spec; label="Power Spectrum", xlim=(0,10), 
      xlab="Frequency [Hz]", ylab="Power", subplot=2)

Figure: Power spectrum with the same noise as above, but with the y-axis on a linear scale.

Now that's the clearer one, though there can easily be spurious noise peaks in the data as well, hiding or distorting the low-powered signals. Either view can be useful in some cases, and misleading in others. To wrap things up, let's look at a more complicated signal, with 50 random component waves and a bit of noise.

Analysis of Random Signals & NoiseJulia
Julia Audio Analysis Install
using Random

ts = Timeseries(1000,0,10); rdev = RandomDevice();
rwf = :(Waveform([2.5,100,2pi].*rand(rdev,3)...)) # reusable random waveform symbol
wv = sum([Wave(eval(rwf),ts) for i in 1:50 ])

plot(size=(nj_width,600), layout=grid(3,2))

plot!(wv, title="Clean Signal", subplot=1, 
     xlim=(0,1), ylim=(-25,25), 
     xlab="Time [s]", ylab="Amplitude")

# add a little noise
wv += Wave(5*randn(length(ts)),Waveform(0,0,0),ts)
plot!(wv, title="Noisy Signal", subplot=2, 
      xlim=(0,1), ylim=(-25,25), 
      xlab="Time [s]", ylab="Amplitude")

spec,freqs = fft_oneshot(wv, window=window_hann)
plot!(freqs, spec; title="Linear Spectrum", subplot=3, 
      xlim=(0,125), xlab="Frequency [Hz]", ylab="Power")

lspec = 10*log10.(spec)
plot!(freqs, lspec; title="Log Spectrum", subplot=4, 
      xlim=(0,125), ylim=(-50,0), 
      xlab="Frequency [Hz]", ylab="Power [dB]")

smin = minimum(spec); smax = maximum(spec); ths = smin:(smax-smin)/49:smax
spth = [ length(findall((p) -> p>th, spec)) for th in ths ]
plot!(ths, spth; title="Bins vs. Threshold", subplot=5, 
      xlab="Threshold", ylab="# Bins >")

smin = minimum(lspec); smax = maximum(lspec); ths = smin:(smax-smin)/49:smax
spth = [ length(findall((p) -> p>th, lspec)) for th in ths ]
plot!(ths, spth; title="Bins vs. Threshold (log)", subplot=6, 
      xlab="Threshold", ylab="# Bins >")

Figure: Analysis of a signal with 50 random waves and some significant background noise.

Comparing the Linear and Log power spectra in the middle row allows us to easily discern which peaks are the largest, while also seeing the total group of frequencies which have a strong signal above the noise level. In the bottom row, the Power or dB range have been divided into 50 levels, and what's shown is how many bins are above a given level.

6. Spectrograms

Now that we've looked at single line spectra, let's introduce a bit of dynamism and examine some spectrograms. As previously mentioned, a spectrogram is a pseudo-3D plot of spectral power vs. both frequency and time, showing the power at a set of frequencies over time, with the power dimension displayed via color. Of course, for a constant signal this isn't much more useful than a single line spectrum, but if it's an unknown signal the spectrogram can help you learn whether it's constant in the first place.

First, we'll build a framework for spectrograms like we had for line spectra, but a little more involved. First, a function to perform the sequence of FFTs. This takes a one-dimensional array of samples of length and breaks it up into snapshots of a specified length , performing FFTs on each and returning an matrix.

FFT Spectrogram GeneratorJulia
Julia Audio Analysis Install
fft_specgram(data, len, [window=wfunc])
Computes repeated 1D FFTs on `data`, using snapshots of size `len`.  If `len` does not evenly divide `length(data)`, the last snapshot will be zero-padded.  If `window` is provided, this function will be applied to each snapshot before the run.
function fft_specgram(data::Array{T}, len::Integer; 
                      window::Function=((x,y) -> 1)) where {T<:Real}
  datasize = length(data)
  # if there aren't enough samples in the input to make an 
  # integer number of snapshots, fill with zeroes
  (dim2, rem) = fldmod(datasize,len)
  if rem > 0
    data = vcat(data,zeros(T,len-rem))
    dim2 += 1
  data = reshape(data, len, dim2)

  w = T[window(n,len+1) for n in 1:len]
  if (w != ones(len))
    data = mapslices(X -> w.*X, data, dims = 1)
    wcorrect = sum(w)/len
    wcorrect = 1

  sizestring = join([string(n) for n in size(data)],"x")
  wisdomdir = "/shared/fftw_plans/"
  wisdomfile = "$(wisdomdir)rdft-1d-$(sizestring).plan"
  try FFTW.import_wisdom(wisdomfile) catch end
  plan = plan_rfft(similar(data), 1, flags=FFTW.PATIENT, timelimit=Inf)
  try mkdir(wisdomdir) catch end
  try FFTW.export_wisdom(wisdomfile) 
    catch e print("Wisdom export error: $(e).") end
  data = (plan * data)/len
  data = (abs.(data.*data)).^2
  data /= wcorrect
  # transpose to get the right order for Plotly freq vs. time specgrams
fft_specgram(data, len, ts::Timeseries; args...) = 
  fft_specgram(data, len; args...), fft_freq(len,ts), ts.times[1:len:end]
fft_specgram(wv::Wave, len; args...) = fft_specgram(wv.samples, len; args...), fft_freq(len,wv.ts), wv.ts.times[1:len:end];

Another to snip out a specific frequency- and/or time-range from a spectrogram result matrix.

# Snip function
function speclimit(spec, freqs, times; fl="all", tl="all")
  f_ind = fl == "all" ? (1:length(freqs)) : 
    (findfirst((x) -> x>=fl[1],freqs):findlast((x) -> x<=fl[2],freqs))
  t_ind = tl == "all" ? (1:length(times)) : 
    (findfirst((x) -> x>=tl[1],times):findlast((x) -> x<=tl[2],times))
  spec[t_ind,f_ind], freqs[f_ind], times[t_ind]

And a method to simplify spectrogram plotting by setting some defaults

specgram(x, y, z; kwargs...) = heatmap(
  x, y/1e3, transpose(z); # default kHz yscale
  xticks = ceil(x[1]):5:floor(x[end]), 
  yticks = ceil(y[1]):0.5:floor(y[end]),
  clims = ( maximum(z)-120, maximum(z)-0 ), # default 90 dB range

Finally, a small function to generate a triangle wave, for testing.


Triangle wave function
function sweep_triangle(center,sweep,period,time)
  t = time % period
  if t<period/2

Now below, a spectrogram of the signal clearly reveals the sweeping frequency.

ts = Timeseries(200,0,10)
wf = Waveform((t) -> 2.5*cos(sweep_triangle(50,10,5,t)))
wv = Wave(wf,ts)
spec,freqs,times = fft_specgram(wv,64,window=window_hann)
spec = 10*log10.(spec)

specgram(times,freqs,spec, title="Triangle Sweep", 
         xlab="Time", ylab="Frequency")

Figure: Spectrogram of a triangularly sweeping signal. Note that the final spectrum is washed out due the final input snapshot being zero-filled with enough samples to complete it.

The advantage of a spectrogram over other such representations (such as waterfall plots) is that the power dimension is represented via color and/or luminosity differences. This 2D representation of 3D data can allow for easier viewing and interpretation, and is optimal for print media; however, if care is not taken, it can lead to misinterpretation of relative or absolute peak powers. In cases where a proper reading is crucial, it can be beneficial to compare both linear and decibel spectrograms, and to view individual line spectra at particularly complex times—or, conversely, slices showing the power at specific frequencies, over time.

Now to finish up, we can finally go back to the clip from Mozart's opera.


We'll load higher-fidelity, single-channel audio from a .wav file.

Magic Flute Clip.wav
using WAV

data, fs, nbits = wavread(
Magic Flute Clip.wav
,format="native"); data = convert(Array{Float32,2},data) ttime = length(data)/fs print("Samples: ",length(data),", Frequency: ",fs, ", Bit Depth: ",nbits,", Total Time: ",ttime,".")

From the output line we can see that the audio is slightly above CD quality, with a 48 kHz sample rate. To get a rough idea of the audio amplitude over time, we can look at the average wave power every samples.

dnum = 
Decimation Factor
# factor by which to decimate ddata = [ sum(data[i-dnum:i+(dnum-1)])/(2*dnum) for i in dnum+1:(2*dnum):length(data)-dnum ] ndata = length(ddata) dtimes = 0:ttime/ndata:ttime-ttime/ndata # generate time array plot(dtimes, ddata; title="Magic Flute Signal", ylab="Sound Amplitude", xlab="Time [s]")

Figure: A 50x averaged amplitude of the Magic Flute audio data.

We'll run the FFTs with 4096-sample snapshots, and use our Hann window.

ts = Timeseries(promote(fs,0,ttime)...)
spec,freqs,times = fft_specgram(data, 4096, ts; window=window_hann);

Now we can take a look at a spectrogram of real data. We'll limit our view to the region below 10 kHz, because there is very little activity at higher frequencies.

lspec,lfreqs,ltimes = speclimit(spec,freqs,times,fl=[0,10000]);
lspec = 10*log10.(lspec)

specgram(ltimes,lfreqs,lspec, title="Magic Flute Spectrogram", 
         clims = ( maximum(lspec)-110, maximum(lspec)-10 ), 
         yticks = ceil(lfreqs[1]):floor(lfreqs[end]),
         xlab="Time [s]", ylab="Frequency [KHz]")

Here we can really begin to see the power of this form of analysis. While the resolution is limited for 48 kHz sampling, we can nevertheless see the three ascending patterns of the flute near the opening of the clip at around 1 kHz, as well as later descending patterns at slightly lower frequency ranges, and around 8s we see the oboe come in around 1.5 kHz. We can also see the ability of a logarithmic scale to encompass large changes in volume between the silent segments and the loud ones.

If we zoom in a bit on the frequency axis, we see that we're hitting the limitations of the sampling rate.

ts = Timeseries(promote(fs,0,ttime)...)
spec,freqs,times = fft_specgram(data, 4096, ts; window=window_hann)
lspec,lfreqs,ltimes = speclimit(spec,freqs,times,fl=[0,4000]);
lspec = 10*log10.(lspec)

specgram(ltimes,lfreqs,lspec, title="Magic Flute Spectrogram, 0-4 kHz", 
         clims=( maximum(lspec)-110, maximum(lspec)-10 ),
         xlabel="Time [s]", ylabel="Frequency [KHz]")

Figure: A zoom on the lower frequency region of the above spectrogram. Note that while we can detect the presence of various instruments, both the frequency and time resolution are a bit blurry at this zoom level.

We can clean up the fine details of the spectrogram a little by using a simple—though computationally expensive—technique called overlapping. Here, rather than having a series of consecutive snapshots such that the first sample of snapshot is the sample following the end of snapshot , we treat the snapshots as a sliding window. Thus, if we have an overlap of 0.5, we gain additional power spectra at half timesteps compared to a non-overlapped spectrogram. With four overlaps, we shift the snapshot by 25% of for each fractional timestep.

N = 1024; n_set = 1:N;
w = [ window_hann(n,N+1) for n in n_set ]
plot([ n_set, n_set.+0.25N, n_set.+0.50N, n_set.+0.75N, n_set.+N ],
     [ w, w, w, w, w ];
     label=[ "Snapshot S", "Overlap S1", "Overlap S2",
             "Overlap S3", "Snapshot (S+1)"], 
     xlab="Samples", title="Overlapped Snapshot Windows")

Figure: A depiction of the window functions of overlapping snapshots. In a spectrogram without overlap, only th leftmost (blue) and rightmost (purple) windows would be used. Note that the overlap windows analyze the data which would be close to zero in the non-overlapped case.

Here, without overlapping only the leftmost blue and rightmost purple snapshots would be used. The data near the boundaries then has minimal significance in our spectrogram, and the time resolution is in a fixed reciprocal relation with frequency resolution, as described above. So we see an additional theoretical benefit of this technique is that the overlap will more fully examine data that is multiplied towards zero near the boundaries of a window function.

One might think that there's nothing stopping us from only sliding the snapshot by one sample each time, achieving maximum overlap. Technically this is correct, but this must be weighed against the fact that each additional fractional timestep increases the number of FFTs required—and the amount of data that must be stored and plotted—by 100%.

Moreover, while in practice overlapping can allow us to increase our frequency resolution while maintaining a good time resolution, in a spectrogram this can quickly reach a limit of utility: highly monochromatic sources will become so vertically narrow on the plot that they can be hard to see, and the larger snapshots will still weaken our view of things in the time domain, regardless of overlap.

For the finest of detail work, it is still generally most effective to find a good balance of frequency resolution, time resolution, and computational demands when crafting a spectrogram (with consideration to the added flexibility that overlapping yields), and then view single line spectra at times of interest—and remember, overlapping can't help us with a spectrum at a single time.

Usually a modest number of overlaps is required to significantly increase a spectrogram's utility. Below, we'll increase our frequency resolution by a factor of 2, to 8192, and then overlap by a factor of 16, meaning each line spectrum will represent a shift of 512 samples.

Overlapped FFTsJulia
Julia Audio Analysis Install
snaplen = 8192
olpieces = 16; olratio = 1/olpieces; # 4 overlaps per window
ollen = convert(Integer,ceil(snaplen/olpieces))

  Rather than make a single input matrix with the data properly shifted for 
  overlap in each snapshot, we'll just do multiple runs of FFTs with the 
  data shifted by (ollen) for each run.
  The additional overhead is negligible for our size of matrix.

lspecs = []
for i in 1:olpieces
	if i > 1
		data[1:ollen] .= 0
	spec,freqs,times = fft_specgram(data, snaplen, ts; window=window_hann)
	global lspec,lfreqs,ltimes = speclimit(spec,freqs,times, fl=[0,4000])

dt = diff(ltimes); push!(dt,dt[end]); # extrapolate final dt
# for each time, tack on (olpieces-1) fractional times
oltimes = [ time for (i,t0) in enumerate(ltimes[1:end]) 
            for time in t0:olratio*dt[i]:t0+dt[i]-olratio*dt[i] ]

# rearrange from array of shifted matrices to one large matrix
spec = transpose(hcat([ spectrum[i,:] for i in 1:size(lspec,1) 
              for spectrum in lspecs ]...))

spec = 10*log10.(spec);
specgram(oltimes,lfreqs,spec, title="Magic Flute Spectrogram", 
         clims=(maximum(spec)-120, maximum(spec)-0), 
         xlab="Time [s]", ylab="Frequency [KHz]")

Figure: The return of our first figure, now that we know what goes into making it.

The overlapping particularly clears up the low end of the spectrum: the overlapping allows us to increase the snapshot size, narrowing down the vertical size of the signals of individual instruments, while the still slightly improving the time resolution. Comparing to the non-overlapped figure, it is now clear that around 10 seconds, we can see the alternating quick note successions of the bassoon and clarinet around 0.5 kHz.

While the most sonically crowded times are still difficult to analyze without better sampling, an experienced composer could probably determine the total number of instruments at each time by examining individual line spectra, and possibly even reconstruct a reasonable form of the clip from this data.

7. Conclusion

Hopefully this article has served as a convincing presentation of the power and utility of spectral analysis. These and related techniques are pervasive in the modern world: frequency-domain data is used as a part of data compression, machine learning, medicine, and behavioral and phenomenological prediction. We observe the world in the time domain, but it is often the promininent features in the frequency domain which make up what is most interesting in our observations.

While we have only touched on some of the simplest uses of FFTs, and there are now other methods which can be more effective for some applications (such as wavelet analysis), in their fullest form FFTs remain the most accessible, widely implemented, and computationally thrifty tools available for spectral analysis.

If you'd like to further explore some of the ideas we've presented:

  • For relating component waves back to the sinusoidal functions' roots in trigonometry and circles in the complex plane, see the fancy animations here.
  • The Mathworld Fast Fourier Transform page makes a good launching point to study FFT algorithms themselves.
  • For deeper coverage of the mathematics of the Fourier transform and Fourier series, see this lecture from here, or this, from here.
  • The Scientist and Engineer's Guide to Digital Signal Processing textbook is available free online, and covers everything from the basics of data acquisition and fundamentals of spectral analysis to myriad applications.
  • NASA's educational materials include a brief introduction to how light spectra are used in astronomy and cosmology.