Background

Refer to this video

Preprocessing

Reading Audio Files

Use Librosa

  • librosa.load() → (array of amplitudes, sampling rate)

    • Sampling rate is also sampling frequency (samples/sec)

samples, sampling_rate = librosa.load(file_path)

# Length of audio file:
len(sample) / sampling_rate

Visualizing Audio

  • Plotting the Time Domain

    • Amplitude (loudness) against time

  • Not very useful itself, but we can transform this into the Frequency-Domain which is more useful (Process is Fourier Transform)

from librosa import display
plt.figure()
librosa.display.waveplot(y=samples, sr=sampling_rate)
plt.xlabel("Time (seconds) -->")
plt.ylabel("Amplitude")
plt.show()

Fourier Transform (FT)

  • Transformation from Time Domain → Frequency Domain

  • Takes continuous signal as input

  • Decomposes signal into constituent frequencies

  • Returns 2 components:

    • Frequency

    • Magnitude of frequency

Downsampling

  • Frequency data domain gets irrelevant after a certain point (the ends have smaller amplitude)

    • Microphones do not pick up on these frequencies

  • When downsampling, we can omit these irrelevant data points in the frequency domain

Fast Fourier Transform (FFT)

  • Takes discrete signal as input

  • Uses Discrete Fourier Transform (DFT)

    • Does the same thing FT does but with discrete input

Short-time Fourier Transform

  • Take small moment in time of audio

  • Conventional window length: 0.025 seconds

  • Conventional step size: 10ms

  • N FFT: 512 samples (in powers of 2)

  • Imagine window length being placed on top of the frequency domain, and the FFT sections out the windows whcih produce a spectrogram

Spectrogram

  • Shows magnitudes in frequency domain

  • More intense bands indicate higher amplitudes

Mel Filterbank

  • High frequencies are indistinguishable to humans

  • Filters what would be considered important to humans when classifying sounds

  • In graph below, we have "binned" bank in 10 filters

  • Creates features for ML based on the frequency domain

Feature Engineering

  • Do discrete cosine transform on the above filterbank and we get Mel Cepstrum Coefficients

    • This compacts and compresses audio to filter out higher frequencies (only keeping lower frequencies)

    • All the instruments kind of have their own unique "fingerprint" at this point

    • So a machine could easily differentiate at this point

Last updated