Background
Last updated
Last updated
Refer to this video
Use Librosa
librosa.load()
→ (array of amplitudes, sampling rate)
Sampling rate is also sampling frequency (samples/sec)
Plotting the Time Domain
Amplitude (loudness) against time
Not very useful itself, but we can transform this into the Frequency-Domain which is more useful (Process is Fourier Transform)
Transformation from Time Domain → Frequency Domain
Takes continuous signal as input
Decomposes signal into constituent frequencies
Returns 2 components:
Frequency
Magnitude of frequency
Frequency data domain gets irrelevant after a certain point (the ends have smaller amplitude)
Microphones do not pick up on these frequencies
When downsampling, we can omit these irrelevant data points in the frequency domain
Takes discrete signal as input
Uses Discrete Fourier Transform (DFT)
Does the same thing FT does but with discrete input
Take small moment in time of audio
Conventional window length: 0.025 seconds
Conventional step size: 10ms
N FFT: 512 samples (in powers of 2)
Imagine window length being placed on top of the frequency domain, and the FFT sections out the windows whcih produce a spectrogram
Shows magnitudes in frequency domain
More intense bands indicate higher amplitudes
High frequencies are indistinguishable to humans
Filters what would be considered important to humans when classifying sounds
In graph below, we have "binned" bank in 10 filters
Creates features for ML based on the frequency domain
Do discrete cosine transform on the above filterbank and we get Mel Cepstrum Coefficients
This compacts and compresses audio to filter out higher frequencies (only keeping lower frequencies)
All the instruments kind of have their own unique "fingerprint" at this point
So a machine could easily differentiate at this point
Gives values that are highly correlated