No description has been provided for this image Data Science 2: Advanced Topics in Data Science

Final Project: Sentiment Analysis from Audio Data¶

Harvard University
Spring 2024
Team 26: Vincent Hock, Conrad Hock, Cooper Bosch, Tomas Arevalo, Jake Pappo


Notebook Contents¶

Instructions for running notebook / organization of code ....

  • Setup

  • Problem Statement

  • Data Description & Preprocessing

  • EDA

  • Baseline Model

  • Final Models

    • FFNN / CNN
    • LSTM
    • Transformer
    • SOTA
  • Discussion

Setup¶

Load libraries, install dependencies, and configure formatting

In [1]:
# HTML Formatting
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/"
    "content/styles/cs109.css"
).text
HTML(styles);

One library our code relies on heavily is the librosa package. It is a library specifically designed for audio and music analysis, and provided us with the functionality to convert our signal data into Mel spectrograms through fast fourier transforms.

We also use the transformers module from Huggingface, which is key to loading pretrained models and training them. The resulting model from Wav2Vec2 (described below) was primarily built in pytorch, so we also installed torchinfo as a tool to summarize model features.

In [2]:
# Install dependencies
# !pip install librosa 
# !pip install evaluate 
# !pip install transformers
# !pip install torchinfo
In [3]:
# Import libraries
import re
import time
import os
import wave
import scipy
import librosa
import evaluate
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import csv

from tensorflow import keras
from keras.models import Model, Sequential, load_model
from keras import layers
from keras import losses
from keras import optimizers
from keras.callbacks import EarlyStopping, LambdaCallback, ModelCheckpoint
from tensorflow.keras import layers
from keras.layers import Input, Embedding, SimpleRNN, GRU, LSTM, TimeDistributed, Bidirectional, Dense
from keras.layers import  BatchNormalization, Activation, Dropout, GaussianNoise, LayerNormalization
from keras.layers import Conv2D, MaxPooling2D, Flatten, Layer
from keras.regularizers import L1
from keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from IPython.display import Audio

from datasets import Dataset
from datasets import Audio as AudioCast
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer
from torchinfo import summary
import seaborn as sns
2024-05-09 00:35:32.333996: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-09 00:35:32.379252: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-09 00:35:32.379286: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-09 00:35:32.380385: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-09 00:35:32.387083: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [4]:
path = os.getcwd() + '/Audio_Speech_Actors_01-24'
path
Out[4]:
'/home/u_388354/project/Audio_Speech_Actors_01-24'

Problem Statement¶

The aim of this project is to develop a robust emotion classification system capable of accurately identifying the emotional state of a speaker based on an audio clip of their speech. Emotion classification from audio presents a considerable challenge due to the inherent variability in speech patterns, pitch & tones, and the subjective nature of emotions. But the benenfits of such a model are apparent; its depolyment in voice assistants or sentiment analysis tools would enable tiMely and accurate emotion recognition from audio inputs, fostering advancements in human-computer interaction and affective computing.

We will explore a number of different model architectures to accurately classify each audio clip into 1 of 8 different emotional tones, and weigh the advantages and drawbacks of each approach to better understand the difficulties and opportunities associated with audio sentiment analysis.

Data Preprocessing¶

Data Description¶

Our data will be coming from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). This database contains audios from 24 professional voice actors, 12 female and 12 male, each of whom, as stated by RAVDESS, have a “neutral North American accent”. There are two phrases which they all say. The first phrase is “Dogs are sitting by the door” and the other is “Kids are talking by the door”. Every actor says each phrase in 8 different emotional tones: “neutral”, “calm“, “happy“, “sad“, “angry“, “fearful“, “disgust“, “surprised“, and each tone has an emotional intensity: “normal“, “strong“. It is important to note that there is no strong intensity for the “neutral” emotion. Next, for all phrases each actor says it twice for repetition. Thus, every actor has 60 speech recordings in total. Although we only plan to use the speech audio data, RAVDESS also makes song audio and video available to us as well, but we don’t plan to use them.

There are 1440 audio samples, with an average of 177632 data points per sample. We don’t have many samples, but we do have a lot of data per sample. Thus, we will need to perform per-sample dimensionality reduction, but we will need to use data augmentation to generate more samples.

Summary of Data¶

After the preprocessing outlined in Milestone 2, there are 1440 samples, each with a stream of 253125 integer values. These samples are each pre-padded with zeros to standardize their length. The number 253125 as it was the smallest number larger than the maximum sample length that was divisible by what we deemed to be sufficient factors: 1, 3, 5, 9, 15, 25, 27, …. This will be useful in reducing the dimensionality of the data through averaging consecutive points or selecting every nth value, as well as in our potential exploration of Fourier transforms for feature selection.

For ease of use, we have constructed a Pandas DataFrame containing, for each audio sample, the corresponding Emotion, Intensity, Statement, Repetition, Actor, Gender of Actor, Frame Rate, and Number of Frames before padding, as well as the np.array containing the data stream itself.

Data Preprocessing¶

In [5]:
def read_wav_file(file_path):
    with wave.open(file_path, 'rb') as wav_file:
        num_channels = wav_file.getnchannels()
        sample_width = wav_file.getsampwidth()
        frame_rate = wav_file.getframerate()
        num_frames = wav_file.getnframes()

        # Read the raw audio data
        raw_data = wav_file.readframes(num_frames)

    # Convert the raw audio data to a numpy array
    if sample_width == 2:
        data_type = np.int16
    elif sample_width == 4:
        data_type = np.int32
    else:
        raise ValueError("Unsupported sample width")

    audio_data = np.frombuffer(raw_data, dtype=data_type)

    # Reshape the numpy array if there are multiple channels
    if num_channels > 1:
        audio_data = audio_data.reshape(-1, num_channels)

    return audio_data, frame_rate
In [6]:
wav_paths = []

# os.walk gives files recursively
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        # ignore .DS_Store
        if file.endswith('.wav'):
            wav_path = os.path.join(root, file)
            wav_paths.append(wav_path)
In [7]:
# Pick padding length that enables prime factorization
MAX_LEN = 3**4 * 5**5
In [8]:
# Helper function for info, can be made categorical instead of numeric
id2label = {
                0: 'neutral', 
                1: 'calm', 
                2: 'happy',
                3: 'sad', 
                4: 'angry',
                5: 'fearful',
                6: 'disgust',
                7: 'surprised'
              }

label2id = {v: k for k, v in id2label.items()}

def info_dict(path):
    dict = {}
    emotion_number = int(path[-18:-16]) - 1
    dict['Emotion_Number'] = emotion_number
    dict['Emotion'] = id2label[emotion_number]
    dict['Intensity'] = int(path[-15:-13])
    statement_number = int(path[-12:-10])
    dict['Statement_Number'] = statement_number
    dict['Statement'] = ['OOB', 'Kids', 'Dogs'][statement_number]
    dict['Repetition'] = int(path[-9:-7])
    dict['Actor'] = int(path[-6:-4])   #0 = female, 1 = male
    gn = int((int(path[-6:-4]) % 2 == 1))
    dict['Gender_Number'] = int((int(path[-6:-4]) % 2 == 1))
    dict['Gender'] = ['Female', 'Male'][gn]
    return dict
In [9]:
data_list = []

for path in wav_paths:
    # helper function above
    dict = info_dict(path)
    data, fr = read_wav_file(path)
    dict['Frame_Rate'] = fr
    
    # Length without padding
    dict['Num_Frames'] = len(data)

    # Check for 5 cases where data is doubled
    if len(data.shape) != 1:
        data = data.T[0]

    # Do padding
    new_data = np.pad(data, (MAX_LEN - len(data), 0), 'constant')

    dict['Data'] = new_data
    data_list.append(dict)
    
df = pd.DataFrame(data_list)
In [10]:
df.head(3)
Out[10]:
Emotion_Number Emotion Intensity Statement_Number Statement Repetition Actor Gender_Number Gender Frame_Rate Num_Frames Data
0 5 fearful 1 2 Dogs 1 17 1 Male 48000 169770 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 3 sad 2 2 Dogs 2 17 1 Male 48000 171371 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 4 angry 2 1 Kids 2 17 1 Male 48000 179379 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1440 entries, 0 to 1439
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Emotion_Number    1440 non-null   int64 
 1   Emotion           1440 non-null   object
 2   Intensity         1440 non-null   int64 
 3   Statement_Number  1440 non-null   int64 
 4   Statement         1440 non-null   object
 5   Repetition        1440 non-null   int64 
 6   Actor             1440 non-null   int64 
 7   Gender_Number     1440 non-null   int64 
 8   Gender            1440 non-null   object
 9   Frame_Rate        1440 non-null   int64 
 10  Num_Frames        1440 non-null   int64 
 11  Data              1440 non-null   object
dtypes: int64(8), object(4)
memory usage: 135.1+ KB

EDA¶

Visualize Audio Sample¶

In [12]:
# Example usage
test_path = wav_paths[100]

file_path = test_path
audio_data, frame_rate = read_wav_file(file_path)
print("Audio data shape:", audio_data.shape)
print("Frame rate:", frame_rate)
audio_data
Audio data shape: (166566,)
Frame rate: 48000
Out[12]:
array([ 1,  2,  2, ..., -5, -4, -5], dtype=int16)
In [13]:
Audio(audio_data, rate=frame_rate)
Out[13]:
Your browser does not support the audio element.
In [14]:
ms = range(audio_data.shape[0])
fig, ax = plt.subplots(figsize=(100, 20)) 
ax.plot(ms, audio_data, c='darkorange')
ax.axis('off')
plt.show()
No description has been provided for this image

Length Variation¶

In [15]:
histos=df['Num_Frames'].hist(by=df['Emotion'], sharex=True, figsize=(20,10), layout=(2,4), density=True)

histos[0,0].set_ylabel('Density')
histos[1,0].set_ylabel('Density')

histos[1,0].set_xlabel('Number of Frames')
histos[1,1].set_xlabel('Number of Frames')
histos[1,2].set_xlabel('Number of Frames')
histos[1,3].set_xlabel('Number of Frames')

plt.suptitle('Distribution of Audio Length by Emotion');
No description has been provided for this image

The distribution of audio lengths appears to be largely similar in shape among various emotions, with disgust showing more variation than the others.

In [16]:
mean_lens=df[['Emotion_Number', 'Num_Frames']].groupby(by=['Emotion_Number']).mean()['Num_Frames']
sd_lens=df[['Emotion_Number', 'Num_Frames']].groupby(by=['Emotion_Number']).std()['Num_Frames']
xs=list(id2label.values())
In [17]:
fig,ax=plt.subplots(1,1,figsize=(9,3))
plt.bar(height=mean_lens, x=xs,  yerr=sd_lens, color='lightblue')
plt.ylim(125000, 210000)
plt.xlabel('Emotion')
plt.ylabel('Length (# of Frames)')
plt.title('Average Audio Length by Emotion +/- Standard Deviation');
No description has been provided for this image

Different emotions exhibit variable average number of frames in their respective recordings. Nevertheless, it appears that the variation of recording lengths within each emotion is greater than the variation among each emotion’s average recording lengths.

Amplitude Variation¶

In [18]:
emotions = range(0, 8)

# Create subplots
fig, ax = plt.subplots(figsize=(20, 10))

# Position for each box plot
positions = np.arange(1, len(emotions) + 1)

# Iterate over emotions
box_data = []
for i, emotion in enumerate(emotions):
    # Filter the DataFrame
    filtered_df = df[(df['Emotion_Number'] == emotion)]
    
    # Combine the data into a single array
    vals = np.concatenate(filtered_df['Data'].values)
    for j in range(-20, 21):
        vals = vals[vals != j]
    
    box_data.append(vals)

# Create the box plot
boxplot = ax.boxplot(box_data, positions=positions, vert=True, patch_artist=True, showfliers=False)

# Add labels and grid
ax.set_ylabel('Amplitude')
ax.set_xlabel('Emotion')
ax.set_xticklabels([id2label[emotion] for emotion in emotions])
ax.set_title("Emotion Ampltitude Distributions")

# Customize colors
for patch in boxplot['boxes']:
    patch.set_facecolor('lightblue')

for median in boxplot['medians']:
    median.set(color='black')
    
plt.show()
No description has been provided for this image

In the visual above, we see a series of boxplots where each one represents a different emotion. This visual was generated by taking all the 1D arrays of amplitudes for each recording from each emotion, and then visualizing it using a boxplot. Immediately we can see how the width of the boxplot is much smaller for emotions such as neutral and calm than those of angry and fearful. This makes sense as we would expect to see more heightened emotions to have a higher amplitude due to them being more expressive. Another thing to notice is that the means all lie around 0 which makes sense due to the nature of sound waves. Lastly, while not too noticeable in this plot, it is important to mention how many values of 0 there were due to frames with no noise. Because of this, the distribution of amplitude values seem to be a lot more centered around 0 than it maybe should be. We can look at the distributions below with the left plot showing the distribution with the 0’s included, and the right plots showing the distributions with values from -20 to 20 excluded.

In [19]:
# Create subplots
fig, axs = plt.subplots(4, 2, figsize=(10, 10))
fig.suptitle('Histograms of Audio Data for Different Emotions and Intensities')

# Iterate over emotions
for i, emotion in enumerate(emotions):
    # Calculate subplot position
    row = i // 2
    col = i % 2

    # Filter the DataFrame
    filtered_df = df[(df['Emotion_Number'] == emotion)]
    
    # Combine the data into a single array
    vals = np.concatenate(filtered_df['Data'].values)
    for j in range(-20, 21):
        vals = vals[vals != j]
    
    # Plot the histogram
    axs[row, col].hist(vals, bins=100, color='skyblue', edgecolor='black')
    axs[row, col].set_title(id2label[emotion])
    axs[row, col].set_xlabel('Amplitude')
    axs[row, col].set_ylabel('Count')

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image

Standard Deviation of $\Delta$amplitude¶

In [20]:
data_array = np.array(list(df['Data']))
data_averages = np.mean(data_array.reshape(1440, int(data_array.shape[1]/5), 5), axis=-1)
# Make an array of the deltas between time steps
data_differences = data_averages[:,1:] - data_averages[:,:-1]
# Store the standard deviations of this differences
stds = np.std(data_differences, axis = 1)
df['Change_Deviation'] = list(stds)
In [21]:
# Can see the more passionate emotions have higher variance in change
df[['Change_Deviation','Emotion']].groupby(['Emotion']).mean()
Out[21]:
Change_Deviation
Emotion
angry 899.235758
calm 50.527442
disgust 167.523488
fearful 446.503306
happy 339.889837
neutral 68.235074
sad 118.996169
surprised 198.101180

This graph displays the standard deviation of the amplitudes for each emotion. More specifically, the audio data for each recording of a given emotion was aggregated, and then the standard deviation of all those values was calculated. This allows us to see the variability of the amplitudes for each emotion and gives us a sense of its potential as a predictor. This can be viewed as the fluctuation of the voice or volume among the recordings. The bar plot makes intuitive sense, as well, with ‘calm’ and ‘neutral’ having the lowest variability and ‘angry’ and ‘fearful’ the highest. In other words, the amplitudes of less emotive voices tend to stay within a smaller range, whereas those of angry and scared voices reach larger amplitudes. Because the standard deviations can help distinguish between different emotions, then it can be utilized in a model. In the simple case of logistic regression regression, the standard deviation could be directly used as a predictor variable, but in the case of a neural network, the model should be able to learn more complex relationships, which would include characteristics like the variability of amplitudes and functions thereof.

Summary of Findings¶

Overall, it appears that emotions like happy, fearful, and angry display much greater variation in signal amplitude than emotions like neutral and calm. Emotions also display modest differences in the lengths of their audio signal. On average, disgust and anger, for example, appear to be more drawn out neutral, fearful, and surprised

Baseline Model¶

Baseline Evaluation¶

Data Collection Pipeline & Tools¶

In [22]:
length = len(df)
frame_rate = 48000
def get_sample():
    is_neutral = True
    # Never plays neutral audio
    while(is_neutral):
        rand_index = np.random.randint(0,length)
        audio_data = df.iloc[rand_index]['Data']
        true_val = df.iloc[rand_index]['Emotion']
        is_neutral = (true_val == 'neutral')
    return (Audio(audio_data, rate = frame_rate, autoplay = True), rand_index, true_val)

# Will delete the csv, do not run
def reset_csv(file_name):
    with open(file_name, 'w', newline='') as csvfile:
        fieldnames = ['index', 'true', 'pred']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

def add_to_csv(file_name, id, true, guess):
    with open(file_name, 'a', newline='') as csvfile:
        fieldnames = ['index', 'true', 'pred']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writerow({'index' : id, 'true' : true, 'pred' : guess})
In [23]:
audio, id, true_val = get_sample()
display(audio)
Your browser does not support the audio element.
In [24]:
for i in range(1,9):
    print(i, id2label[i-1])
1 neutral
2 calm
3 happy
4 sad
5 angry
6 fearful
7 disgust
8 surprised
In [25]:
guess = 4
add_to_csv('baseline.csv', id, true_val, guess)

Data Collection Results¶

In [26]:
baseline_df = pd.read_csv('baseline.csv')
baseline_df['pred'] = baseline_df['pred'] - 1
baseline_df['pred_emotion'] = baseline_df['pred'].apply(lambda x: id2label[x])
baseline_df['true_num'] = baseline_df['true'].apply(lambda x: label2id[x])

y_preds = list(baseline_df['pred'])

y_test = list(baseline_df['true_num'])

emotion_names = list(id2label.values())
ConfusionMatrixDisplay(confusion_matrix(y_test, y_preds), display_labels = emotion_names[1:]).plot(cmap='Blues', xticks_rotation = 40)
plt.title('Human Benchmark')
plt.show()

print(f'\nAccuracy of Human Model: {np.round(accuracy_score(y_preds,y_test), 5)}')
No description has been provided for this image
Accuracy of Human Model: 0.7541

Baseline Logistic Regression¶

In [27]:
# Baseline model uses change deviation and the length of the clip in logistic regression
import warnings
warnings.filterwarnings("ignore")

X = df[['Change_Deviation', 'Num_Frames']]
y = df['Emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1777)
baseline_logreg = LogisticRegression(random_state=10, max_iter=10000, multi_class='ovr').fit(X_train, y_train)
y_preds = baseline_logreg.predict(X_test)

ConfusionMatrixDisplay(confusion_matrix(y_test, y_preds), display_labels = emotion_names).plot(cmap='Blues', xticks_rotation = 40)
plt.title('Logistic Regression Confusion Matrix')
plt.show()

print(f'\nAccuracy of Baseline Model: {np.round(accuracy_score(y_preds,y_test), 5)}')
No description has been provided for this image
Accuracy of Baseline Model: 0.37847

Interpretation & Analysis¶

To get a baseline for accuracy, we collected data from humans to estimate a human benchmark for this problem, and implemented a simple baseline logistic regression model on our data. For the human benchmark, we collected 50 data points among 5 people, where they were provided with a randomly selected audio clip and were tasked with classifying the emotion. On average our testers scored a 75% accuracy. One important thing to note was that for this benchmark, we removed the presence of neutral clips and did not include them in the classification. This is because many people were getting confused by the neutral emotion, and its presence in the dataset itself is to be a baseline zero emotion audio file. This means that our human benchmark is biased to be better than our models trained on classifying the neutral audios.

Our logistic model takes as input the standard deviation of amplitude within a given audio signal as well as the number of frames in the recording. This model achieves 40% accuracy, mostly by differentiating between calm and neutral (low variance) and happy and angry (high variance) emotions. This is significantly better than the majority class model, which would achieve an accuracy of 13.3%. We used a 80/20 train test split, and the confusion matrix here specifies class accuracy. Looking at the coefficients confirms our beliefs about the data, that calm and neutral emotions are negatively correlated with variance, and the opposite is true for the happy and surprised emotions.

Final Models¶

Modeling Preparation¶

Mel Spectrogram¶

We make heavy use of the Mel Spectrogram transformation to preprocess our data before using the more complex models. Since the way humans process audio data is more closely related to variations in pitch, we need some way to turn our amplitude data into pitch frequencies. The fourier transform does this for us, and running an FFT algorithm on multiple small time increments turns our amplitude data into a representation of how the speaker's pitch varies over the time of the recording.

Finally, the Mel transformation puts these pitch frequencies on a log scale that gives better differentiation of pitch when specifically analyzing human speech.

In [28]:
# Hyper parameters for Melspectrogram
sr = 48000
n_fft = 2048
hop_length = 512

def get_melspectrogram(audio, n_mels=128):
    # First make sure audio data is casted to float
    audio_as_float = audio.astype(np.float32)
    mel = librosa.feature.melspectrogram(y = audio_as_float, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
    return mel

df['mel128'] = df['Data'].apply(lambda x: get_melspectrogram(x, n_mels=128))
df['mel256'] = df['Data'].apply(lambda x: get_melspectrogram(x, n_mels=256))
In [29]:
# Mel-SPECTROGRAM CONVERTED DATASET
mel_128 = np.array(list(df['mel128']))
mel_256 = np.array(list(df['mel256']))
In [30]:
# Visualize spectrogram (for 128 Mel features)
fig, ax = plt.subplots()
S_dB = librosa.power_to_db(mel_128[3], ref=np.max)
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=sr,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram');
No description has been provided for this image
In [31]:
# Prepare dataset
X = mel_128
y = df['Emotion_Number'].values

Helper Functions¶

In [32]:
def evaluate_predictions(model, model_name, X_test=X_test, y_test=y_test):
    # Get predictions
    y_pred = model.predict(X_test)
    y_pred_classes = np.argmax(y_pred, axis=1)

    # Create Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred_classes)
    emotion_labels = [id2label[i] for i in range(len(emotions))]  
    disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=emotion_labels)

    # Graph Confusion Matrix
    disp.plot(cmap='Blues', xticks_rotation = 40)
    plt.title(f'{model_name} Confusion Matrix')
    plt.show()
In [33]:
def plot_history(history, model_name):
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    best_val_acc_loc = np.argmax(history.history['val_accuracy'])
    best_val_accuracy = max(history.history['val_accuracy'])
    plt.plot(history.history['accuracy'], label='train')
    plt.plot(history.history['val_accuracy'], label='validation')
    plt.axvline(best_val_acc_loc, linestyle='--', c='k', label=("best val acc: {:.4f}".format(best_val_accuracy)))
    plt.title(f'{model_name} accuracy')
    plt.xticks(range(0, len(history.history['accuracy']), 2))
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(loc='upper left')
    
    plt.subplot(1, 2, 2)
    best_val_loss_loc = np.argmax(history.history['val_loss'])
    best_val_loss = max(history.history['val_loss'])
    plt.plot(history.history['loss'], label='train')
    plt.plot(history.history['val_loss'],  label='validation')
    plt.axvline(best_val_loss_loc, linestyle='--', c='k', label=("best val loss"))
    plt.title(f'{model_name} loss')
    plt.xticks(range(0, len(history.history['accuracy']), 2))
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(loc='upper left')
    
    plt.tight_layout()
    plt.show()

FFNN¶

Let's start with a very simple approach; a feed forward neural network. Given the dimensionality reduction that Mel Spectrogram offers, we now have a somewhat more manageable (but still enormous) input size for densely connect layers. We'll set another baseline (this time, a neural net baseline) by simply brute force passing the Mel spectrograms into a FFNN.

Data Preparation¶

In [34]:
# Prepare dataset
X_ffnn = mel_128
y = df['Emotion_Number'].values

X_train, X_test, y_train, y_test = train_test_split(X_ffnn, y, test_size=0.2, random_state=109, stratify=y)

Build & Compile Model¶

In [35]:
input_shape = (X_train.shape[1], X_train.shape[2], 1)

n_filters = 10

kernel_regularizer = L1(l1=0.015)
bias_regularizer = L1(l1=0.015)
dropout_rate = 0.5
In [36]:
inputs = Input(shape=input_shape)

# # Flatten the convolutional layers output before the fully connected layers
x = Flatten()(inputs)

# Dense layers
x = Dense(500, activation='relu', kernel_regularizer=kernel_regularizer, bias_regularizer=bias_regularizer)(x)
x = Dropout(dropout_rate)(x)
x = Dense(500, activation='relu', kernel_regularizer=kernel_regularizer, bias_regularizer=bias_regularizer)(x)
x = Dropout(dropout_rate)(x)

# Output layer
outputs = Dense(len(emotions), activation='softmax')(x)

ffnn = Model(inputs=inputs, outputs=outputs, name='ffnn')
2024-05-09 00:37:50.611235: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:50.624500: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:50.627381: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:50.630919: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:50.633797: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:50.636493: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:51.196531: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:51.198234: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:51.199777: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:37:51.201261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13775 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
In [37]:
ffnn.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

ffnn.summary()
Model: "ffnn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 128, 495, 1)]     0         
                                                                 
 flatten (Flatten)           (None, 63360)             0         
                                                                 
 dense (Dense)               (None, 500)               31680500  
                                                                 
 dropout (Dropout)           (None, 500)               0         
                                                                 
 dense_1 (Dense)             (None, 500)               250500    
                                                                 
 dropout_1 (Dropout)         (None, 500)               0         
                                                                 
 dense_2 (Dense)             (None, 8)                 4008      
                                                                 
=================================================================
Total params: 31935008 (121.82 MB)
Trainable params: 31935008 (121.82 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

No description has been provided for this image

We are using a very simply FFNN architecture; we'll flatten the input, and then pass it through two fully-connected Dense layers with dropout before passing it to the final output Dense layer with a softmax activation function.

Train Model¶

In [38]:
early_stopping = EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True)

ffnn_history = ffnn.fit(X_train, 
                      y_train, 
                      validation_data=(X_test, y_test), 
                      epochs=100, 
                      batch_size=16,
                      callbacks=[early_stopping])
Epoch 1/100
2024-05-09 00:37:58.719473: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-05-09 00:37:59.941116: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fd126dd7b30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-05-09 00:37:59.941158: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2024-05-09 00:37:59.949294: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-05-09 00:37:59.989075: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1715215080.093829     652 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
72/72 [==============================] - 5s 24ms/step - loss: 14110775296.0000 - accuracy: 0.1606 - val_loss: 4197664000.0000 - val_accuracy: 0.2292
Epoch 2/100
72/72 [==============================] - 1s 18ms/step - loss: 9057529856.0000 - accuracy: 0.2908 - val_loss: 5306324992.0000 - val_accuracy: 0.2257
Epoch 3/100
72/72 [==============================] - 1s 18ms/step - loss: 5125284352.0000 - accuracy: 0.3203 - val_loss: 3121011456.0000 - val_accuracy: 0.3125
Epoch 4/100
72/72 [==============================] - 1s 19ms/step - loss: 5616505856.0000 - accuracy: 0.3299 - val_loss: 4104053248.0000 - val_accuracy: 0.3299
Epoch 5/100
72/72 [==============================] - 1s 18ms/step - loss: 3889230848.0000 - accuracy: 0.3863 - val_loss: 3887718400.0000 - val_accuracy: 0.3090
Epoch 6/100
72/72 [==============================] - 1s 18ms/step - loss: 2214633984.0000 - accuracy: 0.4123 - val_loss: 5547031552.0000 - val_accuracy: 0.2778
Epoch 7/100
72/72 [==============================] - 1s 18ms/step - loss: 1584530048.0000 - accuracy: 0.4358 - val_loss: 5191700992.0000 - val_accuracy: 0.3438
Epoch 8/100
72/72 [==============================] - 1s 17ms/step - loss: 1952501504.0000 - accuracy: 0.4523 - val_loss: 3553359360.0000 - val_accuracy: 0.3125
Epoch 9/100
72/72 [==============================] - 1s 17ms/step - loss: 1480818176.0000 - accuracy: 0.4939 - val_loss: 5062780928.0000 - val_accuracy: 0.2604
Epoch 10/100
72/72 [==============================] - 1s 18ms/step - loss: 953954560.0000 - accuracy: 0.4835 - val_loss: 3849034752.0000 - val_accuracy: 0.2917
Epoch 11/100
72/72 [==============================] - 1s 17ms/step - loss: 1857220224.0000 - accuracy: 0.4887 - val_loss: 4090415360.0000 - val_accuracy: 0.2917
Epoch 12/100
72/72 [==============================] - 1s 17ms/step - loss: 2037850112.0000 - accuracy: 0.4974 - val_loss: 3606037760.0000 - val_accuracy: 0.2812
Epoch 13/100
72/72 [==============================] - 1s 18ms/step - loss: 1596498944.0000 - accuracy: 0.4818 - val_loss: 4720262656.0000 - val_accuracy: 0.3056
Epoch 14/100
72/72 [==============================] - 1s 18ms/step - loss: 1335536384.0000 - accuracy: 0.5165 - val_loss: 5685683712.0000 - val_accuracy: 0.3056
Epoch 15/100
72/72 [==============================] - 1s 17ms/step - loss: 2407896832.0000 - accuracy: 0.5217 - val_loss: 4185713408.0000 - val_accuracy: 0.2778
Epoch 16/100
72/72 [==============================] - 1s 17ms/step - loss: 783191936.0000 - accuracy: 0.5399 - val_loss: 4871812096.0000 - val_accuracy: 0.2986
Epoch 17/100
72/72 [==============================] - 1s 19ms/step - loss: 1004823808.0000 - accuracy: 0.5703 - val_loss: 7729543168.0000 - val_accuracy: 0.3021
In [39]:
plot_history(ffnn_history, 'FFNN')
No description has been provided for this image

Clearly, we're overfitting immensely with a FFNN that contains 30 million parameters. The divergence between train and validation starts after only a few epochs, and continues climbing until the train accuracy begins to plateau around 60%. The validation accuracy hover around 30% to 40% throughout all 20 epochs. One concerning aspect is the loss - because the model makes such confident predictions, the categorical crossentropy loss is enormous (about 25 billion). We believe that this is a failure with the compilation of the model, it's obvious that a FFNN is not the way to go anyways. Still, the performance of even the worst NN is about that of the logisitic regression, so there's promise of better results with better architectures.

Evaluation & Analysis¶

In [40]:
evaluate_predictions(ffnn, 'FFNN', X_test=X_test, y_test=y_test)
9/9 [==============================] - 0s 3ms/step
No description has been provided for this image

Our FFNN confusion matrix looks very similar to that for logistic regression. There are many false predictions, but the performance on some emotions (ie. calm, angry, and fearful) is quite decent. On other emotions (ie. sad, disgust, and neutral) it has almost no correct predictions. Let's see if we can remedy the issues of overfitting, misprediction, huge parameter counts, and high loss with some other, more complex models.

LSTM¶

As we covered in lecture, LSTMs are a type of RNN that can capture long-term dependencies in the input data, making them particularly useful for tasks involving sequential data. Thus, we believe that LSTM model architecture could be potenitially well suited for the task of sentiment analysis on our dataset due to its ability to effectively process our sequential data which comes in the form of audio samples.

The specific configuration of my baseline LSTM model which is shown below includes multiple LSTM layers with varying numbers of units (512, 512, 256, and 256), followed by dense layers with decreasing numbers of units (128, 64, and 8). The intuition behind using multiple LSTM layers was to allow the model to learn increasingly complex representations of the input data, enabling it to capture intricate patterns and dependencies in the audio samples. As I progressed through the model, I decreased the number of units in the dense layers to progressively reduce the dimensionality of the data while hoping to keep the most important features for the sentiment analysis task. The dropout layerswere put in to mitigate overfitting, and the output layer with 8 units and a softmax activation function were designed to classify the input audio samples into the 8 different emotion categories present in the RAVDESS dataset.

Again, this is a baseline LSTM model which I dont expect to work that well, and when training, will have to finetune specific parameters and maybe even add more layers. Nonetheless, we do hope that LSTMs will perform well because our data, audio samples, are as sequential as it gets.

Data Preparation¶

In [41]:
X_lstm = mel_128
y = df['Emotion_Number'].values

X_train, X_test, y_train, y_test = train_test_split(X_lstm, y, test_size=0.2, random_state=109)
In [42]:
# adjust shapes so can pass in sequentially
X_train = np.transpose(X_train, (0, 2, 1))
X_test = np.transpose(X_test, (0, 2, 1))

Build & Compile Model¶

In [48]:
lstm = Sequential([
    GaussianNoise(0.1, input_shape=(X_train.shape[1], X_train.shape[2])),
    LSTM(units=512, return_sequences=True),
    Dropout(0.2),
    LSTM(units=512),
    Dropout(0.2),
    Dense(units=64, activation='relu'),
    Dense(units=8, activation='softmax')
])

lstm.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

lstm.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 gaussian_noise_1 (Gaussian  (None, 495, 128)          0         
 Noise)                                                          
                                                                 
 lstm_2 (LSTM)               (None, 495, 512)          1312768   
                                                                 
 dropout_4 (Dropout)         (None, 495, 512)          0         
                                                                 
 lstm_3 (LSTM)               (None, 512)               2099200   
                                                                 
 dropout_5 (Dropout)         (None, 512)               0         
                                                                 
 dense_5 (Dense)             (None, 64)                32832     
                                                                 
 dense_6 (Dense)             (None, 8)                 520       
                                                                 
=================================================================
Total params: 3445320 (13.14 MB)
Trainable params: 3445320 (13.14 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Train Model¶

In [49]:
early_stopping = EarlyStopping(monitor='val_accuracy', patience=15, restore_best_weights=True)

filepath = "lstm_weights.h5"
checkpoint_callback = ModelCheckpoint(filepath, 
                                      monitor="val_accuracy", 
                                      save_weights_only=True, 
                                      save_best_only=True, 
                                      verbose=0)


lstm_history = lstm.fit(X_train, 
          y_train, 
          validation_data=(X_test, y_test), 
          epochs=100, 
          batch_size=16, 
          callbacks=[early_stopping, checkpoint_callback])
Epoch 1/100
72/72 [==============================] - 14s 132ms/step - loss: 2.0926 - accuracy: 0.1562 - val_loss: 2.0440 - val_accuracy: 0.1562
Epoch 2/100
72/72 [==============================] - 8s 114ms/step - loss: 2.0240 - accuracy: 0.1684 - val_loss: 2.0292 - val_accuracy: 0.1840
Epoch 3/100
72/72 [==============================] - 8s 115ms/step - loss: 2.0061 - accuracy: 0.2014 - val_loss: 2.0075 - val_accuracy: 0.1910
Epoch 4/100
72/72 [==============================] - 8s 115ms/step - loss: 1.9868 - accuracy: 0.1953 - val_loss: 1.9888 - val_accuracy: 0.2083
Epoch 5/100
72/72 [==============================] - 8s 115ms/step - loss: 1.9336 - accuracy: 0.2300 - val_loss: 1.9004 - val_accuracy: 0.2708
Epoch 6/100
72/72 [==============================] - 8s 113ms/step - loss: 1.9460 - accuracy: 0.2422 - val_loss: 1.8903 - val_accuracy: 0.2708
Epoch 7/100
72/72 [==============================] - 8s 115ms/step - loss: 1.8901 - accuracy: 0.2457 - val_loss: 1.9052 - val_accuracy: 0.2951
Epoch 8/100
72/72 [==============================] - 8s 113ms/step - loss: 1.8744 - accuracy: 0.2465 - val_loss: 1.8877 - val_accuracy: 0.2465
Epoch 9/100
72/72 [==============================] - 8s 113ms/step - loss: 1.8049 - accuracy: 0.2891 - val_loss: 1.9476 - val_accuracy: 0.2535
Epoch 10/100
72/72 [==============================] - 8s 115ms/step - loss: 1.7872 - accuracy: 0.2934 - val_loss: 1.7501 - val_accuracy: 0.3160
Epoch 11/100
72/72 [==============================] - 8s 113ms/step - loss: 1.7603 - accuracy: 0.2995 - val_loss: 1.8089 - val_accuracy: 0.2951
Epoch 12/100
72/72 [==============================] - 9s 119ms/step - loss: 1.7413 - accuracy: 0.3134 - val_loss: 1.7929 - val_accuracy: 0.2951
Epoch 13/100
72/72 [==============================] - 8s 115ms/step - loss: 1.7304 - accuracy: 0.3299 - val_loss: 1.8384 - val_accuracy: 0.3264
Epoch 14/100
72/72 [==============================] - 8s 115ms/step - loss: 1.7154 - accuracy: 0.3394 - val_loss: 1.7962 - val_accuracy: 0.3507
Epoch 15/100
72/72 [==============================] - 8s 113ms/step - loss: 1.6768 - accuracy: 0.3533 - val_loss: 1.7283 - val_accuracy: 0.3472
Epoch 16/100
72/72 [==============================] - 9s 122ms/step - loss: 1.6604 - accuracy: 0.3368 - val_loss: 1.7129 - val_accuracy: 0.3715
Epoch 17/100
72/72 [==============================] - 8s 113ms/step - loss: 1.6743 - accuracy: 0.3420 - val_loss: 1.7631 - val_accuracy: 0.3021
Epoch 18/100
72/72 [==============================] - 8s 113ms/step - loss: 1.6615 - accuracy: 0.3759 - val_loss: 1.7741 - val_accuracy: 0.3333
Epoch 19/100
72/72 [==============================] - 8s 113ms/step - loss: 1.6373 - accuracy: 0.3524 - val_loss: 1.7386 - val_accuracy: 0.3438
Epoch 20/100
72/72 [==============================] - 8s 115ms/step - loss: 1.5615 - accuracy: 0.3915 - val_loss: 1.6879 - val_accuracy: 0.3924
Epoch 21/100
72/72 [==============================] - 8s 115ms/step - loss: 1.5467 - accuracy: 0.4167 - val_loss: 1.6062 - val_accuracy: 0.4097
Epoch 22/100
72/72 [==============================] - 8s 113ms/step - loss: 1.5082 - accuracy: 0.4071 - val_loss: 1.6971 - val_accuracy: 0.3542
Epoch 23/100
72/72 [==============================] - 8s 113ms/step - loss: 1.4825 - accuracy: 0.4262 - val_loss: 1.6937 - val_accuracy: 0.3646
Epoch 24/100
72/72 [==============================] - 8s 113ms/step - loss: 1.4557 - accuracy: 0.4262 - val_loss: 1.6134 - val_accuracy: 0.4028
Epoch 25/100
72/72 [==============================] - 8s 115ms/step - loss: 1.4281 - accuracy: 0.4332 - val_loss: 1.6540 - val_accuracy: 0.4167
Epoch 26/100
72/72 [==============================] - 8s 113ms/step - loss: 1.4360 - accuracy: 0.4280 - val_loss: 1.6597 - val_accuracy: 0.3646
Epoch 27/100
72/72 [==============================] - 8s 113ms/step - loss: 1.4239 - accuracy: 0.4332 - val_loss: 1.6897 - val_accuracy: 0.3681
Epoch 28/100
72/72 [==============================] - 8s 113ms/step - loss: 1.4599 - accuracy: 0.4410 - val_loss: 1.7109 - val_accuracy: 0.3715
Epoch 29/100
72/72 [==============================] - 8s 113ms/step - loss: 1.3896 - accuracy: 0.4635 - val_loss: 1.7180 - val_accuracy: 0.3924
Epoch 30/100
72/72 [==============================] - 8s 112ms/step - loss: 1.3812 - accuracy: 0.4618 - val_loss: 1.7227 - val_accuracy: 0.3611
Epoch 31/100
72/72 [==============================] - 8s 113ms/step - loss: 1.3511 - accuracy: 0.4931 - val_loss: 1.7895 - val_accuracy: 0.3403
Epoch 32/100
72/72 [==============================] - 8s 113ms/step - loss: 1.2899 - accuracy: 0.5087 - val_loss: 1.8142 - val_accuracy: 0.3958
Epoch 33/100
72/72 [==============================] - 8s 113ms/step - loss: 1.2784 - accuracy: 0.5165 - val_loss: 1.6934 - val_accuracy: 0.3854
Epoch 34/100
72/72 [==============================] - 8s 113ms/step - loss: 1.2440 - accuracy: 0.5399 - val_loss: 1.6373 - val_accuracy: 0.3889
Epoch 35/100
72/72 [==============================] - 8s 115ms/step - loss: 1.1871 - accuracy: 0.5391 - val_loss: 1.6150 - val_accuracy: 0.4306
Epoch 36/100
72/72 [==============================] - 8s 113ms/step - loss: 1.2689 - accuracy: 0.5174 - val_loss: 1.6526 - val_accuracy: 0.4201
Epoch 37/100
72/72 [==============================] - 8s 113ms/step - loss: 1.1615 - accuracy: 0.5616 - val_loss: 1.7449 - val_accuracy: 0.4062
Epoch 38/100
72/72 [==============================] - 8s 112ms/step - loss: 1.1311 - accuracy: 0.5677 - val_loss: 1.6842 - val_accuracy: 0.4028
Epoch 39/100
72/72 [==============================] - 8s 115ms/step - loss: 1.0898 - accuracy: 0.6163 - val_loss: 1.7551 - val_accuracy: 0.4479
Epoch 40/100
72/72 [==============================] - 8s 113ms/step - loss: 1.1076 - accuracy: 0.5929 - val_loss: 1.6909 - val_accuracy: 0.3924
Epoch 41/100
72/72 [==============================] - 8s 113ms/step - loss: 1.0520 - accuracy: 0.6137 - val_loss: 1.6024 - val_accuracy: 0.4444
Epoch 42/100
72/72 [==============================] - 8s 113ms/step - loss: 1.0191 - accuracy: 0.6259 - val_loss: 1.6725 - val_accuracy: 0.4132
Epoch 43/100
72/72 [==============================] - 8s 113ms/step - loss: 1.0450 - accuracy: 0.6276 - val_loss: 1.8967 - val_accuracy: 0.4167
Epoch 44/100
72/72 [==============================] - 8s 113ms/step - loss: 0.9484 - accuracy: 0.6476 - val_loss: 1.8479 - val_accuracy: 0.4236
Epoch 45/100
72/72 [==============================] - 8s 113ms/step - loss: 0.9562 - accuracy: 0.6380 - val_loss: 1.6428 - val_accuracy: 0.4479
Epoch 46/100
72/72 [==============================] - 8s 113ms/step - loss: 0.9567 - accuracy: 0.6727 - val_loss: 1.9062 - val_accuracy: 0.4028
Epoch 47/100
72/72 [==============================] - 8s 113ms/step - loss: 0.8372 - accuracy: 0.6814 - val_loss: 1.8974 - val_accuracy: 0.4201
Epoch 48/100
72/72 [==============================] - 8s 113ms/step - loss: 0.8651 - accuracy: 0.6892 - val_loss: 1.8136 - val_accuracy: 0.4167
Epoch 49/100
72/72 [==============================] - 8s 113ms/step - loss: 0.8634 - accuracy: 0.6814 - val_loss: 1.8532 - val_accuracy: 0.4028
Epoch 50/100
72/72 [==============================] - 8s 112ms/step - loss: 0.8390 - accuracy: 0.6858 - val_loss: 1.8163 - val_accuracy: 0.4306
Epoch 51/100
72/72 [==============================] - 8s 113ms/step - loss: 0.7814 - accuracy: 0.7109 - val_loss: 1.8316 - val_accuracy: 0.4271
Epoch 52/100
72/72 [==============================] - 8s 112ms/step - loss: 0.9433 - accuracy: 0.6641 - val_loss: 1.8615 - val_accuracy: 0.4132
Epoch 53/100
72/72 [==============================] - 8s 113ms/step - loss: 0.9097 - accuracy: 0.6710 - val_loss: 1.8992 - val_accuracy: 0.4028
Epoch 54/100
72/72 [==============================] - 8s 113ms/step - loss: 0.8308 - accuracy: 0.6910 - val_loss: 1.8929 - val_accuracy: 0.4236
In [50]:
plot_history(lstm_history, "LSTM")
No description has been provided for this image

Evaluation & Analysis¶

In [51]:
evaluate_predictions(lstm, 'LSTM', X_test=X_test, y_test=y_test)
9/9 [==============================] - 1s 53ms/step
No description has been provided for this image

As we covered in lecture, LSTMs are a type of RNN that can capture long-term dependencies in the input data, making them particularly useful for tasks involving sequential data. Thus, we believed that an LSTM model architecture would be potentially well suited for the task of sentiment analysis on our dataset due to its ability to effectively process our sequential data which comes in the form of audio samples.

Our original LSTM model, which was produced for Milestone 4, used 4 LSTM layers with varying numbers of units (512, 512, 256, and 256), followed by dense layers with decreasing numbers of units (128, 64, and 8). The intuition behind using multiple LSTM layers was to allow the model to learn increasingly complex representations of the input data, enabling it to capture intricate patterns and dependencies in the audio samples. As we progressed through the model, we decreased the number of units in the dense layers to progressively reduce the dimensionality of the data while hoping to keep the most important features for the sentiment analysis task. The dropout layers were put in to mitigate overfitting, and the output layer with 8 units and a softmax activation function were designed to classify the input audio samples into the 8 different emotion categories present in the RAVDESS dataset.

While we had high hopes for this architecture, like most models you train for the first time, it didn’t perform well. After 50 epochs with a batch size of 32, the validation accuracy stayed around 15-17% with no signs of improvement and even the training accuracy remain pretty low and constant. We thought our model was maybe a little too complex, so we decided to make it simpler. Thus, our next step was to try an architecture with less LSTM and Dense layers. We reduced the number to LSTM layers from 4 to 3, which is what worked really well in one of the homeworks, and reduced the number of Dense layers from 3 to 2. Additionally, we added some Gaussian Noise and kept the Dropout layers, but reduced the value. Thus, we trained this new model and immediately saw an increase in performance as through 70 epochs, we got around 46% accuracy which was good as it beat our baseline model. Nonetheless, our goal was to at least get over 50% so we then again my the architecture slightly simpler reducing the number of LSTM layers to 2, but now increasing the size of the first one to 1024. This got us to 50% accuracy through 75 epoch and at this point, we were pretty happy with the architecture and decided to fine tune.

The first thing we tried reducing the size of the first LSTM layer to 512 and changing the size of the final Dense layer before the output layer. We made it both smaller (32) and bigger (124), but both performed worse giving us 28% and 49% accuracy respectively. This told us that 64 was the best size for the final Dense layer and then again retrained, still keeping the size of the first LSTM layer to 512 and this produced our best result yet, with a validation accuracy of 55.9%. While we tried further architectures which we will describe in the next paragraph this would end up being our final model as it consistently would get validation accuracies between 54-56%. Looking at its validation accuracy and loss graphs, we can definitely see some over fitting. For accuracy, the train vs validation lines start to grow at different rates around the 30th epoch where as for the loss its around the 20th epoch. However, over many runs, including this final runs, the best validation accuracy tends to be around 70-85 epoch range where as the best validation loss is around the 20-35 epoch range. In this specific graph, the best val accuracy of 55.56% was in epoch 92 (val loss of 2.18) and the best val loss was 1.57 on epoch 30 (val acc of 41.32).

The next change we made to try to improve performance was re-adding a dense layer with size 64. While this slightly improved performance the first time I trained it getting to 56.6% accuracy (the highest of any LSTM models we tried), upon retraining the model, we would consistently get accuracies around 50-52%. After having played around with most of the layer sizes, I decided to try a smaller batch size (16) since our dataset wasn’t incredibly big. While this didn’t lead to overall better performances, what it did do was allow our model’s validation accuracy to keep up with the training accuracy for a little longer, thus overfitting in later epochs rather than earlier one. Because of this we stayed with that batch size. Finally, we attempted both batch and layer normalization methods, but this yield validating accuracies around 45%.

Looking at the actual predictions of our LSTM model by looking at the confusion matrix, we can see it is really good at predicting Calm, Angry, Fearful, Disgust, and Surprised emotions, mediocre at predicting Happy and Sad emotions, and pretty bad at predicting neutral emotions. This sparked the idea of seeing how it learned emotions over epochs and thus what we did was train the model on a small number of epochs at a time, then printing out the confusion matrix, and then continuing to retrain. From this analysis, we saw that Calm is the emotion it predicts the best as it is always the first one it gets right really often. This is then followed by surprised and disgust which are the next two it typically learns pretty confidently, and then finally fearful and angry. Depending on the run it, happy and sad are either mediocrely or badly predicted, but typically are always at least marginally more accurately predicted than neutral.

Transformer¶

Transformers are the current state-of-the-art model for language models due to its ability to positionally encode embeddings, run in-parallel, and learn complex contextual features. We are trying it on audio data since it is also sequential in nature and, thus, should be well-suited for transformers. Furthermore, because the data is already numerical, there is no need for an embedding layer and the positional encodings can be applied directly to the audio data. The custom transformer layer we use includes multi-head attention, concatenation, skip connections, layer normalization, and dropout.

Data Preparation¶

In [52]:
# Prepare dataset
X_transformer = np.transpose(mel_256, (0, 2, 1))
y = df['Emotion_Number'].values

# Train-val split
X_train, X_val, y_train, y_val = train_test_split(X_transformer, y, test_size=0.2, stratify=y, random_state=109)

# Standardize
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape)
X_val = scaler.transform(X_val.reshape(-1, X_val.shape[-1])).reshape(X_val.shape)
In [53]:
def make_dataset(x, y, batch_size=32):
    data = tf.data.Dataset.from_tensor_slices((x, y))
    data = data.batch(batch_size).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return data

# Make tf datasets
train_ds = make_dataset(X_train, y_train)
val_ds = make_dataset(X_val, y_val)
In [54]:
def get_relative_positions(max_seq_length):
    # Create a matrix where the element at [i, j] is j-i; i.e., the relative distance from i to j
    range_vec = tf.range(max_seq_length)
    range_mat = tf.reshape(range_vec, [1, -1])
    distance_mat = range_mat - tf.transpose(range_mat)
    return distance_mat

def get_relative_positional_encoding(max_seq_length, d_model):
    # Compute relative positions
    relative_positions = get_relative_positions(max_seq_length)
    
    # Adjust positions to be within the model's scale
    max_relative_position = max_seq_length - 1
    
    # Clamp the values in the matrix to be within [-max_relative_position, max_relative_position]
    relative_positions = tf.clip_by_value(relative_positions, -max_relative_position, max_relative_position)
    
    # Embeddings for each relative position
    relative_position_embeddings = tf.keras.layers.Embedding(
        2 * max_relative_position + 1, d_model)(relative_positions + max_relative_position)
    
    # Reduce over sequence length to match shape for broadcasting
    relative_position_embeddings = tf.reduce_mean(relative_position_embeddings, axis=1)
    
    return relative_position_embeddings

Build & Compile Model¶

In [55]:
def transformer_encoder(inputs, embed_dim, num_heads, ff_dim, rate=0.2):
    # Multi-head attention
    attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)(inputs, inputs)
    attention_output = layers.Dropout(rate)(attention_output)
    attention_output = layers.LayerNormalization(epsilon=1e-6)(inputs + attention_output)
    
    # Feed-forward network
    ffn_output = layers.Dense(ff_dim, activation="relu")(attention_output)
    ffn_output = layers.Dense(embed_dim)(ffn_output)
    ffn_output = layers.Dropout(rate)(ffn_output)
    ffn_output = layers.LayerNormalization(epsilon=1e-6)(attention_output + ffn_output)
    return ffn_output
In [56]:
def build_model(input_shape, embed_dim, num_heads, ff_dim, max_seq_length, num_classes, num_layers):
    # Input validation for num_heads
    if isinstance(num_heads, int):
        # If num_heads is an integer, use the same number of heads across all layers
        num_heads_list = [num_heads] * num_layers
    elif isinstance(num_heads, list):
        # If num_heads is a list, check that its length matches num_layers
        if len(num_heads) != num_layers:
            raise ValueError(f"The length of num_heads list must be equal to num_layers ({num_layers}).")
        num_heads_list = num_heads
    else:
        raise TypeError("num_heads must be either an integer or a list of integers.")
    
    inputs = layers.Input(shape=input_shape)
    x = layers.GaussianNoise(0.1)(inputs)
    x = layers.Dense(embed_dim)(x)
    x += get_relative_positional_encoding(max_seq_length, embed_dim)
    for i in range(num_layers):
        num_heads = num_heads_list[i]
        x = transformer_encoder(x, embed_dim, num_heads, ff_dim)
    x = layers.GlobalAveragePooling1D()(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    model = Model(inputs=inputs, outputs=outputs)
    return model

input_shape = (495, 256)  # 495 timesteps, 256 Mel features
embed_dim = 50  # Size of the embedding vector
num_heads = 10   # Number of attention heads
ff_dim = 192  # Hidden layer size in feed forward network inside transformer
max_seq_length = 495  # Maximum sequence length
num_classes = 8  # Number of emotions
num_layers = 3 # Number of transformer blocks

transformer = build_model(input_shape, embed_dim, num_heads, ff_dim, max_seq_length, num_classes, num_layers)
In [57]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-3,
    decay_steps=1000,
    decay_rate=0.9)
optimizer_sch = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

transformer.compile(optimizer=optimizer_sch, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

transformer = load_model('models/transformer.h5') # Load in model with above architecture and best weights

transformer.summary()
Model: "model_15"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_17 (InputLayer)       [(None, 495, 256)]           0         []                            
                                                                                                  
 gaussian_noise_16 (Gaussia  (None, 495, 256)             0         ['input_17[0][0]']            
 nNoise)                                                                                          
                                                                                                  
 dense_123 (Dense)           (None, 495, 50)              12850     ['gaussian_noise_16[0][0]']   
                                                                                                  
 tf.__operators__.add_108 (  (None, 495, 50)              0         ['dense_123[0][0]']           
 TFOpLambda)                                                                                      
                                                                                                  
 multi_head_attention_47 (M  (None, 495, 50)              101550    ['tf.__operators__.add_108[0][
 ultiHeadAttention)                                                 0]',                          
                                                                     'tf.__operators__.add_108[0][
                                                                    0]']                          
                                                                                                  
 dropout_98 (Dropout)        (None, 495, 50)              0         ['multi_head_attention_47[0][0
                                                                    ]']                           
                                                                                                  
 tf.__operators__.add_109 (  (None, 495, 50)              0         ['tf.__operators__.add_108[0][
 TFOpLambda)                                                        0]',                          
                                                                     'dropout_98[0][0]']          
                                                                                                  
 layer_normalization_92 (La  (None, 495, 50)              100       ['tf.__operators__.add_109[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 dense_124 (Dense)           (None, 495, 192)             9792      ['layer_normalization_92[0][0]
                                                                    ']                            
                                                                                                  
 dense_125 (Dense)           (None, 495, 50)              9650      ['dense_124[0][0]']           
                                                                                                  
 dropout_99 (Dropout)        (None, 495, 50)              0         ['dense_125[0][0]']           
                                                                                                  
 tf.__operators__.add_110 (  (None, 495, 50)              0         ['layer_normalization_92[0][0]
 TFOpLambda)                                                        ',                            
                                                                     'dropout_99[0][0]']          
                                                                                                  
 layer_normalization_93 (La  (None, 495, 50)              100       ['tf.__operators__.add_110[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 multi_head_attention_48 (M  (None, 495, 50)              101550    ['layer_normalization_93[0][0]
 ultiHeadAttention)                                                 ',                            
                                                                     'layer_normalization_93[0][0]
                                                                    ']                            
                                                                                                  
 dropout_100 (Dropout)       (None, 495, 50)              0         ['multi_head_attention_48[0][0
                                                                    ]']                           
                                                                                                  
 tf.__operators__.add_111 (  (None, 495, 50)              0         ['layer_normalization_93[0][0]
 TFOpLambda)                                                        ',                            
                                                                     'dropout_100[0][0]']         
                                                                                                  
 layer_normalization_94 (La  (None, 495, 50)              100       ['tf.__operators__.add_111[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 dense_126 (Dense)           (None, 495, 192)             9792      ['layer_normalization_94[0][0]
                                                                    ']                            
                                                                                                  
 dense_127 (Dense)           (None, 495, 50)              9650      ['dense_126[0][0]']           
                                                                                                  
 dropout_101 (Dropout)       (None, 495, 50)              0         ['dense_127[0][0]']           
                                                                                                  
 tf.__operators__.add_112 (  (None, 495, 50)              0         ['layer_normalization_94[0][0]
 TFOpLambda)                                                        ',                            
                                                                     'dropout_101[0][0]']         
                                                                                                  
 layer_normalization_95 (La  (None, 495, 50)              100       ['tf.__operators__.add_112[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 multi_head_attention_49 (M  (None, 495, 50)              101550    ['layer_normalization_95[0][0]
 ultiHeadAttention)                                                 ',                            
                                                                     'layer_normalization_95[0][0]
                                                                    ']                            
                                                                                                  
 dropout_102 (Dropout)       (None, 495, 50)              0         ['multi_head_attention_49[0][0
                                                                    ]']                           
                                                                                                  
 tf.__operators__.add_113 (  (None, 495, 50)              0         ['layer_normalization_95[0][0]
 TFOpLambda)                                                        ',                            
                                                                     'dropout_102[0][0]']         
                                                                                                  
 layer_normalization_96 (La  (None, 495, 50)              100       ['tf.__operators__.add_113[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 dense_128 (Dense)           (None, 495, 192)             9792      ['layer_normalization_96[0][0]
                                                                    ']                            
                                                                                                  
 dense_129 (Dense)           (None, 495, 50)              9650      ['dense_128[0][0]']           
                                                                                                  
 dropout_103 (Dropout)       (None, 495, 50)              0         ['dense_129[0][0]']           
                                                                                                  
 tf.__operators__.add_114 (  (None, 495, 50)              0         ['layer_normalization_96[0][0]
 TFOpLambda)                                                        ',                            
                                                                     'dropout_103[0][0]']         
                                                                                                  
 layer_normalization_97 (La  (None, 495, 50)              100       ['tf.__operators__.add_114[0][
 yerNormalization)                                                  0]']                          
                                                                                                  
 global_average_pooling1d_1  (None, 50)                   0         ['layer_normalization_97[0][0]
 5 (GlobalAveragePooling1D)                                         ']                            
                                                                                                  
 dense_130 (Dense)           (None, 8)                    408       ['global_average_pooling1d_15[
                                                                    0][0]']                       
                                                                                                  
==================================================================================================
Total params: 376834 (1.44 MB)
Trainable params: 376834 (1.44 MB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________

No description has been provided for this image

This multi-head transformer model takes in the audio data and outputs predictions for each audio sample's emotion. First, the two-dimensional Mel-transformed data ($495$ timesteps, $296$ Mel features) is passed into the model. Gaussian noise is then applied with a standard deviation of $0.1$. After this, it is passed through a dense layer with $50$ nodes in order to get $495$ embeddings (one for each time step) with embed_dim $=50$.

Drawing inspiration from the DeBERTa model, a custom relative positional encoder was used rather than a fixed sinusoidal positional encoder, as in traditional transformers. First, the get_relative_positions function is used to create a matrix that stores the distances between each of the $495$ timesteps in the audio data. Then the get_relative_positional_encoding function converts each relative position into a vector of size embed_dim $=50$ so that they can be added to the embeddings.

These embeddings are then passed through $3$ successive transformer layers. In each layer, ten-head attention is applied, meaning each head receives one-tenth of the embeddings (i.e., vectors of length $5$). Dropout is also applied to the multi-head attention step with a rate of $0.2$. The output from the attention step is added to its original input (a skip connection) and layer-normalized. Then it is passed through a feed-forward network in which the first dense layer has $192$ nodes and the next dense layer has $50$ nodes in order to reshape it back to the original size of the embedding, embed_dim $=50$. An add-and-normalize step is also in the FFN stage.

After the embeddings have been transformed by the $3$ transformer layers, a global average pooling layer converts the $495$ x $50$ matrix into a $1$-dimensional vector of length $50$, essentially creating one embedding that represents the entire audio sample. Finally, this is passed into the softmax output layer, which returns a probability distribution across the $8$ emotions, with the highest value indicating the model's prediction. This model is also very light-weight, totaling only $376,834$ parameters.

Train Model¶

In [58]:
early_stopping = EarlyStopping(
    monitor='val_accuracy', 
    patience=20, 
    restore_best_weights=True)

callbacks = [early_stopping]

### Not training here, best model loaded in above
# history = transformer.fit(train_ds, validation_data=val_ds, epochs=200, verbose=1, callbacks=callbacks)

transformer_history

The model was trained for $83$ epochs, with the best validation accuracy of $51.4\%$ occurring at epoch $63$ (due to early stopping with a patience of $20$). The validation accuracy quickly rose to ~$35\%$ within the first couple epochs, then slowly rose from there. The training accuracy had a standard concave growth and was around $90\%$ by the time training ended. Conversely, the training loss had a convex decline, going from $2$ to around $0.25$. Interestingly, though, the validation loss reached its minimum of $1.71$ at epoch $8$ and steadily rose thereafter. At epoch $63$, which had the highest validation accuracy, the loss was ~$2.5$. This means that, although the validation accuracy was improving, the loss from the output probability distributions were increasing. However, our metric for performance was accuracy and so this is what was monitored.

Evaluation & Analysis¶

In [59]:
loss, accuracy = transformer.evaluate(val_ds)
print(f"Validation Loss: {loss:.2f}, Validation Accuracy: {accuracy:.2%}")
9/9 [==============================] - 2s 70ms/step - loss: 2.5667 - accuracy: 0.5139
Validation Loss: 2.57, Validation Accuracy: 51.39%
In [60]:
evaluate_predictions(transformer, "Transformer", X_test=X_val, y_test=y_val)
9/9 [==============================] - 1s 64ms/step
No description has been provided for this image

The validation accuracy for this transformer architecture with its best set of weights was $51.4\%$. This strong performance is suggested by the diagonal in the confusion matrix above. The model performed best on "calm", with an accuracy of $69.2\%$, and worst on "sad", with an accuracy of $36.8\%$. It also over-predicted "surprised" the most, with $35$ false-positives and a precision rate of $40.7\%$; in fact, $5$ of the $7$ other emotions had "surprise" for their most common incorrect prediction. Thus, the model tends to predict "surprise" in a somewhat naive manner. Overall, though, the errors of the transformer are relatively distributed, with no off-diagonal cells having a value above $9$.

The transformer is also very light-weight, especially considering its performance. Its $377\text{k}$ parameter count is over $8\text{x}$ less than the LSTM's ($3.1\text{M}$) and over $250\text{x}$ less than the SOTA model's ($94.6\text{M}$).

SOTA¶

Here we finetune the second iteration and base version of Facebook’s Wav2Vec model, described in their paper “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” This model has 95 million parameters and was trained on thousands of hours of raw speech audio data sampled at 16kHz. Because our own data is sampled at 48kHz, we downsampled each input to 16kHz. In order to implement the model, we used a Trainer object from the Transformers module, which is specialized for fine tuning pretrained models. We also employed the Transformer module’s AutoFeatureExtractor class, which normalizes and processes the audio data in a manner required by the model. Our code was heavily inspired by this HuggingFace tutorial on fine tuning an audio model. Due to long training time and limited GPU access on colab (ran into various pytorch dependency errors on Jupyter Hub), we were only able to train for 10 epochs, which took approximately 32 minutes.

Data Preparation¶

In [61]:
audio_dataset = Dataset.from_dict({"audio": wav_paths}).cast_column("audio", AudioCast())
audio_dataset = audio_dataset.add_column('label', df['Emotion_Number']-1)
In [62]:
audio_dataset = audio_dataset.train_test_split(test_size=0.2)
In [63]:
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]
In [64]:
audio = audio_dataset.cast_column("audio", AudioCast(sampling_rate=16_000))
In [65]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, padding='longest')
    return inputs
In [66]:
encoded_audio = audio.map(preprocess_function, remove_columns="audio", batched=True)
Map:   0%|          | 0/1152 [00:00<?, ? examples/s]
Map:   0%|          | 0/288 [00:00<?, ? examples/s]

Load Model¶

In [67]:
accuracy = evaluate.load("accuracy")
Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]
In [68]:
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)
In [69]:
num_labels = len(id2label)

##Load Wav2Vec Model 
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)
pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [70]:
summary(model)
Out[70]:
================================================================================
Layer (type:depth-idx)                                  Param #
================================================================================
Wav2Vec2ForSequenceClassification                       --
├─Wav2Vec2Model: 1-1                                    768
│    └─Wav2Vec2FeatureEncoder: 2-1                      --
│    │    └─ModuleList: 3-1                             4,200,448
│    └─Wav2Vec2FeatureProjection: 2-2                   --
│    │    └─LayerNorm: 3-2                              1,024
│    │    └─Linear: 3-3                                 393,984
│    │    └─Dropout: 3-4                                --
│    └─Wav2Vec2Encoder: 2-3                             --
│    │    └─Wav2Vec2PositionalConvEmbedding: 3-5        4,719,488
│    │    └─LayerNorm: 3-6                              1,536
│    │    └─Dropout: 3-7                                --
│    │    └─ModuleList: 3-8                             85,054,464
├─Linear: 1-2                                           196,864
├─Linear: 1-3                                           2,056
================================================================================
Total params: 94,570,632
Trainable params: 94,570,632
Non-trainable params: 0
================================================================================

No description has been provided for this image

In [73]:
#Load model that was trained on Google Colab
# model = AutoModelForAudioClassification.from_pretrained('models/wav2vec10epochs')

Configure & Train Model¶

In [101]:
# training_args = TrainingArguments(
#     output_dir="wav2vec2_audio",
#     evaluation_strategy="epoch",
#     save_strategy="epoch",
#     learning_rate=3e-5,
#     per_device_train_batch_size=32,
#     gradient_accumulation_steps=4,
#     per_device_eval_batch_size=32,
#     num_train_epochs=10,
#     warmup_ratio=0.1,
#     logging_steps=10,
#     load_best_model_at_end=True,
#     metric_for_best_model="accuracy"
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=encoded_audio["train"],
#     eval_dataset=encoded_audio["test"],
#     tokenizer=feature_extractor,
#     compute_metrics=compute_metrics,
# )

# trainer.train()
In [74]:
#load saved results from training on colab
metrics=pd.read_csv('TrainingResults.csv')

epochs=metrics['Epoch'].to_list()
train_loss=metrics['Training Loss'].to_list()
val_loss=metrics['Validation Loss'].to_list()
val_accuracy=metrics['Validation Accuracy'].to_list()
In [75]:
plt.figure(figsize=(10, 4))

# Plot validation accuracy
plt.subplot(1, 2, 1)
plt.plot(epochs, val_accuracy, c='darkorange', label='validation')
plt.title('SOTA Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

# Plot training and validation loss
plt.subplot(1, 2, 2)
plt.plot(epochs[1:], train_loss[1:], label='train')
plt.plot(epochs, val_loss, label='validation')
plt.title('SOTA Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()  # Adjust layout to prevent overlapping
plt.show()
No description has been provided for this image

Both training and validation loss decrease steadily with more epochs. Validation accuracy increases until 62.5% by epoch 10. Although improvement in validation accuracy and decreases in loss appear to be slowly plateauing, these results indicate that training for more epochs would likely yield improved performance. With addition of data augmentation techniques like guassian noise, these results could potentially be improved further. Unfortunately lack of compute resources/time limited our ability to test out various hyperparameters.

Evaluation & Analysis¶

In [76]:
### Get model predictions on validation set

# output = trainer.predict(encoded_audio['test'])

# predictions = output.predictions
# label_ids = output.label_ids
# metrics = output.metrics

# preds=[np.argmax(x) for x in predictions]
In [77]:
#Load saved predictions
outputs_df = pd.read_csv('predictions.csv')
In [78]:
sns.heatmap(confusion_matrix(outputs_df['labels'], outputs_df['preds']), annot=True, xticklabels = [id2label[i] for i in range(8)], yticklabels = [id2label[i] for i in range(8)], cmap='Blues');
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Wav2Vec2 Confusion Matrix')
plt.show()
No description has been provided for this image

Above is a confusion matrix for predictions made on the validation set. Interestingly, it appears that the model never predicts neutral and instead predicted calm for all neutral. This is consistent with what we expect, as we previously noted that calm and neutral are redundant emotions without many perceivable distinction. Indeed many of the errors that the model makes are sensible. It frequently mistakes sad, a low energy emotion, for calm and also mistakes fearful for surprised. This further indicates that the model is performing quite well at capturing essential emotional characteristics

Discussion¶

As the human benchmark reveals, this emotion classification task is a fundamentally difficult problem to solve. There is significant variation even in how voice actors interpret emotions. Achieving around 50% accuracy, both the transformer and LSTM models do quite well considering their parameter counts relative to Facebook’s Wav2Vec2 pretrained model. Indeed, the LSTM and transformer have 25 times and 250 times fewer parameters respectively than the pretrained model. Nevertheless, despite difficulties with its large parameter count, a fine-tuned Wav2Vec2 robustly outperforms all other models that we tried. It achieves 83%, 84%, 98%, 100% accuracy for classification of disgust, fearful, angry, and calm respectively. Moreover, many of its primary misclassifications are understandable in that they align with human error. It frequently misclassified neutral and sad as calm and surprised as fearful. Humans also displayed significant difficulty identifying sad and neutral, such that neutral had to be removed from the emotions that were tested in order to avoid confusion.

One area for future improvement for both the LSTM and Transformer models is trying to reduce the overfitting problem that typically happens in the early epochs, before the model has reached its best validation accuracy. We began to brief experiment with solutions to this with the LSTM model, but much more could be done with both models. Our intuition was to make the model learn more slowly, but still reach the same so that the validation accuracies could keep up. One idea that we began to test was to go back to the 4 layers LSTM approach but make all the layers much smaller (512-128 units). Because each layer is now learning a lot less, this model took a lot longer to train (250 epochs) and reached slight lower validation accuracies around 52%, but had a much better validation loss of 1.45. Additionally, when we plotted the graphs, the overfitting not only happened proportionally later, but also the validation accuracies where much closer to the training accuracies through out the training epochs. Using a similar approach we could test a 3 layer LSTM model and hyper-parameter tune the layers to see if this improves upon the overfitting problem. The later the model begins to overfit, the higher we think the validation accuracy can get as it will flatten our at a later time.

Another future improvement would be to perform a more rigorous and structured analysis to see what emotions are being learned at specific steps for each model. We briefly did this with the LSTM model by plotting the confusion matrix at intermediate times, while the model was training. However, if we were to run this analysis multiple times and keeping tract of the exact statistics, instead of visually performing an eye test, it would potentially be super useful; we could use this data to help us train our model to be able to differentiate between emotions that are close together or emotions that it tends to predict incorrectly.

While experimenting with the transformer model, we found a significant 5% improvement in accuracy by increasing the number of pitch frequencies returned by the Mel spectrogram at each time step. It is possible that more experimentation in this area could yield better results. One thing we were unable to do due to computation issues was invert the Mel spectrogram back to an audio file to find how much information was lost when converting to Mel spectrogram. It would be helpful to know how the audio file sounds after being transformed to Mel and back again. Another concept we did not try is overlapping windows. Our implementation with librosa separated the audio clip into discrete time windows, but others have had these windows overlap to create a more fluid transition between pitches across time. This fluidity might be important for emotion classification, or might enable us to increase complexity in a more meaningful way.

Another obvious way to improve the performance of our model(s) is to add more training data. This can include more voice actors saying the same phrases or labelled audio samples from other forms of media, for which the internet provides a seemingly infinite supply. This will allow the model to extract better representations from the audio data. A large part of the SOTA Wav2Vec2’s superior performance stems from the fact that it was pre-trained on a massive audio database, whereas our own models only saw the RAVDESS data. Moving beyond the same 24 voice actors and the same 2 phrases will result in purer and more complex features being learned. Deeper model architectures will also be required in order to capture patterns and temporal dependencies with strong predictive power. At the end of the day, no matter how complex a deep learning model is, it must be provided with enough high-quality data in order to work. For example, if you inputted a new phrase from a new voice into our custom models, they likely would return accuracies much worse than the validation data would suggest. If, however, we significantly added to our dataset, then the trained models should have learned features that allow it to generalize much better to unseen data.

With more time, we'd also love to continue exploring the use of CNNs for this task. A convolutional neural network seems like a fruitful option given the sheer amount of data we need to take as input. Even once it has been processed via. the Mel Spectrogram, there are still $128 * 495 = 63360$ values for each audio sample. We'd like to detect the key features of the data using convolutional filters, much like we do for images. A working theory is that each emotion may be associated with certain patterns of frequencies or amplitudes, and the convolutional layers may be able to detect these patterns. Anger, for example, may be associated with higher amplitude bursts for louder voices or higher frequency bursts for sped-up talking. Additionally, the shift-invariance of CNN's will also be very beneficial. It should, in theory, not matter whether the patterns associated with anger appear at the beginning or end of the audio sample, so we don't want to take their position into account. For some reason, however, the CNNs just were not working, despite our best efforts. The loss was astronomical, and we could never exceed an accuracy of around 18% (essentially a pure guess between the options). We believe that this could be caused by non-neighborial relations in the Mel Spectrogram output, but we would need more time to explore this.

Overall, we were content with our progress on this challenging task, and very much enjoyed exploring the world of audio networks.