Who's fibbing? A Machine Learning Approach to Acoustic Feature Analysis¶
1 Author¶
Name: Rashad Malik
2 Problem formulation¶
The goal of this project is to develop a machine learning model capable of distinguishing between true and deceptive narratives based on audio recordings of spoken stories.
This task is both intriguing and challenging because it requires exploring subtle acoustic features, such as pauses and pitch variability, to try and infer the truthfulness of speech, which is a task that people often struggle with in all aspects of life.
The ability to automate this process has significant potential applications across multiple areas:
Law enforcement and the legal system would greatly benefit from being able to reliably detect true or false statements.
Psychological research and academia may also benefit, where accurately assessing the veracity of statements can be critical.
However, there are also ethical concerns if a successful tool like this could fall into the hand of bad actors. There is a risk of privacy invasion if people did not consent to having their statements validated, and there could be negative psychological impacts from the misuse of such a tool. The social implications of such a project must be constantly assessed and always kept in the back of our minds.
Beyond the practical implications, this problem intersects fields like computational linguistics, acoustic analysis, and behavioural science, offering an opportunity to explore how deception manifests in human speech patterns and whether these cues can be quantified effectively for predictive purposes.
Additionally, this project was used as an opportunity to learn additional computer science tools:
- Learning to use an IDE (VSCode) to write Jupyter notebooks.
- Using Git and Github to track project progress and to host it online.
3. Methodology¶
The methodology for this project involves several stages to develop and evaluate a machine learning model capable of classifying spoken narratives as true or deceptive. The approach includes training with validation, and testing phases.
3.1 Training Task¶
The training task involved preparing predictive models based on features extracted from 30-second audio chunks, with each chunk labelled as either part of a true or deceptive story, corresponding to the original audio file's label. The dataset was first divided into an 80:20 split for training and testing. Within the training set, Group-K-Fold cross-validation was performed to ensure all chunks from the same audio file were grouped together in a fold, preventing data leakage. This cross-validation process was used to determine the best-performing model based on accuracy, while avoiding overfitting to specific speech patterns. Predictors include features such as the number of pauses, total silence duration, pause-to-speech ratio, pitch range, and pitch variability, extracted from the audio samples.
3.2 Test Task¶
Once the best-performing model was identified during cross-validation, it was retrained using the entire training set with the optimal hyperparameters. The final evaluation was conducted on the unseen test set, ensuring a fair assessment of the model's generalisation ability. The performance on the test set provided the final results for accuracy and other metrics.
3.3 Model Performance Definition¶
Model performance is evaluated using multiple metrics. Accuracy serves as the primary measure, providing a simple baseline for overall model correctness, and for comparing models during training and validation. After the best-performing model was selected, the F1-score was used for a balanced evaluation between precision and recall, which is particularly important in the presence of class imbalances. Additionally, the confusion matrix was analysed to identify patterns of misclassification, such as false positives (incorrectly predicting deceptive stories as true) and false negatives (failing to detect deceptive stories).
3.4 Other Tasks¶
Several preparatory steps were undertaken to build the model effectively. Audio data was preprocessed to standardise sampling rates and normalise amplitudes, ensuring consistent input. Feature engineering was carried out to extract meaningful acoustic features from each audio chunk. Exploratory data analysis helped us better design the extraction of some features such as silent regions within the audio. Finally, hyperparameter tuning was conducted during the cross-validation phase to optimise model performance, resulting in better generalisation to unseen data. This multi-step methodology helped ensure that the models were trained effectively and evaluated rigorously.
4 Implemented ML prediction pipelines¶
The implemented machine learning prediction pipeline processes audio data to classify spoken stories as either true or deceptive. The pipeline begins by normalising the extracted acoustic features, such as pauses, silence durations, and pitch variability, to ensure consistency across all input data.
Three machine learning models were explored:
Logistic regression
Support Vector Machines (SVM)
k-Nearest Neighbors (k-NN)
Each model was tuned using GridSearchCV with GroupKFold cross-validation to prevent data leakage and identify the optimal hyperparameters. The best-performing model, an SVM with an RBF kernel, was selected based on cross-validation accuracy and subsequently evaluated on a separate test set to assess its generalisation performance. The pipeline ensures a systematic approach to feature preparation, model training, and evaluation.
4.1 Transformation Stage¶
The transformation stage of the pipeline focuses on two main areas: the preprocessing stage, and the feature extraction stage to prepare the raw audio data for the machine learning models. The input to this stage consists of 30-second chunks of audio data, sampled at a uniform rate. The output is a set of numerical features that capture key acoustic characteristics of the audio, including the number of pauses, total silence duration, pause-to-speech ratio, pitch range, and pitch variability.
According to Rockwell, Buller, and Burgoon (1997), deceivers may exhibit increased pitch variety, potentially as a strategy to appear more truthful and expressive. Loy, Rohde, and Corley (2018) found an increase in filled pauses when lying, especially for complex lies that require greater cognitive effort to fabricate. These are the main factors which contributed to our feature selection. Additionally, studies also pointed towards speech rate or tempo being another big factor which indicates deception, however, this is a more complicated feature to extract. Due to time constraints, it will not be included in this project.
Language is another key feature that could help our models with their classification, as different languages have distinct speech patterns. However, our project's dataset is heavily skewed towards English speakers (78 out of the total 100 audio files are in English), and we will not use Language as an extracted feature. In future work, this would be a useful feature to explore, and to understand how it impacts our models.
After feature extraction, the data is normalised using StandardScaler to standardise the feature distributions, ensuring that all predictors are on a consistent scale. This step is particularly crucial for machine learning models like SVM and k-NN, which are sensitive to feature magnitudes. The transformation stage ensures that the data is both interpretable and compatible with the machine learning models, enabling effective training and evaluation. By focusing on these specific features, the pipeline emphasises interpretability while leveraging acoustic properties relevant to the challenge of distinguishing true from deceptive stories.
4.2 Model stage¶
The machine learning models implemented for this project are Support Vector Machines (SVM), Logistic Regression, and k-Nearest Neighbors (k-NN). These models were selected for their complementary strengths, their ability to handle the relatively small dataset and numerical features extracted from the audio data, and their ease of use due to their availability in the scikit-learn library.
4.2.1 Logistic Regression¶
Logistic Regression was included as a baseline model due to its simplicity, interpretability, and availability in scikit-learn. It assumes a linear relationship between the input features and the target variable, which provides a useful benchmark for evaluating the complexity of the problem. Its straightforward implementation allowed for rapid prototyping and ensured that initial results could be obtained quickly, saving time for more complex analyses.
4.2.2 Support Vector Machines (SVM)¶
SVM was chosen due to its effectiveness in handling non-linear relationships through the use of kernels, such as the radial basis function (RBF) kernel. Given the complexity of audio-derived features and the possibility of non-linear decision boundaries between true and deceptive stories, SVM provides a robust framework for classification. It also performs well in high-dimensional spaces, which is beneficial for handling the multiple acoustic features extracted in this project.
4.2.3 k-Nearest Neighbors (k-NN)¶
k-NN was selected for its simplicity and ability to model local relationships in the data. As a non-parametric method, k-NN relies on the proximity of data points in feature space, which may be advantageous when working with small datasets and features like pauses and pitch variability that capture local variations. Its implementation in scikit-learn made it easy to test and tune hyperparameters such as the number of neighbors and distance metric, enabling efficient experimentation without unnecessary complexity.
4.2.4 Ease of Implementation¶
One of the primary reasons these models were chosen was their accessibility through the scikit-learn library, which offers straightforward implementations of these algorithms with consistent syntax and extensive documentation. This reduced the time required for setting up and running experiments, allowing for a focus on feature engineering, evaluation, and hyperparameter tuning. Given the nearing project deadline, this ease of use was a critical factor in selecting models that could be implemented and evaluated efficiently, without introducing undue complexity.
These models provide a balance of robustness, simplicity, and practicality, ensuring that meaningful results could be obtained within the constraints of the project timeline. Through cross-validation and hyperparameter tuning, the best-performing model (SVM) was identified, reflecting its suitability for capturing the underlying structure of the data.
4.3 Ensemble stage¶
For this project, an ensemble approach was not implemented due to time constraints and the focus on building and evaluating individual models (SVM, Logistic Regression, and k-NN). Each of these models was trained and optimised independently, and the best-performing model (SVM) was selected based on cross-validation results for evaluation on the test set.
Although ensembles were not included in the current implementation, they present a valuable future avenue for exploration. Ensemble methods, such as voting classifiers, bagging (e.g., Random Forest), or boosting (e.g., Gradient Boosting, AdaBoost), could potentially improve the overall performance by combining the strengths of individual models. For example:
- Soft Voting Ensembles could combine the probabilistic outputs of SVM, Logistic Regression, and k-NN to create a more robust classifier.
- Stacking Ensembles could use a meta-model, such as Logistic Regression, to learn how to best combine the predictions of the base models.
Ensemble methods are particularly useful in scenarios where individual models capture different aspects of the data and their errors are uncorrelated. By combining their predictions, ensembles can often achieve higher accuracy and robustness than any single model.
5 Dataset¶
For this project, we are using data from the MLEnd Small Deception dataset. This dataset, created by the Data Science and AI Teaching Group at Queen Mary University of London, is a collection of 100 audio files. Each audio file roughly ranges from 2 to 4 minutes in length, and contains a story narrated by a student or faculty member at the University. The stories can either be true stories, or deceptive (made up).
In addition to the 100 audio files, the dataset also includes a CSV file which contains the Language and Story Type (whether the story was truthful or deceptive) for each audio file.
5.1 MLEnd library setup and data download¶
# Showing that the MLEnd library is installed on our system
!pip show mlend
# If you have not got the library installed, uncomment out the line below (ensure it is version 1.0.0.4)
# !pip install mlend==1.0.0.4
Name: mlend Version: 1.0.0.4 Summary: MLEnd Datasets Home-page: https://MLEndDatasets.github.io Author: Jesús Requena Carrión and Nikesh Bajaj Author-email: nikkeshbajaj@gmail.com License: MIT Location: c:\programdata\anaconda3\envs\py310\lib\site-packages Requires: joblib, matplotlib, numpy, pandas, scipy, spkit Required-by:
Now, we will import the library along with some key functions, and download the dataset.
Note: The dowloaded dataset requires approximately 1.8GB of available disc space.
# Importing the MLEnd library and functions
import mlend
from mlend import download_deception_small, deception_small_load
# Downloading the "small" deception dataset
datadir = download_deception_small(save_to='MLEnd', subset={}, verbose=1, overwrite=False)
Downloading 100 stories (audio files) from https://github.com/MLEndDatasets/Deception
100%|▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓|100\100|00100.wav
Done!
5.2 Preliminary data exploration¶
We will now explore our downloaded files to understand the dataset a bit further before beginning our experimentation.
import glob
# Checking how many audio files we have
sample_path = r'MLEnd\deception\MLEndDD_stories_small\*.wav'
files = glob.glob(sample_path)
print("Number of downloaded audio files:",len(files))
Number of downloaded audio files: 100
import librosa
import matplotlib.pyplot as plt
import matplotlib.style as style
# Finding the ranges of durations for our audio files
durations = []
for i in range(len(files)):
fs = None
x, fs = librosa.load(files[i],sr=fs)
duration = librosa.get_duration(y=x, sr=fs)
durations.append(duration)
print("Shortest audio file duration (in seconds):", round(min(durations),1))
print("Longest audio file duration (in seconds):", round(max(durations),1))
# Plotting the distribution of durations
style.use("seaborn-v0_8")
plt.hist(durations, bins=10, edgecolor="black", alpha=0.7)
plt.title("Distribution of audio durations", fontsize = 12)
plt.xlabel("Duration (seconds)", fontsize = 10, fontweight = "bold")
plt.ylabel("Frequency", fontsize = 10, fontweight = "bold")
plt.show()
Shortest audio file duration (in seconds): 43.4 Longest audio file duration (in seconds): 247.6
We can see that most audio files are around 110 to 150 seconds long, and there is a wide distribution of durations, ranging from as low as 43 seconds to 248 seconds. We will need to keep this in mind when we eventually start experimenting with these files.
We will now import the CSV file into a Pandas dataframe, and use it to describe the distribution of languages and story types.
import pandas as pd
# Loading the CSV file into a Pandas dataframe
MLEND_df = pd.read_csv(r'MLEnd\deception\MLEndDD_story_attributes_small.csv')
language_counts = MLEND_df["Language"].value_counts()
story_type_counts = MLEND_df["Story_type"].value_counts()
display(MLEND_df)
display(language_counts)
print("Number of unique languages in the dataset:", len(MLEND_df["Language"].unique()))
display(story_type_counts)
filename | Language | Story_type | |
---|---|---|---|
0 | 00001.wav | Hindi | deceptive_story |
1 | 00002.wav | English | true_story |
2 | 00003.wav | English | deceptive_story |
3 | 00004.wav | Bengali | deceptive_story |
4 | 00005.wav | English | deceptive_story |
... | ... | ... | ... |
95 | 00096.wav | English | deceptive_story |
96 | 00097.wav | English | true_story |
97 | 00098.wav | English | deceptive_story |
98 | 00099.wav | English | true_story |
99 | 00100.wav | English | deceptive_story |
100 rows × 3 columns
Language English 78 Hindi 4 Arabic 3 Chinese, Mandarin 2 Marathi 2 Bengali 1 Kannada 1 French 1 Russian 1 Portuguese 1 Spanish 1 Swahilli 1 Telugu 1 Korean 1 Cantonese 1 Italian 1 Name: count, dtype: int64
Number of unique languages in the dataset: 16
Story_type deceptive_story 50 true_story 50 Name: count, dtype: int64
There are 16 unique languages in the dataset, with English being the most commonly spoken (with 78 audo files). Additionally, we have an even split of story types, with both true and deceptive stories each making up half of the dataset.
5.3 Data preparation and training/test data split¶
5.3.1 Creating 30 second chunks¶
In this project, we are specifically required to build a machine learning model that takes a 30 second audio clip, and to predict whether the story being narrated is true or not.
However, as seen in section 5.2, the duration of our audio files are varied. Therefore, we will need to initially need to split all of our audio files into 30 second chunks, in order to ensure data consistency, and computational efficiency. This will help us strike a balance between retaining meaningful context and making the data manageable for both feature extraction and our machine learning models.
This raises a new challenge. As our audio files are not perfectly divisible by 30 seconds, we must decide on an approach that allows us to divide all files into 30 second chunks. The approach we will use for this project is as follows:
Round down the audio length to the nearest number divisible by 30.
Subtract this number from the original length of the audio.
Divide this number by two, and trim the start and end by this amount.
Let us demonstrate this approach practically:
import IPython.display as ipd
# Function to trim audio evenly from the start and tail
def trim_audio_equal(audio_file):
fs = None
x, fs = librosa.load(audio_file,sr=fs)
total_duration = librosa.get_duration(y=x, sr=fs)
# Calculate the excess audio duration
excess_duration = total_duration % 30
# Trimming half of the excess from start and half from end
trim_amount = int((excess_duration / 2) * fs)
total_trimmed = (trim_amount * 2) / fs
# Trim the audio
trimmed_audio = x[trim_amount : -trim_amount]
return trimmed_audio, total_trimmed, fs, x
file_number = 1
trimmed_audio, amount, fs, x = trim_audio_equal(files[file_number - 1])
print("Original audio from file number " + str(file_number) + ":")
display(ipd.Audio(x, rate=fs))
print("File number", file_number, "has successfully been trimmed.\n",
round(amount, 2), "seconds removed.")
display(ipd.Audio(trimmed_audio, rate=fs))
Original audio from file number 1: