Speaker separation involves separating individual speakers from a mixture of voices or background noise, known as the "cocktail party problem.... Show moreSpeaker separation involves separating individual speakers from a mixture of voices or background noise, known as the "cocktail party problem." This refers to the ability to focus on a specific sound while filtering out other distractions.In this analysis, we propose the idea of obtaining features present in the original data and then evaluating the impact they have on the ability of the model to separate the mixed audio streams.
The dataset is prepared such that these feature values can be used as predictor variables to various models like Logistic Regression, Decision Trees, SVM (both rbf and linear kernel), XGBoost, AdaBoost, to obtain the most contributing features that is the features that
will lead to a better separation. These results shall then be analyzed to conclude the features that affect separating the audio streams the most.
Initially, 400 audio streams are selected from the VoxCeleb dataset and combined to form 200 single utterances. After the mixes are obtained, the pre-trained Speechbrain model, sepformer-whamr is used. This model separates the audio mixes given as input and obtain two outputs that should be as close as possible to the original ones.
A feature list from the 400 chosen audios is obtained and then the effect of certain features on the model's capability to distinguish between multiple audio sources in a mixed recording is assessed.
Two analysis parameters- permutation feature importance and SHAP values are used to conclude which features have more effect on separation.
Our hypothesis is that the features contributing the most to a good separation are invariant across datasets. To test this hypothesis, we obtain 1,000 audio streams from the Mozilla Common Voice Dataset and perform the same experimental methodology described above. Our results demonstrate that the features we extract from VoxCeleb dataset are indeed invariant and aid in separating the audio streams of the Mozilla Common Voice dataset. Show less