Assignment 4: Clustering vowels

This is a draft, please do not work on it before it is released

Deadline: July 3, 10:00 CEST.

This exercise is about clustering, evaluation of clustering results, and using clustering to improve classification. We will cluster vowels based on their first and second formants, and evaluate the results using two different methods, one intrinsic measure, silhouette score, which evaluates the results only based on clusering configuration, and another set of measures homogeniety, completeness and V-measure, which require gold-standard labels. We will also use the clustering (learned on a larger unlabeled data set) for improving classification results (on a smaller labeled data set).

Please implement all your solutions in the provided template vowels.py.

Data

The data comes from a recent study [1]. We are going to use only part of the data. All vowel instances in the part of the data set we use belong to a single speaker.

There are two data files. The file vowels-unlabeled.txt is a tab-separated file with two columns. Each row contains first and second formant of a vowel instance in Hertz. The second file vowels-labeled.txt is similar, but it has a third column with the class label (the vowel encoded using X-SAMPA).

You are encouraged inspect and to plot the data, and try to understand it. Although it is not part of the tasks in the assignment, trying to visualize and understand the data at hand often help finding better solutions.

Exercises

4.1 Clustering and cluster evaluation

Implement the cluster() function in the template, which

4.2 Classification

Implement the initial part of the classify() function in the template. The function should train a classifier on the labeled data set using 5-fold cross validation, and print out the mean of the macro-averaged F1 score of a classifier.

You can use any classifier from sklearn, or a neural network implemented with Keras.

4.3 Semi-supervised classification with cluster labels

Extend the classify() function such that it uses the cluster labels assigned to the labeled data instance as additional features for classification. Your classifier should make use of the formant frequencies in the labeled data set and the cluster labels together. Again, print out the mean of the macro-averaged F1 scores of the classifier using 5-fold cross validation.

4.4 Semi-supervised classification using distances from cluster centers

Extend the classify() function similar to 4.3, however, instead of using categorical cluster assignments, use the euclidean distance of each vowel instance in the labeled data set from each of the cluster centroids as additional features.

Your classifier should be using a feature matrix similar to the following.

f1 f2 dist. from C1 centroid dist. from C2 centroid dist. from C7 centroid
381.1 1445.0 324.3 40.2 10.6

Again, print out the mean of the macro-averaged F1 scores of the classifier using 5-fold cross validation.

References

[1] Gahl, S., and Baayen, R. H. (2019). Twenty-eight years of vowels: Tracking phonetic variation through young to middle age adulthood. Journal of Phonetics, 74, 42-54.