Assignment 2: Classifying languages

Deadline: June 5, 10:00 CEST.

In this exercise we will try to predict family of a language from a set of typological features.

All the assignments should be implemented in the bodies of the indicated functions in the provided template classify.py which also contains some additional information and clarifications.

Data

The data you will use for this assignment is in a tab-separated file with name wals-train.tsv. The first row of the file is the header. The first two columns are for information only. The third column with the header family is the variable we want to predict. The rest of the columns are the features. A special feature value NA marks the features that are unspecified.

The data is a subset of WALS database.

Grading

The grade you will receive from this exercise has two parts. The first part, 8/10, is the correct implementation of all the exercises. For this part, each exercise is equally weighted. The second, 2/10, part will be based on the performance of the classifier you are asked to tune in exercise 3 on a test set.

The performance score (2/10) is determined based on the macro-averaged F1 score on the test set, such that

if your score is within ±1 standard deviation of the average, you will receive 1 point
if your score is 1 standard deviation (or more) above the average, you will receive 2 points
if your score is 1 standard deviation (or more) below the average, or if your your code does not produce the expected output, you will receive 0 points

Exercises

2.1 Encoding the data

Implement the function encode() in the template that reads a data file as defined above, and returns a tuple containing features and labels. features should be a two dimensional numpy array where each row is the concatenation of one-hot vectors of corresponding value of each feature. For example, if we had only two features, and the one-hot encoding of the first feature for a particular language was 0000001 and the one-hot encoding of the second feature value was 00100, the encoded features corresponding to the language would be 000000100100. Make sure that your function maps NA and any unknown value to a vector of all 0s.

The labels should be a list with the same length as the rows of the features, where values are the family of the corresponding language.

For the sake of exercise, you are not allowed to use any high-level library function (for this exercise only). Normally, you should consider using well-known and well-tested library routines.

2.2 Training a simple classifier

Implement to the function classify() in the template. This function should train and test a logistic regression classifier by using 2/3rd of the data for training and 1/3rd of the data for testing. You are not required to tune your classifier in this exercise. However, you should pay attention to how you divide the data into training and test sets. You should make sure that all classes are represented on both training and test sets (with a similar distribution).

Your function should print macro-averaged precision, recall, and F1 score on the training and test sets in comparison to a random baseline and a majority-class baseline. The output should look like the following.

	p_train	r_train	f_train	p_test	r_test	f_test
random	0.0	0.0	0.0	0.0	0.0	0.0
majority	0.0	0.0	0.0	0.0	0.0	0.0
classifier	0.0	0.0	0.0	0.0	0.0	0.0

p, r, and f refers to precision, recall and f-score.

You are recommended to use scikit-learn library for implementing this exercises and the next two. If you would like to use another library, please contact the instructor and/or the tutors.

2.3 Tune a classifier for the task

Implement to the function tune() in the template.

You need to look for a suitable classifier and tune its hyperparameters, so as to obtain the highest possible macro-averaged F1 score.

You are free to use any classifier from the Python scikit-learn library. Note that depending on the classifier you chose, the hyperparameters will differ. You can opt for using a fixed training/development split, or you can also use a k-fold setup. Your function should return the best parameter values as a Python dictionary.

2.4 Generating predictions

Implement the predict() function in the template which takes two filename arguments, a training file and a test file, and returns the predictions as a list using the best model from exercise 2.3. The format of the test file is the same as the training file, but the language family column contains a placeholder _.

Tip: it is crucial for this exercise to encode the features in the training and test data consistently. Your encode() function should determine the one-hot codes on the training set, and use the same coding scheme for the test set.