Assignment 2: Classifying languages
Deadline: June 5, 10:00 CEST.
In this exercise we will try to predict family of a language from a set of typological features.
All the assignments should be implemented in the bodies of the indicated functions in the provided template classify.py which also contains some additional information and clarifications.
Data
The data you will use for this assignment is
in a tab-separated file with name wals-train.tsv.
The first row of the file is the header.
The first two columns are for information only.
The third column with the header family
is the variable we want to predict.
The rest of the columns are the features.
A special feature value NA
marks the features that are unspecified.
The data is a subset of WALS database.
Grading
The grade you will receive from this exercise has two parts. The first part, 8/10, is the correct implementation of all the exercises. For this part, each exercise is equally weighted. The second, 2/10, part will be based on the performance of the classifier you are asked to tune in exercise 3 on a test set.
The performance score (2/10) is determined based on the macro-averaged F1 score on the test set, such that
- if your score is within ±1 standard deviation of the average, you will receive 1 point
- if your score is 1 standard deviation (or more) above the average, you will receive 2 points
- if your score is 1 standard deviation (or more) below the average, or if your your code does not produce the expected output, you will receive 0 points
Exercises
2.1 Encoding the data
Implement the function encode()
in the template
that reads a data file as defined above,
and returns a tuple containing features
and labels
.
features
should be a two dimensional numpy
array
where each row is the concatenation of one-hot
vectors of corresponding value of each feature.
For example, if we had only two features,
and the one-hot encoding of the first feature
for a particular language was 0000001
and the one-hot encoding of the second feature value
was 00100
, the encoded features corresponding to the language
would be 000000100100
.
Make sure that your function maps NA
and any unknown value
to a vector of all 0
s.
The labels
should be a list with the same length as the rows of the features
,
where values are the family of the corresponding language.
For the sake of exercise, you are not allowed to use any high-level library function (for this exercise only). Normally, you should consider using well-known and well-tested library routines.
2.2 Training a simple classifier
Implement to the function classify()
in the template.
This function should train and test a logistic regression classifier
by using 2/3rd of the data for training
and 1/3rd of the data for testing.
You are not required to tune your classifier in this exercise.
However, you should pay attention
to how you divide the data into training and test sets.
You should make sure that all classes are represented on both
training and test sets (with a similar distribution).
Your function should print macro-averaged precision, recall, and F1 score on the training and test sets in comparison to a random baseline and a majority-class baseline. The output should look like the following.
p_train | r_train | f_train | p_test | r_test | f_test | |
---|---|---|---|---|---|---|
random | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
majority | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
classifier | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
p, r, and f refers to precision, recall and f-score.
You are recommended to use scikit-learn library for implementing this exercises and the next two. If you would like to use another library, please contact the instructor and/or the tutors.
2.3 Tune a classifier for the task
Implement to the function tune()
in the template.
You need to look for a suitable classifier and tune its hyperparameters, so as to obtain the highest possible macro-averaged F1 score.
You are free to use any classifier from the Python scikit-learn library. Note that depending on the classifier you chose, the hyperparameters will differ. You can opt for using a fixed training/development split, or you can also use a k-fold setup. Your function should return the best parameter values as a Python dictionary.
2.4 Generating predictions
Implement the predict()
function in the template
which takes two filename arguments, a training file
and a test file, and returns the predictions as a list
using the best model from exercise 2.3.
The format of the test file is
the same as the training file,
but the language family column contains a placeholder _
.
Tip: it is crucial for this exercise to encode
the features in the training and test data consistently.
Your encode()
function should determine the one-hot codes
on the training set, and use the same coding scheme for the test set.