Assignment 7: Text classification

Deadline: Aug 21, 10:00 CEST.

This task is about text classification. In particular, given a short text, we are interested in predicting the gender of the author.

Unlike the earlier assignments, this assignment is intentionally open-ended. You can use any of the methods we studied in the class, as well as any data and resources outside the data provided here.

The main challenge in this assignment is the size of the data set. Since our data set is small, achieving reliably good scores is rather difficult.

Data

The primary data set we will use is from the short essays written by the class participants at the beginning of the semester. You will find two files in your repositories, one for training and one for testing. Both data sets are provided as tab-separated-value files. Where column headers for the class label and the essay texts are label and text, respectively. Note that a record may span multiple lines, if the essay text contains newlines, in which case text will be quoted. You are recommended to use a library (e.g., python csv library) to read the data.

You should use the essays-train.tsv for both training and tuning your model, and you should use essays-test.tsv only for testing. It is OK to use it to test alternative models/systems, but test file should not be used during development/tuning process.

Task

7.1 Implement, tune and test a text classification model

Implement one or more text classification methods, tune its hyperparameters, and test on the provided test set.

Since the data set is small, you are strongly recommended to find ways to make use of external data. Potentially useful external data include, but not limited to, earlier data sets on author profiling or gender detection, pre-trained embeddings or features based on clustering results on a large, unlabeled data set.

You are encouraged to try multiple models/systems, and organize your code the way you like. Please pay attention to good programming practices. You may lose points if your code is difficult to follow.

7.2 Report your results

Part of your grade will be based on a short (no more than 1000 words) report describing your approach and your results. Please write your report as a markdown document with name report.md.

Your report should include the following.

Evaluation