Assignment 5: German compound splitting with RNNs

Deadline: July 17, 10:00 CEST.

In this exercise we are going to work on German compound splitting. The main objective of the assignment is getting (more) familiar with recurrent neural networks (RNNs). The task at hand is a sequence labeling task. Hence, the method you will be working on is equally applicable to any task that can be cast as a sequence labeling task (e.g., POS tagging, named entity recognition, even parsing).

In this assignment we will implement a variant of “I/O/B tokenization.” Our aim is to label each character with one of the following labels.

B if the letter is at the beginning of a word (part of a compound). For example, for the compound ‘Lehrprogramm’, the letters ‘L’ and ‘p’ should be labeled with ‘B’.
I if the letter is inside a word (except the first letter).
O means outside of a word (or token). For tokenization tasks, non-token characters (e.g., white space) are labeled with ‘O’. For most segmentation tasks, this label is not necessary. However, we will use it for labeling beginning- and end-of-sequence symbols.

As before, please implement all of the exercises in the provided template as specified in the instructions below.

Data

The data comes from a recent study [1]. The words were extracted from GermaNet. The data files contain a word per line, where parts of the compound words are split with a space character. All words were converted to lowercase.

We use a training and a test file. The test file should be used only for testing. There is no specified development set. You should use part of the training data for any tuning you perform.

Exercises

5.1 Data preparation, encoding

This exercise has two steps. First, read the data file, and create appropriate labels, and second, encode the characters and labels using one-hot encoding.

Implement the function read_data() in the template, which reads in a segmented file, and returns two sequences of equal length
- A sequence of words, where all spaces are removed. Each word should be prepended with a beginning-of-sequence symbol (function argument bos), and if the argument pad is true, as many end-of-sequence symbols (eos) as necessary to make sure the word length (excluding bos) is equal to the parameter max_len if specified, or the length of the maximum word in the data.
- A sequence of strings that contain the appropriate label for each character of each word in the sequence of words described above.
Implement the class InputOutputEncoder in the template, which encodes both the input (a sequence of words), and the output (a sequence of label sequences). The interface of the class is similar to the one you implemented in the previous assignment. The method fit() should create necessary structures to consistently encode given words and labels, and the method transform() should take two (padded) sequences, words and labels, and returns two three-dimensional numpy arrays of shape (N, m, k), where N is the number of words in the data, m is the length of the (padded) words from the previous step, and k is the encoded length of characters or labels (k should be 3 for encoded labels, and number of unique characters in the padded words for the encoded words).

5.2 Building and training an RNN

Implement the method train_rnn() in the template, which takes words and labels encoded according the description in 5.1, and returns a trained recurrent network model predicting labels of each character in the given (encoded) word list. Use “early stopping” for preventing overfitting.

In such networks, it is a common practice to use an embedding layer before the recurrent layer. For the sake of exercise please do not use an embedding layer. You are free to use any recurrent architecture (e,g., GRU, LSTM). You may, but not required to, use a bidirectional RNN. You are recommended to experiment with different parameters.

5.3 Segmenting compounds

Implement the function segment(), which takes a list of unsegmented (padded) words, and corresponding labels, and returns the segmented words.

5.4 Evaluating the segmentations

Implement the function evaluate(), which takes two label sequences and returns the following measures.

prediction accuracy (ACC): accuracy of splits. That is, number of correctly split (compound) words divided by the total number words in the test set.
boundary precision (BP), recall (BR), F1 score (BF), where we are only interested in the label ‘B’. That is, whether the model is correctly predicting the non-trivial word beginnings. Make sure not to credit the model for predicting the obvious: the beginnings of the compound words should not affect the scores.
word precision (WP), recall (WR), F1 score (WF), where a correctly predicted boundary is counted as a “true positive” only if preceding boundary was also correctly predicted.

The function should return the scores as a dictionary where keys are the short identifiers of each measure (indicated in parentheses above).

Confused? Here is an example:

gold standard	predicted
`flug angst`	`flug angs t`
`menschen rechts konvention`	`mens chenrechts konvention`
`sprach wissenschaft`	`sprach wissenschaft`

For the data give above,

boundary precision is 3/5: of 5 predicted boundaries, 3 are correct.
boundary recall is 3/4: of 4 boundaries in the data set, 3 are found by the model.
word precision is 4/8: of 8 predicted words, 4 are correct (flug, konvention, sprach, and wissenschaft).
word recall is 4/7: of 7 words in the gold standard, 4 are identified successfully.
prediction accuracy is 1/3: only the last item is split correctly.

References

[1] Jianqiang Ma, V. Henrich & Erhard Hinrichs (2016). Letter Sequence Labeling for Compound Splitting. In: Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, August 2016, pp. 76-81