Assignment 5: German compound splitting with RNNs

Deadline: July 17, 10:00 CEST.

In this exercise we are going to work on German compound splitting. The main objective of the assignment is getting (more) familiar with recurrent neural networks (RNNs). The task at hand is a sequence labeling task. Hence, the method you will be working on is equally applicable to any task that can be cast as a sequence labeling task (e.g., POS tagging, named entity recognition, even parsing).

In this assignment we will implement a variant of “I/O/B tokenization.” Our aim is to label each character with one of the following labels.

As before, please implement all of the exercises in the provided template as specified in the instructions below.

Data

The data comes from a recent study [1]. The words were extracted from GermaNet. The data files contain a word per line, where parts of the compound words are split with a space character. All words were converted to lowercase.

We use a training and a test file. The test file should be used only for testing. There is no specified development set. You should use part of the training data for any tuning you perform.

Exercises

5.1 Data preparation, encoding

This exercise has two steps. First, read the data file, and create appropriate labels, and second, encode the characters and labels using one-hot encoding.

5.2 Building and training an RNN

Implement the method train_rnn() in the template, which takes words and labels encoded according the description in 5.1, and returns a trained recurrent network model predicting labels of each character in the given (encoded) word list. Use “early stopping” for preventing overfitting.

In such networks, it is a common practice to use an embedding layer before the recurrent layer. For the sake of exercise please do not use an embedding layer. You are free to use any recurrent architecture (e,g., GRU, LSTM). You may, but not required to, use a bidirectional RNN. You are recommended to experiment with different parameters.

5.3 Segmenting compounds

Implement the function segment(), which takes a list of unsegmented (padded) words, and corresponding labels, and returns the segmented words.

5.4 Evaluating the segmentations

Implement the function evaluate(), which takes two label sequences and returns the following measures.

The function should return the scores as a dictionary where keys are the short identifiers of each measure (indicated in parentheses above).

Confused? Here is an example:

gold standard predicted
flug angst flug angs t
menschen rechts konvention mens chenrechts konvention
sprach wissenschaft sprach wissenschaft

For the data give above,

References

[1] Jianqiang Ma, V. Henrich & Erhard Hinrichs (2016). Letter Sequence Labeling for Compound Splitting. In: Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, August 2016, pp. 76-81