Assignment 1: Corpus creation

Deadline: 22nd of May 2019, 10:00 CEST

In this assignment you will create a corpus of code-switching. Code-switching refers to use of multiple languages within a single sentence (or utterance, or conversation turn). Code-switching is a linguistic phenomenon observed in multilingual societies, and it is particularly common in informal language use. It is an interesting area of linguistic research, and some natural language applications may benefit from such corpora.

For this assignment you will collect social media text including code switching between German and English. Please contact the instructor if you are interested in working with a different language pair.

For this set of exercises, we will collect a code-switching corpus from Twitter. The main part of this exercise is similar to the one from last year. An example Python solution for last year’s assignment is provided as tweet_search.py. You are allowed to adapt this to your needs, but encouraged to try to implement it yourself.

Objectives

Becoming (more) familiar with Python
Experimenting with corpus collection for a particular purpose
Dealing with (somewhat) large corpora
Experimenting with Automatic annotation of corpora

Exercises

1.1 Creating word lists

Since we are interested in tweets that contain more than one language, Twitter’s language detection will not be useful for our purposes. The public Twitter API allows searching for a string with limited length, or filtering the twitter stream for a limited number of keywords. However, it is rather limited for our purposes. To get around these limitations, we will use a trick based on querying most common words in one of the languages, and then filtering the tweets further based on common words in the second language. Hence, your task in this exercise is create word lists for both languages. We want one of the lists (that we will use for querying tweets) to be frequent and distinctive, and this list will have to be short due to Twitter query restrictions. The other list, on the other hand, can be larger, but the words should be distinctive from the first language.

Write a Python script wordlist.py that uses the Opensubtitles corpus from OPUS to prepare word lists for German and English. The word list for German should contain 2000 most-frequent words in German corpus excluding the words shorter than 4 characters (which are likely to occur in many other languages) and ones that are fairly frequent (most frequent 20000 words) in the English corpus. The wordlist for English should contain most frequent 5000 words excluding the words that are frequent in the German corpus (you can use a larger wordlist, which will catch more code-switched tweets, but may be computationally demanding depending on the way you implement the filtering. TIP: pre-compiled regular expressions are your friend). Write the list of words for German and English to files named de.words and en.words, respectively. Your output files should be UTF-8 encoded text files containing one word per line.

Notes/rules:

You should use tokenized mono-lingual corpora from OPUS Opensubtitles 2018 collection.
The files are fairly large. Dealing with the text versions (de.tok.gz and en.tok.gz) is computationally less demanding than the XML formatted files.
Test your solution on small test files. Depending on your implementation and hardware, obtaining the word lists as described above may require considerable amount of time.
If you implement it reasonably efficiently the task requires less than 500M of memory, and runs in less than an hour on a single (and somewhat old) CPU. Part of this exercise is about dealing with large corpora. So, you should try to use the full data set for both languages. However, if your implementation requires too much time and/or memory, you can truncate the data files to a reasonable size (e.g., one million lines each).
The ideal solution for the German word list requires excluding frequent words of not only English but frequent words in any other language (present on Twitter). You are not required, but encouraged to extend your solution to exclude fairly frequent words from more languages (e.g., using all languages in OPUS subtitles, or ones available in Python wordfreq package, or frequency lists from Wiktionary).
Do not forget to commit your word lists as well as your implementation.
Do not commit/push the data files from OPUS.

1.2 Collecting tweets

Write a Python program, tweet_collect.py, using tweepy library to obtain 50 tweets (hopefully) with code-switching for the language pair.

Rules/notes:

Do not store the tweets shorter than 50 (Unicode) characters.
Store the matching tweets, including all the data obtained from Twitter, encoded as a JSON file with name de-en-tweets.json.
You will need a Twitter account, and you need to create access keys for this purpose. There are many tutorials on the Internet on how create an application connected to a Twitter account, and how to obtain the necessary access keys.
You should never store passwords, or secret keys in git repositories. The example code provided includes a common/reasonable way to deal with passwords and keys.
Depending on the time of the day, it may take considerable time to obtain the required amount of tweets. You should not leave the collection process to the last minute.
Do not forget to commit the corpus you collected as well as your implementation.

1.3 Automatically annotating the corpus

Write a Python script, detect_language.py that,

Reads the JSON files created in the previous exercise
Tokenizes the tweet text using NLTK TweetTokenizer
Labels each token (individually) with
- most likely language in the language pair according to langdetect.detect_langs()
- OTHER if the list does not contain the languages in the language pair, or if the token includes non-alphabetic characters
Writes the output as a tab separated file where the first column contains the tokens and the second column contains the language IDs (de, en or OTHER). Name your file de-en.txt
Reports (writes it to the console output) the number of tweets for which the code-switching is observed according to langdetect

Notes:

Do not be alarmed if results from langdetect differs from what you expect. Language detection generally works better for longer strings than individual tokens.
Do not forget to commit the processed file as well as your implementation.