Assignment 1: Corpus creation

Deadline: 22nd of May 2019, 10:00 CEST

In this assignment you will create a corpus of code-switching. Code-switching refers to use of multiple languages within a single sentence (or utterance, or conversation turn). Code-switching is a linguistic phenomenon observed in multilingual societies, and it is particularly common in informal language use. It is an interesting area of linguistic research, and some natural language applications may benefit from such corpora.

For this assignment you will collect social media text including code switching between German and English. Please contact the instructor if you are interested in working with a different language pair.

For this set of exercises, we will collect a code-switching corpus from Twitter. The main part of this exercise is similar to the one from last year. An example Python solution for last year’s assignment is provided as tweet_search.py. You are allowed to adapt this to your needs, but encouraged to try to implement it yourself.

Objectives

Exercises

1.1 Creating word lists

Since we are interested in tweets that contain more than one language, Twitter’s language detection will not be useful for our purposes. The public Twitter API allows searching for a string with limited length, or filtering the twitter stream for a limited number of keywords. However, it is rather limited for our purposes. To get around these limitations, we will use a trick based on querying most common words in one of the languages, and then filtering the tweets further based on common words in the second language. Hence, your task in this exercise is create word lists for both languages. We want one of the lists (that we will use for querying tweets) to be frequent and distinctive, and this list will have to be short due to Twitter query restrictions. The other list, on the other hand, can be larger, but the words should be distinctive from the first language.

Write a Python script wordlist.py that uses the Opensubtitles corpus from OPUS to prepare word lists for German and English. The word list for German should contain 2000 most-frequent words in German corpus excluding the words shorter than 4 characters (which are likely to occur in many other languages) and ones that are fairly frequent (most frequent 20000 words) in the English corpus. The wordlist for English should contain most frequent 5000 words excluding the words that are frequent in the German corpus (you can use a larger wordlist, which will catch more code-switched tweets, but may be computationally demanding depending on the way you implement the filtering. TIP: pre-compiled regular expressions are your friend). Write the list of words for German and English to files named de.words and en.words, respectively. Your output files should be UTF-8 encoded text files containing one word per line.

Notes/rules:

1.2 Collecting tweets

Write a Python program, tweet_collect.py, using tweepy library to obtain 50 tweets (hopefully) with code-switching for the language pair.

Rules/notes:

1.3 Automatically annotating the corpus

Write a Python script, detect_language.py that,

Notes: