# Taaltechnologien

Web site for the course Taaltechnologien for AI students.

The course is base of the book SPEECH and LANGUAGE PROCESSING by Daniel Jurafsky and James H. Martin

## Exercises

1. (chapter 4): Transcribe 12 Dutch words of your choosing into Worldbet. (postscript version at CSLU.CSE.OGI)
2. (chapter 5): Implement the edit distance algorithm from figure 5.05 but include back-pointers to allow tracing the best alignment (use 'S', 'I', 'D' to indicate Substitution, Insertion, and Deletion). Test it on a few examples. NOTE: there are errors in the pseudo-code, e.g., the iterations should start at 1, not 0, and you should initialize the 0 column and row of the distance matrix
3. (chapter 6): Implement the backoff smoothing (equation 6.37) based on Katz discounting (equation 6.29) in well documented Pseudo-code or a working program. Note that
1) the new, smoothed, zero count becomes N1/N0, i.e., the original number of unique word N-grams divided by the number of unseen word N-grams.
2) In the backoff procedure, count only over the Ngrams actually seen.
That is Sum over all words Wn, keep previous words fixed: P(Wn|Wn-N+1..Wn-1) becomes Sum over all seen N-grams with Wn: Counts(Wn-N+1 ... Wn) and divide them by the sum over all N-grams, seen and unseen. Use smoothed counts. Do the same with the backoff probabilities, but use the smoothed counts of the N-1 grams while still counting over the seen N-grams.
4. (chapter 7&8): POS-tagger implements a (toy) HMM POS tagger. It can use a POS bigram table to tag a sentence (note, it is extremely slow). This POS-tagger uses a Viterbi search to determine the lowest "cost" path optimizing the path with minimal summ cost (i.e., Cost = -log2(P(word|tag)) + m * -log2(P(Tag|PreviousTag))). In this formula m is the language match factor. Answer the following questions:
• a) Try out the following commands:
./POStagger_hmm.pl POSbigramtable.txt 'de was is schoon'
./POStagger_hmm.pl POSbigramtable.txt 'de bal is schoon'
What are the differences? Explain them (you can look inside the POS bigram table)
• b) One problem might be that the lexical factor -log2(P(word|tag)) is much larger than the language factor -log2(P(Tag|PreviousTag)). You can equalize the weight of both factors by using the -m option. Try it:
./POStagger_hmm.pl POSbigramtable.txt 'de was is schoon' -m 5
./POStagger_hmm.pl POSbigramtable.txt 'de bal is schoon' -m 5
What is the result? Try out -m 10 for the language factor. Try to explain what happens.
Give more examples.
• c) Specify a broad design of this POS tagger. Include the relevant formula's for determining and smoothing P(word|tag) and P(Tag|PreviousTag), and the pseudo code of the Viterbi search. (note: the table for the Viterbi search contains the words along the bottum and all the Tags allong the vertical axis).
• d) Specify the design to include back-off in the language model (i.e., P(Tag|PreviousTag))?
• e) How to back-off on P(word|tag) is not directly obvious. It can be done by noticing that the tag of an unknown word can often be guessed on its suffix. So you could build a Tag probability table for the last three letters of each known word and use it to back off from unknown words.
Specify how this suffix back-off would look like in the design of the POS-tagger.
5. Student Presentations of recent papers

## Course texts from SPEECH and LANGUAGE PROCESSING

• Chapter 2 all
• Chapter 3 all
• Chapter 4 all
• Chapter 5 all
• Chapter 6 all
• Chapters 4&5 of HMM-based continuous speech recognition, PhD thesis of Paul van Alphen, 1992 (replaces chapter 7 of the book)
• skip Chapter 8, use what has been told for exercise 4
• The Generalized LR Parsing Algorithm (Tomita parser), Masaru Tomita & See-Kiong Ng
• Sections 1-4 of Data Oriented Language Processing, Rens Bod & Remco Scha
• Sections 1-4 of Putting Language into Language Modelling, Frederick Jelinek & Ciprian Chelba (Eurospeech 99)
• The above replace Chapters 9, 10, 12
• Table I and Sections 3 and 5 of A conversation acts model for generating spoken dialogue contributions, Amanda Stent, Computer Speech and Language 16, 313-352 (2002)
• The above replaces Chapters 19 and 20
• Chapter 21 is skipped