Taaltechnologien
Web site for the course Taaltechnologien for AI students.
The course is base of the book
SPEECH and LANGUAGE PROCESSING
by Daniel Jurafsky and James H. Martin
Exercises
- (chapter 4): Transcribe 12 Dutch words of your choosing into
Worldbet.
(postscript
version at CSLU.CSE.OGI)
- (chapter 5): Implement the edit distance algorithm from figure
5.05
but include back-pointers to allow tracing the best alignment (use 'S', 'I', 'D'
to indicate Substitution, Insertion, and Deletion). Test it on a few examples.
NOTE: there are errors in the pseudo-code, e.g., the iterations should start
at 1, not 0, and you should initialize the 0 column and row of the distance matrix
- (chapter 6): Implement the backoff smoothing (equation 6.37) based on Katz discounting
(equation 6.29) in well documented Pseudo-code or a working program. Note that
1) the new, smoothed, zero count becomes N1/N0, i.e., the original
number of unique word N-grams divided by the number of unseen word N-grams.
2) In the backoff procedure, count only over the Ngrams actually seen.
That is Sum over all words Wn, keep previous words fixed: P(Wn|Wn-N+1..Wn-1) becomes
Sum over all seen N-grams with Wn: Counts(Wn-N+1 ... Wn) and divide them by
the sum over all N-grams, seen and unseen. Use smoothed counts. Do the same with the backoff
probabilities, but use the smoothed counts of the N-1 grams while still counting over the
seen N-grams.
- (chapter 7&8): POS-tagger implements a (toy)
HMM POS tagger. It can use a POS bigram table to
tag a sentence (note, it is extremely slow). This POS-tagger
uses a Viterbi search to determine the lowest "cost" path optimizing the path with minimal
summ cost (i.e., Cost = -log2(P(word|tag)) + m * -log2(P(Tag|PreviousTag))). In this formula
m is the language match factor.
Answer the following questions:
- a) Try out the following commands:
./POStagger_hmm.pl POSbigramtable.txt 'de was is schoon'
./POStagger_hmm.pl POSbigramtable.txt 'de bal is schoon'
What are the differences? Explain them (you can look inside
the POS bigram table)
- b) One problem might be that the lexical factor -log2(P(word|tag)) is much
larger than the language factor -log2(P(Tag|PreviousTag)). You can equalize
the weight of both factors by using the -m option. Try it:
./POStagger_hmm.pl POSbigramtable.txt 'de was is schoon' -m 5
./POStagger_hmm.pl POSbigramtable.txt 'de bal is schoon' -m 5
What is the result? Try out -m 10 for the language factor. Try to
explain what happens.
Give more examples.
- c) Specify a broad design of this POS tagger. Include the relevant formula's
for determining and smoothing P(word|tag) and P(Tag|PreviousTag), and the
pseudo code of the Viterbi search. (note: the table for the Viterbi
search contains the words along the bottum and all the Tags allong the
vertical axis).
- d) Specify the design to include back-off in the language model (i.e.,
P(Tag|PreviousTag))?
- e) How to back-off on P(word|tag) is not directly obvious. It can be done
by noticing that the tag of an unknown word can often be guessed on its
suffix. So you could build a Tag probability table for the last three
letters of each known word and use it to back off from unknown
words.
Specify how this suffix back-off would look like in the design of the
POS-tagger.
- Student Presentations of recent papers
Materials
- Chapter 2 all
- Chapter 3 all
- Chapter 4 all
- Chapter 5 all
- Chapter 6 all
- Chapters 4&5 of HMM-based continuous speech recognition, PhD thesis of Paul van Alphen, 1992
(replaces chapter 7 of the book)
- skip Chapter 8, use what has been told for exercise 4
- The Generalized LR Parsing Algorithm (Tomita parser), Masaru Tomita & See-Kiong Ng
- Sections 1-4 of Data Oriented Language Processing, Rens Bod & Remco Scha
- Sections 1-4 of Putting Language into Language Modelling, Frederick Jelinek & Ciprian Chelba (Eurospeech 99)
- The above replace Chapters 9, 10, 12
- Table I and Sections 3 and 5 of A conversation acts model for generating spoken dialogue contributions,
Amanda Stent, Computer Speech and Language 16, 313-352 (2002)
- The above replaces Chapters 19 and 20
- Chapter 21 is skipped