Web site for the course Taaltechnologien for AI students.

The course is base of the book SPEECH and LANGUAGE PROCESSING by Daniel Jurafsky and James H. Martin


  1. (chapter 4): Transcribe 12 Dutch words of your choosing into Worldbet. (postscript version at CSLU.CSE.OGI)
  2. (chapter 5): Implement the edit distance algorithm from figure 5.05 but include back-pointers to allow tracing the best alignment (use 'S', 'I', 'D' to indicate Substitution, Insertion, and Deletion). Test it on a few examples. NOTE: there are errors in the pseudo-code, e.g., the iterations should start at 1, not 0, and you should initialize the 0 column and row of the distance matrix
  3. (chapter 6): Implement the backoff smoothing (equation 6.37) based on Katz discounting (equation 6.29) in well documented Pseudo-code or a working program. Note that
    1) the new, smoothed, zero count becomes N1/N0, i.e., the original number of unique word N-grams divided by the number of unseen word N-grams.
    2) In the backoff procedure, count only over the Ngrams actually seen.
    That is Sum over all words Wn, keep previous words fixed: P(Wn|Wn-N+1..Wn-1) becomes Sum over all seen N-grams with Wn: Counts(Wn-N+1 ... Wn) and divide them by the sum over all N-grams, seen and unseen. Use smoothed counts. Do the same with the backoff probabilities, but use the smoothed counts of the N-1 grams while still counting over the seen N-grams.
  4. (chapter 7&8): POS-tagger implements a (toy) HMM POS tagger. It can use a POS bigram table to tag a sentence (note, it is extremely slow). This POS-tagger uses a Viterbi search to determine the lowest "cost" path optimizing the path with minimal summ cost (i.e., Cost = -log2(P(word|tag)) + m * -log2(P(Tag|PreviousTag))). In this formula m is the language match factor. Answer the following questions:
  5. Student Presentations of recent papers