Advanced Natural Language Processing

Lecture Notes

LEC #

TOPICS

1

Introduction and Overview (PDF)

2

Parsing and Syntax I (PDF)

3

Smoothed Estimation, and Language Modeling (PDF)

4

Parsing and Syntax II (PDF)

5

The EM Algorithm (PDF)

6

The EM Algorithm Part II (PDF)

7

Lexical Similarity (PDF)

8

Lexical Similarity (cont.) (PDF)

9

Log-Linear Models (PDF)

10

Tagging and History-based Models (PDF)

11

Grammar Induction (PDF)

12

Computational Modeling of Discourse (PDF)

13

Text Segmentation (PDF - 3.6 MB)

14

Local Coherence and Coreference (PDF)

15

Machine Translation (PDF)

16

Machine Translation (cont.) (PDF)

17

Machine Translation (cont.) (PDF 1) (PDF 2 - 1.4 MB) (Courtesy of Philipp Koehn and Ivona Kucerova. Used with permission.)

18

Graph-based Methods for NLP Applications (PDF)

19

Word Sense Disambiguation (PDF)

20

Global Linear Models (PDF)

21

Global Linear Models Part II (PDF)

22

Dialogue Processing (PDF)

23

Dialogue Processing (cont.) (PDF)

24

Guest Lecture: Stephanie Seneff

25

Text Summarization (PDF)

Assignments

ASSIGNMENTS

SUPPORTING FILES

Homework 1 (PDF)

 

Homework 2 (PDF)

counts.gz (GZ - 3.2 MB) (The GZ file contains: counts.txt.)
theirthere.test (TXT)

Homework 3 (PDF)

data.gz (GZ) (The GZ file contains: data.txt.)
synrev (TXT)

Development Data

Verb pairs and associated cosine similarity scores (note that sim.in = synrev).
sim.in (TXT)
sim.out (TXT)

Result of complete-link clustering to the 2-cluster level.
cluster1 (TXT)
cluster2 (TXT)

Homework 4 (PDF)

poscounts.gz (GZ) (The GZ file contains: poscounts.txt.)
wsj.19-21.test (TXT)

Extra Materials

A package containing the scripts that were used to generate the poscounts.gz corpus. We are providing this code in case you are curious about the data generation. For the purposes of the problem set, however, please use the poscounts.gz training corpus to ensure that your results comply with the reference implementation.
ft.tar.tar (TAR - 2.5 MB)

Development data for testing your tag-trigram probabilities; tritest contains tag trigrams, while tritest.probs contains the corresponding probabilities.
tritest (TXT)
tritest.probs (TXT)

Development data for testing your Viterbi tag assignments. The simplesents file contains about 530 simple sentences that admit relatively few possible tag assignments. The simplesents.bf_tagged file contains optimal tag assignments and log-probabilities as discovered by brute-force enumeration. The first element in every line of simplesents.bf_tagged gives the log-probability of the best tagging, and the rest of the line gives the tag assignment itself.
simplesents (TXT)
simplesents.bf_tagged (TXT)

Homework 5 (PDF)

Resources

BoosTexter
 - The download page for the BoosTexter binaries. If you get an error message, try reloading the link. BoosTexter is UNIX®-based, so if you want to run it in Windows, you will need to get a UNIX® shell such as Cygwin or U/Win.

Penn Treebank Tagset
 - Descriptions of the Penn part-of-speech tags. NB: When you are determining the plurality of a noun phrase, you will find that the last tag is not always a noun-type tag. Use the following rule to determine the plurality of these other parts of speech:
Plural: CD, JJP, SYM
Singular: DT, JJ, RB, VBG, WDT

Datasets

Sentence pairs with coreference annotations.
coref_samples.train (TXT)
coref_samples.test (TXT)

BoosTexter .names template for feature generation. Please adhere to this template to ensure that your features conform to the reference results.
coref.names (TXT)

Reference features for the first 30 sentence pairs in coref_samples.train. The feature vectors were generated in a left-to-right postorder traversal of the noun phrases in a given sentence pair; e.g.
[[ [[ 1 ]] 2 ]] [[ 3 ]] [[ 4 ]] [[ [[ 5 ]] [[ [[ 6 ]] 7 ]] 8 ]]
first30.data (TXT)

Homework 6 (PDF)

corpus.de.gz (GZ) (The GZ file contains: corpus.de.txt.)
corpus.en.gz (GZ) (The GZ file contains: corpus.en.txt.)

Datasets

A set of words and their associated translation probabilities. The output file is formatted as a series of lines, where each line contains a number of (German word, translation probability) pairs, all tokens separated by spaces.
devwords (TXT)
devwords.out (TXT)

A set of words for which you must provide output probabilities. Please provide a file testwords.out with the same format as devwords.out above.
testwords (TXT)