Štěpán Roh - PFL043 - Assignment #1

1. Entropy of a Text

Attached files

Main source code: assign1a.pl
TEXTEN1.txt output: TEXTEN1.log
TEXTCZ1.txt output: TEXTCZ1.log
Output processor source code: assign1a_proc.pl
Processed output: assign1a_proc.txt

Results interpretation

Processed output shows that used English text has higher entropy than Czech text. From the table we can see that entropy drops while number of words with frequency 1 (freq1) is growing. More low frequent words means large number of word pairs with smaller combinations. I guess that having large number of words (not lemmas) with low frequency is a characteristic of inflecting languages (contrary to isolating/analytical languages). We can also see that messing with characters results in lower entropy (more messing, less entropy). Such changes increase number of low frequency words in both languages. Messing with words increases entropy for English, but not for Czech. This looks like a nice anomaly. I think that for English it still depends on number of low frequency words, but for Czech the change of its number is not that big and some other phenomenons are more important.

Paper-and-pencil exercise

Conditional entropy change depends on border pair (last word from T1 and first word from T2). The new entropy is a "sum" (it is not true sum because probabilities are rescaled for greater text) of both old entropies and weighted conditional probability of border pair. If last word from T1 was not in any pair in first place, weighted probability of border pair is zero which makes new entropy less or equal than E. If last word from T1 was in some pair in first place, weighted probability of border pair is non-zero which makes new entropy greater or equal than E.

2. Cross-Entropy and Language Modeling

Attached files

Source code: assign1b.pl
Output: assign1b.txt

Results interpretation

Output shows that used English text has higher cross entropy than Czech text. I think the reason for this is that "word coverage" for TEXTEN1.txt is better than for TEXTCZ1.txt. This has two reason: 1. used text is not uniform (even 75% coverage is not much) so when we are isolating test and training data we got different set of words, 2. Czech has greater vocabulary (counting words not lemmas). Output also shows that bigrams for English text and unigrams for Czech text have higher impact on minimizing cross entropy than trigrams. This is quite surprising for me and may have two explanations: 1. there is a bug in program, 2. test text characteristics are very strange. I would say 2 is right because English text's coverage is only 75% and Czech's is only 66% which is very low so in bigrams and trigrams there are often holes with unknown words. This is also confirmed by the fact that increasing trigrams importance increases entropy while decreasing it decreases entropy (there is an exception for Czech where decreasing trigrams importance increases entropy, but there is a problem that l(3) is already very small).