AES Consulting meeting on 9 Aug 2017

Differences in word frequency between two corpus

A client has 2 different corpora, say A and B, and is interested in differences in word frequencies (amongst ~100 words) between the corpora.

The client is performing bootstrapping on residuals or something.

Advice

A permutation/randomization tests seems like the most straightforward approach. In this approach, the corpora identification would be randomly assigned to a document and this would be repeated. Then the statistic would be calculated under this null hypothesis that the corpora identification has no relation to word frequency.