Topic Modeling
Topic modelling is a statistical learning approach in the field of natural language processing that discovers latent or hidden “themes” (which are referred to here as ‘topics’) from a large textual collection or corpus.
It is one of the most popular techniques for text categorization and approaches to textual analysis. We are employing a particular form of topic modelling called Latent Dirichlet Allocation (LDA) to analyse 11th to 13th century English property transfer documents. In this model, the number of latent topics we wish to discover is provided apriori. For roughly 17,000 documents, we run the LDA model for eight topics. Documents are distributed over topics and topics are distributed over the words in the vocabulary of the textual collection. In other words, each of the documents is a percentage mixture (summing to 100%) of the eight topics, and every topic is a percentage mixture (summing to 100%) of the textual word vocabulary.
The collection of the words in the vocabulary we are referring to are not necessarily individual words, as one would find in a dictionary. They can also be
n
gram words (a string of
n
number of consecutive individual words). For example, a document which reads “The quick brown fox jumps over the lazy dog", the list of 3-gram words (trigram) are “The_quick_brown”, “quick_brown_fox”, “brown_fox_jumps”, “fox_jumps_over”, “jumps_over_the”, “over_the_lazy” and “the_lazy_dog".
From experimental analysis and interpretability of the results, for the DEEDS English collection, we have implemented an LDA model for eight topics and 3-gram words
Topic Distribution
Word Distribution