Topic Modeling: Anglo Saxon
Topic Modelling is one of the most popular techniques for text categorization and approaches to textual analysis. We are employing a particular form of topic modelling called Latent Dirichlet Allocation (LDA) to analyse Anglo Saxon property transfer documents. In this model, the number of latent topics we wish to discover is provided apriori. For roughly 1400 documents, we run the LDA model for four topics. Documents are distributed over topics and topics are distributed over the words in the vocabulary of the textual collection. In other words, each of the documents is a percentage mixture (summing to 100%) of the four topics, and every topic is a percentage mixture (summing to 100%) of the textual word vocabulary.
The collection of the words in the vocabulary we are referring to are not necessarily individual words, as one would find in a dictionary. They can also be
n
gram words (a string of
n
number of consecutive individual words). For example, a document which reads “The quick brown fox.", the list of 2-gram words (bi-gram) are “The_quick”, “quick_brown”, “brown_fox”.
From experimental analysis and interpretability of the results, for the DEEDS English collection, we have implemented an LDA model for four topics and 2-gram words
Topic Distribution
Word Distribution
Main page