HLTA Algorithm
Hierarchical Latent Tree Analysis(HLTA) is a novel method for hierarchical topic detection. It uses a class of graphical models called hierarchical latent tree models. The variables at the bottom level of an hierarchical latent tree model are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics.
Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relationships between topics and words, which is conducive to the discovery of meaningful topics and topic hierarchies.
Step for clustering
We choose 50,000 reviews for establishing an HLTA model. We simply make sentence segmentation for raw txt, and run the HLTA model to get words clusters.
- Sentence segmentation for raw txt.
- Convert text files to to bag-of-words representation with 1000 words and 1 concatenation.
- Use EM algorithm to update the parameters,build model with maximum 50 EM steps.
- Extract topic from topic model.
- Find out which documnets belong to that topic.
Advantage
Compared to LDA, the HLTA model gives a hierarchical model for topics. And in the LDA approach to topic detection, a topic is determined by identifying the words that are used with high frequency when writing about the topic. However, high frequency words in one topic may be also used with high frequency in other topics. Thus they may not be the best words to characterize the topic. However, this new method for topic detection, where a topic is determined by identifying words that appear with high frequency in the topic and low frequency in other topics.
On the other hand, in contrast with Latent Factor model, HLTA has a clearer thematic characterizations.