Topic Extraction Based on LDA and Its Application in Tourism 1
Hui Peng1,2, Jiapei Huang1,2, Xi Li1,2, Danyang Dong1,2, Peiying Fan1,2
1
    Tourism science school of Beijing international studies university, Beijing, China;
2
    Research center for Beijing tourism development, Beijing, China;

                  Abstract
                  This paper introduces LDA, an algorithm that automatically extracts text topics from a large
                  amount of text, and presents a case study of its application: extracting the features of recom-
                  mendation information of travel microblog key opinion leaders. Using these features to con-
                  struct a travel decision influence model and analyzing the influence of travel microblog key
                  opinion leaders' recommendation information on travellers' travel decisions. The following
                  conclusions were drawn: the information recommended by travel microblog key opinion
                  leaders provides a certain reference role for travellers' decision-making, and among the six
                  features of travel microblog key opinion leaders' recommended information, the degree of
                  quantification of recommended information is the most important factor that has an impact on
                  travel decision-making.

                  Keywords
                  LDA Model, Topics Extraction, Text Mining, Recommended Information of Microblog
                  Opinion Leader, travellers' decision-making

1         Introduction

   In the era of information explosion, in order to obtain effective information from massive texts, we
need to automatically classify, cluster and extract topics from texts. LDA is a method that uses the
probabilistic production model to model the implied topics of text. The basic idea is to assume that
there are several independent implied topics in the corpus. According to the probability distribution of
these topics, all words in each document of the corpus can be generated, so that the document can be
understood as the distribution of specific implied topics. At present, LDA model is widely used in
topics mining, text retrieval, text classification, citation analysis and social network analysis. This pa-
per introduces the principle of LDA model, applies it to tourism text data processing, analyzes the
recommendation information of key opinion leaders of travel microblog, extracts the features of rec-
ommendation information of key opinion leaders of travel microblog with Python of LDA model so
that further analysis of relevant issues in the tourism field can be carried out.

2   The LDA model and its Python implementation
2.1 Model LDA introduction

    The Latent Dirichlet Allocation (LDA) was proposed by Blei etc. in 2003[1]. It is a three-layer
Bayesian probabilistic generation model which contains a three-layer structure of documents, topics,
and words[2]. As an unsupervised type of machine learning, the LDA model consists of two main steps
which are word generation and topic generation. After determining the number of topics K during
training, running the model yields the probability of the distribution of words under each topic and the
probability of the topic corresponding to the document.
    The modeling process of LDA for text was shown in Figure 1, where the circle indicates the poten-
tial variables. The arrow with direction can indicate the relationship between two variables, and the

AIoTC2022@International Conference on Artificial Intelligence, Internet of Things and Cloud Computing Technology
EMAIL: *Corresponding author’s email: penghui@bisu.edu.cn (Hui Peng)
             ©️ 2022 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   52
rectangular box means repeated sampling was performed. The specific steps of the LDA topic model
are as follows:
       and are the prior parameters of the Dirichlet function,     is the parameter of the multinomial
distribution of the topic in the document, and     is the parameter of the multinomial distribution of
the word in the topic, which obey the Dirichlet distribution with hyperparameters and , respective-
ly[3]:
                                                                                                       (1)

                                                                                                       (2)
    M represents the total number of documents, N represents the number of feature words contained
in the documents. According to the topic distribution   , for the nth word in any document m, the
distribution of its topics   is obtained as follows:
                                                                      [4]
                                                                                                       (3)

   By combining the topic        and the distribution of words      , the distribution of specific words
    is obtained as follows:
                                                                            [4]
                                                                                                       (4)
   To move in cycle, a document containing N words was obtained by cycling. Finally, M documents
under K topics were generated.
   To extract topics using the LDA topic model for travel microblog key opinion leaders' recommen-
dation information, it is necessary to determine the optimal number of topics to extract. The repre-
sentative method is to measure topic consistency or perplexity. The consistency is used to measure the
coherence of words within the same topic, the higher the value of consistency index means that the
words within the same topic have strong coherence, the better the model fit; while the perplexity is the
degree of uncertainty whether the topic belongs to the document or not. Perplexity is the most com-
mon evaluation metric in natural language processing[5], which is used to test the trained language
model. The smaller the perplexity is, the relatively stronger the generalization ability of the topic. The
specific formula for the perplexity is as follows[6]:
                                                                                                       (5)

                                                                                                       (6)
         represents the probability of each word in the test set, and       denotes the sum of all fea-
ture words.          means the probability of topic k in a given document, and        means the prob-
ability of each word under a given topic.


Figure 1 Model LDA structure and its workflow

2.2 Python Implementation of LDA

  In the application of LDA model, we need to download and preprocess the text data, then call the
modeling function in Python and obtain the relatively ideal number of topics and keywords under


                                                   53
each topic by adjusting the consistency parameter and confusion parameter. The Figure 2 is the main
code of the modeling part.


Figure 2 The main code of the modeling part

3 Extracting recommended information features from travel microblog key
   opinion leaders
3.1 Prepare sample data

   This study mainly selects the microblog content and user comments published by the current
microblog platforms "Top Ten Influential Travel Bloggers in 2020" and "Top Ten Popular Travel
Bloggers in 2020" from June 2019 to June 2021 as samples source.
   Python was used to capture the content, number of likes and comments, posting time and text of
comments of the 20 travel microblog key opinion leaders in the field of tourism during the period of
2019.6-2021.6, and a total of 7,879 pieces of content and 363,779 microblog comments recommended
by the 20 travel microblog key opinion leaders in the field of tourism were obtained. The top 100
microblog users' comments and interaction data were collected under each microblog content. To en-
sure the scientific validity of the research results, the data is screened to eliminate invalid and mean-
ingless comments, and the following processing is carried out on the collected data. It mainly includes
the construction of deactivation dictionaries, text splitting, synonym replacement.

3.2 LDA Topic Analysis
3.2.1 Determination of topics for user comment

   By continuously changing the value of K, we observe the change of confusion and consistency
value, when the value of K is between 3 and 5, the value of confusion is relatively low at this time, as
shown in the figure. When the value of K is 4, it can better reflect and cover the meaning of the
semantics of visitors' comments, and the consistency between topics is the highest, so the number of
topics is set to 4. See Figure 3 and Figure 4.


Figure 3 User Review Topic Consistency


                                                   54
Figure 4 User Comments Topic Confusion

    The high-frequency keywords in the four themes coalesced into the recommended information of
travel microblog key opinion leaders are listed in Table 1. In Theme 1, words such as "comment", "re-
tweet" and "follow" reflect the interaction between microblog users and opinion leaders, categorizing
Theme 1 as: information interactivity. In Theme 2, users comment on words such as "nice", "good",
"cute", etc., expressing their praise and compliments on the content recommended by the opinion
leaders, with their own feelings, so Theme 2 is categorized as information expression. In Theme 3,
users commented on the words "link", "web page", "video" and "image", reflecting the diversity of
information presentation forms used by opinion leaders in recommending content, and users' expecta-
tion and demand for diverse information presentation forms. Therefore, Theme 3 is categorized as:
Information presentation formats. In Theme 4, users comment on words such as "raffle" and "prize"
as a form of interaction between opinion leaders and their followers through rewarding activities such
as raffles, and words such as "envy" and "rule" as a form of interaction between travel microblog key
opinion leaders and their followers. Therefore, Theme 4 is also categorized as information interactivi-
ty.

Table 1 Travel microblog key opinion leaders recommend user comments under the message theme
keywords
      Topics           High-frequency Keywords               User comment theme qualities
        1          Comment, live, retweet, share, etc.          Information interactivity
                                                           Emotional expressions, information
        2         Good-looking, nice, like, happy, etc.
                                                                      expressions
                    Links, web pages, accompanying
        3                                                   Information presentation format
                              images, etc.
        4        Sweepstakes, prizes, details, links, etc. Incentive mechanism, interactivity

3.2.2 Identifying the qualities of travel microblog key opinion leaders ' rec-
ommended messages

    Combining the perceived characteristics of the information recommended by the travel microblog
key opinion leaders as reflected in the above microblog user comments, as well as the data of the likes
and comments of the 20 t microblog travel key opinion leaders and the topics to which the highly in-
teractive content belongs. The two were compiled and compared to arrive at the following characteris-
tics of the travel microblog key opinion leaders' recommended information, and then study their influ-
ence on travelers' destination decisions.
    （1）The quantitative degree of information recommended by travel microblog key opinion lead-
ers: the comprehensive degree of the number of retweets, likes, comments, etc. of the recommended
content.
    （2）Information quality: including the accuracy, completeness and interest of the description of
the tourist destination, tourist products or services, etc.
    （3）Information timeliness: the frequency of recommended information, whether it is combined
with current hotspots, leading the latest developments in the field of tourism, etc.

                                                  55
    （4）Information interactivity: travel microblog key opinion leaders recommend information in
the process of using questions, @, add topic tag, super talk and other ways to communicate and inter-
act with potential tourists; microblog users interact with each other in the type of comments that occur
after a microblog opinion leader releases a microblog.
    （5）Form of information presentation: The expressions used by travel microblog key opinion
leaders to disseminate information: plain text, long text, combination of pictures and text, video, live
broadcast, etc.
    （6）Information expression: Objective description of tourism products or services information,
the post-purchase experience of tourism products or services released and recommended, adding their
own attitude, with a certain emotional color.

4 Model Construction

    From the above recommended information features the following tourism decision model can be
constructed. It shows in Figure 5.
    Based on the model, through questionnaires and hypothesis testing, it is concluded that the infor-
mation recommended by tourism travel microblog key opinion leaders provides a certain reference
role in travelers' decision-making behavior, and among the six characteristics of information recom-
mended by travel microblog key opinion leaders, the quantitative degree of recommended information
is the most important factor that has an impact on tourism decision-making.


Figure 5 Theoretical model of the influence of microblog opinion leaders' recommendations on trav-
ellers' destination decisions

5 Conclusion

   In this paper, the LDA model is used to mine the information recommended by travel microblog
opinion leaders, and six features of the information are summarized. They are the quantitative degree
of information, information quality, information timeliness, inforrmation interactivity, form of
information presentation and information expression. Applying these features in the model of
traveller‘ decision making, it shows the degree of quantification of recommended information is the
most important factor that has an impact on travel decision-making.

6 Related works

   Hoffmann[7] proposed the Probabilistic Latent Semantic Indexing (PLSI) model, which uses prob-
abilistic generative models for topic analysis and extraction of text. Blei[1] etc. improved the PLSI
model by proposing the LDA (Latent Dirichlet Allocation) model, which is currently the most widely
used model in the field of topic modeling research. Xu, Ge and Wang, Houfeng[8] introduced and ana-
lyzed the important role of probabilistic implicit semantic indexing and LDA in the development of
topic models, and classified and introduced various models derived from LDA. An important discus-
sion of LDA-based text segmentation[9] and topic extraction[10] is provided by Jing Shi etc.


                                                  56
    In the field of tourism, LDA models are widely used in research. chao Huang[11] etc. used LDA
methods to refine a seasonal theme model to analyse the themes corresponding to each attraction in
different seasonal contexts. Zhou Wenliang[12] used LDA to mine textual themes to obtain relatively
objective tourism destination evaluation indicators, thus reconstructing the tourism destination evalua-
tion system to evaluate tourist attractions in Jiangxi Province.

7 Acknowledgement

   This research was financially supported by science research project of Beijing International
Studies University (LYFZ18B003).

8 References

[1] BleiDM, NgAY JordanMI. Latent dirichleta llocation [J]. Journal of machine Learning research,
     2003, (1): 993-1022.
[2] Xing Feng, Liu Xingxu. Practice and Application of Machine Learning in data analysis [J]. Tele-
     communication Engineering Technology & Standardization, 2021,34(12): 82-84 + 88.
[3] Li Tingyi. Study on the factors influencing the travel intention of Guangzhou residents in the con-
     text of the normalization of epidemic [D]. Guangxi Normal University.
[4] Zhou Wen-Liang. Research on AHP tourism destination evaluation based on LDA improvement
     [D]. Jiangxi University of Finance and Economics.
[5] Guan Peng, Wang Yuefen. Research on the Optimal Topic Number of LDA Topic Model in Sci-
     entific and technological Information Analysis [J]. Modern Library and Information Technolo-
     gy,2016(9):42-50.
[6] Zhao Zixuan. Emotional Evaluation and Influencing Factors Analysis of museum Tourism [D].
     East China Normal University, 2022.
[7] Hofmann T. Probabilistic latent semantic indexing [J]. International ACM SIGIR conference
     on research and development in information retrieval, 1999, 51(2):50-57.
[8] Xu Ge, Wang Houfeng. Development of Topic Models in Natural Language Processing [J]. Jour-
     nal of Computers,2011,34(08): 1423-1436.
[9] Shi Jing, Hu Ming, Shi Xin, etc. Text Segmentation Based on LDA Model [J]. Journal of Comput-
     ers,2008(10): 1865-1873.
[10] Shi Jing, Fan Meng, Li Wanlong. Thematic analysis based on LDA model [J]. Acta automatica
     sinica,2009,35(12): 1586-1592.
[11] Chao Huang, Qing Wang, Donghui Yang, etc. Topic mining of tourist attractions based on a sea-
     sonal context aware LDA model [J]. Intelligent Data Analysis, 2018, 22: 383-405.
[12] Wenliang Zhou. Research on AHP tourism destination evaluation based on LDA improvement
     [D].        Jiangxi      University      of       Finance        and      Economics,         2021.
     doi:10.27175/d.cnki.gjxcu.2021.000448.


                                                  57