=Paper=
{{Paper
|id=Vol-1318/paper1
|storemode=property
|title=Identification of Opinion Leaders Using Text Mining Technique in Virtual Community
|pdfUrl=https://ceur-ws.org/Vol-1318/paper1.pdf
|volume=Vol-1318
|dblpUrl=https://dblp.org/rec/conf/simbig/HungY14
}}
==Identification of Opinion Leaders Using Text Mining Technique in Virtual Community==
Identification of Opinion Leaders Using Text Mining Technique in
Virtual Community
Chihli Hung Pei-Wen Yeh
Department of Information Management Department of Information Management
Chung Yuan Christian University Chung Yuan Christian University
Taiwan 32023, R.O.C. Taiwan 32023, R.O.C.
chihli@cycu.edu.tw mogufly@gmail.com
significantly and consumers are further influenced
Abstract by other consumers without any geographic
Word of mouth (WOM) affects the buying
limitation (Flynn et al., 1996).
behavior of information receivers stronger than Nowadays, making buying decisions based on
advertisements. Opinion leaders further affect WOM becomes one of collective decision-making
others in a specific domain through their new strategies. It is nature that all kinds of human
information, ideas and opinions. Identification groups have opinion leaders, explicitly or
of opinion leaders has become one of the most implicitly (Zhou et al., 2009). Opinion leaders
important tasks in the field of WOM mining. usually have a stronger influence on other
Existing work to find opinion leaders is based members through their new information, ideas and
mainly on quantitative approaches, such as representative opinions (Song et al., 2007). Thus,
social network analysis and involvement. how to identify opinion leaders has increasingly
Opinion leaders often post knowledgeable and
useful documents. Thus, the contents of WOM
attracted the attention of both practitioners and
are useful to mine opinion leaders as well. This researchers.
research proposes a text mining-based approach As opinion leadership is relationships between
to evaluate features of expertise, novelty and members in a society, many existing opinion leader
richness of information from contents of posts identification tasks define opinion leaders by
for identification of opinion leaders. According analyzing the entire opinion network in a specific
to experiments in a real-world bulletin board domain, based on the technique of social network
data set, this proposed approach demonstrates analysis (SNA) (Kim, 2007;; Kim and Han, 2009).
high potential in identifying opinion leaders. This technique depends on relationship between
initial publishers and followers. A member with
the greatest value of network centrality is
1 Introduction considered as an opinion leader in this network
This research identifies opinion leaders using the (Kim, 2007).
technique of text mining, since the opinion leaders However, a junk post does not present useful
affect other members via word of mouth (WOM) information. A WOM with new ideas is more
on social networks. WOM defined by Arndt (1967) interesting. A spam link usually wastes readers'
is an oral person-to-person communication means time. A long post is generally more useful than a
between an information receiver and a sender, who short one (Agarwal et al., 2008). A focused
exchange the experiences of a brand, a product or a document is more significant than a vague one.
service based on a non-commercial purpose. That is, different documents may contain different
Internet provides human beings with a new way of influences on readers due to their quality of WOM.
communication. Thus, WOM influences the WOM documents per se can also be a major
consumers more quickly, broadly, widely, indicator for recognizing opinion leaders. However,
such quantitative approaches, i.e. number-based or
8
SNA-based methods, ignore quality of WOM and network hubs usually contain six aspects, which
only include quantitative contributions of WOM. are ahead in adoption, connected, travelers,
Expertise, novelty, and richness of information information-hungry, vocal, and exposed to media
are three important features of opinion leaders, more than others (Rosen, 2002). Ahead in adoption
which are obtained from WOM documents (Kim means that network hubs may not be the first to
and Han, 2009). Thus, this research proposes a text adopt new products but they are usually ahead of
mining-based approach in order to identify opinion the rest in the network. Connected means that
leaders in a real-world bulletin board system. network hubs play an influential role in a network,
Besides this section, this paper is organized as such as an information broker among various
follows. Section 2 gives an overview of features of different groups. Traveler means that network hubs
opinion leaders. Section 3 describes the proposed usually love to travel in order to obtain new ideas
text mining approach to identify opinion leaders. from other groups. Information-hungry means that
Section 4 describes the data set, experiment design network hubs are expected to provide answers to
and results. Finally, a conclusion and further others in their group, so they pursue lots of facts.
research work are given in Section 5. Vocal means that network hubs love to share their
opinions with others and get responses from their
2 Features of Opinion Leaders audience. Exposed to media means that network
hubs open themselves to more communication
The term “opinion leader”, proposed by Katz and
from mass media, and especially to print media.
Lazarsfeld (1957), comes from the concept of
Thus, a network hub or an opinion leader is not
communication. Based on their research, the
only an influential node but also a novelty early
influence of an advertising campaign for political
adopter, generator or spreader. An opinion leader
election is lesser than that of opinion leaders. This
has rich expertise in a specific topic and loves to be
is similar to findings in product and service
involved in group activities.
markets. Although advertising may increase
As members in a social network influence each
recognition of products or services, word of mouth
other, degree centrality of members and
disseminated via personal relations in social
involvement in activities are useful to identify
networks has a greater influence on consumer
opinion leaders (Kim and Han, 2009). Inspired by
decisions (Arndt, 1967;; Khammash and Griffiths,
the PageRank technique, which is b ased on the link
2011). Thus, it is important to identify the
structure (Page et al., 1998), OpinionRank is
characteristics of opinion leaders.
proposed by Zhou et al. (2009) to rank members in
According to the work of Myers and Robertson
a network. Jiang et al. (2013) proposed an
(1972), opinion leaders may have the following
extended version of PageRank based on the
seven characteristics. Firstly, opinion leadership in
sentiment analysis and MapReduce. Agarwal et al.
a specific topic is positively related to the quantity
(2008) identified influential bloggers through four
of output of the leader who talks, knows and is
aspects, which are recognition, activity generation,
interested in the same topic. Secondly, people who
novelty and eloquence. An influential blog is
influence others are themselves influenced by
recognized by others when this blog has a lot of in-
others in the same topic. Thirdly, opinion leaders
links. The feature of activity generation is
usually have more innovative ideas in the topic.
measured by how many comments a post receives
Fourthly and fifthly, opinion leadership is
and the number of posts it initiates. Novelty means
positively related to overall leadership and an
novel ideas, which may attract many in-links from
individual’s social leadership. Sixthly, opinion
the blogs of others. Finally, the feature of
leaders usually know more about demographic
eloquence is evaluated by the length of post. A
variables in the topic. Finally, opinion leaders are
lengthy p ost is treated as an influential post.
domain dependent. Thus, an opinion leader
Li and Du (2011) determined the expertise of
influences others in a specific topic in a social
authors and readers according to the similarity
network. He or she knows more about this topic
between their posts and the pre-built term ontology.
and publishes more new information.
However both features of information novelty and
Opinion leaders usually play a central role in a
influential position are dependent on linkage
social network. The characteristics of typical
relationships between blogs. We propose a novel
9
text mining-based approach and compare it with 3.3 Novelty
several q uantitative approaches.
We utilize Google trends service
3 Quality Approach-Text Mining (http://www.google.com/trends) to obtain the first-
search time tag for significant words in documents.
Contents of word of mouth contain lots of useful Thus, each significant word has its specific time
information, which has high relationships with tag taken from the Google search repository. For
important features of opinion leaders. Opinion example, the first-search time tag for the search
leaders usually provide knowledgeable and novel term, Nokia N81, is 2007 and for Nokia Windows
information in their posts (Rosen, 2002;; Song et al., Phone 8 is 2011. We define three degrees of
2007). An influential post is often eloquent (Keller novelty evaluated by the interval between the first-
and Berry, 2003). Thus, expertise, novelty, and search year of significant words and the collected
richness of information are important year of our targeted document set, i.e. 2010. This
characteristics of opinion leaders. significant word belongs to normal novelty if the
interval is equal to two years. A significant word
3.1 Preprocessing with an interval of less than two years belongs to
This research uses a traditional Chinese text high novelty and one with an interval greater than
mining process, including Chinese word two years belongs to low novelty. We then
segmenting, part-of-speech filtering and removal summarize all novelty values based on significant
of stop words for the data set of documents. As a words used by a member in a social network. The
single Chinese character is very ambiguous, equation of novelty for a member is shown in (2).
segmenting Chinese documents into proper
Chinese words is necessary (He and Chen, 2008). e 0.66 em 0.33 el
novi h , (2)
This research uses the CKIP service eh em el
(http://ckipsvr.iis.sinica.edu.tw/) to segment
where eh , em and el is the number of words that
Chinese documents into proper Chinese words and
belong to the groups of high, normal and low
their suitable part-of-speech tags. Based on these
novelty, respectively.
processes, 85 words are organized into controlled
vocabularies as this approach is efficient to capture
the main concepts of document (Gray et al., 2 009). 3.4 Richness of Information
3.2 Expertise In general, a long document suggests some useful
information to the users (Agarwal et al., 2008).
This can be evaluated by comparing their posts
Thus, richness of information of posts can be used
with the controlled vocabulary base (Li and Du,
for the identification of opinion leaders. We use
2011). For member i, words are collected from his
both textual information and multimedia
or her posted documents and member vector i is
information to represent the richness of
represented as fi=(w1, w2, …wj, …, wN), where wj
information as (3).
denotes the frequency of word j used in the posted
documents of user i. N denotes the number of ric=d + g, (3)
words in the controlled vocabulary. We then
normalize the member vector by his or her
where d is the total number of significant words
maximum frequency of any significant word. The that the user uses in his or her posts and g is the
degree of expertise can be calculated by the
total number of multimedia objects that the user
Euclidean norm as show in (1).
posts.
fi
exp i , (1) 3.5 Integrated Text Mining Model
mi
Finally, we integrate expertise, novelty and
where is Euclidean norm. richness of information from the content of posted
documents. As each feature has its own
10
distribution and range, we normalize each feature number of documents that a member initiates plus
to a value between 0 and 1. Thus, the weights of the number of derivative documents by other
opinion leaders based on the quality of posts members is treated as involvement.
become the average of these three features as (4). Thus, we have one qualitative model, i.e. ITM,
and four quantitative models, i.e. DEG, CLO, BET
Norm ( nov ) Norm (exp) Norm ( ric ) and INV. We p ut top ten rankings from each model
ITM . (4)
3 in a pool of potential opinion leaders. Duplicate
members are removed and 25 members are left.
We request 20 human testers, which have used and
4 Experiments are familiar with Mobile01.
In our questionnaire, quantitative information is
4.1 Data Set provided such as the number of documents that the
potential opinion leaders initiate and the number of
Due to lack of available benchmark data set, we derivative documents that are posted by other
crawl WOM documents from the Mobile01 members. For the qualitative information, a
bulletin board system (http://www.mobile01.com/), maximum of three documents from each member
which is one of the most popular online discussion are provided randomly to the testers. The top 10
forums in Taiwan. This bulletin board system rankings are also considered as opinion leaders
allows its members to contribute their opinions based on human judgment.
free of charge and its contents are available to the
public. A bulletin board system generally has an 4.3 Results
organized structure of topics. This organized We suppose that ten of 9460 members are
structure provides people who are interested in the
considered as opinion leaders. We collect top 10
same or similar topics with an online discussion ranking members from each models and remove
forum that forms a social network. Finding opinion duplicates. We request 20 human testers to identify
leaders on bulletin boards is important since they
10 opinion leaders from 25 potential opinion
contain a lot of availably focused WOM. In our leaders obtained from five models. According to
initial experiments, we collected 1537 documents,
experiment results in Table 1, the proposed model
which were initiated by 1064 members and outperforms others. This presents the significance
attracted 9192 followers, who posted 19611 of documents per se. Even INV is a very simple
opinions on those initial posts. In this data set, the
approach but it performs much better than social
total number of p articipants is 9460. network analysis models, i.e. DEG, CLO and BET.
One possible reason is the sparse network structure.
4.2 Comparison Many sub topics are in the bulletin board system so
these topics form several isolated sub networks.
As we use real-world data, which has no ground
truth about opinion leaders, a user centered F-
evaluation approach should be used to compare the Recall Precision Accuracy
measure
difference between models (Kritikopoulos et al., DEG 0.45 0.50 0.48 0.56
2006). In our research, there are 9460 members in CLO 0.36 0.40 0.38 0.48
this virtual community. We suppose that ten of BET 0.64 0.70 0.67 0.72
them have a high possibility of being opinion INV 0.73 0.80 0.76 0.80
leaders. ITM 0.82 0.90 0.86 0.88
As identification of opinion leaders is treated to
be one of important tasks of social network Table 1: Results of models evaluated by recall,
analysis (SNA), we compare the proposed model precision, F-measure and accuracy
(i.e. ITM) with three famous SNA approaches,
which are degree centrality (DEG), closeness
centrality (CLO), betweenness centrality (BET).
Involvement (INV) is an important characteristic
of opinion leaders (Kim and Han, 2009). The
11
5 Conclusions and Further Work Flynn, L. R., Goldsmith, R. E. and Eastman, J. K. 1996.
Opinion Leaders and Opinion Seekers: Two New
Word of mouth (WOM) has a powerful effect Measurement Scales. Academy of Marketing
on consumer behavior. Opinion leaders have He, J. and Chen, L. 2008. Chinese Word Segmentation
stronger influence on other members in an opinion Based on the Improved Particle Swarm Optimization
society. How to find opinion leaders has been of Neural Networks. Proceedings of IEEE Cybernetics
interest to both practitioners and researchers. and Intelligent S ystems, 695-699.
Existing models mainly focus on quantitative
Jiang, L., Ge, B., Xiao, W. and Gao, M. 2013. BBS
features of opinion leaders, such as the number of Opinion Leader Mining Based on an Improved
posts and the central position in the social network. PageRank Algorithm Using MapReduce.
This research considers this issue from the Proceedings of Chinese Automation Congress, 392-
viewpoints of text mining. We propose an 396.
integrated text mining model by extracting three
Katz, E. and Lazarsfeld, P. F. 1957. Personal Influence,
important features of opinion leaders regarding New York: The Free Press.
novelty, expertise and richness of information,
from documents. Finally, we compare this Keller, E. and Berry, J. 2003. One American in Ten
proposed text mining model with four quantitative Tells the Other Nine How to Vote, Where to Eat and,
approaches, i.e., involvement, degree centrality, What to Buy. They Are The Influentials. The Free
Press.
closeness centrality and betweenness centrality,
evaluated by human judgment. In our experiments, Khammash, M. and Griffiths, G. H. 2011. Arrivederci
we found that the involvement approach is the best CIAO.com Buongiorno Bing.com- Electronic Word-
one among the quantitative approaches. The text of-Mouth (eWOM), Antecedences and Consequences.
mining approach outperforms its quantitative International Journal of Information Management,
31:82-87.
counterparts as the richness of document
information provides a similar function to the Kim, D. K. 2007. Identifying Opinion Leaders by Using
qualitative features of opinion leaders. The Social Network Analysis: A Synthesis of Opinion
proposed text mining approach further measures Leadership Data Collection Methods and Instruments.
opinion leaders based on features of novelty and PhD Thesis, the Scripps College of Communication,
Ohio U niversity.
expertise.
In terms of possible future work, some Kim, S. and Han, S. 2009. An Analytical Way to Find
integrated strategies of both qualitative and Influencers on Social Networks and Validate their
quantitative approaches should take advantages of Effects in Disseminating Social Games. Proceedings
both approaches. For example, the 2-step of Advances in Social Network Analysis and Mining,
integrated strategy, which uses the text mining- 41-46.
based approach in the first step, and uses the Kritikopoulos, A., Sideri, M. and Varlamis, I. 2006.
quantitative approach based on involvement in the BlogRank: Ranking Weblogs Based on Connectivity
second step, may achieve the better performance. and Similarity Features. Proceedings of the 2nd
Larger scale experiments including topics, the International Workshop on Advanced Architectures
and Algorithms for Internet Delivery and
number of documents and testing, should be done
Applications, Article 8 .
further in order to produce more general results.
Li, F. and Du, T. C. 2011. Who Is Talking? An
Ontology-Based Opinion Leader Identification
References Framework for Word-of-Mouth Marketing in Online
Social Blogs. Decision Support Systems, 51,
Agarwal, N., Liu, H., Tang, L. and Yu, P. S. 2008. 2011:190-197.
Identifying the Influential Bloggers in a Community. Myers, J. H. and Robertson, T. S. 1972. Dimensions of
Proceedings of WSDM, 207-217.
Opinion Leadership. Journal of Marketing Research,
Arndt, J. 1967. Role of Product-Related Conversations 4:41-46.
in the Diffusion of a New Product. Journal of
Page, L., Brin, S., Motwani, R. and Winograd, T. 1998.
Marketing Research, 4 (3):291-295. The PageRank Citation Ranking: Bringing Order to
the Web. Technical Report, S tanford U niversity.
12
Rosen, E. 2002. The Anatomy of Buzz: How to Create
Word of Mouth Marketing, 1 st ed., Doubleday.
Song, X., Chi, Y., Hino, K. and Tseng, B. L. 2007.
Identifying Opinion Leaders in the Blogosphere.
Proceedings of CIKM’07, 971-974.
Zhou, H., Zeng, D. and Zhang, C. 2009. Finding
Leaders from Opinion Networks. Proceedings of the
2009 IEEE International Conference on Intelligence
and Security Informatics, 266-268.
13