Identification  of  Opinion  Leaders  Using  Text  Mining  Technique  in   Virtual  Community       Chihli  Hung   Pei-­Wen  Yeh   Department  of  Information  Management     Department  of  Information  Management   Chung  Yuan  Christian  University   Chung  Yuan  Christian  University   Taiwan  32023,  R.O.C.   Taiwan  32023,  R.O.C.   chihli@cycu.edu.tw   mogufly@gmail.com         significantly  and  consumers  are  further  influenced   Abstract   by   other   consumers   without   any   geographic   Word   of   mouth   (WOM)   affects   the   buying   limitation  (Flynn  et  al.,  1996).     behavior  of  information  receivers  stronger  than   Nowadays,   making   buying   decisions   based   on   advertisements.   Opinion   leaders   further   affect   WOM  becomes  one  of  collective  decision-­making   others   in   a   specific   domain   through   their   new   strategies.   It   is   nature   that   all   kinds   of   human   information,   ideas   and   opinions.   Identification   groups   have   opinion   leaders,   explicitly   or   of  opinion  leaders  has  become  one  of   the   most   implicitly   (Zhou   et   al.,   2009).   Opinion   leaders   important   tasks   in   the   field   of   WOM   mining.   usually   have   a   stronger   influence   on   other   Existing   work   to   find   opinion   leaders  is  based   members  through  their  new  information,  ideas  and   mainly   on   quantitative   approaches,   such   as   representative   opinions   (Song   et   al.,   2007).   Thus,   social   network   analysis   and   involvement.   how   to   identify   opinion   leaders   has   increasingly   Opinion   leaders   often   post   knowledgeable   and   useful  documents.  Thus,   the  contents  of  WOM   attracted   the   attention   of   both   practitioners   and   are  useful  to  mine  opinion  leaders  as  well.  This   researchers.     research  proposes  a  text  mining-­based  approach   As   opinion   leadership   is   relationships   between   to   evaluate   features   of   expertise,   novelty   and   members  in  a  society,  many  existing  opinion  leader   richness  of   information   from   contents  of   posts   identification   tasks   define   opinion   leaders   by   for  identification  of  opinion  leaders.  According   analyzing   the  entire   opinion   network  in   a   specific   to   experiments   in   a   real-­world   bulletin   board   domain,   based   on   the   technique   of  social  network   data   set,   this   proposed   approach   demonstrates   analysis  (SNA)  (Kim,  2007;;  Kim  and  Han,  2009).   high  potential  in  identifying  opinion  leaders.     This   technique   depends   on   relationship   between   initial   publishers   and   followers.   A   member   with   the   greatest   value   of   network   centrality   is   1   Introduction   considered   as   an   opinion   leader   in   this   network   This   research   identifies   opinion   leaders   using   the   (Kim,  2007).   technique  of  text  mining,  since  the  opinion  leaders   However,   a   junk   post   does   not   present   useful   affect   other   members   via   word   of   mouth   (WOM)   information.   A   WOM   with   new   ideas   is   more   on  social  networks.  WOM  defined  by  Arndt  (1967)   interesting.   A   spam   link   usually   wastes   readers'   is   an   oral   person-­to-­person   communication   means   time.   A   long   post   is   generally   more   useful   than   a   between  an  information  receiver  and  a  sender,  who   short   one   (Agarwal   et   al.,   2008).   A   focused   exchange  the  experiences  of  a  brand,  a  product  or  a   document   is   more   significant   than   a   vague   one.   service   based   on   a   non-­commercial   purpose.   That  is,  different  documents   may  contain  different   Internet  provides  human  beings  with  a  new  way  of   influences  on  readers  due  to  their  quality  of  WOM.   communication.   Thus,   WOM   influences   the   WOM   documents   per   se   can   also   be   a   major   consumers   more   quickly,   broadly,   widely,   indicator  for  recognizing  opinion  leaders.  However,   such  quantitative  approaches,  i.e.  number-­based  or   8 SNA-­based   methods,   ignore  quality   of  WOM   and   network   hubs   usually   contain   six   aspects,   which   only  include  quantitative  contributions  of  WOM.     are   ahead   in   adoption,   connected,   travelers,   Expertise,   novelty,   and   richness   of   information   information-­hungry,   vocal,   and   exposed   to   media   are   three   important   features   of   opinion   leaders,   more  than  others  (Rosen,  2002).  Ahead  in  adoption   which   are   obtained   from   WOM   documents   (Kim   means   that   network   hubs   may   not   be   the   first   to   and  Han,  2009).  Thus,  this  research  proposes  a  text   adopt   new   products   but   they   are   usually   ahead   of   mining-­based  approach  in  order  to  identify  opinion   the   rest   in   the   network.   Connected   means   that   leaders  in  a  real-­world  bulletin  board  system.   network  hubs  play  an  influential  role  in  a  network,   Besides   this   section,   this   paper   is   organized   as   such   as   an   information   broker   among   various   follows.  Section  2  gives  an  overview  of  features  of   different  groups.  Traveler  means  that  network  hubs   opinion   leaders.   Section   3   describes   the   proposed   usually  love  to  travel  in  order  to  obtain  new  ideas   text   mining   approach   to   identify   opinion   leaders.   from  other  groups.  Information-­hungry  means  that   Section  4  describes  the  data  set,  experiment  design   network   hubs   are   expected   to   provide   answers   to   and   results.   Finally,   a   conclusion   and   further   others   in  their   group,  so   they  pursue   lots  of   facts.   research  work  are  given  in  Section  5.   Vocal  means  that  network  hubs  love  to  share  their   opinions   with   others   and   get   responses   from   their   2   Features  of  Opinion  Leaders   audience.   Exposed   to   media   means   that   network   hubs   open   themselves   to   more   communication   The   term   “opinion   leader”,   proposed   by   Katz   and   from   mass   media,   and   especially   to   print   media.   Lazarsfeld   (1957),   comes   from   the   concept   of   Thus,   a   network   hub   or   an   opinion   leader   is   not   communication.   Based   on   their   research,   the   only   an   influential   node   but   also   a   novelty   early   influence   of   an   advertising   campaign   for   political   adopter,   generator   or   spreader.   An   opinion   leader   election  is  lesser  than  that  of  opinion  leaders.  This   has  rich  expertise  in  a  specific  topic  and  loves  to  be   is   similar   to   findings   in   product   and   service   involved  in  group  activities.   markets.   Although   advertising   may   increase   As  members  in  a  social  network  influence  each   recognition  of  products  or  services,  word  of  mouth   other,   degree   centrality   of   members   and   disseminated   via   personal   relations   in   social   involvement   in   activities   are   useful   to   identify   networks   has   a   greater   influence   on   consumer   opinion  leaders  (Kim  and  Han,  2009).  Inspired  by   decisions   (Arndt,   1967;;   Khammash   and   Griffiths,   the  PageRank  technique,  which  is  b ased  on  the  link   2011).   Thus,   it   is   important   to   identify   the   structure   (Page   et   al.,   1998),   OpinionRank   is   characteristics  of  opinion  leaders.     proposed  by  Zhou  et  al.  (2009)  to  rank  members  in   According  to  the   work  of  Myers  and  Robertson   a   network.   Jiang   et   al.   (2013)   proposed   an   (1972),   opinion   leaders   may   have   the   following   extended   version   of   PageRank   based   on   the   seven  characteristics.  Firstly,  opinion  leadership  in   sentiment  analysis  and  MapReduce.  Agarwal  et  al.   a  specific  topic  is  positively  related  to  the  quantity   (2008)   identified  influential   bloggers  through  four   of   output   of   the   leader   who   talks,   knows   and   is   aspects,  which  are  recognition,  activity  generation,   interested  in  the  same  topic.  Secondly,  people  who   novelty   and   eloquence.   An   influential   blog   is   influence   others   are   themselves   influenced   by   recognized  by  others  when  this  blog  has  a  lot  of  in-­ others   in   the   same   topic.   Thirdly,   opinion   leaders   links.   The   feature   of   activity   generation   is   usually   have   more   innovative   ideas   in   the   topic.   measured  by  how   many  comments  a  post  receives   Fourthly   and   fifthly,   opinion   leadership   is   and  the  number  of  posts  it  initiates.  Novelty  means   positively   related   to   overall   leadership   and   an   novel  ideas,  which  may  attract  many  in-­links  from   individual’s   social   leadership.   Sixthly,   opinion   the   blogs   of   others.   Finally,   the   feature   of   leaders   usually   know   more   about   demographic   eloquence   is   evaluated   by   the   length   of   post.   A   variables  in   the   topic.   Finally,  opinion   leaders  are   lengthy  p ost  is  treated  as  an  influential  post.     domain   dependent.   Thus,   an   opinion   leader   Li   and   Du   (2011)   determined   the   expertise   of   influences   others   in   a   specific   topic   in   a   social   authors   and   readers   according   to   the   similarity   network.   He   or   she   knows   more   about   this   topic   between  their  posts  and  the  pre-­built  term  ontology.   and  publishes  more  new  information.     However  both  features  of   information  novelty  and   Opinion   leaders  usually  play   a   central  role   in  a   influential   position   are   dependent   on   linkage   social   network.   The   characteristics   of   typical   relationships   between   blogs.   We   propose   a   novel   9 text   mining-­based   approach   and   compare   it   with   3.3   Novelty   several  q uantitative  approaches.     We   utilize   Google   trends   service   3   Quality  Approach-­Text  Mining   (http://www.google.com/trends)  to  obtain  the  first-­ search  time  tag  for  significant  words  in  documents.   Contents   of   word   of   mouth   contain   lots   of   useful   Thus,   each   significant   word   has   its   specific   time   information,   which   has   high   relationships   with   tag   taken   from   the   Google   search   repository.   For   important   features   of   opinion   leaders.   Opinion   example,   the   first-­search   time   tag   for   the   search   leaders   usually   provide   knowledgeable   and   novel   term,  Nokia  N81,  is  2007  and  for  Nokia  Windows   information  in  their  posts  (Rosen,  2002;;  Song  et  al.,   Phone   8   is   2011.   We   define   three   degrees   of   2007).  An  influential  post  is  often  eloquent  (Keller   novelty  evaluated  by  the  interval  between  the  first-­ and   Berry,   2003).   Thus,   expertise,   novelty,   and   search   year  of   significant   words   and  the   collected   richness   of   information   are   important   year   of  our  targeted   document  set,   i.e.  2010.  This   characteristics  of  opinion  leaders.     significant   word   belongs   to   normal   novelty   if   the   interval   is   equal   to   two   years.   A   significant   word   3.1   Preprocessing   with   an   interval   of   less  than   two   years  belongs   to   This   research   uses   a   traditional   Chinese   text   high  novelty  and  one  with  an   interval  greater  than   mining   process,   including   Chinese   word   two   years   belongs   to   low   novelty.   We   then   segmenting,   part-­of-­speech   filtering   and   removal   summarize  all   novelty   values  based   on  significant   of   stop  words   for   the   data  set   of  documents.  As  a   words  used  by  a   member  in  a  social  network.  The   single   Chinese   character   is   very   ambiguous,   equation  of  novelty  for  a  member  is  shown  in  (2).   segmenting   Chinese   documents   into   proper     Chinese   words  is   necessary   (He   and  Chen,   2008).   e  0.66  em  0.33  el novi  h ,   (2)   This   research   uses   the   CKIP   service   eh  em  el (http://ckipsvr.iis.sinica.edu.tw/)   to   segment   where   eh   ,   em   and   el   is   the   number   of   words   that   Chinese  documents  into  proper  Chinese  words  and   belong   to   the   groups   of   high,   normal   and   low   their   suitable   part-­of-­speech   tags.   Based   on   these   novelty,  respectively.   processes,   85   words   are   organized   into   controlled     vocabularies  as  this  approach  is  efficient  to  capture   the  main  concepts  of  document  (Gray  et  al.,  2 009).   3.4   Richness  of  Information   3.2   Expertise   In   general,  a  long   document   suggests   some   useful   information   to   the   users   (Agarwal   et   al.,   2008).   This   can   be   evaluated   by   comparing   their   posts   Thus,  richness  of  information  of  posts  can  be  used   with   the   controlled   vocabulary   base   (Li   and   Du,   for   the   identification   of   opinion   leaders.   We   use   2011).  For  member  i,  words  are  collected  from  his   both   textual   information   and   multimedia   or   her   posted   documents   and   member   vector   i   is   information   to   represent   the   richness   of   represented   as   fi=(w1,   w2,   …wj,   …,   wN),   where   wj   information  as  (3).   denotes  the  frequency  of  word  j  used  in  the  posted     documents   of   user   i.   N   denotes   the   number   of   ric=d  +  g,   (3)   words   in   the   controlled   vocabulary.   We   then     normalize   the   member   vector   by   his   or   her   where   d   is   the   total   number   of   significant   words   maximum   frequency  of   any   significant   word.   The   that   the   user   uses   in   his   or   her   posts   and   g   is   the   degree   of   expertise   can   be   calculated   by   the   total   number   of   multimedia   objects   that   the   user   Euclidean  norm  as  show  in  (1).   posts.           fi exp i  ,   (1)   3.5   Integrated  Text  Mining  Model   mi Finally,   we   integrate   expertise,   novelty   and   where     is  Euclidean  norm.   richness  of  information  from  the  content  of  posted   documents.   As   each   feature   has   its   own   10 distribution   and   range,   we   normalize   each   feature   number  of  documents  that  a  member  initiates  plus   to   a   value   between   0   and   1.   Thus,   the   weights   of   the   number   of   derivative   documents   by   other   opinion   leaders   based   on   the   quality   of   posts   members  is  treated  as  involvement.   become  the  average  of  these  three  features  as  (4).   Thus,   we   have  one   qualitative  model,   i.e.   ITM,     and  four  quantitative  models,  i.e.  DEG,  CLO,  BET   Norm ( nov )  Norm (exp)  Norm ( ric ) and  INV.  We  p ut  top  ten  rankings  from  each  model   ITM  .   (4)   3 in   a   pool   of   potential   opinion   leaders.   Duplicate     members   are   removed   and   25   members   are   left.   We  request  20  human  testers,  which  have  used  and   4   Experiments     are  familiar  with  Mobile01.   In  our  questionnaire,  quantitative  information   is   4.1   Data  Set   provided  such  as  the  number  of  documents  that  the   potential  opinion  leaders  initiate  and  the  number  of   Due   to   lack   of   available   benchmark   data   set,   we   derivative   documents   that   are   posted   by   other   crawl   WOM   documents   from   the   Mobile01   members.   For   the   qualitative   information,   a   bulletin  board  system  (http://www.mobile01.com/),   maximum   of   three   documents   from   each   member   which  is  one  of  the  most  popular  online  discussion   are   provided   randomly   to   the   testers.   The   top   10   forums   in   Taiwan.   This   bulletin   board   system   rankings   are   also   considered   as   opinion   leaders   allows   its   members   to   contribute   their   opinions   based  on  human  judgment.   free  of  charge  and  its  contents  are  available  to  the   public.   A   bulletin   board   system   generally   has   an   4.3   Results   organized   structure   of   topics.   This   organized   We   suppose   that   ten   of   9460   members   are   structure  provides  people  who  are  interested  in  the   considered   as   opinion   leaders.   We   collect   top   10   same   or   similar   topics   with   an   online   discussion   ranking   members   from   each   models   and   remove   forum  that  forms  a  social  network.  Finding  opinion   duplicates.  We  request  20  human  testers  to  identify   leaders   on   bulletin   boards   is   important   since   they   10   opinion   leaders   from   25   potential   opinion   contain   a   lot   of   availably   focused   WOM.   In   our   leaders   obtained   from   five   models.   According   to   initial   experiments,   we  collected  1537  documents,   experiment  results   in  Table  1,  the  proposed  model   which   were   initiated   by   1064   members   and   outperforms   others.   This   presents   the   significance   attracted   9192   followers,   who   posted   19611   of   documents   per   se.   Even   INV   is   a   very   simple   opinions  on  those  initial  posts.  In  this  data  set,  the   approach   but   it   performs   much   better   than   social   total  number  of  p articipants  is  9460.   network  analysis  models,  i.e.  DEG,  CLO  and  BET.     One  possible  reason  is  the  sparse  network  structure.   4.2   Comparison   Many  sub  topics  are  in  the  bulletin  board  system  so   these  topics  form  several  isolated  sub  networks.     As   we   use   real-­world   data,   which   has   no   ground     truth   about   opinion   leaders,   a   user   centered   F-­   evaluation  approach  should  be  used  to  compare  the     Recall Precision   Accuracy measure   difference   between   models   (Kritikopoulos   et   al.,   DEG 0.45     0.50     0.48     0.56     2006).  In  our  research,  there  are  9460  members  in   CLO 0.36     0.40     0.38     0.48     this   virtual   community.   We   suppose   that   ten   of   BET 0.64     0.70     0.67     0.72     them   have   a   high   possibility   of   being   opinion   INV 0.73     0.80     0.76     0.80     leaders.     ITM 0.82     0.90     0.86     0.88     As  identification  of  opinion  leaders  is  treated  to     be   one   of   important   tasks   of   social   network   Table  1:  Results  of  models  evaluated  by  recall,   analysis   (SNA),   we   compare   the   proposed   model   precision,  F-­measure  and  accuracy   (i.e.   ITM)   with   three   famous   SNA   approaches,     which   are   degree   centrality   (DEG),   closeness   centrality   (CLO),   betweenness   centrality   (BET).   Involvement   (INV)   is   an   important   characteristic   of   opinion   leaders   (Kim   and   Han,   2009).   The   11 5   Conclusions  and  Further  Work   Flynn,  L.  R.,  Goldsmith,  R.  E.  and  Eastman,  J.  K.  1996.   Opinion   Leaders   and   Opinion   Seekers:   Two   New   Word   of   mouth   (WOM)   has  a   powerful   effect   Measurement  Scales.  Academy  of  Marketing     on   consumer   behavior.   Opinion   leaders   have   He,  J.  and  Chen,  L.  2008.  Chinese  Word  Segmentation   stronger  influence  on  other  members  in  an  opinion   Based  on  the  Improved  Particle  Swarm  Optimization   society.   How   to   find   opinion   leaders   has   been   of   Neural   Networks.   Proceedings   of   IEEE   Cybernetics   interest   to   both   practitioners   and   researchers.   and  Intelligent  S ystems,  695-­699.   Existing   models   mainly   focus   on   quantitative   Jiang,   L.,   Ge,   B.,   Xiao,   W.   and   Gao,   M.   2013.   BBS   features  of  opinion  leaders,  such  as  the  number  of   Opinion   Leader   Mining   Based   on   an   Improved   posts  and  the  central  position  in  the  social  network.   PageRank   Algorithm   Using   MapReduce.   This   research   considers   this   issue   from   the   Proceedings   of   Chinese   Automation   Congress,   392-­ viewpoints   of   text   mining.   We   propose   an   396.   integrated   text   mining   model   by   extracting   three   Katz,  E.  and  Lazarsfeld,  P.  F.  1957.  Personal  Influence,   important   features   of   opinion   leaders   regarding   New  York:  The  Free  Press.   novelty,   expertise   and   richness   of   information,   from   documents.   Finally,   we   compare   this   Keller,   E.   and   Berry,   J.   2003.   One   American   in   Ten   proposed  text  mining  model  with  four  quantitative   Tells  the  Other  Nine  How  to  Vote,  Where  to  Eat  and,   approaches,   i.e.,   involvement,   degree   centrality,   What   to   Buy.   They   Are   The   Influentials.   The   Free   Press.   closeness   centrality   and   betweenness   centrality,   evaluated  by  human  judgment.  In  our  experiments,   Khammash,   M.   and   Griffiths,   G.   H.   2011.   Arrivederci   we  found  that  the  involvement  approach  is  the  best   CIAO.com  Buongiorno  Bing.com-­  Electronic  Word-­ one   among   the   quantitative   approaches.   The   text   of-­Mouth  (eWOM),  Antecedences  and  Consequences.   mining   approach   outperforms   its   quantitative   International   Journal   of   Information   Management,   31:82-­87.   counterparts   as   the   richness   of   document   information   provides   a   similar   function   to   the   Kim,  D.  K.  2007.  Identifying  Opinion  Leaders  by  Using   qualitative   features   of   opinion   leaders.   The   Social   Network   Analysis:   A   Synthesis   of   Opinion   proposed   text   mining   approach   further   measures   Leadership  Data  Collection  Methods  and  Instruments.   opinion   leaders   based   on   features   of   novelty   and   PhD  Thesis,  the  Scripps  College  of  Communication,   Ohio  U niversity.   expertise.   In   terms   of   possible   future   work,   some   Kim,  S.  and  Han,  S.  2009.  An  Analytical  Way  to   Find   integrated   strategies   of   both   qualitative   and   Influencers   on   Social   Networks   and   Validate   their   quantitative   approaches   should  take   advantages   of   Effects  in  Disseminating  Social  Games.  Proceedings   both   approaches.   For   example,   the   2-­step   of  Advances  in  Social  Network  Analysis  and  Mining,   integrated   strategy,   which   uses   the   text   mining-­ 41-­46.   based   approach   in   the   first   step,   and   uses   the   Kritikopoulos,  A.,   Sideri,  M.   and   Varlamis,   I.   2006.   quantitative  approach  based  on  involvement  in  the   BlogRank:  Ranking  Weblogs  Based  on  Connectivity   second   step,   may   achieve   the   better   performance.   and   Similarity   Features.   Proceedings   of   the   2nd   Larger   scale   experiments   including   topics,   the   International   Workshop   on   Advanced   Architectures   and   Algorithms   for   Internet   Delivery   and   number   of  documents   and  testing,   should   be  done   Applications,  Article  8 .   further  in  order  to  produce  more  general  results.     Li,   F.   and   Du,   T.   C.   2011.   Who   Is   Talking?   An   Ontology-­Based   Opinion   Leader   Identification   References     Framework  for  Word-­of-­Mouth  Marketing  in  Online   Social   Blogs.   Decision   Support   Systems,   51,   Agarwal,   N.,   Liu,   H.,   Tang,   L.   and   Yu,   P.   S.   2008.   2011:190-­197.   Identifying  the  Influential  Bloggers  in  a  Community.     Myers,  J.  H.  and  Robertson,  T.  S.  1972.  Dimensions  of   Proceedings  of  WSDM,  207-­217.   Opinion  Leadership.  Journal  of  Marketing  Research,   Arndt,   J.   1967.   Role  of   Product-­Related  Conversations   4:41-­46.   in   the   Diffusion   of   a   New   Product.   Journal   of   Page,  L.,  Brin,  S.,  Motwani,  R.  and  Winograd,  T.  1998.   Marketing  Research,  4 (3):291-­295.   The   PageRank   Citation   Ranking:   Bringing   Order   to   the  Web.  Technical  Report,  S tanford  U niversity.   12 Rosen,  E.  2002.  The  Anatomy  of  Buzz:  How  to  Create   Word  of  Mouth  Marketing,  1 st  ed.,  Doubleday.   Song,   X.,   Chi,   Y.,   Hino,   K.   and   Tseng,   B.   L.   2007.   Identifying   Opinion   Leaders   in   the   Blogosphere.   Proceedings  of  CIKM’07,    971-­974.   Zhou,   H.,   Zeng,   D.   and   Zhang,   C.   2009.   Finding   Leaders   from  Opinion  Networks.  Proceedings  of  the   2009   IEEE   International   Conference   on   Intelligence   and  Security  Informatics,  266-­268.   13