NaturalOpinions: NLP-based opinion extraction in user-generated content NaturalOpinions: extracción de opinión basada en PLN para contenidos generados por usuarios Antonio S. Valderrábanos Enrique Torrejón BITEXT BITEXT Cólquide 6, Las Rozas, Madrid, Spain Cólquide 6, Las Rozas, Madrid, Spain asv@bitext.com etorrejon@bitext.com Resumen: Cada vez más personas acceden a Internet y cada vez es mayor el contenido generado por los mismos. La necesidad de saber lo que escriben estas personas en blogs, foros y medios sociales en general sobre marcas y productos específicos tiene una importancia estratégica para grandes empresas en todo el mundo. Presentaremos una demostración de una aplicación, NaturalOpinions, que sigue un enfoque de PLN basado en reglas para analizar opiniones en Twitter en español. La aplicación puede detectar el tema del que se opina y extraer el sentimiento, ya sea positivo o negativo, sobre nombres de marcas o características de productos concretos. De este modo, se puede implementar rápidamente con la tecnología lingüística de Bitext soluciones de análisis de medios sociales. Palabras clave: Extracción de información, análisis de opiniones, clasificación de sentimientos, contenidos generados por usuarios, procesamiento del lenguaje natural, inteligencia de marcas. Abstract: More and more people have access to Internet and the content they produce keeps growing. The need to know what people are writing in blogs, forums and social media in general about specific brands and products has become strategically important for large corporations all around the globe. We will present the demo of an application, NaturalOpinions, which follows a rule-based NLP approach to parsing opinion in Twitter in Spanish. The application is able to detect the topic and extract the sentiment, either positive or negative, about particular product features or brand names. Social media intelligence solutions can thus be implemented rapidly with Bitext's language technologies. Keywords: Information extraction, opinion analysis, sentiment classification, user-generated content, Twitter, NLP, brand intelligence. 1 Introduction process daily tweets in search for opinions about brands is a considerable task. Brand intelligence tools are becoming a Spain is among the 10 countries with greater strategic asset for companies interested in number of registers in Twitter. It represents keeping track of users’ opinions and sentiments 1.7% of the users, that is, over 2 million users. in user-generated content, most significantly in According to the study “Uso de Twitter en blogs, forums, Twitter, Facebook, etc. Those España”, from the Asociación Española para la tools have traditionally adopted statistics-based Economía Digital, 63% of users recommend approaches along the lines of standard text products in Twitter, 61% share complaints mining techniques (Popescu and Etzioni, 2005). about products or services, and a 94% follow We claim that the tasks of opinion extraction specific companies in Twitter. These and polarity determination require a more percentages mean that Twitter is a source of sophisticated approach where mainly rule-based social opinions which cannot be ignored in NLP software, furnished with syntactic and brand intelligence. semantic processing, can provide the fine- grained analysis needed for reliable reports on 3 Opinions in Twitter: social media opinion and sentiment classification. Bitext's focus therefore continues the path of using writing style dependencies grammars and full semantic Tweets usually have their own writing style representation (Dini and Mazzini, 2002) to features which they share with other user- tackle the challenge of opinion extraction. We generated content repositories such as will demo an application of NaturalOpinions Facebook, blogs, forus, or even SMS. Any which analyzes opinions about main brands in NLP-based opinion analysis software for Twitter in Spanish. Twitter must handle the following style features: 2 User-generated content 1. 1 4 0 - c h a r a c t e r l i m i t : t w e e t s a s microblogs have a limit of 140 User-generated content continues to grow at an characters and, therefore, users need amazing pace. We will consider three types of to restrain themselves from writing user-generated content: blogs, Facebook, and full-fledged opinions and condensed Twitter. Regarding blogs, in August 2010 the them as much as possible, for number of blogs doubles every 5 months and example: there was a new blog published every second. a. La calidad de Spotify mobile In early 2009, there were approximately 200 desde el Ipod Touch, Ipad o million blogs in English. In August 2010, the Nexus One es genial. Mucho number of active blogs in Spanish was 400 mejor que desde el MacBook million. If we take into account other b. @hayleytheone no es nada languages, there were 1 billion active blogs. bonita la coca-cola sin curvas If we consider Facebook, in February 2010, xdd parece que es la de marca there were 400 million active users; on July blanca xdd a mi no me gusta 21st, 2010, there were 500 million active users. la coca-cola xdd Every user writes an average of 25 comments 2. Use of tokens: tweets contain special per month; more than 35 million users update tokens such as @ for user names, # their status every day and more than 60 million for trending topics; they also have http updates are made every day. Just to keep track links for related content; for example: of all the status update and detect opinions in a. @ranablue T recomiendo them is a considerable task for which search hablar sobre el iPAD a engines such as Booshaka @fotomaf y @cuasante (www.booshaka.com) can be very useful. b. Estoy muy contento con mi If we focus on Twitter, there are more than #Kindle 105 million registered users and 300,000 new c. esta es la direccion si quieren u s e r s r e g i s t e r e v e r y d a y. T h e r e a r e bajar la aplicacion para approximately 80 million tweets every day. In iPhone, iPod touch, y iPad en August 2010, Twitter has accumulated 20 el es gratuita http://goo.gl/aia1 billion tweets. In early September, it surpassed 3. Lack of punctuation marks: in order to the barrier of 23 billion tweets. And it is save characters, users write tweets estimated that by the end of 2010 Twitter will without punctuation marks, which have accumulated 30 billion tweets. Just to makes parsing even harder without sentences/phrases separation marks; e. @miguelrtorija Yo tengo flickr, for example: molaaaaaaa!! XDD yo a. @hayleytheone no es nada también pensaba comprarme bonita la coca-cola sin curvas un blackberry o algo jajaja el xdd parece que es la de marca nokia no mola para twittear blanca xdd a mi no me gusta XD la coca-cola xdd 7. Colloquial style: tweets also feature a b. E s o n u n c a s e s a b e j e j e colloquial style which can be @ A n t o n i o P a m o s described as “I write the way I speak”; @cosechadel66 Pero el IPAD this includes all the phrasing, swear tiene más aplicaciones que la words, chopped words, etc which are Sra. Obama typical of spoken language; for 4. Relaxation in the use of accented example: characters (vowels): this makes a. @asturking pos es muy fácil morphological processing difficult vincular el iphone con un and opinion analysis software may macbook have to include spell-checking; for b. @guarroman: Flipando con el example: iPhone de @vego" // si es que a. creo que movistar jode las mi iPhone es el mejor! conexiones a proposito c. Partidita al Parchis en el iPad b. iphone es muchisimo mas caro con mi hermana, sobrina y mi y tactil... cuñado. Como mola el iPad c. dios que miticas las canciones leches, aquí el Parchis http:// de los juegos de nintendo yfrog.com/0m5dxdj d. los mensajes automatizados de d. Y joder, por la puta ballena bienvenida en twitter me dan pensaba que el iPod iba mal, mala sensación su puta vida... 5. Spelling errors: most tweets are written e. @Natychan Entre eso, el with spelling errors given the careless iPhone 4 y que la pantalla de and colloquial writing style which is mi iPod Nano está cascando preponderant in Twitter. Therefore, de mala manera including a spell-checker with the f. el twitter ta mal me esta opinion analysis software must be borrando los tweets considered. For example: 8. Space of creativity and humor: tweets a. qué tío más pesado el chabal de also contain a great deal of humorous coca-cola expressions that show the creativity b. que esquisito este twitter user like to indulge in to make their 6. SMS style: tweets have also adopted comments more brilliant and, the emoticons, abbreviations, slang, therefore, retweetable. Opinion etc, which is typical of SMS. Opinion analysis software has to continuously analysis software must consider update their sentiment lexicons in whether to include them in the parsing order to be able to evaluate these process if they add sentiment tweets correctly; for example: information to the opinion conveyed a. rt @jakarrion a quien buen in the tweets. For example, emoticons apple se arrima buen iphone such as  ;-) ^^ ¬¬ xD xDD etc. : le cobija #variantes a. @Shinfu No está mal el iPhone b. este twitter falla más que una 4 ¿verdad? ;-) escopeta de feria b. @Maria__Lourdes ^^ xro el c. el iphone 4 es la cosa más ipod tiene un lado malo... es bonita que ha parío mare un enganxe OO d. @aletshe @ferr_kon las penas c. @yeyustyle jajaja un blog? con iphone son menos penas pues uno tuyo personal, no va xd mal =P mi ipod y yo... y X e. Oh dios. Mi ipod está tan horas de viaje.. ¬¬ como un jodidamente brillante, y dia vaya os arrepentireis xD suave, que voy a orgasmizar! d. parece ke co echofon si ke me f. d i o s l a p e ñ a e s t a funciona en el ipod archienganchada al twitter 4 Overview of NaturalOpinions component and calculates a score (integer with two decimal values) which measures NaturalOpinions for Twitter has been developed the strength of the conveyed opinion. The using Bitext’s proprietary NLP technology. scoring component consists of: NaturalOpinions consists of three main a. Sentiment dictionaries for the components: following part of speech: nouns, 1. DataSuite NLP engine, which includes: adjectives, verbs, adverbs, and a. DataLexica: a component with over determiners. Also, there is a 3 million Spanish words sentiment dictionary of features. morphologically classified and b. Scoring algorithm: the algorithm used for POS-tagging. takes into account the values in the b. DataGrammar: a syntactic parser sentiment dictionaries. These which uses a specifically-designed values include: dependency grammar for opinion i. A d d i n g v a l u e : v a l u e s analysis. This parser can return which are summed in the both a complete syntactic tree and a algorithm process, for shallow parsed tree when the tweet instance for summing is not grammatically correct. The values of adjectives parser establishes the dependency ii. Multiplier value: values structure from which brand features which are multiplied in the and opinions (either positive or algorithm process, for negative) can be identified. instance for adverbs which 2. Semantic extraction component: this modify adjectives component takes care of parsing the iii. Hue value: values which syntactic tree and extracting relevant have to do with the information for opinion analysis, namely intrinsic meaning of the a. Brand or product name about words; for instance, which the opinion is expressed, for “gorgeous” and “horrible” example, “iPad”, “iPhone”, etc. have an intrinsic absolute b. Brand component/feature about value (positive hue and which the opinion is expressed, for negative hue respectively), instance, “the screen”, “the whereas “cheap” has a battery”, etc. relative value (neutral hue) c. B r a n d c o m p o n e n t / f e a t u r e which may turn out to be attributes, which allow for a topic positive or negative classification of opinions. These depending on the context, attributes may be customized for instance “cheap according to the domain. Currently, price” (positive) versus attributes include General, Product, “ c h e a p Service, Image, Quality, and Price. material” (negative). d. Semantic polarity: whether the opinion is an affirmative or NaturalOpinions for Twitter also includes a negative statement. graphical dashboard with visualization of e. Comparative opinion: whether tweets according to brands, features, attributes, there is a comparison of two or opinion polarity, time stamps, opinion holders, more brands with detection of among others. It can be accessed on Bitext’s topical brand and compared-against website www.bitext.com. brand; likewise, with detection of topical feature and compared- References against feature. f. Opinion itself: part of the parsed Dini, I., and Mazzini, G. 2002. Opinion sentence which contains the user’s classification Through information assessment of the brand/feature and extraction. In Proceedings of the Conference includes the words expressing on Data Mining Methods and Databases for value. Engineering, Finance and other fields, 3. Scoring component: this component takes páginas 299-310. into account the opinion and the semantic Popescu, A. M. and Mazzini, G. 2005. polarity returned by the previous Extracting Product Features and Opinions from Reviews. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT/ EMNLP), pages 339-346.