KEYWEXT: A Multilingual Keyword Extraction Service based on Word Embeddings KEYWEXT: un Servicio Multilingüe de Extracción de Palabras Clave basado en Word Embeddings Eva Martı́nez Garcia, Luis Talegón, Iván Cañaveral, Pablo Martı́nez, Paul Goldbaum SEEDTAG c/Marqués de Valdeiglesias 6, 28004, Madrid (Spain) {evamartinez, luis, ivancanaveral, pablomartinez, paul}@seedtag.com Abstract: Contextual Advertising utilizes the content a user is seeing to under- stand their interest in real-time to serve relevant advertising. A good representation of this context is the first step to achieve a more precise selection of suitable ad- vertisements that is relevant to the content. We present the description of the SEEDTAG’s keywords extractor system demonstration: KEYWEXT. It uses state- of-the-art multilingual BERT-based positional embeddings to help contextualize ad- vertising campaigns by retrieving those n−grams that best represent the content of the document. This leads to more relevant advertising while being respectful with the user. Keywords: keywords, automatic extraction, word embeddings, multilingual, word2vec, sentence-BERT. Resumen: La Publicidad Contextual utiliza el contenido que un usuario está viendo para entender su interés en tiempo real. Una buena representación del con- texto que un usuario está leyendo en un momento puntual es el primer paso para una mejor selección de los anuncios más adecuados. Presentamos la descripción de la demostración del sistema de extracción de palabras clave de SEEDTAG: KEY- WEXT. Utiliza técnicas de estado del arte sobre embeddings posicionales de modelos multilingües basados en BERT para ayudar a contextualizar campañas publicitarias al extraer aquellos n−gramas que mejor representan el contenido de un documento. De esta manera se asegura una publicidad relevante a la vez que respetuosa con el usuario. Palabras clave: palabras clave, extracción automática, word embeddings, multil- ingüe, word2vec, sentence-BERT. 1 Introduction though this context is nowadays multimodal: Contextual Advertising technologies allows text, images, video, etc., especially in the brands to reach their target users in the right case of professionally-produced content, the context by what the user is seeing in that text still holds some of the most important moment. If we can understand that context, part of the information. Natural Language we will be able to serve more suitable adver- Processing (NLP) techniques can help us to tisements thus improving their experience as categorize the suitability of a web article tak- well as the advertiser impact. A good under- ing into account the context of what the user standing of the context is key to selecting the is looking at and without the need to make most suitable ads for a page and leads to an use of any personal data. improved user experience and advertisement Contextual Advertising strategies typi- impact. cally rely either on well-known taxonomies1 In a digital scenario, the user’s context can be a web page showing a news article, 1 https://www.iab.com/guidelines/ a blog post, or an encyclopedia entry. Al- content-taxonomy/ Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 33 SEEDTAG ADVERTISEMENT or on the targeting of vertical domains (au- CONTEXTUALIZATION tomotive, sports, etc.). The solutions based BRAND SUITABILITY PUBLISHERS on these resources are usually rigid and limit BRAND SAFETY the precision that can be achieved are thus Digital content KEYWEXT insufficient to achieve a higher level of con- CATEGORIZATION textualization. We present the description of the SEEDTAG’s keywords extraction system demonstration: KEYWEXT. A web service able to extract the most relevant words, bi- AD ADVERTISING CAMPAIGNS BRANDS SERVING grams, and trigrams from a web article by TARGETING using the information from pre-trained word STRATEGY embeddings. These words will help to en- hance the context information available for Figure 1: SEEDTAG advertisement contex- improving the advertising contextualization tualization workflow. workflow. The rest of the paper is structured as fol- tential for solving different tasks (Martı́nez lows: Section 2 describes the problem more Garcia et al., 2017; Sun et al., 2019) that in detail. Section 3 describes the KEY- they were not trained for. Also, they pro- WEXT service and its integration in SEED- vide support for a multilingual scenario as TAG’s contextualization workflow and Sec- well (Reimers and Gurevych, 2020). Follow- tion 4 shows some examples of the service ing the transfer-learning trend, KEYWEXT functionality. Finally, Section 5 draws con- uses pre-trained word embeddings to build clusions and points out some future work. a context representation vector for a whole document and to retrieve its closest content 2 Motivation n-grams as the keywords list. Our extrac- As users, we are used to being surrounded tor is not the first one in using BERT-like by ads related to our latest search or by the models (Devlin et al., 2019) to extract key- latest sites that we visited. This information words (Grootendorst, 2020) but, for the best does not normally match the website content of our knowledge, KEYWEXT is the first in that we are visiting, and is many times dis- taking advantage of the positional embed- tracting or even annoying. dings and the local contextual information Since SEEDTAG cares about the privacy that they provide to retrieve better and more of web users, we focus on understanding the relevant words. context by implementing a cookieless contex- tualization strategy. We believe that the im- 3 System Description portant information to decide which ad to SEEDTAG’s contextualization workflow goes serve in a particular site comes from the infor- as follows. When we have digital content mation that the user is seeing at that partic- where it is possible to serve advertising, we ular moment. Looking into the text present process the content to assess its suitability in a web scenario, we need to understand the and to get all the information needed to se- information the user is reading in a particular lect the most adequate advertising campaign. moment to select the most suitable advertise- Figure 1 shows the SEEDTAG’s contextu- ment. Finding the keywords of a text plays alization workflow. The suitability of a par- an important role in improving the represen- ticular digital content is measured by ana- tation of this context. Informally, we under- lyzing two main features: brand safety and stand a set of keywords as the set of phrases content adequacy. On the one hand, brand that depict the main information from a text. safety measures if the content is safe to be There exist several keywords or keyphrases related to a brand or product. On the other extraction methods (Campos et al., 2020; Mi- hand, content adequacy measures how much halcea and Tarau, 2004). Although these ap- the content is related to the topics that the proaches are fast and easy to apply, they are brand or product wants to be associated highly language-dependent and many times with. Content adequacy is also understood return noisy lists of words that are difficult as the categorization of the content. Finally, to use. we use the KEYWEXT service to extract the Word embeddings have shown their po- keywords and keyphrases from the content. 34 KEYWEXT trained model. KEYWEXT sums the sen- tence embedding of each sentence in the text NER to obtain the document embedding. Heading { entities: aliqua, ut labore et dolore, The second step is done by calculating Lorem ipsum dolor sit amet, consectetur KEYWORD HANDLER lorem ipsum dolor, adipisicing elit the distance among the content words from adipisicing elit, sed do eiusmod tempor incididunt ut labore et keywords: heading, the input text to the document embedding. amet consectetur dolore magna aliqua. sit amet consectetur } KEYWEXT uses the BERT positional em- en KEYWORD EXTRACTOR beddings for the input tokens to obtain the words, bigrams, and trigrams embeddings and the cosine similarity as distance. These Figure 2: KEYWEXT web service. embeddings are the result of summing the embeddings from the tokens that form a par- ticular word, bigram, or trigram. We de- Both the categorization and brand safety cided not to consider n-grams with n > 3 modules as well as KEYWEXT work directly to control the sparseness and the quantity of and in parallel on the text from the digital the possible combinations when checking and content. Then, their outputs are combined calculating distances. Using the positional to feed the contextualization flow. In par- embeddings from the sentence-BERT mod- ticular, the context information from KEY- els gives the service a local idea of the con- WEXT is used to refine targeting strategies text of the text. Even though it is not yet a and brand positioning in order to select the document-level context, this approach allows best advertisements according to the current KEYWEXT to have a broader vision without context that a user is seeing. conflating different senses of a word in the KEYWEXT is a Python2 web service same embedding. In short, that will help to built using the Tornado3 web framework. better disambiguate the keyword choice and Figure 2 shows the architecture of the system. to produce more adequate results. The service has different actors working to- gether. When it receives a request with a text 3.2 Multilingualism extracted from a web article and its detected KEYWEXT is also multilingual. Having a language, the KeyWord Handler passes the service that is able to handle requests in dif- information to a Spacy4 Named Entity Rec- ferent languages is crucial for its integration ognizer and the KeyWord Extractor. Then, within SEEDTAG’s workflow to cover the these modules obtain the list of Named Enti- languages of the countries the company op- ties and keywords respectively, that the Key- erates in. Word Handler will use to build the response. Multilingualism is achieved by using a multilingual sentence-BERT-based embed- 3.1 Key Words Extraction ding model (Reimers and Gurevych, 2020) We want to retrieve the most relevant n- to retrieve the keyword set from articles in grams from a text. Thus, we need to un- different languages. derstand the text to select the most suitable content words or n-grams from the text that 4 Sample of Keyword Extraction best represent it. Functionality KEYWEXT performs the keyword extrac- We show some examples of the KEYWEXT tion in two steps: functionality on some of the most relevant 1. Build a vector representation from the languages for SEEDTAG. whole input text. If we process the following short text in English: 2. Retrieve the closest words, bigrams, and How the suspension of the AstraZeneca trigrams to the text representation. vaccine is affecting the inoculation drive in each Spanish region. Regional authorities The first step is done by using a sentence- have administered 5.7 million doses and BERT (Reimers and Gurevych, 2019) pre- fully vaccinated nearly 1.7 million people, 2 https://www.python.org/ but the jabs for essential workers have been 3 https://www.tornadoweb.org/ put on hold due to the decision to halt the 4 https://spacy.io/ use of the Anglo-Swedish medication.[...] 35 The KEYWEXT service returns the follow- of the multilingual sentence-BERT models to ing keyword list: handle articles in different languages. New versions of KEYWEXT will improve astrazeneca, administered, suspension, handling document-level information: using vaccine, astrazeneca vaccine, authorities document-level text representations, taking have administered into account topic fluctuations when produc- Notice how a generic open-domain pre- ing the set of the top keywords, etc. The trained word embedding model can detect new features will improve the extraction of a recent Named Entity like astrazeneca as keywords for longer and more complex texts. a relevant element of the text. If a dif- References ferent kind of embedding model such as word2vec (Mikolov et al., 2013) had been Campos, R., V. Mangaravite, A. Pasquali, used, this adaptation would not have been A. Jorge, C. Nunes, and A. Jatowt. 2020. possible due to vocabulary coverage restric- Yake! keyword extraction from single doc- tions. uments using multiple local features. In- Moving to Spanish texts, when process- formation Sciences, 509:257–289. ing a negative news piece about an attack Devlin, J., M.-W. Chang, K. Lee, and in Burkina Faso discussing the death of two K. Toutanova. 2019. BERT: Pre-training journalists : of deep bidirectional transformers for lan- guage understanding. In Proceedings of Dos periodistas españoles mueren asesinados NACL2019. en un ataque en Burkina Faso. Un grupo de hombres armados asaltó el convoy de los Grootendorst, M. 2020. Keybert: Minimal reporteros David Beriain y Roberto Fraile en keyword extraction with bert. dos camionetas y una decena de motos.[...] Martı́nez Garcia, E., C. Creus, C. España- We obtain the following set of keywords using Bonet, and L. Màrquez. 2017. Using word our KEYWEXT service: embeddings to enforce document-level lex- ical consistency in machine translation. ataque, asesinados, mueren, periodistas, The Prague Bulletin of Mathematical Lin- viajaban, asaltó, españoles mueren, guistics, 108. periodistas españoles mueren, ataque en Mihalcea, R. and P. Tarau. 2004. Textrank: Burkina Bringing order into text. In Proceedings Although these words or phrases can seem of EMNLP2004. trivial, once fed into our contextualization Mikolov, T., I. Sutskever, K. Chen, G. S. models they reinforce their knowledge about Corrado, and J. Dean. 2013. Distributed potential harmful content and allow SEED- representations of words and phrases and TAG to help advertisers better design their their compositionality. In Advances in Contextual Advertising strategies. Neural Information Processing Systems 26: 27th NIPS. 5 Conclusions and Future Work Reimers, N. and I. Gurevych. 2019. We presented KEYWEXT, a keywords ex- Sentence-bert: Sentence embeddings us- traction system that takes advantage of pre- ing siamese bert-networks. In Proceedings trained word embeddings to retrieve the most of EMNLP2019. relevant n-grams from an article. These ex- tracted keywords feed into SEEDTAG’s con- Reimers, N. and I. Gurevych. 2020. Making textual advertising workflow to identify the monolingual sentence embeddings multi- most suitable matches among brands, their lingual using knowledge distillation. In advertising campaigns and web articles. Proceedings of EMNLP2020. KEYWEXT is a web service that uses Sun, C., X. Qiu, Y. Xu, and X. Huang. 2019. sentence-BERT-based pre-trained models to How to fine-tune bert for text classifica- understand the context of an article beyond tion? In China National Conference on the sentence level and, then, retrieve the clos- Chinese Computational Linguistics. est words, bigrams, and trigrams of the doc- ument. Also, KEYWEXT takes advantage 36