KEYWEXT: A Multilingual Keyword Extraction
        Service based on Word Embeddings
      KEYWEXT: un Servicio Multilingüe de Extracción de
         Palabras Clave basado en Word Embeddings
               Eva Martı́nez Garcia, Luis Talegón, Iván Cañaveral,
                          Pablo Martı́nez, Paul Goldbaum
                                       SEEDTAG
                 c/Marqués de Valdeiglesias 6, 28004, Madrid (Spain)
          {evamartinez, luis, ivancanaveral, pablomartinez, paul}@seedtag.com

      Abstract: Contextual Advertising utilizes the content a user is seeing to under-
      stand their interest in real-time to serve relevant advertising. A good representation
      of this context is the first step to achieve a more precise selection of suitable ad-
      vertisements that is relevant to the content. We present the description of the
      SEEDTAG’s keywords extractor system demonstration: KEYWEXT. It uses state-
      of-the-art multilingual BERT-based positional embeddings to help contextualize ad-
      vertising campaigns by retrieving those n−grams that best represent the content of
      the document. This leads to more relevant advertising while being respectful with
      the user.
      Keywords: keywords, automatic extraction, word embeddings, multilingual,
      word2vec, sentence-BERT.
      Resumen: La Publicidad Contextual utiliza el contenido que un usuario está
      viendo para entender su interés en tiempo real. Una buena representación del con-
      texto que un usuario está leyendo en un momento puntual es el primer paso para
      una mejor selección de los anuncios más adecuados. Presentamos la descripción de
      la demostración del sistema de extracción de palabras clave de SEEDTAG: KEY-
      WEXT. Utiliza técnicas de estado del arte sobre embeddings posicionales de modelos
      multilingües basados en BERT para ayudar a contextualizar campañas publicitarias
      al extraer aquellos n−gramas que mejor representan el contenido de un documento.
      De esta manera se asegura una publicidad relevante a la vez que respetuosa con el
      usuario.
      Palabras clave: palabras clave, extracción automática, word embeddings, multil-
      ingüe, word2vec, sentence-BERT.


1   Introduction                                                                       though this context is nowadays multimodal:
Contextual Advertising technologies allows                                             text, images, video, etc., especially in the
brands to reach their target users in the right                                        case of professionally-produced content, the
context by what the user is seeing in that                                             text still holds some of the most important
moment. If we can understand that context,                                             part of the information. Natural Language
we will be able to serve more suitable adver-                                          Processing (NLP) techniques can help us to
tisements thus improving their experience as                                           categorize the suitability of a web article tak-
well as the advertiser impact. A good under-                                           ing into account the context of what the user
standing of the context is key to selecting the                                        is looking at and without the need to make
most suitable ads for a page and leads to an                                           use of any personal data.
improved user experience and advertisement                                                 Contextual Advertising strategies typi-
impact.                                                                                cally rely either on well-known taxonomies1
    In a digital scenario, the user’s context
can be a web page showing a news article,                                                 1
                                                                                            https://www.iab.com/guidelines/
a blog post, or an encyclopedia entry. Al-                                             content-taxonomy/
          Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                               33
                                                                                    SEEDTAG ADVERTISEMENT
or on the targeting of vertical domains (au-                                          CONTEXTUALIZATION

tomotive, sports, etc.). The solutions based                                   BRAND SUITABILITY
                                                            PUBLISHERS
on these resources are usually rigid and limit                                   BRAND SAFETY
the precision that can be achieved are thus                          Digital
                                                                     content
                                                                                                        KEYWEXT
insufficient to achieve a higher level of con-                                 CATEGORIZATION

textualization. We present the description of
the SEEDTAG’s keywords extraction system
demonstration: KEYWEXT. A web service
able to extract the most relevant words, bi-                          AD             ADVERTISING
                                                                                      CAMPAIGNS    BRANDS
                                                                    SERVING
grams, and trigrams from a web article by                                                   TARGETING
using the information from pre-trained word                                                  STRATEGY

embeddings. These words will help to en-
hance the context information available for             Figure 1: SEEDTAG advertisement contex-
improving the advertising contextualization             tualization workflow.
workflow.
   The rest of the paper is structured as fol-          tential for solving different tasks (Martı́nez
lows: Section 2 describes the problem more              Garcia et al., 2017; Sun et al., 2019) that
in detail. Section 3 describes the KEY-                 they were not trained for. Also, they pro-
WEXT service and its integration in SEED-               vide support for a multilingual scenario as
TAG’s contextualization workflow and Sec-               well (Reimers and Gurevych, 2020). Follow-
tion 4 shows some examples of the service               ing the transfer-learning trend, KEYWEXT
functionality. Finally, Section 5 draws con-            uses pre-trained word embeddings to build
clusions and points out some future work.               a context representation vector for a whole
                                                        document and to retrieve its closest content
2   Motivation                                          n-grams as the keywords list. Our extrac-
As users, we are used to being surrounded               tor is not the first one in using BERT-like
by ads related to our latest search or by the           models (Devlin et al., 2019) to extract key-
latest sites that we visited. This information          words (Grootendorst, 2020) but, for the best
does not normally match the website content             of our knowledge, KEYWEXT is the first in
that we are visiting, and is many times dis-            taking advantage of the positional embed-
tracting or even annoying.                              dings and the local contextual information
   Since SEEDTAG cares about the privacy                that they provide to retrieve better and more
of web users, we focus on understanding the             relevant words.
context by implementing a cookieless contex-
tualization strategy. We believe that the im-           3      System Description
portant information to decide which ad to               SEEDTAG’s contextualization workflow goes
serve in a particular site comes from the infor-        as follows. When we have digital content
mation that the user is seeing at that partic-          where it is possible to serve advertising, we
ular moment. Looking into the text present              process the content to assess its suitability
in a web scenario, we need to understand the            and to get all the information needed to se-
information the user is reading in a particular         lect the most adequate advertising campaign.
moment to select the most suitable advertise-               Figure 1 shows the SEEDTAG’s contextu-
ment. Finding the keywords of a text plays              alization workflow. The suitability of a par-
an important role in improving the represen-            ticular digital content is measured by ana-
tation of this context. Informally, we under-           lyzing two main features: brand safety and
stand a set of keywords as the set of phrases           content adequacy. On the one hand, brand
that depict the main information from a text.           safety measures if the content is safe to be
There exist several keywords or keyphrases              related to a brand or product. On the other
extraction methods (Campos et al., 2020; Mi-            hand, content adequacy measures how much
halcea and Tarau, 2004). Although these ap-             the content is related to the topics that the
proaches are fast and easy to apply, they are           brand or product wants to be associated
highly language-dependent and many times                with. Content adequacy is also understood
return noisy lists of words that are difficult          as the categorization of the content. Finally,
to use.                                                 we use the KEYWEXT service to extract the
   Word embeddings have shown their po-                 keywords and keyphrases from the content.


                                                   34
                            KEYWEXT
                                                                            trained model. KEYWEXT sums the sen-
                                                                            tence embedding of each sentence in the text
                                  NER
                                                                            to obtain the document embedding.
Heading                                    { entities:     aliqua,
                                                     ut labore et dolore,
                                                                               The second step is done by calculating
 Lorem ipsum dolor sit
 amet, consectetur              KEYWORD
                                HANDLER
                                                     lorem ipsum dolor,
                                                        adipisicing elit
                                                                            the distance among the content words from
 adipisicing elit, sed do
 eiusmod tempor
 incididunt ut labore et
                                           keywords: heading,               the input text to the document embedding.
                                                   amet consectetur
 dolore magna aliqua.                            sit amet consectetur }     KEYWEXT uses the BERT positional em-
 en                             KEYWORD
                               EXTRACTOR                                    beddings for the input tokens to obtain the
                                                                            words, bigrams, and trigrams embeddings
                                                                            and the cosine similarity as distance. These
        Figure 2: KEYWEXT web service.                                      embeddings are the result of summing the
                                                                            embeddings from the tokens that form a par-
                                                                            ticular word, bigram, or trigram. We de-
Both the categorization and brand safety                                    cided not to consider n-grams with n > 3
modules as well as KEYWEXT work directly                                    to control the sparseness and the quantity of
and in parallel on the text from the digital                                the possible combinations when checking and
content. Then, their outputs are combined                                   calculating distances. Using the positional
to feed the contextualization flow. In par-                                 embeddings from the sentence-BERT mod-
ticular, the context information from KEY-                                  els gives the service a local idea of the con-
WEXT is used to refine targeting strategies                                 text of the text. Even though it is not yet a
and brand positioning in order to select the                                document-level context, this approach allows
best advertisements according to the current                                KEYWEXT to have a broader vision without
context that a user is seeing.                                              conflating different senses of a word in the
   KEYWEXT is a Python2 web service                                         same embedding. In short, that will help to
built using the Tornado3 web framework.                                     better disambiguate the keyword choice and
Figure 2 shows the architecture of the system.                              to produce more adequate results.
The service has different actors working to-
gether. When it receives a request with a text                              3.2    Multilingualism
extracted from a web article and its detected                               KEYWEXT is also multilingual. Having a
language, the KeyWord Handler passes the                                    service that is able to handle requests in dif-
information to a Spacy4 Named Entity Rec-                                   ferent languages is crucial for its integration
ognizer and the KeyWord Extractor. Then,                                    within SEEDTAG’s workflow to cover the
these modules obtain the list of Named Enti-                                languages of the countries the company op-
ties and keywords respectively, that the Key-                               erates in.
Word Handler will use to build the response.                                    Multilingualism is achieved by using a
                                                                            multilingual sentence-BERT-based embed-
3.1          Key Words Extraction
                                                                            ding model (Reimers and Gurevych, 2020)
We want to retrieve the most relevant n-                                    to retrieve the keyword set from articles in
grams from a text. Thus, we need to un-                                     different languages.
derstand the text to select the most suitable
content words or n-grams from the text that                                 4     Sample of Keyword Extraction
best represent it.                                                                Functionality
   KEYWEXT performs the keyword extrac-
                                                                            We show some examples of the KEYWEXT
tion in two steps:
                                                                            functionality on some of the most relevant
  1. Build a vector representation from the                                 languages for SEEDTAG.
     whole input text.                                                         If we process the following short text in
                                                                            English:
  2. Retrieve the closest words, bigrams, and                                  How the suspension of the AstraZeneca
     trigrams to the text representation.                                   vaccine is affecting the inoculation drive in
                                                                             each Spanish region. Regional authorities
  The first step is done by using a sentence-                                 have administered 5.7 million doses and
BERT (Reimers and Gurevych, 2019) pre-                                       fully vaccinated nearly 1.7 million people,
    2
      https://www.python.org/                                               but the jabs for essential workers have been
    3
      https://www.tornadoweb.org/                                            put on hold due to the decision to halt the
    4
      https://spacy.io/                                                       use of the Anglo-Swedish medication.[...]


                                                                     35
The KEYWEXT service returns the follow-                 of the multilingual sentence-BERT models to
ing keyword list:                                       handle articles in different languages.
                                                            New versions of KEYWEXT will improve
     astrazeneca, administered, suspension,             handling document-level information: using
    vaccine, astrazeneca vaccine, authorities           document-level text representations, taking
               have administered                        into account topic fluctuations when produc-
Notice how a generic open-domain pre-                   ing the set of the top keywords, etc. The
trained word embedding model can detect                 new features will improve the extraction of
a recent Named Entity like astrazeneca as               keywords for longer and more complex texts.
a relevant element of the text. If a dif-               References
ferent kind of embedding model such as
word2vec (Mikolov et al., 2013) had been                Campos, R., V. Mangaravite, A. Pasquali,
used, this adaptation would not have been                 A. Jorge, C. Nunes, and A. Jatowt. 2020.
possible due to vocabulary coverage restric-              Yake! keyword extraction from single doc-
tions.                                                    uments using multiple local features. In-
   Moving to Spanish texts, when process-                 formation Sciences, 509:257–289.
ing a negative news piece about an attack               Devlin, J., M.-W. Chang, K. Lee, and
in Burkina Faso discussing the death of two               K. Toutanova. 2019. BERT: Pre-training
journalists :                                             of deep bidirectional transformers for lan-
                                                          guage understanding. In Proceedings of
Dos periodistas españoles mueren asesinados
                                                          NACL2019.
en un ataque en Burkina Faso. Un grupo de
  hombres armados asaltó el convoy de los              Grootendorst, M. 2020. Keybert: Minimal
reporteros David Beriain y Roberto Fraile en              keyword extraction with bert.
 dos camionetas y una decena de motos.[...]
                                                        Martı́nez Garcia, E., C. Creus, C. España-
We obtain the following set of keywords using             Bonet, and L. Màrquez. 2017. Using word
our KEYWEXT service:                                      embeddings to enforce document-level lex-
                                                          ical consistency in machine translation.
    ataque, asesinados, mueren, periodistas,              The Prague Bulletin of Mathematical Lin-
      viajaban, asaltó, españoles mueren,               guistics, 108.
    periodistas españoles mueren, ataque en
                                                        Mihalcea, R. and P. Tarau. 2004. Textrank:
                    Burkina
                                                          Bringing order into text. In Proceedings
Although these words or phrases can seem                  of EMNLP2004.
trivial, once fed into our contextualization            Mikolov, T., I. Sutskever, K. Chen, G. S.
models they reinforce their knowledge about               Corrado, and J. Dean. 2013. Distributed
potential harmful content and allow SEED-                 representations of words and phrases and
TAG to help advertisers better design their               their compositionality. In Advances in
Contextual Advertising strategies.                        Neural Information Processing Systems
                                                          26: 27th NIPS.
5     Conclusions and Future Work
                                                        Reimers, N. and I. Gurevych.          2019.
We presented KEYWEXT, a keywords ex-
                                                          Sentence-bert: Sentence embeddings us-
traction system that takes advantage of pre-
                                                          ing siamese bert-networks. In Proceedings
trained word embeddings to retrieve the most
                                                          of EMNLP2019.
relevant n-grams from an article. These ex-
tracted keywords feed into SEEDTAG’s con-               Reimers, N. and I. Gurevych. 2020. Making
textual advertising workflow to identify the              monolingual sentence embeddings multi-
most suitable matches among brands, their                 lingual using knowledge distillation. In
advertising campaigns and web articles.                   Proceedings of EMNLP2020.
   KEYWEXT is a web service that uses
                                                        Sun, C., X. Qiu, Y. Xu, and X. Huang. 2019.
sentence-BERT-based pre-trained models to
                                                          How to fine-tune bert for text classifica-
understand the context of an article beyond
                                                          tion? In China National Conference on
the sentence level and, then, retrieve the clos-
                                                          Chinese Computational Linguistics.
est words, bigrams, and trigrams of the doc-
ument. Also, KEYWEXT takes advantage


                                                   36