SmokPro: Towards Tobacco Product
                        Identification in Social Media Text

                Venkata Himakar Yanamandra ? , Kartikey Pant ? , and Radhika Mamidi

                        International Institute of Information Technology, Hyderabad
                            {himakar.yv,kartikey.pant}@research.iiit.ac.in
                                        radhika.mamidi@iiit.ac.in


                   Abstract. In this work, we explore the fine-grained classification of
                   tweets involving tobacco focused on identifying tobacco products. We
                   release the SmokPro dataset, along with an extensible method of label-
                   ing the tweets through a comprehensive annotation schema. We then
                   perform benchmarking experiments using state-of-the-art text classifi-
                   cation models, exploiting contextual word embeddings and achieve F1
                   scores as high as 0.971, hence showing the efficacy of the dataset and the
                   suitability of the models for the task.


           1     Introduction
           Smoking is one of the leading causes of preventable death with tobacco use caus-
           ing more than 7 million deaths per year worldwide 1 . While cigarette smoking
           went down among high school students from 2011 to 2019, the number of stu-
           dents using e-cigarettes rose from 3.6 million to 5.4 million 2 . Consequently, 2807
           e-cigarette induced lung injury cases were reported in the United States alone,
           of which, there have been 68 deaths 3 . An e-cigarette search has a high chance of
           coming across a tilted conversation or an advertisement encouraging e-cigarette
           use as a socially acceptable practice [10].
                It thus becomes essential to monitor and regulate its use through the detec-
           tion of its mentions in social media text. Even though Twitter is being used as
           a resource for public health surveillance, very little information is known about
           tobacco products, especially modern tobacco products [12,13]. There is a wide
           availability of user-generated content on Twitter, and it is critical to find trends
           in different tobacco products for further research, monitoring, and regulatory
           enforcement efforts [9]. The language used in tweets is diverse, idiosyncratic,
           sprinkled with emojis, and rapidly evolving [3]. In contrast to previous studies,
           we have taken slang into account as well. The use of variable spellings with
           non-standard grammar has risen in informal media. Studying this will help us
           filter ambiguous and sarcastic tweets [4]. Although Pant et al. [13] explored the
            ?
              The first two authors contributed equally to the work.
            1
              https://bit.ly/2WOuQki
            2
              https://bit.ly/2UrBnjf
            3
              https://bit.ly/3byz6Zf


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           2        Himakar, Pant and Mamidi

           fine-grained classification of tobacco-related tweets, they did not focus on the
           type of tobacco product mention, modern or traditional, to be identified. We,
           therefore, extend their study by annotating their released dataset SmokEng on
           a different dimension.
               In this work, we explore tobacco product identification and release the SmokPro
           dataset 4 . We also give the annotation schema used to label the dataset, enabling
           the dataset to be extensible. We further use existing state-of-the-art methods for
           text classification including contextual word embeddings to solve the multi-class
           classification problem effectively.


           2      Related Work

           There has been significant work towards the identification of tobacco-related so-
           cial media content. In [12], the authors explored content and sentiment analysis
           of Tobacco-related Twitter posts. They used basic statistical classifiers for de-
           tection with emphasis on emerging products like hookah and e-cigarettes. How-
           ever, the number of keywords was limited, and they do not consider slang into
           their dataset collection pipeline. Cole-Lewis et al. worked on content analysis
           for identifying trends of e-cigarette related tweets[2]. However, they do not give
           an explicit fine-grained pipeline for the identification of tobacco products. In
           [8], a supervised predictive model was developed for automatic identification
           of proponents of e-cigarettes on twitter using a data set of 1000 independently
           annotated twitter profiles. ”Proponents” of e-cigarettes were defined to be manu-
           facturers, advocates, and users of e-cigarettes who actively promote the product.
           They also analyzed the behavior of the selected users but did not detect tweets
           mentioning e-cigarette use. Fine-grained classification of tobacco-related tweets
           incorporating slang keywords was explored in [13]. However, they do not differ-
           entiate between different types of tobacco products. An analysis of marketing
           trends of e-cigarettes between 2008 and 2013 was studied in [9]. The authors col-
           lected e-cigarette tweets using keywords that were classified into advertising and
           non-advertisement classes and further sub-classes. However, they only consider
           e-cigarette tweets and thus do not enable identification of the type of tobacco
           product.


           3      Dataset Creation

           3.1     Preliminaries


           We use the clean version of SmokEng dataset [13], which consists of 2116 tobacco-
           related tweets classified into five distinct classes. The authors collected 7, 236, 442
           tweets between 1st October 2018 to 7th October 2018, which represents 1%
           of the entire twitter feed. They then extract tobacco-related tweets using an
            4
                https://github.com/kartikeypant/smokpro-tobacco-product-classification


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 SmokPro: Towards Tobacco Product Identification in Social Media Text        3

           exhaustive list of keywords considering colloquial slang. The authors claim to
           avoid potential bias based on the day by taking the dataset for a full week. They
           then prune out non-English tweets and tweets from users with less than 100
           followers in an attempt to weed out spam and bot behavior. The data was then
           manually annotated on the following five classes: ambiguous mentions, personal
           or anecdotal mention of tobacco use, advisory mention of tobacco products,
           advertisements of tobacco products, and mention of non-tobacco drugs.

           3.2     Dataset Annotation


           We annotate the data based on five categories for the task of tobacco prod-
           uct identification. These categories were motivated by the general perception of
           tobacco, e-cigarettes, and non-drug related tweets.
              To detect whether a tweet is associated with a tobacco product(traditional
           or modern), we formulate the following guidelines:-
              – Accounts of either personal use of the tobacco products, or provide instances
                of the use of the products by themselves or others
              – Mention statistics of tobacco consumption
              – Referring to associated health risks
              – Social campaigns condemning the usage of the tobacco
              – Endorsing or targeting the sale of the tobacco products and analogous prod-
                ucts or services
                 We then label each tweet as either of the following classes:

              1. Traditional Tobacco Product Mention: Tweets containing a tobacco
                 product mention according to the above guidelines related to a traditional
                 tobacco product. We consider cigarette, hookah, pipe, cigar, bidis, cigarillo,
                 shisha, and baccy as the traditional tobacco products.
              2. Modern Tobacco Product Mention: Tweets containing a tobacco prod-
                 uct mention according to the above guidelines related to a modern tobacco
                 product. We consider e-cigarette, e-juice, e-hookahs, e-liquid, mods, vape
                 pens, vapes, tank systems, and electronic nicotine delivery systems (ENDS)
                 as the modern tobacco products.
              3. Generic Mention of Smoking: Tweets portraying the aforementioned
                 definitions of tobacco usage without referring to any kind of tobacco products
                 be it traditional or modern or other drugs.
              4. Narcotics & Other Drug Mentions: Tweets denoting usage, purchase,
                 and information about narcotics and drugs other than traditional and mod-
                 ern tobacco products5 .
              5. Ambivalent or Unclear Mentions: This category of tweets contain tweets
                 containing information unrelated to tobacco or any other drug, or about
                 ambiguity in the intent of tweet, such as sarcasm.

            5
                https://www.incb.org/incb/en/narcotic-drugs/index.html


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           4        Himakar, Pant and Mamidi

                Category                                      Example                               Count
           Ambivalent Mentions     ”M EN T ION M EN T ION What have you been smoking”                333
           Traditional Tobacco       ”Trying to sell cigars to a non-smoker is a waste of time.
                                                                                                     555
                Mentions       Identify your target audience and create your message spe. . . U RL”
            Narcotics Mentions       ”M EN T ION Making my money and smoking my weed”                393
             Modern Tobacco           ”my vape stopped working Friday so I took it back and
                                                                                                     293
                Mentions          all he would do is send it back to the company!!! I’m furious.”
            General Mentions                       ”Smoking alone is boring lol”                     542
                    Table 1. Class-wise distribution of tweets, with an example per class.


           3.3     Inter-annotator agreement

           We calculate the Inter-Annotator Agreement (IAA) to assess the quality of an-
           notation for the fine-grained classification between the two annotation sets of
           2, 116 tobacco-product related tweets using Cohen’s Kappa coefficient [6]. The
           Kappa score of 0.861 indicates that the high quality & usefulness of the schema.


           4     Methodology

           In this section, we describe the classifiers designed for the task of fine-grained
           classification. We employ widely used contextual word embedding models like
           BERT and RoBERTa to obtain state-of-the-art performance for this task of
           multi-class text classification. We also use carefully tuned FastText to serve as
           a baseline in our comparison.
               We use the following models for the experiments:-

              1. FastText[7]: Fasttext classifier uses embeddings for each character n-gram
                 which helps in improving the task on newer unseen data as it captures in-
                 formation about local word ordering. We train the model in an end to end
                 fashion by reducing the cross-entropy loss over the predictions using an SGD
                 optimizer [1].
              2. BERT[5]: BERT is contextualized word representation based on bidirec-
                 tional transformers. that leverages context from both left and right repre-
                 sentations in each layer. The model can be trained in a simpler, yet efficient
                 manner without having to make major architectural changes. BERT is clas-
                 sified into two types, based on the casing characteristic of the input: Uncased
                 BERTLarge and Cased BERTLarge Both models are based in BERTLarge
                 which uses a 24-layered transformer with 340M parameters. We finetune
                 both models on the training dataset and evaluate the finetuned models on
                 the test dataset.
              3. RoBERTa[11]: RoBERTa is a replication study of BERT, trained on a much
                 larger dataset of over 160GB training dataset as compared to 16GB train-
                 ing dataset used for BERT. While BERT uses character-level encodings for
                 training, RoBERTa makes use of larger byte-pair encoding(BPE) vocabulary
                 that helps to achieve better performance on various tasks. Finally, compared


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 SmokPro: Towards Tobacco Product Identification in Social Media Text             5

                 to BERT, it removes the next sequence prediction objective from the train-
                 ing procedure. We use its large variant, RoBERT alarge , and finetune it for
                 the task.

           5      Experiments and Results
           In this section, we describe the full fine-grained classification experiment towards
           detection of tobacco product mentions in the SmokPro dataset. The experiment
           is designed to show the efficacy of existing state-of-the-art models for text clas-
           sification tasks in our dataset.
                For both BERT -based and RoBERTa models, we use a learning rate of
           2 ∗ 10−5 , a maximum sequence length of 50, and a weight decay of 0.01 while
           finetuning the model. We use the recently released automatic hyperparameter
           optimization technique for optimizing FastText’s hyperparameters using the val-
           idation set.
                The experiment conducted in the study entails classification of tweets into
           the five classes as defined above. We perform all our experiments using a 72-20-8
           train-test-validation split of the dataset. Table 2 illustrates the results of the
           experiment, using the following four metrics: Accuracy, and weighted F1 score.
           We observe CasedBERTLarge to outperform all models obtaining an accuracy
           of 97.10% and F1 score of 0.971 in the experiments. Further, RoBERT aLarge
           performs competitively obtaining an accuracy of 96.04% and F1 score of 0.960.
           As a non contextual word embedding based models, FastText performs fairly
           well obtaining 80.18% and F1 score of 0.801.


                                Methods                 Accuracy F1 Score
                                FastText                    80.18%      0.801
                                Uncased BERTLarge           92.84%      0.928
                                Cased BERTLarge            97.10%      0.971
                                RoBERT aLarge               96.04%      0.960
              Table 2. Experimental results for the five-class tobacco product identification task.


           6      Conclusion and Future Work
           In this work, we explored identification of tobacco product mention in a given
           tweet. We propose an extensible data annotation schema for the task. We also
           release the SmokPro dataset, a 2116 tweet dataset manually annotated into five
           classes as defined by the schema. We then perform experiments using existing
           state-of-the-art for text classification models in the given dataset and obtain F1
           scores as high as 0.971. The effective predictive performance for the task paves
           way for future work on disease surveillance, personal health mention detection
           and aspect-based sentiment analysis of social media text pertaining to tobacco
           products.


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           6        Himakar, Pant and Mamidi

           References
            1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
               subword information. Transactions of the Association for Computational Linguis-
               tics 5, 135–146 (2016)
            2. Cole-Lewis, H., Pugatch, J., Sanders, A., Varghese, A., Posada, S., Yun, C.,
               Schwarz, M., Augustson, E.: Social listening: A content analysis of e-cigarette dis-
               cussions on twitter. Journal of Medical Internet Research 17, e243 (10 2015).
               https://doi.org/10.2196/jmir.4969
            3. Collier, N., Nguyen, S., Nguyen, N.: Omg u got flu? analysis of shared health
               messages for bio-surveillance. Journal of Biomedical Semantics 2 (10 2011).
               https://doi.org/10.1186/2041-1480-2-S5-S9
            4. Conway, M., Hu, M., Chapman, W.: Recent advances in using natural language
               processing to address public health research questions using social media and con-
               sumergenerated data. Yearbook of Medical Informatics 28, 208–217 (08 2019).
               https://doi.org/10.1055/s-0039-1677918
            5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-
               rectional transformers for language understanding. In: Proceedings of the 2019
               Conference of the North American Chapter of the Association for Computational
               Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN,
               USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186 (2019),
               https://aclweb.org/anthology/papers/N/N19/N19-1423/
            6. Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass corre-
               lation coefficient as measures of reliability. Educational and Psychological Mea-
               surement 33(3), 613–619 (1973). https://doi.org/10.1177/001316447303300309,
               https://doi.org/10.1177/001316447303300309
            7. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
               classification. In: EACL (2016)
            8. Kavuluru, R., Sabbir, A.: Toward automated e-cigarette surveillance: Spotting e-
               cigarette proponents on twitter. Journal of Biomedical Informatics 61 (03 2016).
               https://doi.org/10.1016/j.jbi.2016.03.006
            9. Kim, A., Hopper, T., Simpson, S., Nonnemaker, J., Lieberman, A.A., Hansen,
               H., Guillory, J., Porter, L.: Using twitter data to gain insights into e-cigarette
               marketing and locations of use: An infoveillance study. Journal of Medical Internet
               Research (06 2015). https://doi.org/10.2196/jmir.4466
           10. Lazard, A., Saffer, A., Wilcox, G., Chung, A., Mackert, M., Bernhardt, J.: E-
               cigarette social media messages: A text mining analysis of marketing and consumer
               conversations on twitter. JMIR Public Health and Surveillance 2, e171 (12 2016).
               https://doi.org/10.2196/publichealth.6551
           11. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
               Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
               approach (07 2019)
           12. Myslı́n, M., Zhu, S.H., Chapman, W., Conway, M.: Using twitter to examine smok-
               ing behavior and perceptions of emerging tobacco products. Journal of medical
               Internet research 15, e174 (08 2013). https://doi.org/10.2196/jmir.2534
           13. Pant, K., Yanamandra, V.H., Debnath, A., Mamidi, R.: Smokeng: Towards fine-
               grained classification of tobacco-related social media text. In: W-NUT@EMNLP
               (2019)


Copyright ©  2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).