=Paper= {{Paper |id=Vol-1737/T2-4 |storemode=property |title=Word Embeddings for Information Extraction from Tweets |pdfUrl=https://ceur-ws.org/Vol-1737/T2-4.pdf |volume=Vol-1737 |authors=Surajit Dasgupta,Abhash Kumar,Dipankar Das,Sudip Kumar Naskar,Sivaji Bandyopadhyay |dblpUrl=https://dblp.org/rec/conf/fire/DasguptaKDNB16 }} ==Word Embeddings for Information Extraction from Tweets== https://ceur-ws.org/Vol-1737/T2-4.pdf
 Word Embeddings for Information Extraction from Tweets

                  Surajit Dasgupta*                         Abhash Kumar*                            Dipankar Das
                Jadavpur University, India              Jadavpur University, India              Jadavpur University, India
              surajit.techie@gmail.com abhashmaddi@gmail.com                                    ddas@cse.jdvu.ac.in
                               Sudip Kumar Naskar                                Sivaji Bandyopadhyay
                               Jadavpur University, India                        Jadavpur University, India
                           sudip.naskar@cse.jdvu.ac.in                       sivaji.cse.ju@gmail.com

ABSTRACT                                                                geographical region, reports of relief being carried out by an
This paper describes our approach on “Information Extrac-               organization and reports of damage to infrastructure. The
tion from Microblogs Posted during Disasters” as an attempt             main objective of this task is to extract all tweets, ti ∈ T
in the shared task of the Microblog Track at Forum for Infor-           that are relevant to each topic, qj ∈ Q with high precision
mation Retrieval Evaluation (FIRE) 2016 [2]. Our method                 and high recall, and rank them in their order of relevance.
uses vector space word embeddings to extract information
from microblogs (tweets) related to disaster scenarios, and             3.        DATA AND RESOURCES
can be replicated across various domains. The system, which                This section describes the dataset and resources provided
shows encouraging performance, was evaluated on the Twit-               to the shared task participants. A text file containing 50,068
ter dataset provided by the FIRE 2016 shared task.                      tweet identifiers that were posted during the Nepal earth-
                                                                        quake in April 2015, was provided by the organizers. A
CCS Concepts                                                            Python script was provided that downloaded the tweets us-
                                                                        ing the Twitter API2 into a JSON encoded tweet file which
•Computing methodologies → Natural language pro-
                                                                        was processed during the task. A text file of topic descrip-
cessing; Information extraction;
                                                                        tions in TREC3 format was provided, that contained in-
                                                                        formation necessary for the extraction of relevant tweets.
Keywords                                                                The topic file consisted of 7 topics: FMT1, FMT2, FMT3,
word embedding; information retrieval; information extrac-              FMT4, FMT5, FMT6 and FMT7. Each topic consisted of
tion; social media                                                      the following 4 sections:

                                                                             ·     : Topic number.
1.        INTRODUCTION
                                                                             ·     : Title of the topic.
   Social media plays a very important role in the dissemi-
nation of real-time information such as disaster outbreaks.                  ·    <desc> : Description of the topic.
Efficient processing of information from social media web-
sites such as Twitter can help us to pursue proper disaster
                                                                             ·    <narr> : A detailed narrative which describes what
                                                                                  types of tweets would be considered relevant to the
mitigation strategies. Extracting relevant information from                       topic.
tweets proves to be a challenging task, owing to their short
and noisy nature. Information extraction from social media              4.        SYSTEM DESCRIPTION
text is a well researched problem [3], [1], [9], [4], [8], [7]. Ap-
proaches using bag-of-words model, n-grams based methods                4.1        Preprocessing
and machine learning have been extensively used to extract
information from microblogs.                                               We parsed the JSON encoded tweets and retrieved the fol-
                                                                        lowing attributes – tweet identifier, tweet, geolocation. From
                                                                        the tweets, we removed the Twitter handles starting with @,
2.        TASK DEFINITION                                               URLs and all punctuation marks except the instances of a
   A set of tweets, T = {t1 , t2 , t3 . . . tn } and a set of topics,   single “ . ” (period) and a single “ , ” (comma), using reg-
Q = {q1 , q2 , q3 . . . qm } are given. Each topic contains a title,    ular expressions. We removed the ASCII characters from
a brief description and a detailed narrative on what type               the tweets and converted the remaining tweet to lower case
of tweets are considered relevant to the topic. The tweets              characters.
given in the task were posted during the Nepal earthquake1                 We also preprocessed the <narr> sections of the topic
in April 2015. Each topic contains a broad information need             file. We removed the punctuation marks, stop words and
during a disaster, such as – availability or requirement of             converted the text to lower case characters. The prepro-
general or medical resources by the population in the disas-            cessed <narr> sections for each topic was used for building
ter affected area, availability or requirement of resources in a        the word bags.
                                                                             2
         *Indicates equal contribution.                                          https://dev.twitter.com/overview/api/tweets
     1                                                                       3
         http://en.wikipedia.org/wiki/April 2015 Nepal earthquake                http://trec.nist.gov
4.2      Word Bags                                                 where,
   To build the topic-specific word bags, the preprocessed                    q#»i = Topic vector of ith word bag, qi .
<narr> section was manually checked to retain the relevant
words for each topic. The topic words were expanded using               Nv (qi ) = Number of words in qi present in vocabulary.
the synonyms obtained from NLTK WordNet4 . The past,                       # » = Vector of ith word in jth word bag.
                                                                           w  ij
past participle and present continuous forms of verbs were                             #»
obtained using the NodeBox5 library for Python. Vowels,            The tweet vector, ti and the word bag vector, q#»i are used
except the initial character, were removed to create unnor-        to calculate the similarity.
malized version of the words which are generally used in             The Word2Vec [5] library for Gensim8 was used to cre-
Twitter owing to the 140 character limit. The resultant set        ate the tweet vectors and the topic vectors using the pre-
of words were used to create the word bags for each topic.         trained GloVe model. The GloVe vectors were converted to
                                                                   Word2Vec vectors using code from the GitHub repository,
4.3      Entity Detection                                          manasRK/glove-gensim9 .
   For the topics FMT5 and FMT6, location and organiza-            4.5        Similarity Metric
tion information was required to be detected from the tweet.         We used cosine similarity measure to calculate the cosine
To extract the location information, we used the geo-location      similarity, S between the tweet vector and the topic vector.
attribute from the tweets and the Stanford NER tagger6 to                                             #»
extract location names from the tweet text. Similarly, we                             S = cosine-sim(t , q#»)     i   j
                                                                                                 #»
used the Stanford NER tagger to detect organizations in                                          ti · q#»j
                                                                                             = #»
                                                                                              ||ti || ||q#»j ||
the tweet text.
   We split the tweet file into 10 files containing 5,000 tweets
each. The Stanford NER tagger was used in parallel on the          A high value of S denotes higher similarity between the tweet
                                                                           #»
10 splitted files to identify the location and organization, if    vector, ti and the topic vector, q#»j and vice versa.
any. This reduced the computation time by 85%.                       For topics such as FMT5 and FMT6, where entity infor-
                                                                   mation such as location (LOC) or organization (ORG) was
4.4      Word Vectors                                              required, the consolidated score, S 0 was calculated as fol-
  We used the pre-trained 200 dimensional GloVe [6] word           lows:
vectors on Twitter data7 (2 billion tweets) to create the vec-                         S+I
                                                                                  S0 =
tors of the preprocessed tweets and the word bags.                                     2
  The tweet vectors were created by taking the normalized                              1, if LOC or ORG is present.
summation of the vectors of the words in the tweets, which             where,      I=
were present in the vocabulary of the pre-trained GloVe                                0, otherwise.
model. In cases where the word was not a part of the model
vocabulary, it was assigned to the null vector.                    The consolidated value, S 0 shifts the cosine similarity to-
                                                                   wards 1 if the location or organization information is present
                                    Nv (ti )                       (high relevance) and towards 0, otherwise (low relevance).
                      #»      1      X
                                          # »
                      ti =                u ij
                           Nv (ti ) j=1
                                                                   5.       RESULTS AND ERROR ANALYSIS
              and,   # » = #»
                     u     0,    if uij ∈
                                        / vocabulary                 Table 1 represents the results obtained by our word em-
                       ij
                                                                   bedding based approach. As seen in the table, Run 1 has
where,                                                             achieved the best results among the other runs, owing to the
       #»                                                          fact that Run 1 used word bags which were made from its
       ti = Tweet vector of ith tweet, ti .                        corresponding descriptions for each topic, whereas the the
  Nv (ti ) = Number of words in ti present in vocabulary.          other runs split the word bags categorically and averaged
     # » = Vector of ith word in jth tweet.                        the similarity between the tweet vector and the split topic
     u   ij                                                        vectors.
  Similarly, the word bag vectors were created by taking                      Run ID  Precision Recall MAP Overall MAP
the normalized summation of the vectors of the words in                                 @ 20    @ 1000 @ 1000
the word bags, which were present in the vocabulary of the                   JU NLP 1 0.4357 0.3420 0.0869    0.1125
pre-trained GloVe model. Out of vocabulary words were                        JU NLP 2 0.3714 0.3004 0.0647    0.0881
assigned to the null vector.                                                 JU NLP 3 0.3714 0.3004 0.0647    0.0881
                                    Nv (qi )
                      q#»i =                 # »
                              1      X                                             Table 1: Results of automatic runs
                                             w ij
                           Nv (qi ) j=1
                                                                     The secondary performance obtained in Run 2, 3 is a re-
              and,   # » = #»
                     w     0,    if wij ∈ / vocabulary
                       ij                                          sult of the averaging which approximated the actual cosine
                                                                   similarity value between the tweet and topic vectors. Runs
   4                                                               2 and 3, which are identical in nature, used cosine distance
     http://www.nltk.org/howto/wordnet.html
   5
     https://www.nodebox.net/code/index.php/Linguistics            as their similarity metric.
   6                                                                    8
     http://nlp.stanford.edu/software/CRF-NER.shtml                         https://radimrehurek.com/gensim/models/word2vec.html
   7                                                                    9
     http://nlp.stanford.edu/projects/glove/                                https://github.com/manasRK/glove-gensim
6.   CONCLUSION
   In this paper, we presented a brief overview of our sys-
tem to address the information extraction from microblog
data. We have observed that, building word bags which
contained all the topic words relevant to the topic showed
better results than splitting the word bags. Therefore, Run
1 exhibited better results than the rest. Considering hash-
tags as a feature should also improve the performance of the
system.
   As a future work, we work like to explore more sophisti-
cated techniques to build the vectors of the tweets, given the
vectors of its constituent words, by considering the sequence
of the words into account. We also plan to incorporate more
topic specific features to improve the performance of our
system.

7.   REFERENCES
[1] S. Choudhury, S. Banerjee, S. K. Naskar, P. Rosso, and
    S. Bandyopadhyay. Entity extraction from social media
    using machine learning approaches. In Working Notes
    in Forum for Information Retrieval Evaluation (FIRE
    2015), pages 106–109, Gandhinagar, India, 2015.
[2] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
    Microblog track: Information Extraction from
    Microblogs Posted during Disasters. In Working notes
    of FIRE 2016 - Forum for Information Retrieval
    Evaluation, Kolkata, India, December 7-10, 2016,
    CEUR Workshop Proceedings. CEUR-WS.org, 2016.
[3] M. Imran, S. Elbassuoni, C. Castillo, F. Diaz, and
    P. Meier. Practical extraction of disaster-relevant
    information from social media. In Proceedings of the
    22Nd International Conference on World Wide Web,
    WWW ’13 Companion, pages 1021–1024, New York,
    NY, USA, 2013. ACM.
[4] M. Imran, S. M. Elbassuoni, C. Castillo, F. Diaz, and
    P. Meier. Extracting information nuggets from
    disaster-related messages in social media. Proc. of
    ISCRAM, Baden-Baden, Germany, 2013.
[5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
    J. Dean. Distributed representations of words and
    phrases and their compositionality. In Advances in
    neural information processing systems, pages
    3111–3119, 2013.
[6] J. Pennington, R. Socher, and C. D. Manning. Glove:
    Global vectors for word representation. In EMNLP,
    volume 14, pages 1532–43, 2014.
[7] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,
    and M. Demirbas. Short text classification in twitter to
    improve information filtering. In Proceedings of the 33rd
    International ACM SIGIR Conference on Research and
    Development in Information Retrieval, SIGIR ’10,
    pages 841–842, New York, NY, USA, 2010. ACM.
[8] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen.
    Microblogging during two natural hazards events: what
    twitter may contribute to situational awareness. In
    Proceedings of the SIGCHI conference on human factors
    in computing systems, pages 1079–1088. ACM, 2010.
[9] X. Yang, C. Macdonald, and I. Ounis. Using word
    embeddings in twitter election classification. arXiv
    preprint arXiv:1606.07006, 2016.

</pre>