A Hybrid Model For Information Retrieval From
                   Microblogs During Disaster

                                  Du Xin                                 Wang Xiaoyu

                     School of Computer Science and            School of Computer Science and
                   Technology, Heilongjiang Institute of     Technology, Heilongjiang Institute of
                       Technology, Harbin, China                 Technology, Harbin, China

                          duxin111@outlook.com                   wongxiaoyu1946@gmail.com

                              Zhuang Ziyao                                  Qi Limin

                   Faculty of Science, Agriculture and      School of Physical Sciences, Harbin
                Engineering, University of Newcastle upon    Normal University, Harbin, China
                                Tyne, UK
                                                                    qilimin111@outlook.com
                       zhuangziyao1@outlook.com


Abstract                                                    tweets from other tweets such as useless tweets, and
                                                            repeat forwarding tweets.
When the disaster occurs, the social network site
such as Twitter is increasingly being used for                  For Task 2, Using the results identified by Task 1,
helping direct rescue operations. This article              we implement the matching task. We should match
describes the methods we used in the Fire2017.              need-tweets with availability-tweets. Also, a need-
                                                            tweet and a matching availability-tweets can be in
We regarded the distinction of need-tweets and
                                                            different language. we deem this task as an information
availability-tweets as classification tasks, and the        retrieval problem. For each of the need-tweets in the
logistic regression and Support Vector Machine              query collection, retrieve the most matching
are used to decide the type of the tweets. In the           availability-tweets. In the match between need-tweets
need and availability matching, we regard it as an          and availability-tweets.
information retrieval task, using the retrieval
model to complete the task.                                 2. Task 1
                                                            2.1 Problem Description
KEYWORDS
                                                                In Task 1, we judge that the text is need-tweets,
information retrieval, microblog, identify need-            availability-tweets, or other using the correlation
tweets and availability-tweets, matching need-              between text and category. Based on the description of
tweets and availability-tweets                              classification problem, the classification of text content
                                                            includes the expression of the text, the selection and
1. Introduction                                             training of the classifier, text preprocessing, feature
    Now users on the social networking site to publish      extraction and so on. We use the training data (𝑥1 , 𝑦1 ),
real-time content, become the most extensive                (𝑥2 , 𝑦2 ,,   , (𝑥𝑁 , 𝑦𝑁 ) to learn a classifier Y=f(x,, so
information sharing, one of the fastest ways to spread      that the classification system can use the classifier for
information. The published material can contain             new data to mark. 𝑥𝑖 represents the characteristic
specific locations, specific needs, especially when the     vector of data. 𝑦𝑖 represents the classification system
disaster comes, can provide accurate rescue information     output. 𝑦𝑖 labels as {0,1}.
to guide the rescue efficiently.
                                                            2.2 Classifier Models
    For Task 1, we need to judge the need-tweets,               In this working note, two groups adopt SVM, which
availability-tweets and others. We deem it as a             is a binary model, which is defined as the most
classification problem. We use the classifiers to           significant liner classifier in the feature space. The
distinguish the useful need-tweets and the availability-    learning strategy is the interval maximization. Solving
the problem into Convex Quadratic Programming                   Remove URL                   √         √           √
Problem. We chose the RBF kernel, which can map the
samples non-linearly to a higher dimension. The RBF             ID                          2_1        2_2         2_3
core has fewer numerical complexity characteristics.
Ultimately two groups completed Task 1 with LibSVM.             Remove stop word             √         √           ×

    The third group uses LR model. Each data point of           Remove @user name            √         √           √
the LR model has an impact on the classification plane.
                                                                Remove punctuation           √         √           √
Its influence is far from its distance to the classification
plane, and if the data dimension is high, the LR model          Remove URL                   √         √           √
will match the parameters regularization method. In
classification, the computation is minimal, and the               Note:
speed is breakneck, so the storage resources are low, so
it is convenient to observe the probability scores of             (1, HLJIT2017-IRMIDIS_1_task1_1 = 1_1
samples.
                                                                  (2, HLJIT2017-IRMIDIS_1_task1_2 = 1_2
2.3 Feature Selection                                             (3, HLJIT2017-IRMIDIS_1_task1_3 = 1_3
   For feature selection, we conduct experiments with
words and n-gram (n = 2, 3, 4, 5, respectively. The               (4, HLJIT2017-IRMIDIS_1_task2_1 = 2_1
results show that we have the best effect of using words
                                                                  (5, HLJIT2017-IRMIDIS_1_task2_2 = 2_2
as features. Table 1 shows the comparison of the test
results of each feature. For the preprocessing of tweet           (6, HLJIT2017-IRMIDIS_1_task2_3 = 2_3
text, each team adopts different methods to preprocess
the data. Table 2 specifically describes the different         3. Task 2
teams in the pre-treatment differences exist. (√
indicates that this method is used, and × is not used,         3.1 Problem Description
                                                                   Task 2 requires that the need-tweets match in Task
   Note:
                                                               1 be searched by Task 1. Need-tweets is used as a query
  HLJIT2017-IRMIDIS_1_task1_1 and HLJIT2017-                   set Q. Availability-tweets can be used as a collection of
IRMIDIS_1_task1_2 is different in stop words list.             documents D. We use statistical language models to
                                                               solve the problem of Task 2. Language models are used
                                                               to assess what kind of word sequences are more typical
 Table 1: The Result of various characteristic                 according to language usages, and inject the right bias
                                                               accordingly into the system to prefer an output
   Feather              Need            Availability           sequence of words with high probability according to
                  Pre          Re      Pre        Re           the language model. If a document language model
                                                               gives the query a high probability, the query words
   2-gram       0.281      0.245      0.477     0.474          must have top opportunities according to the document
                                                               language model, which further means that the query
   3-gram       0.482      0.254      0.671     0.597          words frequently occur in the document.
   4-gram       0.428      0.245      0.641     0.606          3.2 Relation
   5-gram       0.385          0.2    0.604     0.556              The correlation calculation can be expressed briefly
                                                               as shown in Figure 1, using the Need-Tweets (N, as the
   word         0.517      0.281      0.709     0.644          query set Q, the Availability-tweets (A, as the
                                                               document set D, and then the correlation calculation to
                                                               obtain the correlation R (Q, D,.


           Table 2: Preprocess tweet text                                Figure 1: Relevance calculation
 ID                             1_1      1_2        1_3
                                                                     N            Q                D           A
 Remove stop word                √        √            ×

 Remove @username                √        √            √

 Remove punctuation              √        √            √                                R (Q, D)
                                                              Table 3 describes the parameters of the two tools
                                                          selected in this article.
3.3 Language Model
    We use the indri open-source retrieval tool and use              Table 3: Parameter setting
the Dirichlet smoothing language model. The formula
                                                           SVM             svm_type=c_svc,kernel_type=rbf,
is as follows:

                        𝑐(𝑤; 𝑑) + 𝜇𝑝(𝑤|𝐶)                                        Gamma=0.1，nr_class=2，
           𝑃𝑠 (𝑤|𝑑) =
                          ∑𝑤 𝑐(𝑤; 𝑑) + 𝜇                                       total_sv=8660，Rho=0.313597
                              𝜇
                  𝑎𝑑 =                                     LR                  studyRate=0.01，theta = 0.05
                       ∑𝑤 𝑐(𝑤; 𝑑) + 𝜇
                                                                                      iterNum=1000
   We give a discount to the probability of the word
appearing in the document and provide extra value to
the likelihood of the word that does not appear in the
document.                                                 4.4 Experimental results
              𝑠𝑐𝑜𝑟𝑒(𝑞, 𝑑) = 𝑙𝑜𝑔 𝑝(𝑞|𝑑)                       Table 4 shows the experimental results of Task 1,
                                                          the average Map of LR algorithm is higher than that of
                                           𝑝𝑠 (𝑝𝑖 |𝑑)
            =         ∑             𝑙𝑜𝑔                   two groups of using LibSVM. The Availability-Tweet
                                          𝑎𝑑 𝑝(𝑞𝑖 |𝑐)     and Need-Tweets Map values of the two groups using
                𝑖:𝑐(𝑖:𝑐(𝑞𝑖 :𝑑)>0)
                                                          LibSVM showed a contrast. And they all were much
             + ∑ 𝑙𝑜𝑔𝑝(𝑞𝑖 |𝑐) + 𝑛𝑙𝑜𝑔𝑎𝑑                     higher than the other one, which led to the low Average
                  𝑖                                       Map value. The preliminary analysis may be due to
                 𝑝     (𝑤|𝑑)      𝑤∈𝑑                     improper selection of stop words caused by the
       𝑝(𝑤|𝑑) = { 𝑠𝑒𝑒𝑛                                    occurrence of the above phenomenon.
                 𝑎𝑑 𝑃(𝑤|𝐶) 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒

4. Experimental Result                                       Table 5 shows the experimental results of Task 2.
                                                          Task 2 of the input is based on task 1 is output. The
4.1 Data Set                                              precision of the three sets of values was almost the same,
                                                          but the first group of Recall was much lower than the
    At the start of the track, about 20,000 tweets        other two teams. Probably the result of the submission
released (training set,, along with a sample of need-     in Task 1 is too much, and there is a mistake on the
tweets and availability-tweets in these 20 K tweets.      threshold.
Later, a new set of 50,000 tweets released (test set,.
                                                                Table 4: Task 1 Experimental Result
4.2 Evaluation index
  Precision = number of correct messages extracted /                      ID         1_1         1_2        1_3
number of extracted messages
                                                           Avail-     Precision      0.550      0.100      0.760
   Recall = the number of correct messages extracted /
the number of messages in the sample                       ability      @100

                                    (𝛽2 + 1)𝑃𝑅                          Recall       0.100      0.017      0.159
                𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
                                     𝛽2 𝑃 + 𝑅
                                                                       @1000
   β is the parameter, 𝑃 is the precision, and 𝑅 is the
recall.                                                                  Map         0.760      0.001      0.112

   Map：                                                    Need       Precision      0.060      0.630      0.610

                                                                        @100
                𝜃̂𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓(𝑥|𝜃)
                                                                        Recall       0.021      0.217      0.173

                                  𝑓(𝑥|𝜃)𝑔(𝜃)                           @1000
      𝜃̂𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃
                               ∫⊖ 𝑓(𝑥|𝜃 ′ )𝑔(𝜃 ′ )𝑑𝜃 ′                  MAP          0.001      0.140      0.128
4.3 Parameter setting                                      Average MAP               0.031      0.071      0.120
       Table 5: Task 2 Experimental Result

           Precision@5        Recall        F-score

 2_1           0.088          0.021          0.034

 2_2           0.088          0.217          0.125

 2_3           0.082          0.147          0.105


5. Conclusion
   Through the above experimental results, we found
that the classification and sorting of text content for
some informal occasions are very regarding words. In
the future experiments, we will deepen the study of
machine learning and try to select more different
features, to filter and choose text content of informal
occasions.

Acknowledgments
This work is supported by Philosophy and Social
Science Planning Project of Heilongjiang
Province, China (No. 16EDD05)

References
 [1] M. Basu, S. Ghosh, K. Ghosh and M. Choudhury.
     Overview of the FIRE 2017 track: Information
     Retrieval from Microblogs during Disasters
     (IRMiDis,. In Working notes of FIRE 2017 - Forum
     for Information Retrieval Evaluation, Bangalore,
     India, December 8-10, 2017, CEUR

 [2] T. Joachims, Making large-Scale SVM Learning
     Practical. Advances in Kernel Methods - Support
     Vector Learning, B. Schölkopf and C. Burges and A.
     Smola (ed.,, MIT-Press, 1999.

 [3] QI Haoliang1,CHENG Xiaolong1, YANG Muyun2,
     HE Xiaoning3, LI Sheng2, LEI Guohua1. High
     Performance Chinese Spam Filter.2010, 24(2,: 76-84