A Hybrid Model For Information Retrieval From Microblogs During Disaster Du Xin Wang Xiaoyu School of Computer Science and School of Computer Science and Technology, Heilongjiang Institute of Technology, Heilongjiang Institute of Technology, Harbin, China Technology, Harbin, China duxin111@outlook.com wongxiaoyu1946@gmail.com Zhuang Ziyao Qi Limin Faculty of Science, Agriculture and School of Physical Sciences, Harbin Engineering, University of Newcastle upon Normal University, Harbin, China Tyne, UK qilimin111@outlook.com zhuangziyao1@outlook.com Abstract tweets from other tweets such as useless tweets, and repeat forwarding tweets. When the disaster occurs, the social network site such as Twitter is increasingly being used for For Task 2, Using the results identified by Task 1, helping direct rescue operations. This article we implement the matching task. We should match describes the methods we used in the Fire2017. need-tweets with availability-tweets. Also, a need- tweet and a matching availability-tweets can be in We regarded the distinction of need-tweets and different language. we deem this task as an information availability-tweets as classification tasks, and the retrieval problem. For each of the need-tweets in the logistic regression and Support Vector Machine query collection, retrieve the most matching are used to decide the type of the tweets. In the availability-tweets. In the match between need-tweets need and availability matching, we regard it as an and availability-tweets. information retrieval task, using the retrieval model to complete the task. 2. Task 1 2.1 Problem Description KEYWORDS In Task 1, we judge that the text is need-tweets, information retrieval, microblog, identify need- availability-tweets, or other using the correlation tweets and availability-tweets, matching need- between text and category. Based on the description of tweets and availability-tweets classification problem, the classification of text content includes the expression of the text, the selection and 1. Introduction training of the classifier, text preprocessing, feature Now users on the social networking site to publish extraction and so on. We use the training data (π‘₯1 , 𝑦1 ), real-time content, become the most extensive (π‘₯2 , 𝑦2 ,, , (π‘₯𝑁 , 𝑦𝑁 ) to learn a classifier Y=f(x,, so information sharing, one of the fastest ways to spread that the classification system can use the classifier for information. The published material can contain new data to mark. π‘₯𝑖 represents the characteristic specific locations, specific needs, especially when the vector of data. 𝑦𝑖 represents the classification system disaster comes, can provide accurate rescue information output. 𝑦𝑖 labels as {0,1}. to guide the rescue efficiently. 2.2 Classifier Models For Task 1, we need to judge the need-tweets, In this working note, two groups adopt SVM, which availability-tweets and others. We deem it as a is a binary model, which is defined as the most classification problem. We use the classifiers to significant liner classifier in the feature space. The distinguish the useful need-tweets and the availability- learning strategy is the interval maximization. Solving the problem into Convex Quadratic Programming Remove URL √ √ √ Problem. We chose the RBF kernel, which can map the samples non-linearly to a higher dimension. The RBF ID 2_1 2_2 2_3 core has fewer numerical complexity characteristics. Ultimately two groups completed Task 1 with LibSVM. Remove stop word √ √ Γ— The third group uses LR model. Each data point of Remove @user name √ √ √ the LR model has an impact on the classification plane. Remove punctuation √ √ √ Its influence is far from its distance to the classification plane, and if the data dimension is high, the LR model Remove URL √ √ √ will match the parameters regularization method. In classification, the computation is minimal, and the Note: speed is breakneck, so the storage resources are low, so it is convenient to observe the probability scores of (1, HLJIT2017-IRMIDIS_1_task1_1 = 1_1 samples. (2, HLJIT2017-IRMIDIS_1_task1_2 = 1_2 2.3 Feature Selection (3, HLJIT2017-IRMIDIS_1_task1_3 = 1_3 For feature selection, we conduct experiments with words and n-gram (n = 2, 3, 4, 5, respectively. The (4, HLJIT2017-IRMIDIS_1_task2_1 = 2_1 results show that we have the best effect of using words (5, HLJIT2017-IRMIDIS_1_task2_2 = 2_2 as features. Table 1 shows the comparison of the test results of each feature. For the preprocessing of tweet (6, HLJIT2017-IRMIDIS_1_task2_3 = 2_3 text, each team adopts different methods to preprocess the data. Table 2 specifically describes the different 3. Task 2 teams in the pre-treatment differences exist. (√ indicates that this method is used, and Γ— is not used, 3.1 Problem Description Task 2 requires that the need-tweets match in Task Note: 1 be searched by Task 1. Need-tweets is used as a query HLJIT2017-IRMIDIS_1_task1_1 and HLJIT2017- set Q. Availability-tweets can be used as a collection of IRMIDIS_1_task1_2 is different in stop words list. documents D. We use statistical language models to solve the problem of Task 2. Language models are used to assess what kind of word sequences are more typical Table 1: The Result of various characteristic according to language usages, and inject the right bias accordingly into the system to prefer an output Feather Need Availability sequence of words with high probability according to Pre Re Pre Re the language model. If a document language model gives the query a high probability, the query words 2-gram 0.281 0.245 0.477 0.474 must have top opportunities according to the document language model, which further means that the query 3-gram 0.482 0.254 0.671 0.597 words frequently occur in the document. 4-gram 0.428 0.245 0.641 0.606 3.2 Relation 5-gram 0.385 0.2 0.604 0.556 The correlation calculation can be expressed briefly as shown in Figure 1, using the Need-Tweets (N, as the word 0.517 0.281 0.709 0.644 query set Q, the Availability-tweets (A, as the document set D, and then the correlation calculation to obtain the correlation R (Q, D,. Table 2: Preprocess tweet text Figure 1: Relevance calculation ID 1_1 1_2 1_3 N Q D A Remove stop word √ √ Γ— Remove @username √ √ √ Remove punctuation √ √ √ R (Q, D) Table 3 describes the parameters of the two tools selected in this article. 3.3 Language Model We use the indri open-source retrieval tool and use Table 3: Parameter setting the Dirichlet smoothing language model. The formula SVM svm_type=c_svc,kernel_type=rbf, is as follows: 𝑐(𝑀; 𝑑) + πœ‡π‘(𝑀|𝐢) Gamma=0.1,nr_class=2, 𝑃𝑠 (𝑀|𝑑) = βˆ‘π‘€ 𝑐(𝑀; 𝑑) + πœ‡ total_sv=8660,Rho=0.313597 πœ‡ π‘Žπ‘‘ = LR studyRate=0.01,theta = 0.05 βˆ‘π‘€ 𝑐(𝑀; 𝑑) + πœ‡ iterNum=1000 We give a discount to the probability of the word appearing in the document and provide extra value to the likelihood of the word that does not appear in the document. 4.4 Experimental results π‘ π‘π‘œπ‘Ÿπ‘’(π‘ž, 𝑑) = π‘™π‘œπ‘” 𝑝(π‘ž|𝑑) Table 4 shows the experimental results of Task 1, the average Map of LR algorithm is higher than that of 𝑝𝑠 (𝑝𝑖 |𝑑) = βˆ‘ π‘™π‘œπ‘” two groups of using LibSVM. The Availability-Tweet π‘Žπ‘‘ 𝑝(π‘žπ‘– |𝑐) and Need-Tweets Map values of the two groups using 𝑖:𝑐(𝑖:𝑐(π‘žπ‘– :𝑑)>0) LibSVM showed a contrast. And they all were much + βˆ‘ π‘™π‘œπ‘”π‘(π‘žπ‘– |𝑐) + π‘›π‘™π‘œπ‘”π‘Žπ‘‘ higher than the other one, which led to the low Average 𝑖 Map value. The preliminary analysis may be due to 𝑝 (𝑀|𝑑) π‘€βˆˆπ‘‘ improper selection of stop words caused by the 𝑝(𝑀|𝑑) = { 𝑠𝑒𝑒𝑛 occurrence of the above phenomenon. π‘Žπ‘‘ 𝑃(𝑀|𝐢) π‘œπ‘‘π‘’π‘Ÿπ‘€π‘–π‘ π‘’ 4. Experimental Result Table 5 shows the experimental results of Task 2. Task 2 of the input is based on task 1 is output. The 4.1 Data Set precision of the three sets of values was almost the same, but the first group of Recall was much lower than the At the start of the track, about 20,000 tweets other two teams. Probably the result of the submission released (training set,, along with a sample of need- in Task 1 is too much, and there is a mistake on the tweets and availability-tweets in these 20 K tweets. threshold. Later, a new set of 50,000 tweets released (test set,. Table 4: Task 1 Experimental Result 4.2 Evaluation index Precision = number of correct messages extracted / ID 1_1 1_2 1_3 number of extracted messages Avail- Precision 0.550 0.100 0.760 Recall = the number of correct messages extracted / the number of messages in the sample ability @100 (𝛽2 + 1)𝑃𝑅 Recall 0.100 0.017 0.159 𝐹 βˆ’ π‘ π‘π‘œπ‘Ÿπ‘’ = 𝛽2 𝑃 + 𝑅 @1000 Ξ² is the parameter, 𝑃 is the precision, and 𝑅 is the recall. Map 0.760 0.001 0.112 Map: Need Precision 0.060 0.630 0.610 @100 πœƒΜ‚π‘€πΏ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑓(π‘₯|πœƒ) Recall 0.021 0.217 0.173 𝑓(π‘₯|πœƒ)𝑔(πœƒ) @1000 πœƒΜ‚π‘€π΄π‘ƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ βˆ«βŠ– 𝑓(π‘₯|πœƒ β€² )𝑔(πœƒ β€² )π‘‘πœƒ β€² MAP 0.001 0.140 0.128 4.3 Parameter setting Average MAP 0.031 0.071 0.120 Table 5: Task 2 Experimental Result Precision@5 Recall F-score 2_1 0.088 0.021 0.034 2_2 0.088 0.217 0.125 2_3 0.082 0.147 0.105 5. Conclusion Through the above experimental results, we found that the classification and sorting of text content for some informal occasions are very regarding words. In the future experiments, we will deepen the study of machine learning and try to select more different features, to filter and choose text content of informal occasions. Acknowledgments This work is supported by Philosophy and Social Science Planning Project of Heilongjiang Province, China (No. 16EDD05) References [1] M. Basu, S. Ghosh, K. Ghosh and M. Choudhury. Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis,. In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, December 8-10, 2017, CEUR [2] T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. SchΓΆlkopf and C. Burges and A. Smola (ed.,, MIT-Press, 1999. [3] QI Haoliang1,CHENG Xiaolong1, YANG Muyun2, HE Xiaoning3, LI Sheng2, LEI Guohua1. High Performance Chinese Spam Filter.2010, 24(2,: 76-84