=Paper=
{{Paper
|id=Vol-2036/T2-11
|storemode=property
|title=Use of the Pole-based Overlapping Clustering Method for Labeling Tweets @ IRMiDis FIRE 2017
|pdfUrl=https://ceur-ws.org/Vol-2036/T2-11.pdf
|volume=Vol-2036
|authors=Noushin Fadaei,Thomas Mandl
|dblpUrl=https://dblp.org/rec/conf/fire/FadaeiM17
}}
==Use of the Pole-based Overlapping Clustering Method for Labeling Tweets @ IRMiDis FIRE 2017==
<pdf width="1500px">https://ceur-ws.org/Vol-2036/T2-11.pdf</pdf>
<pre>
        Use of the Pole-based Overlapping Clustering Method for
                  Labeling Tweets @ IRMiDis FIRE 2017
                          Noushin Fadaei                                                        Thomas Mandl
                        Hildesheim University                                              Hildesheim University
                        Hildesheim, Germany                                                Hildesheim, Germany
                      fadaei@uni-hildesheim.de                                            mandl@uni-hildesheim.de
1   INTRODUCTION                                                          Consequently all objects which are not part of the poles are assigned
Real-time information in social media such as Twitter on major            to the clusters based on their similarity with the poles. Based on
events, for example major earthquakes or terrorist attacks, plays an      those similarities a ranking of the poles is generated for each object
important role in revealing critical needs resulted by the disaster       and it is always assigned to the most similar one (rank 1). In order
and the resources to address them; however this information is            to validate the assignment of the clusters on the other ranking po-
also influentially clouded by emotional posts. Finding the critical       sitions the following condition is required: The average similarity
information and finding their relevance to the resources would            between the object and the previous as well as the following pole
provide a valuable platform for further analysis.                         needs to be bigger or equal to the similarity of the object to the
   This working note covers the method we used to tackle the              current pole.
challenge exposed to discussion by FIRE 2017 track: Information           Following the hierarchical agglomerative clustering method “single-
Retrieval from Microblogs during Disasters (IRMiDis) [1]. The first       link” [4], the hierarchical clusters are being built until all objects
task is to identify tweets regarding the required resources such as       fill in one single cluster. The algorithm is shown in figure 1.
food, water or medical aids and the available or potential avail-
able resources like the already transported or distributed resources
among a dataset of over 50,000 tweets regarding the Nepal-India           2.2    Expert Knowledge as Relevance Feedback
earthquake in 2015.
                                                                          PoBOC builds a graph on top of the similarity matrices of the given
   A tweet can represent the needs for resources as well as the
                                                                          objects and tries to find out the strongly connected graphs namely
availability of some others at specific locations; therefore it would
                                                                          poles; however if we consider these poles are already identified by
be reasonable to use a soft clustering algorithm which shows the
                                                                          the experts (similar to relevance feedback methods), our method
relevance degree of one tweet to each of the classes. Pole-based
                                                                          tries to find the objects that are close enough to each pole. One
overlapping clustering algorithm utilizes this notion to not only
                                                                          alternative solution is to cluster the entire data and consider the
detect the proper class for each object but also let the objects share
                                                                          level of hierarchy that covers three cluster (as need class, availabil-
various classes in case they are similar enough to each class.
                                                                          ity class and class of irrelevant tweets). Although PoBOC supports
                                                                          overlapping clusters, it doesn’t explain how to organize a soft clus-
2 METHODOLOGY                                                             tering over two poles. An object belongs to the closest pole. In case
2.1 Pole-based Overlapping Clustering                                     an object is similar to both of the classes evenly, we considered
    (PoBOC)                                                               it a member of both. The similarity of each object and a pole is
                                                                          calculated as follows:
The algorithm starts off with a similarity matrix X × X containing
the similarity of each object with all the other objects. Based on this
matrix a similarity graph is built, where an edge exists between two                                           1   Õ
                                                                                            s(x i , Pm ) =           s(x i , x j )           (1)
objects, if the similarity of these two is greater than the average                                          |Pm |
                                                                                                                  x j ∈Pm
similarities of each of the two individual objects and the whole set
of objects. [2]
In the next step the poles representing the individual medoids of           where x i is an object and Pm is the mth pole.
the clusters are built. For this purpose the cliques within the sim-
ilarity graph are obtained. These cliques form subgraphs where               We have used Weka [3] for the pre-processing. We removed the
each node/vertex is linked to all other nodes/vertices, starting off      links and used Snowball stop word list in addition to some few
with a node with at least the degree of one, with the lowest average      frequent words in the data collection namely earthquake, Nepal
similarity to all other objects. The starting nodes for the next poles    and death. As tweets are quite short the minimum term frequency
are determined in order to minimize the similarity with previous          of a word is not of a matter and for weighting tfidf is used to select
poles. The termination criterion for this process is again based on       the first 500 important words. These terms formed Weka instances
the average similarity: if the average similarity of the starting node    which are taken as objects in PoBOC algorithm.
for the new pole with the already built poles is smaller than the
average similarity of that node with all other objects, then the pro-         Developing this fuzzy algorithm, Euclidean distance is used as a
cess is ended and the number of clusters based on the previously          similarity measure and the closer an object is to a pole, the higher
found poles is determined. [2]                                            it is ranked.
                                                                                           during Disasters (IRMiDis). In Working notes of FIRE 2017 - Forum for Information
                                                                                           Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org.
                                                                                       [2] Guillaume Cleuziou, Lionel Martin, and Christel Vrain. 2004. PoBOC: an Overlap-
                                                                                           ping Clustering Algorithm. Application to Rule-Based Classification and Textual
                                                                                           Data. In In R. Lopez de Mantaras and L. Saitta, IOS Press, editor, Proceedings of the
                                                                                           16th European Conference on Artificial Intelligence. 440–444.
                                                                                       [3] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
                                                                                           and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD
                                                                                           Explor. Newsl. 11, 1 (Nov. 2009), 10–18. https://doi.org/10.1145/1656274.1656278
                                                                                       [4] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data Clustering: A Review. ACM
                                                                                           Comput. Surv. 31, 3 (Sept. 1999), 264–323. https://doi.org/10.1145/331499.331504


 Figure 1: PoBOC: an overlapping clustering algorithm [2]

Table 1: Sub-task 1- Identifying need-tweets and availability-
tweets

    Submission Detail     Availability-Tweets Evaluation      Need-Tweets Evaluation
Run ID         RunType Precision@100Recall@1000 MAP Precision@100Recall@1000 MAP Average MAP
Iwist_task1_1Automatic       0.0300       0.0194   0.0165    0.0400       0.1639   0.0417   0.0291

          Source: Results of our methodology by IRMiDis - FIRE2017 [1]


3    RESULTS
The results of our method regarding task 1 are shown in Table
1. According to the evaluations of the organization committee, it
ranked 13t h. The lack of lexical databases and semantic analysis in
the pre-processing phase can explain the low performance of this
method. Considering the amount of emotional stop words using
sentiment analysis can also improve the representation of the vector
objects.
   In our method, we took the given training data as the experts
feedback for which there are 674 tweets relevant to the availability
class and less than one third of it namely 201 tweets relevant to the
need class. Here the presumption is that the relevant tweets form a
cluster meaning they have more similar terms overlap, however the
relatively low recalls show the distribution of the relevant tweets
does not match our assumption. Yet the recall of the need class is
much better than the one of the availability class, and since we
have less relevant tweets for the need class, it suggests the variety
of the wordings regarding each class if not the variety of experts
perspectives.

4    FUTURE WORK
Due to working on this task within a short time the semantic aspects
of the training data and the co-appearance of the word were ignored.
Also the relevance feedback methods can play an important role in
case we look at the problem as an information retrieval task where
training data can be used to produce relevant queries.

REFERENCES
[1] Moumita Basu, Saptarshi Ghosh, Kripabandhu Ghosh, and Monojit Choudhury.
    2017. Overview of the FIRE 2017 track: Information Retrieval from Microblogs
                                                                                   2

</pre>