IR-based k-Nearest Neighbor Approach for
         Identifying Abnormal Chat Users
                      Notebook for PAN at CLEF 2012

            In-Su Kang1, Chul-Kyu Kim1, Shin-Jae Kang2, Seung-Hoon Na3
                             1
                              Kyungsung University, South Korea
                               2
                                Daegu University, South Korea
             3
              Electronics and Telecommunications Research Institute, South Korea
     1
       dbaisk@ks.ac.kr, 1ckkim@ks.ac.kr, 2sjkang@daegu.ac.kr, 3seunghoonna@gmail.com


       Abstract. This paper addresses a task of automatically identifying abnormal
       chat users where training data is given as a collection of chat messages from
       both abnormal and normal users. We employ a k-NN classification based on an
       IR technique. A document is constructed in per-conversation for each user by
       concatenating his/her messages in a conversation. A query is constructed for a
       new user in the same way. A k-NN classification is then performed using top
       retrieved documents in response to the query.


1     Introduction

   A chat user has his/her intended goals when taking part in chatting with others.
This paper addresses a task of identifying such goals of a chat user. Our assumption is
that strong clues for inferring their intended goals may commonly appear in chat
messages of similar users. Based on this assumption, we represent a chat user as a
document comprising his/her chat messages in a specific conversation and then
identify chat users with abnormal goals by finding chat user documents with similar
goals. We employ an information retrieval (IR) technique to discern such documents.
In a training step, we prepare an IR system by indexing a collection of chat logs with
chat-user goals marked either ‘abnormal’ or not. Given an unseen chat user, its chat
messages are collected to formulate a query to be submitted to the IR system, and its
chat goal is automatically classified using top-retrieved documents to which a k-NN
approach is applied.


2     Method

   A chat conversation can be viewed as a set of one or more user documents each of
which consists of sentences from a particular user of the conversation. The training
conversations are thus converted into a collection of user documents which is indexed
using an information retrieval (IR) system. Given a test conversation, it is divided
similarly into documents {qi} each of which is then submitted as a query to the IR
system to retrieve a set R={d1, …, dk} of its highly related k training documents. For
each qi, the following k-nearest neighbor classifier (Tan, 2005) is then used to
determine whether qi is uttered from a sexual predator (SP) or not:

                                 c* = argmax     å sim(q , d )d (d , c)
                                       cÎ{Y , N } d ÎR
                                                           i


                                            ì1 if d Î c
                                 d (d , c) = í
                                            î0 if d Ï c
  where Y and N indicate SP class and non-SP class respectively, and sim(×,×) is a
query-document similarity score from the IR system.


3       Evaluation Results and Discussion

   To evaluate the performance of our IR-based k-NN classifier and to find the best
parameter value for the number k of top-retrieved documents, 5-fold cross-validation
was performed on the training set. Apache Lucene1 was employed for the IR system.
Without stop-words removal and stemming, all 1-gram and 2-gram terms were used
for index terms, where only rare terms with frequencies less than 3 were removed. For
retrieval, the Lucene’s default retrieval formula was used.


                        Figure 1. 5-fold cross-validation for training data.

Figure 1 shows cross-validation performance of train data for different k values.
Based on the result shown in Figure 1, we chose k values 31 and 41 to submit two
official runs. The better performance of our official runs was reported as 0.2140,

1
    http://lucene.apache.org/core/
0.7960, and 0.3373, respectively for precision, recall, and F1. However, it was found
that in our official runs, roughly a half of the test set was missed when preparing run
submissions. So, we have fixed the error and have reiterated the same experiment.
Figure 2 presents our revised result.
   As Figures 1-2 show, the best k values for k-NN classifier are significantly
different between training and test data, and this is the main reason for the poor
performance in this year’s SPI task. Using a robust value for k was indeed important
in our approach; when we used k values which are optimal for both training and test
data, the proposed method showed more than 60% and 70% in F1, respectively for
training and test data.
   Overall, our current use of k-NN classifier was not very successful in obtaining a
good performance. We believe that this is because our current approach is not so
matured with a lot of further explorations remaining. In the future, we will further
examine the effect using document similarity on the same task by focusing on finding
a robust range for k and using more advanced IR similarity functions, and so on.


                           Figure 2. Performance for test data.


References
 1. Songbo Tan. (2005). Neighbor-weighted K-nearest neighbor for unbalanced text corpus.
    Expert Systems with Applications, 28(4): 667-671.