=Paper= {{Paper |id=Vol-1150/burnside |storemode=property |title=One Day in Twitter: Topic Detection Via Joint Complexity |pdfUrl=https://ceur-ws.org/Vol-1150/burnside.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/BurnsideMJ14 }} ==One Day in Twitter: Topic Detection Via Joint Complexity== https://ceur-ws.org/Vol-1150/burnside.pdf
          One Day in Twitter: Topic Detection Via Joint
                          Complexity

                    Gérard Burnside               Dimitris Milioris             Philippe Jacquet
                      Bell Labs, Alcatel-Lucent, Centre de Villarceaux, 91620 Nozay, France
                     {gerard.burnside, dimitrios.milioris, philippe.jacquet}@alcatel-lucent.com



                                                                1     Introduction
                                                                During recent years the online social media services
                       Abstract
                                                                have seen a huge expansion, and as a result, the
    In this paper we introduce a novel method                   value of information within these networks has in-
    to perform topic detection in Twitter based                 creased dramatically. Interactions and communication
    on the recent and novel technique of Joint                  between users help predict the evolution of informa-
    Complexity. Instead of relying on words as                  tion in online social networks, and the ability to study
    most other existing methods which use bag-of-               these networks can provide relevant information in real
    words or n-gram techniques, Joint Complex-                  time.
    ity relies on String Complexity which is de-                The study of social networks has several research chal-
    fined as the cardinality of a set of all distinct           lenges:
    factors, subsequences of characters, of a given                 • searching in blogs, tweets and other social me-
    string. Each short sequence of text is decom-                     dia is still an open problem since posts are indi-
    posed in linear time into a memory efficient                      vidually small in size but there is a tremendous
    structure called Suffix Tree and by overlap-                      quantity of them delivered in real time, with little
    ping two trees, in linear or sublinear average                    and sometimes transient contextual information.
    time, we obtain the Joint Complexity defined                      Real time search has to balance between quality,
    as the cardinality of factors that are common                     authority, relevance and timeliness of the content,
    in both trees. The method has been exten-
    sively tested for Markov sources of any order                   • the information of the correlation between groups
    for a finite alphabet and gave good approxi-                      of users can help predict media consumption, net-
    mation for text generation and language dis-                      work resources and traffic. This can be used in
    crimination. One key take-away from this ap-                      order to improve the quality of service and expe-
    proach is that it is language-agnostic since we                   rience,
    can detect similarities between two texts in                    • analyzing the relationship between members of a
    any loosely character-based language. There-                      group or a community within a social network can
    fore, there is no need to build any specific dic-                 reveal the important teams in order to use them
    tionary or stemming method. The proposed                          for companies marketing plans,
    method can also be used to capture a change
    of topic within a conversation, as well as the                  • spam and advertisement detection is another im-
    style of a specific writer in a text. In this                     portant challenge, since with the growth of so-
    paper we exploit a dataset collected by using                     cial networks we also have a continuously grow-
    the Twitter streaming API for one full day,                       ing amount of irrelevant information over the net-
    and we extract a significant number of topics                     work.
    for every timeslot.                                            In order to address these challenges, we need to
                                                                extract the relevant information from online social
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
                                                                media in real time.
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,      In this paper we use the theory of Joint Com-
published at http://ceur-ws.org                                 plexity to detect bursty trends among short pieces
of information. Our method is simple, context-free,        |I(X) ∩ I(Y )|.
with no grammar and no language assumptions, and              The J-complexity is an efficient way to estimate
it does not use semantics.                                 similarity degree of two sequences. Two texts written
                                                           in similar languages have more sequences in common
The Suffix Tree contains the algorithmic signa-            than texts written in very different languages. Fur-
ture of the text, and we use that to have a fast and       thermore, texts in the same language but on different
massive – without human intervention – trend sensing       topics have smaller Joint Complexity than texts on the
in social communities.                                     same topic.
                                                              The analysis of a sequence in subcomponents is
In a prior work, we proposed a method based                done by Suffix Trees, which come with a simple, fast
on Joint Complexity along with pattern matching            and low complexity method to store and recall them
on URLs extracted from tweets. That method was             from memory, especially for short sequences. Suffix
compared with a version of a Document-Pivot method         Trees are frequently used in DNA sequence analysis.
described in [APM+ 13], which was modified in order        The construction of the Suffix Tree for the words
to perform classification of tweets, i.e. assigning        apple and maple, as well as the comparison between
tweets to categories such as Sports, Politics, Eco-        them is shown in Fig. 1.
nomics, etc. The results showed a good performance
of the method compared to Document-Pivot and
motivated us to use a more simplified version of
it based on Joint Complexity for the Snow Data
Challenge 2014.

The paper is organized as follows: Section 2 in-
troduces the Joint Complexity method, while
Section 3 briefly discusses the Snow Data Challenge
dataset. Section 4 describes the implementation of
the proposed method for topic detection in Twitter,
while Section 5 evaluates the performance with real
data obtained from the mentioned dataset. Finally,
Section 6 summarizes our main results and provides
directions for future work.

2   Joint Complexity                                       Figure 1: Suffix Tree of words apple, maple.
In the last decades, several attempts have been made       JC(apple, maple) = 9.
to mathematically capture the concept of complexity
of a sequence, i.e, the number of distinct factors con-       In our previous work [Jac07] it is proved that the
tained in a sequence. In other words, if X is a sequence   J-complexity of two texts of size m built from two
and I(X) its set of factors, then |I(X)| is the com-       different binary memoryless sources grows like
plexity of the sequence. For example, if X = “apple”                                  mκ
then I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl,                              γ√           ,                 (1)
                                                                                     α log m
ple, appl, pple, apple, v} and |I(X)| = 15 (v de-
notes the empty string). The complexity of a string        for some κ < 1 and γ, α > 0 which depend on the
is also called the I−complexity [BH11]. The notion         parameters of the two sources. When the sources
is connected with deep mathematical properties, in-        are identical, then the J-complexity growth is O(m),
cluding the rather elusive concept of randomness in a      hence κ = 1. When the texts are identical (i.e,
string [IYZ02, LV93, Nie99].                               X = Y ), then the J-complexity is identical to the
                                                                                            2
   In general, the information contained in a string       I-complexity and it grows as m2 [JLS04]. Indeed the
may be revealed by comparing with a reference string,      presence of a common factor of length O(m) would
since its entropy or I-complexity does not give much       inflate the J-complexity by a term O(m2 ).
insight. In a previous work [Jac07] we introduced the         We should point out that experiments show that
concept of Joint Complexity, or J-complexity, of two       the complexity estimated as above for memory-less
strings. The J-complexity is the number of common          sources converges very slowly. Furthermore, memory-
distinct factors in two sequences. In other words the J-   less sources are not appropriate for modeling text gen-
complexity of sequence X and Y is equal to J(X, Y ) =      eration. In a prior work, we extend the J-complexity
estimate to Markov sources of any order on a finite         an efficient method to capture the similarity degree of
alphabet. Markov models are more realistic and have         short texts.
better approximation for text generation, i.e. for DNA
sequences, than memory-less sources [JMS13, MJ14].          3    Snow Data Challenge Dataset
   Joint string complexity has a variety of applications,
                                                            According to the Social Sensor dataset for the Snow
such as detection of the similarity degree of two se-
                                                            Data Challenge, we collected tweets for 24 hours;
quences, for example the use of “copy-paste” in texts
                                                            between Tuesday Feb. 25, 18:00 and Wednesday
or documents or the identification of authors and era of
                                                            Feb.    26, 18:00 (GMT). The result of crawling
texts. We want to show here that it can also be used in
                                                            was over 1, 041, 062 tweets between the Unix times-
analysis of social networks (e.g. tweets which are lim-
                                                            tamps 1393351200000 and 1393437600000 and was
ited to 140 characters) and classification. Therefore it
                                                            constructed through the use of the T witter Stream-
may be a pertinent tool for automated monitoring of
                                                            ing API by following 556, 295 users and also looking
social networks. Real time search in blogs, tweets and
                                                            for four specific keywords: Syria; terror; Ukraine; bit-
other social media has to balance between quality and
                                                            coin. The dataset used for the experiments consists of
relevance of the content, which due to the very short
                                                            96 timeslots, where each timeslot contains tweets for
size of the texts and the huge amount of those, is still
                                                            every 15 minutes, starting at 18:00 on Tuesday 25th
an unsolved problem.
                                                            2014. The challenge then consisted in providing a min-
   In a prior work [JMS13, MJ14] we derive a second
                                                            imum of one and a maximum of ten different topics per
order asymptotics for J-complexity of the following
                                                            timeslot, along with a headline, a set of keywords and
form
                            mκ                              a URL of a relevant image for each detected topic. The
                    γ√              ,                 (2)   test dataset activity and the statistics of the dataset
                        α log m + β
for some β > 0. This new estimate converges more            crawl are described more extensively in [PCA].
quickly, and usually works for texts of order m ≈ 102 ;
In fact, for some Markov sources our analysis indicates     4    Topic Detection Method
that J-complexity oscillates with m. This is outlined       Until the present, the main methods used for text
by the introduction of a periodic function Q(log m) in      classification are based on keywords detection and
the leading term of our asymptotics. This additional        Machine Learning techniques. Using keywords in
term even further improves the convergence for small        tweets has several drawbacks because of wrong
values of m.                                                spelling or distorted usage of the words – it also
   In view of these facts, we can use the J-complexity      requires lists of stop-words for every language to be
to discriminate between two identical/non-identical         built – or because of implicit references to previous
Markov sources [Ziv88]. We introduce the discrimi-          texts or messages. Machine Learning techniques are
nant function as follows                                    generally heavy and complex and therefore may not
                             1                              be good candidates for real time text processing,
          d(X, Y ) = 1 −         log J(X, Y ),        (3)
                           log m                            especially in the case of Twitter where we have
for two sequences X and Y of length m. This dis-            natural language and thousands of tweets per second
criminant allows us to determine whether X and Y            to process. Furthermore, Machine Learning processes
are generated by the same Markov source or not by           have to be manually initiated by tuning parameters,
verifying whether                                           and it is one of the main drawbacks for the kind
                                                            of application, where we want minimum if any
    d(X, Y )   = O(1/ log m) → 0 or                         human intervention. Some other methods are using
    d(X, Y )   =   1 − κ + O(log log m/ log m) > 0,         information extracted by visiting the specific URLs
                                                            on the text, which makes them a heavy procedure,
respectively when the length of X and Y is equal to m.      since one may have limited or no access to the infor-
                                                            mation, e.g. because of access rights, or data size/rate.
   In [JMS13, MJ14] we show some experimental evi-
dence of usage of our discriminant function on real and        In our method we use an analysis of the messages
simulated texts by different Markov orders. We com-         (tweets) based on Suffix Trees [THP04] and we use the
pare the J-complexity of a simulated English text with      Joint Complexity as a metric to quantify the similarity
same length texts generated in French, Polish, Greek        between these messages.
and Finnish by a third order Markov model to our the-          According to the dataset described in Section 3 and
oretical results. Even for texts of lengths smaller than    in [PCA] we have N = 96 timeslots with n = 1 . . . N .
a thousand characters, one can easily discriminate be-      For every tweet ti , where i = 1 . . . M , with M being
tween these languages. Therefore, Joint Complexity is       the total number of tweets, in the n-th timeslot, i.e tni ,
we build a Suffix Tree, ST (tni ), as described in Section          Algorithm 1 Method to retrieve row i of an upper
2. Building a Suffix Tree is an operation that costs                triangular matrix
O(m log m) and takes O(m) space in memory, where                      // data is the internal TIntArrayList object
m is the length of the tweet.
   Then we compute the Joint Complexity metric,                       int[] row = new int[data.length+1];
JC(tni , tnj ) of the tweet tni with every other tweet tnj            // read the k-th column up to i:
of the n-th timeslot, where j = 1 . . . M , and j 6= i                for k = 0 to i − 1 do
(by convention we choose JC(tni , tni ) = 0). The Joint                   row[k] ← data[k].get(i − k − 1);
Complexity between two tweets is the number of the                    end for
common factors defined in language theory and can                     row[i] ← 0; // by convention, not computed
be computed efficiently in O(m) operations (sublinear                 if i < data.length − 1 then
on average) by Suffix Tree superposition. For the N                       // read the i-th row until the end:
timeslots we store the results of the computation in                      for j = 0 to data.length − i do
the matrices T1 , T2 , . . . , TN of M × M dimensions.                        row[i + j + 1] ← data[i].get(j);
                                                                          end for
                                                                      else do nothing; // the last tweet does not have a
                                                                      row!
                                                                      end if
                                                                      return row;



                                                                                                                ···
                                                                                                                                  
                                                                                      JCtn1 ,tn2   JCtn1 ,tn3           JCtn1 ,tnM
                                                                                                  JCtn2 ,tn3   ···     JCtn2 ,tnM 
                                                                                                                           ..
                                                                                                                                  
                                                                    Tnup triang = 
                                                                                                               ..                 
                                                                                                                  .        .      
                                                                                                                                   
                                                                                                                      JCtn ,tn 
                                                                                                                           M−1   M




                                                                       The actual implementation of the upper triangular
Figure 2: Representation of the first row of the n-th               data structure is based on an efficient T IntArrayList
timeslot via weighted graph.                                        data structure provided by the Trove4j library (it uses
   We represent timeslots by fully-connected weighted               arrays of int instead of Java objects) encapsulated in a
graphs. Each tweet is a node in the graph and the two-              class which provides methods to access individual ele-
dimensional array Tn holds the weight of each edge, as              ments thanks to their row and column index or a whole
shown in Fig. 2. Then, we calculate the score for each              row by providing its index. Algorithm 1 illustrates the
node in our graph by summing all the edges that are                 getRow() method of this class.
connected to the node. The node that gives the highest                 The whole Cross Joint Complexity computation was
score is the most representative and central tweet of               run in a multithreaded way on a 24 processor machine:
the timeslot.                                                       k = 23 threads are started and each thread works on
                                                                    a disjoint set of rows. Note that because each row of
      
            0       JCtn1 ,tn2   JCtn1 ,tn3   ···    JCtn1 ,tnM
                                                                   the triangular matrix does not have the same num-
                                                                    ber of elements, the number of rows assigned to each
      JCtn ,tn       0          JCtn2 ,tn3   ···    JCtn2 ,tnM 
          2 1                                                     thread should not be divided evenly, otherwise the first
Tn =  JCtn3 ,tn1
                   JCtn3 ,tn2     0          ···    JCtn3 ,tnM 
         ..           ..                                ..
                                                                   threads would have much more work to do than the
                                             ..                
         .            .                         .       .         last ones (remember that the input is an upper tri-
      JCtnM ,tn1    JCtnM ,tn2   JCtnM ,tn3   ···        0          angular matrix); instead, because the total number of
                                                                    non-zero elements of the n∗n matrix is n(n−1)/2, each
    Most of the timeslots have M = 5, 000 tweets,                   thread should work on approximately n(n − 1)/(2 ∗ k)
so matrices T1 , T2 , . . . , TN have approximately 25, 000         elements so the implementation kept assigning rows to
entries for every timeslot. Since they are symmet-                  a thread until that number of elements was reached.
ric, only half of these entries could be used, i.e the                 This implementation allowed the program to run
upper triangular or the lower triangular of matrices                in on average 95 seconds in order to compute a 15-
T1 , T2 , . . . , TN , as shown below, which save space and         minutes timeslot.
make the structure more efficient.                                     These computations were only run once, as soon as
the data was properly divided into 15-minutes times-            rence. Finally we report the k most frequent words,
                                                                                            1        2                N
lots, and the results were saved in files which were            in a list of keywords K = [K1...k , K1...k , . . . , K1...k ], for
subsequently used to perform the rest of implementa-            the N total number of timeslots.
tion.
    When we finally get the scores of the Joint Com-            4.2    Media URLs
plexity metric, we try to find the R most repre-                The body of a tweet (in the .json file format), con-
sentative and central tweets of every timeslot. At              tains a URL information for links to media files such
first we get the sum of the Joint Complexity, Sin =
P                                                               as pictures or videos, when available this information
   j=1...M,j6=i JCti ,tj , of the i-th tweet with every other
                     n n
                                                                is stored in the following subsection of the json object:
tweet j in the n-th timeslot, and finally we get the vec-       entities → media → media url.
tor S n = [S1n , S2n , . . . , SM
                                n
                                  ] for all timeslots.              While reporting the most representative and central
    We sort the elements of each vector S n in descend-         tweets, we scan the original json format in order to re-
ing order and we get the R most representative and              trieve such a URL, from the most representative tweet
central tweets in the following way: The best-ranked            or any of its related tweets, pointing to valid photos
tweet is chosen unconditionally, the second one is              or pictures in a .jpg, .png or .gif format. Then, we
picked only if its JC score with the first one is be-           report these pictures along with the headlines and the
low a chosen threshold T hrlow , otherwise it is added          set of keywords, as shown in Algorithm 2 .
to the list of related tweets of the first tweet; similarly,    Almost half of the headlines (47%) produced by our
the third one is picked only if its JC score with the first     method had an image retrieved from the original tweet.
two is below T hrlow , etc. This ensures that the top-          When no such URL is provided within the collected
ics are dissimilar enough and it classifies best ranked         json objects we planned to visit the URLs of websites
tweets into topics at the same time.                            and retrieve images from the website but did not have
    Then by removing punctuation, special characters,           sufficient time to implement this in a way that enforced
etc., of each selected tweet, we construct the headlines        that the retrieved image was indeed relevant for the
of each topic and we run through the list of related            topic. It is important to point out that the goal of the
tweets to keep only tweets that are different enough            challenge was to detect newsworthy items before they
from the selected one (we do not want duplicates), we           hit mainstream news websites, so it was decided that
do so by keeping only the tweets whose JC score with            parsing images from such websites was not interesting
the selected tweet and all previous related tweets is           in that context.
above a chosen threshold T hrmax . We first chose em-
pirical values 400 and 600 for T hrlow and T hrmax re-
                                                                5     Evaluation
spectively, then a few hours before submission time we
noticed that many topics had only one related tweet             Twitter is one of the most popular social networks and
(all the others were retweets), so we decided to lower          micro-blogging service in the world, with more than
that threshold to 240 but did not have time to recom-           540 million users connected by 24 billion links [GL12].
pute the results on the whole dataset so only a handful         It allows the information spread between people,
of timeslots benefited from this better T hrlow = 240           groups and advertisers, and since the relation between
value. The final plan is to come up with a formula to           its users is unidirectional, the information propagation
have the system determine those thresholds automat-             within the network is similar to the way that the in-
ically depending on the number of characters of each            formation propagates in real life.
tweet.                                                              Apart from the specific implementation for the
    While running through the list of related tweets we         Snow Data Challenge, the main benefits of our
compute the bag of words used to construct the list             method are that we can both classify the messages
of keywords and we also check the original json data            and identify the growing trends in real time, without
to find a URL pointing to a valid image related to the          having to manually identify lists of keywords for every
topic.                                                          language. We can track the information and timeline
    We chose to print the first top 8 topics for each           within a social network and find groups of users that
timeslot.                                                       agree or have the same interests, i.e, perform trend
                                                                sensing.
4.1   Keywords
                                                                The official evaluation results of our method in
In the bag of words constructed from the list of related        the Data Challenge are included in [PCA], there-
tweets, we remove articles (stop-words), punctuation,           fore here we will only discuss a quick comparison
special characters, etc., and we get a list of words, and       of our method with simply counting the number
then we order them by decreasing frequency of occur-            of retweets within a chosen timeslot, namely Feb.
Algorithm 2 Topic detection based on Joint Com-          (JC6) BBCSport: Olympiakos take the lead vs
plexity                                                  Manchester United, a hideous deflection off
  // N = # timeslots, M = # tweets in the n-th           Alejandro Dominguez beats De Gea.
  timeslot
                                                         (JC7) WhatTheBit: It’s going to be pretty damn
  for n = 1 to N do                                      near impossible to top this as photo of the week.
     for t = 1 to M do
         t ← tjson .getT ext();                          (JC8) Al Qaeda branch in Syria issues ultimatum
         tST ← suf f ixT reeConstruction(t);             to splinter group: The head of an al
         JCScores ← JCM etric();                         Qaeda-inspired militia fighting...
     end for                                                A small program was written to compare the num-
     // Find the most representative & central tweets    ber of retweets observed on the first occurence of a
     S n ← sum(JCScores);                                retweet and substract that to the number of retweets
                                                         observed on the last occurence of the same retweet.
     // Get headlines for the central tweets
                                                         All tweets were then sorted in descending order of the
     Rn ← descendingOrder(S n );
                                                         number of retweets obtained, and the result is shown
     // Get set of keywords                              below:
     K n ← keywords(Rn );
     // Get URLs of pictures from the .json file         (RT1) RT @OllieHolt22: Ashley Young berating
     P n ← mediaU RL(Rn );                               someone for going down too easily - that’s funny
     // Print the results in appropriate format          (RT2) RT @WhiteHouse: "I’m here to announce that
     P rint(Rn );                                        we’re building Iron ManNot really. Maybe. It’s
     P rint(K n );                                       classified." @President Obama #ActOnJobs
     P rint(P n );
  end for                                                (RT3) RT @BarackObama: LIVE: President Obama is
                                                         announcing a new plan to boost job growth for
                                                         middle-class Americans. http://t.co/d6x1Bous1S
25th from 8:15 pm until 8:30 pm, comprised of just
under 16, 000 tweets, which is about 50% higher than     (RT4) RT @TheEllenShow: I’m very excited @Pharrell’s
the average number of tweets for the whole 24h period.   performing his big hit "Happy" at the #Oscars.
                                                         Spolier alert: I’ll be hiding in his hat.
The ranking produced by our method for the
timeslot 25-02-14 20:15 is listed below:                 (RT5) RT @Madonna: Fight for the right to be
                                                         free! Fight Fascism and discrimination everywhere!
                                                         Free Venezuela the Ukraine...
(JC1) BarackObama: LIVE: President Obama is
announcing a new plan to boost job growth                (RT6) RT @civicua: Free #Venezuela ! #Ukraine is
for middle-class Americans.                              with you! #euromaidan Photo by Liubov Yeremicheva

(JC2) WhiteHouse: "I’m here to announce that             (RT7) RT @russellcrowe: Holy Father @Pontifex , it
we’re building Iron Man | Not really. Maybe.             would be my deepest pleasure to bring the
It’s classified." President Obama ActOnJobs              @DarrenAronofsky film to you to screen. [...]

(JC3) Madonna: Fight for the right to be free!           (RT8) RT @DjKingAssassin: $myry recovering
Fight Fascism and discrimination everywhere!             on #bitcoin market update on #mtgox
Free Venezuela the Ukraine...
                                                            The following differences can be observed between
                                                         the two rankings:
(JC4) Karinabarbosa: civicua: Free Venezuela !
Ukraine is with you! euromaidan Photo by                   • The best ranked retweet does not appear in the
Liubov Yeremicheva                                           list produced by our method! In fact a plausible
                                                             explanation for RT 1 and RT 8 not being picked
(JC5) TheEllenShow: I’m very excited Pharrell’s              up by our method is a weak centrality due to a
performing his big hit "Happy" at the Oscars.                shorter size of text, as briefly explained in Sec-
Spolier alert: I’ll be hiding in his hat.                    tion 6. For RT 7 however the size of the text is not
      a factor and the weak centrality is most probably       • Dividing the timeslots every quarter of an hour is
      explained by the unlikely co-occurrence of terms          arbitrary as some topics may be cut in half if they
      within the sentence. Although this is highly sub-         started after the middle of the previous timeslot
      jective it is worth noting that those tweets do not       and may therefore not acquire enough significance
      appear to be very newsworthy.                             on the current timeslot. Before aggregating the
                                                                tweets into 15-minutes timeslots, our implementa-
    • Five out of eight items produced by our method
                                                                tion first divided the data into 5-minutes timeslots
      (JC1 to JC5) are listed in the retweet ranking,
                                                                and we considered building 20 minutes timeslots
      this is rather expected as there is a mechanical
                                                                (each time including the 5 minutes preceding the
      link between our method and the number of oc-
                                                                official 15-minutes slot) to remedy the above is-
      currences of the same text.
                                                                sue, but this is now left as future work to check if
    • The eighth item produced by our method (JC8)              it provides significant improvements.
      seems to be very newsworthy but appears only
                                                              • The current implementation which sorts the top-
      past the 100th position in the retweet count
                                                                ics by finding the most central tweets is a bit
      method. Again, although this is subjective it ap-
                                                                unfair to shorter tweets because longer tweets
      pears to be an interesting result.
                                                                have mechanically a better chance of obtaining a
   Finally, although the dataset that was used for this         higher centrality score (Joint Complexity counts
challenge did not allow to show this properly, the au-          the number of common factors), therefore a fu-
thors strongly believe that one key advantage of using          ture work should be to somehow mitigate this.
Joint Complexity is that it can deal with languages             One idea may be to modify the Joint Complexity
other than English [JMS13, MJ14] without requiring              metric in order for it to behave like a distance and
any additional human intervention.                              to perform clustering based on this distance.

6     Conclusion and Future Work                            Acknowledgment
In this short paper we presented an implementation          This work is supported by the SocialSensor and RE-
of a topic detection method applied to a dataset            VEAL FP7 project under contract number 287975
of tweets emitted during a 24 hour period. The              and 610928 respectively. Dimitris Milioris would like
implementation relies heavily on the concept of Joint       to thank INRIA, École Polytechnique ParisTech and
String Complexity which has the benefit of being            LINCS.
language agnostic and not requiring humans to deal
with list of keywords. The results obtained seemed          References
satisfactory and a final evaluation in the context of
                                                            [APM+ 13] L. Aiello, G. Petkos, C. Martin, D. Corney,
the SNOW Data Challenge 2014, can be found in
                                                                      S. Papadopoulos, R. Skraba, A. Goker,
[PCA].
                                                                      I. Kompatsiaris, and A. Jaimes. Sens-
                                                                      ing trending topics in twitter. 15(6):1268–
As the timeframe available to participate in the
                                                                      1282, October 2013.
challenge was quite short, a - probably inexhaustive
- number of items have been identified as possible          [BH11]     V. Becher and P. A. Heiber. A better com-
future work, and these are listed below:                               plexity of finite sequences. 8th Int. Conf.
    • As mentioned in Section 4, although the current                  on Computability and Complexity in Anal-
      empirical values chosen as thresholds to determine               ysis and 6th Int. Conf. on Computabil-
      whether two texts are similar and whether two                    ity, Complexity, and Randomness, page 7,
      texts are near-duplicates seemed to be quite satis-              2011.
      factory on the dataset that our method was tested     [GL12]     M. Gabielkov and A. Legout. The com-
      against, it would be more useful to compute val-                 plete picture of the twitter social graph. In
      ues that would adjust to the size of the texts.                  ACM CoNEXT Student Workshop, 2012.
    • The current implementation systematically picks       [IYZ02]    L. Ilie, S. Yu, and K. Zhang. Repetition
      the top 8 topics from each timeslot whereas a bet-               complexity of words. pages 320–329, 2002.
      ter approach might be to identify a threshold un-
      der which topics may be considered not significant    [Jac07]    P. Jacquet. Common words between two
      enough, and thus allowing some not very active                   random strings. In IEEE International
      timeslots to contain less than 8 topics while other              Symposium on Information Theory, Nice,
      very active timeslots may contain more than 8.                   France, 2007.
[JLS04]   S. Janson, S. Lonardi, and W. Sz-                      In C. Ding, T. Helleseth, and H. Nieder-
          pankowski. On average sequence complex-                reiter, editors, Sequences and their Appli-
          ity. pages 213–227, 2004.                              cations, Discrete Mathematics and The-
                                                                 oretical Computer Science, pages 67–78.
[JMS13]   P. Jacquet, D. Milioris, and W. Sz-                    Springer London, 1999.
          pankowski.    Classification of markov
          sources through joint string complexity:     [PCA]     S. Papadopoulos, D. Corney, and L. Aiello.
          Theory and experiments. 2013.                          Snow 2014 data challenge: Assessing the
                                                                 performance of news topic detection meth-
[LV93]    M. Li and P. Vitanyi. Introduction to Kol-             ods in social media. In Proceedings of the
          mogorov Complexity and Its Applications.               SNOW 2014 Data Challenge.
          Springer–Verlag, Berlin, August, 1993.
                                                       [THP04]   S. Tata, R. Hankins, and J. Patel. Practi-
[MJ14]    D. Milioris and P. Jacquet. Joint sequence             cal suffix tree construction. In Proceedings
          complexity analysis: Application to social             of the 30th VLDB Conference, 2004.
          networks information flow. Bell Laborato-
          ries Technical Journal, 18(4), 2014. Issue   [Ziv88]   J. Ziv. On classification with empirically
          on Data Analytics.                                     observed statistics and universal data com-
                                                                 pression. IEEE Transactions on Informa-
[Nie99]   Harald Niederreiter. Some computable                   tion Theory, 34:278–286, 1988.
          complexity measures for binary sequences.