<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One Day in Twitter: Topic Detection Via Joint Complexity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerard Burnside</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Milioris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Jacquet</string-name>
          <email>philippe.jacquetg@alcatel-lucent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bell Labs, Alcatel-Lucent, Centre de Villarceaux</institution>
          ,
          <addr-line>91620 Nozay</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>In this paper we introduce a novel method to perform topic detection in Twitter based on the recent and novel technique of Joint Complexity. Instead of relying on words as most other existing methods which use bag-ofwords or n-gram techniques, Joint Complexity relies on String Complexity which is dened as the cardinality of a set of all distinct factors, subsequences of characters, of a given string. Each short sequence of text is decomposed in linear time into a memory e cient structure called Su x Tree and by overlapping two trees, in linear or sublinear average time, we obtain the Joint Complexity de ned as the cardinality of factors that are common in both trees. The method has been extensively tested for Markov sources of any order for a nite alphabet and gave good approximation for text generation and language discrimination. One key take-away from this approach is that it is language-agnostic since we can detect similarities between two texts in any loosely character-based language. Therefore, there is no need to build any speci c dictionary or stemming method. The proposed method can also be used to capture a change of topic within a conversation, as well as the style of a speci c writer in a text. In this paper we exploit a dataset collected by using the Twitter streaming API for one full day, and we extract a signi cant number of topics for every timeslot.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>During recent years the online social media services
have seen a huge expansion, and as a result, the
value of information within these networks has
increased dramatically. Interactions and communication
between users help predict the evolution of
information in online social networks, and the ability to study
these networks can provide relevant information in real
time.</p>
      <p>The study of social networks has several research
challenges:
searching in blogs, tweets and other social
media is still an open problem since posts are
individually small in size but there is a tremendous
quantity of them delivered in real time, with little
and sometimes transient contextual information.
Real time search has to balance between quality,
authority, relevance and timeliness of the content,
the information of the correlation between groups
of users can help predict media consumption,
network resources and tra c. This can be used in
order to improve the quality of service and
experience,
analyzing the relationship between members of a
group or a community within a social network can
reveal the important teams in order to use them
for companies marketing plans,
spam and advertisement detection is another
important challenge, since with the growth of
social networks we also have a continuously
growing amount of irrelevant information over the
network.</p>
      <p>In order to address these challenges, we need to
extract the relevant information from online social
media in real time.</p>
      <p>In this paper we use the theory of Joint
Complexity to detect bursty trends among short pieces
of information. Our method is simple, context-free,
with no grammar and no language assumptions, and
it does not use semantics.</p>
      <p>The Su x Tree contains the algorithmic
signature of the text, and we use that to have a fast and
massive { without human intervention { trend sensing
in social communities.</p>
      <p>
        In a prior work, we proposed a method based
on Joint Complexity along with pattern matching
on URLs extracted from tweets. That method was
compared with a version of a Document-Pivot method
described in [APM+13], which was modi ed in order
to perform classi cation of tweets, i.e. assigning
tweets to categories such as Sports, Politics,
Economics, etc. The results showed a good performance
of the method compared to Document-Pivot and
motivated us to use a more simpli ed version of
it based on Joint Complexity for the Snow
        <xref ref-type="bibr" rid="ref5">Data
Challenge 2014</xref>
        .
      </p>
      <p>The paper is organized as follows: Section 2
introduces the Joint Complexity method, while
Section 3 brie y discusses the Snow Data Challenge
dataset. Section 4 describes the implementation of
the proposed method for topic detection in Twitter,
while Section 5 evaluates the performance with real
data obtained from the mentioned dataset. Finally,
Section 6 summarizes our main results and provides
directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Joint Complexity</title>
      <p>In the last decades, several attempts have been made
to mathematically capture the concept of complexity
of a sequence, i.e, the number of distinct factors
contained in a sequence. In other words, if X is a sequence
and I(X) its set of factors, then jI(X)j is the
complexity of the sequence. For example, if X = \apple"
then I(X) = fa, p, l, e, ap, pp, pl, le, app, ppl,
ple, appl, pple, apple, vg and jI(X)j = 15 (v
denotes the empty string). The complexity of a string
is also called the I complexity [BH11]. The notion
is connected with deep mathematical properties,
including the rather elusive concept of randomness in a
string [IYZ02, LV93, Nie99].</p>
      <p>In general, the information contained in a string
may be revealed by comparing with a reference string,
since its entropy or I-complexity does not give much
insight. In a previous work [Jac07] we introduced the
concept of Joint Complexity, or J -complexity, of two
strings. The J -complexity is the number of common
distinct factors in two sequences. In other words the J
complexity of sequence X and Y is equal to J (X; Y ) =
jI(X) \ I(Y )j.</p>
      <p>The J -complexity is an e cient way to estimate
similarity degree of two sequences. Two texts written
in similar languages have more sequences in common
than texts written in very di erent languages.
Furthermore, texts in the same language but on di erent
topics have smaller Joint Complexity than texts on the
same topic.</p>
      <p>The analysis of a sequence in subcomponents is
done by Su x Trees, which come with a simple, fast
and low complexity method to store and recall them
from memory, especially for short sequences. Su x
Trees are frequently used in DNA sequence analysis.
The construction of the Su x Tree for the words
apple and maple, as well as the comparison between
them is shown in Fig. 1.
for some &lt; 1 and ; &gt; 0 which depend on the
parameters of the two sources. When the sources
are identical, then the J -complexity growth is O(m),
hence = 1. When the texts are identical (i.e,
X = Y ), then the J -complexity is identical to the
I-complexity and it grows as m2
2 [JLS04]. Indeed the
presence of a common factor of length O(m) would
in ate the J -complexity by a term O(m2).</p>
      <p>We should point out that experiments show that
the complexity estimated as above for memory-less
sources converges very slowly. Furthermore,
memoryless sources are not appropriate for modeling text
generation. In a prior work, we extend the J -complexity
estimate to Markov sources of any order on a nite
alphabet. Markov models are more realistic and have
better approximation for text generation, i.e. for DNA
sequences, than memory-less sources [JMS13, MJ14].</p>
      <p>Joint string complexity has a variety of applications,
such as detection of the similarity degree of two
sequences, for example the use of \copy-paste" in texts
or documents or the identi cation of authors and era of
texts. We want to show here that it can also be used in
analysis of social networks (e.g. tweets which are
limited to 140 characters) and classi cation. Therefore it
may be a pertinent tool for automated monitoring of
social networks. Real time search in blogs, tweets and
other social media has to balance between quality and
relevance of the content, which due to the very short
size of the texts and the huge amount of those, is still
an unsolved problem.</p>
      <p>In a prior work [JMS13, MJ14] we derive a second
order asymptotics for J -complexity of the following
form</p>
      <p>m
p log m + ; (2)
for some &gt; 0. This new estimate converges more
quickly, and usually works for texts of order m 102;
In fact, for some Markov sources our analysis indicates
that J -complexity oscillates with m. This is outlined
by the introduction of a periodic function Q(log m) in
the leading term of our asymptotics. This additional
term even further improves the convergence for small
values of m.</p>
      <p>In view of these facts, we can use the J -complexity
to discriminate between two identical/non-identical
Markov sources [Ziv88]. We introduce the
discriminant function as follows
1
log m
d(X; Y ) = 1
log J (X; Y );
(3)
for two sequences X and Y of length m. This
discriminant allows us to determine whether X and Y
are generated by the same Markov source or not by
verifying whether
d(X; Y )
d(X; Y )
=</p>
      <p>O(1= log m) ! 0 or
= 1</p>
      <p>+ O(log log m= log m) &gt; 0;
respectively when the length of X and Y is equal to m.</p>
      <p>In [JMS13, MJ14] we show some experimental
evidence of usage of our discriminant function on real and
simulated texts by di erent Markov orders. We
compare the J -complexity of a simulated English text with
same length texts generated in French, Polish, Greek
and Finnish by a third order Markov model to our
theoretical results. Even for texts of lengths smaller than
a thousand characters, one can easily discriminate
between these languages. Therefore, Joint Complexity is
an e cient method to capture the similarity degree of
short texts.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Snow Data Challenge Dataset</title>
      <p>According to the Social Sensor dataset for the Snow
Data Challenge, we collected tweets for 24 hours;
between Tuesday Feb. 25, 18:00 and Wednesday
Feb. 26, 18:00 (GMT). The result of crawling
was over 1; 041; 062 tweets between the Unix
timestamps 1393351200000 and 1393437600000 and was
constructed through the use of the T witter
Streaming API by following 556; 295 users and also looking
for four speci c keywords: Syria; terror; Ukraine;
bitcoin. The dataset used for the experiments consists of
96 timeslots, where each timeslot contains tweets for
every 15 minutes, starting at 18:00 on Tuesday 25th
2014. The challenge then consisted in providing a
minimum of one and a maximum of ten di erent topics per
timeslot, along with a headline, a set of keywords and
a URL of a relevant image for each detected topic. The
test dataset activity and the statistics of the dataset
crawl are described more extensively in [PCA].
4</p>
    </sec>
    <sec id="sec-4">
      <title>Topic Detection Method</title>
      <p>Until the present, the main methods used for text
classi cation are based on keywords detection and
Machine Learning techniques. Using keywords in
tweets has several drawbacks because of wrong
spelling or distorted usage of the words { it also
requires lists of stop-words for every language to be
built { or because of implicit references to previous
texts or messages. Machine Learning techniques are
generally heavy and complex and therefore may not
be good candidates for real time text processing,
especially in the case of Twitter where we have
natural language and thousands of tweets per second
to process. Furthermore, Machine Learning processes
have to be manually initiated by tuning parameters,
and it is one of the main drawbacks for the kind
of application, where we want minimum if any
human intervention. Some other methods are using
information extracted by visiting the speci c URLs
on the text, which makes them a heavy procedure,
since one may have limited or no access to the
information, e.g. because of access rights, or data size/rate.</p>
      <p>In our method we use an analysis of the messages
(tweets) based on Su x Trees [THP04] and we use the
Joint Complexity as a metric to quantify the similarity
between these messages.</p>
      <p>According to the dataset described in Section 3 and
in [PCA] we have N = 96 timeslots with n = 1 : : : N .
For every tweet ti, where i = 1 : : : M , with M being
the total number of tweets, in the n-th timeslot, i.e tin,
we build a Su x Tree, ST (tin), as described in Section
2. Building a Su x Tree is an operation that costs
O(m log m) and takes O(m) space in memory, where
m is the length of the tweet.</p>
      <p>Then we compute the Joint Complexity metric,
J C(tin; tjn) of the tweet tin with every other tweet tj
n
of the n-th timeslot, where j = 1 : : : M , and j 6= i
(by convention we choose J C(tin; tin) = 0). The Joint
Complexity between two tweets is the number of the
common factors de ned in language theory and can
be computed e ciently in O(m) operations (sublinear
on average) by Su x Tree superposition. For the N
timeslots we store the results of the computation in
the matrices T1; T2; : : : ; TN of M M dimensions.</p>
      <p>We represent timeslots by fully-connected weighted
graphs. Each tweet is a node in the graph and the
twodimensional array Tn holds the weight of each edge, as
shown in Fig. 2. Then, we calculate the score for each
node in our graph by summing all the edges that are
connected to the node. The node that gives the highest
score is the most representative and central tweet of
the timeslot.</p>
      <p>0</p>
      <p>0</p>
      <p>B J Ct2n;t1n
Tn = BB J Ct3n;t1n
@BB ...</p>
      <sec id="sec-4-1">
        <title>J CtnM ;t1n</title>
        <p>J Ct1n;t2n
0
J Ct3n;t2n
.
.
.</p>
      </sec>
      <sec id="sec-4-2">
        <title>J CtnM ;t2n</title>
        <p>J Ct1n;t3n
J Ct2n;t3n
0</p>
      </sec>
      <sec id="sec-4-3">
        <title>J CtnM ;t3n</title>
        <p>. . .</p>
        <p>J Ct1n;tnM 1
JJ CCtt23nn;;ttnnMM CCCC</p>
        <p>C</p>
        <p>A
.
.
.
0</p>
        <p>Most of the timeslots have M = 5; 000 tweets,
so matrices T1; T2; : : : ; TN have approximately 25; 000
entries for every timeslot. Since they are
symmetric, only half of these entries could be used, i.e the
upper triangular or the lower triangular of matrices
T1; T2; : : : ; TN , as shown below, which save space and
make the structure more e cient.</p>
        <p>Algorithm 1 Method to retrieve row i of an upper
triangular matrix
// data is the internal TIntArrayList object
int[] row = new int[data.length+1];
// read the k-th column up to i:
for k = 0 to i 1 do</p>
        <p>row[k] data[k]:get(i k 1);
end for
row[i] 0; // by convention, not computed
if i &lt; data:length 1 then
// read the i-th row until the end:
for j = 0 to data:length i do</p>
        <p>row[i + j + 1] data[i]:get(j);
end for
else do nothing; // the last tweet does not have a
row!
end if
return row;</p>
        <p>B</p>
        <p>B
T up triang = B
n B</p>
        <p>B
@
0</p>
        <p>JCt1n;t2n</p>
        <p>JCt1n;t3n
JCt2n;t3n
. . .</p>
        <p>1
JCt1n;tnM
JCt2n;tnM
.
.
.</p>
        <p>C
C
C</p>
        <p>C
JCtnM 1;tnM AC</p>
        <p>The actual implementation of the upper triangular
data structure is based on an e cient T IntArrayList
data structure provided by the Trove4j library (it uses
arrays of int instead of Java objects) encapsulated in a
class which provides methods to access individual
elements thanks to their row and column index or a whole
row by providing its index. Algorithm 1 illustrates the
getRow() method of this class.</p>
        <p>The whole Cross Joint Complexity computation was
run in a multithreaded way on a 24 processor machine:
k = 23 threads are started and each thread works on
a disjoint set of rows. Note that because each row of
the triangular matrix does not have the same
number of elements, the number of rows assigned to each
thread should not be divided evenly, otherwise the rst
threads would have much more work to do than the
last ones (remember that the input is an upper
triangular matrix); instead, because the total number of
non-zero elements of the n n matrix is n(n 1)=2, each
thread should work on approximately n(n 1)=(2 k)
elements so the implementation kept assigning rows to
a thread until that number of elements was reached.</p>
        <p>This implementation allowed the program to run
in on average 95 seconds in order to compute a
15minutes timeslot.</p>
        <p>These computations were only run once, as soon as
the data was properly divided into 15-minutes
timeslots, and the results were saved in les which were
subsequently used to perform the rest of
implementation.</p>
        <p>When we nally get the scores of the Joint
Complexity metric, we try to nd the R most
representative and central tweets of every timeslot. At
rst we get the sum of the Joint Complexity, Sin =
Pj=1:::M;j6=i J Ctin;tjn , of the i-th tweet with every other
tweet j in the n-th timeslot, and nally we get the
vector Sn = [S1n; S2n; : : : ; SMn ] for all timeslots.</p>
        <p>We sort the elements of each vector Sn in
descending order and we get the R most representative and
central tweets in the following way: The best-ranked
tweet is chosen unconditionally, the second one is
picked only if its JC score with the rst one is
below a chosen threshold T hrlow, otherwise it is added
to the list of related tweets of the rst tweet; similarly,
the third one is picked only if its JC score with the rst
two is below T hrlow, etc. This ensures that the
topics are dissimilar enough and it classi es best ranked
tweets into topics at the same time.</p>
        <p>Then by removing punctuation, special characters,
etc., of each selected tweet, we construct the headlines
of each topic and we run through the list of related
tweets to keep only tweets that are di erent enough
from the selected one (we do not want duplicates), we
do so by keeping only the tweets whose JC score with
the selected tweet and all previous related tweets is
above a chosen threshold T hrmax. We rst chose
empirical values 400 and 600 for T hrlow and T hrmax
respectively, then a few hours before submission time we
noticed that many topics had only one related tweet
(all the others were retweets), so we decided to lower
that threshold to 240 but did not have time to
recompute the results on the whole dataset so only a handful
of timeslots bene ted from this better T hrlow = 240
value. The nal plan is to come up with a formula to
have the system determine those thresholds
automatically depending on the number of characters of each
tweet.</p>
        <p>While running through the list of related tweets we
compute the bag of words used to construct the list
of keywords and we also check the original json data
to nd a URL pointing to a valid image related to the
topic.</p>
        <p>We chose to print the rst top 8 topics for each
timeslot.
4.1
In the bag of words constructed from the list of related
tweets, we remove articles (stop-words), punctuation,
special characters, etc., and we get a list of words, and
then we order them by decreasing frequency of
occurrence. Finally we report the k most frequent words,
in a list of keywords K = [K11:::k; K12:::k; : : : ; K1N:::k], for
the N total number of timeslots.
4.2</p>
        <p>Media URLs
The body of a tweet (in the .json le format),
contains a URL information for links to media les such
as pictures or videos, when available this information
is stored in the following subsection of the json object:
entities ! media ! media url.</p>
        <p>While reporting the most representative and central
tweets, we scan the original json format in order to
retrieve such a URL, from the most representative tweet
or any of its related tweets, pointing to valid photos
or pictures in a :jpg, :png or :gif format. Then, we
report these pictures along with the headlines and the
set of keywords, as shown in Algorithm 2 .</p>
        <p>Almost half of the headlines (47%) produced by our
method had an image retrieved from the original tweet.
When no such URL is provided within the collected
json objects we planned to visit the URLs of websites
and retrieve images from the website but did not have
su cient time to implement this in a way that enforced
that the retrieved image was indeed relevant for the
topic. It is important to point out that the goal of the
challenge was to detect newsworthy items before they
hit mainstream news websites, so it was decided that
parsing images from such websites was not interesting
in that context.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>Twitter is one of the most popular social networks and
micro-blogging service in the world, with more than
540 million users connected by 24 billion links [GL12].
It allows the information spread between people,
groups and advertisers, and since the relation between
its users is unidirectional, the information propagation
within the network is similar to the way that the
information propagates in real life.</p>
      <p>Apart from the speci c implementation for the
Snow Data Challenge, the main bene ts of our
method are that we can both classify the messages
and identify the growing trends in real time, without
having to manually identify lists of keywords for every
language. We can track the information and timeline
within a social network and nd groups of users that
agree or have the same interests, i.e, perform trend
sensing.</p>
      <p>The o cial evaluation results of our method in
the Data Challenge are included in [PCA],
therefore here we will only discuss a quick comparison
of our method with simply counting the number
of retweets within a chosen timeslot, namely Feb.
Algorithm 2 Topic detection based on Joint
Complexity
// N = # timeslots, M = # tweets in the n-th
timeslot
for n = 1 to N do
for t = 1 to M do
t
tST
end for
tjson:getT ext();</p>
      <p>suf f ixT reeConstruction(t);</p>
      <sec id="sec-5-1">
        <title>JCScores</title>
      </sec>
      <sec id="sec-5-2">
        <title>JCM etric();</title>
        <p>// Find the most representative &amp; central tweets
Sn sum(JCScores);
// Get headlines for the central tweets
Rn descendingOrder(Sn);
// Get set of keywords
Kn keywords(Rn);
// Get URLs of pictures from the .json le
P n mediaU RL(Rn);
// Print the results in appropriate format
P rint(Rn);
P rint(Kn);</p>
        <p>P rint(P n);
end for
25th from 8:15 pm until 8:30 pm, comprised of just
under 16; 000 tweets, which is about 50% higher than
the average number of tweets for the whole 24h period.</p>
        <p>The ranking produced by our method for the
timeslot 25-02-14 20:15 is listed below:
(JC1) BarackObama: LIVE: President Obama is
announcing a new plan to boost job growth
for middle-class Americans.
(JC6) BBCSport: Olympiakos take the lead vs
Manchester United, a hideous deflection off
Alejandro Dominguez beats De Gea.
(JC7) WhatTheBit: It's going to be pretty damn
near impossible to top this as photo of the week.
(JC8) Al Qaeda branch in Syria issues ultimatum
to splinter group: The head of an al
Qaeda-inspired militia fighting...</p>
        <p>A small program was written to compare the
number of retweets observed on the rst occurence of a
retweet and substract that to the number of retweets
observed on the last occurence of the same retweet.</p>
        <p>All tweets were then sorted in descending order of the
number of retweets obtained, and the result is shown
below:
(RT1) RT @OllieHolt22: Ashley Young berating
someone for going down too easily - that's funny
(RT2) RT @WhiteHouse: "I'm here to announce that
we're building Iron ManNot really. Maybe. It's
classified." @President Obama #ActOnJobs
(RT3) RT @BarackObama: LIVE: President Obama is
announcing a new plan to boost job growth for
middle-class Americans. http://t.co/d6x1Bous1S
(RT4) RT @TheEllenShow: I'm very excited @Pharrell's
performing his big hit "Happy" at the #Oscars.</p>
        <p>Spolier alert: I'll be hiding in his hat.
(RT5) RT @Madonna: Fight for the right to be
free! Fight Fascism and discrimination everywhere!
Free Venezuela the Ukraine...
(RT6) RT @civicua: Free #Venezuela ! #Ukraine is
with you! #euromaidan Photo by Liubov Yeremicheva
(JC2) WhiteHouse: "I'm here to announce that
we're building Iron Man | Not really. Maybe.</p>
        <p>It's classified." President Obama ActOnJobs
(RT7) RT @russellcrowe: Holy Father @Pontifex , it
would be my deepest pleasure to bring the
@DarrenAronofsky film to you to screen. [...]
(JC3) Madonna: Fight for the right to be free! (RT8) RT @DjKingAssassin: $myry recovering
Fight Fascism and discrimination everywhere! on #bitcoin market update on #mtgox
Free Venezuela the Ukraine...
(JC4) Karinabarbosa: civicua: Free Venezuela !
Ukraine is with you! euromaidan Photo by
Liubov Yeremicheva
(JC5) TheEllenShow: I'm very excited Pharrell's
performing his big hit "Happy" at the Oscars.</p>
        <p>Spolier alert: I'll be hiding in his hat.</p>
        <p>The following di erences can be observed between
the two rankings:</p>
        <p>The best ranked retweet does not appear in the
list produced by our method! In fact a plausible
explanation for RT 1 and RT 8 not being picked
up by our method is a weak centrality due to a
shorter size of text, as brie y explained in
Section 6. For RT 7 however the size of the text is not
a factor and the weak centrality is most probably
explained by the unlikely co-occurrence of terms
within the sentence. Although this is highly
subjective it is worth noting that those tweets do not
appear to be very newsworthy.</p>
        <p>Five out of eight items produced by our method
(J C1 to J C5) are listed in the retweet ranking,
this is rather expected as there is a mechanical
link between our method and the number of
occurrences of the same text.</p>
        <p>The eighth item produced by our method (J C8)
seems to be very newsworthy but appears only
past the 100th position in the retweet count
method. Again, although this is subjective it
appears to be an interesting result.</p>
        <p>Finally, although the dataset that was used for this
challenge did not allow to show this properly, the
authors strongly believe that one key advantage of using
Joint Complexity is that it can deal with languages
other than English [JMS13, MJ14] without requiring
any additional human intervention.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>
        In this short paper we presented an implementation
of a topic detection method applied to a dataset
of tweets emitted during a 24 hour period. The
implementation relies heavily on the concept of Joint
String Complexity which has the bene t of being
language agnostic and not requiring humans to deal
with list of keywords. The results obtained seemed
satisfactory and a nal evaluation in the context of
the SNOW
        <xref ref-type="bibr" rid="ref5">Data Challenge 2014</xref>
        , can be found in
[PCA].
      </p>
      <p>As the timeframe available to participate in the
challenge was quite short, a - probably inexhaustive
- number of items have been identi ed as possible
future work, and these are listed below:</p>
      <p>As mentioned in Section 4, although the current
empirical values chosen as thresholds to determine
whether two texts are similar and whether two
texts are near-duplicates seemed to be quite
satisfactory on the dataset that our method was tested
against, it would be more useful to compute
values that would adjust to the size of the texts.
The current implementation systematically picks
the top 8 topics from each timeslot whereas a
better approach might be to identify a threshold
under which topics may be considered not signi cant
enough, and thus allowing some not very active
timeslots to contain less than 8 topics while other
very active timeslots may contain more than 8.
Dividing the timeslots every quarter of an hour is
arbitrary as some topics may be cut in half if they
started after the middle of the previous timeslot
and may therefore not acquire enough signi cance
on the current timeslot. Before aggregating the
tweets into 15-minutes timeslots, our
implementation rst divided the data into 5-minutes timeslots
and we considered building 20 minutes timeslots
(each time including the 5 minutes preceding the
o cial 15-minutes slot) to remedy the above
issue, but this is now left as future work to check if
it provides signi cant improvements.</p>
      <p>The current implementation which sorts the
topics by nding the most central tweets is a bit
unfair to shorter tweets because longer tweets
have mechanically a better chance of obtaining a
higher centrality score (Joint Complexity counts
the number of common factors), therefore a
future work should be to somehow mitigate this.
One idea may be to modify the Joint Complexity
metric in order for it to behave like a distance and
to perform clustering based on this distance.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>This work is supported by the SocialSensor and
REVEAL FP7 project under contract number 287975
and 610928 respectively. Dimitris Milioris would like
to thank INRIA, Ecole Polytechnique ParisTech and
LINCS.
[IYZ02]
[Jac07]</p>
      <p>V. Becher and P. A. Heiber. A better
complexity of nite sequences. 8th Int. Conf.
on Computability and Complexity in
Analysis and 6th Int. Conf. on
Computability, Complexity, and Randomness, page 7,
2011.</p>
      <p>M. Gabielkov and A. Legout. The
complete picture of the twitter social graph. In
ACM CoNEXT Student Workshop, 2012.</p>
      <p>L. Ilie, S. Yu, and K. Zhang. Repetition
complexity of words. pages 320{329, 2002.</p>
      <p>P. Jacquet. Common words between two
random strings. In IEEE International
Symposium on Information Theory, Nice,
France, 2007.
[JMS13]
[MJ14]</p>
      <p>Harald Niederreiter. Some computable
complexity measures for binary sequences.
[PCA]
[THP04]
[Ziv88]</p>
      <p>In C. Ding, T. Helleseth, and H.
Niederreiter, editors, Sequences and their
Applications, Discrete Mathematics and
Theoretical Computer Science, pages 67{78.</p>
      <p>Springer London, 1999.</p>
      <p>S. Papadopoulos, D. Corney, and L. Aiello.</p>
      <p>
        Snow 2014 data challenge: Assessing the
performance of news topic detection
methods in social media. In Procee
        <xref ref-type="bibr" rid="ref5">dings of the
SNOW 2014</xref>
        Data Challenge.
      </p>
      <p>S. Tata, R. Hankins, and J. Patel.
Practical su x tree construction. In Proceedings
of the 30th VLDB Conference, 2004.</p>
      <p>J. Ziv. On classi cation with empirically
observed statistics and universal data
compression. IEEE Transactions on
Information Theory, 34:278{286, 1988.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [APM+13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Aiello</surname>
          </string-name>
          , G. Petkos,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Skraba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goker</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Kompatsiaris, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Sensing trending topics in twitter</article-title>
          .
          <volume>15</volume>
          (
          <issue>6</issue>
          ):
          <volume>1268</volume>
          {
          <fpage>1282</fpage>
          ,
          <year>October 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [BH11] [GL12] [LV93] [Nie99]
          <string-name>
            <given-names>S.</given-names>
            <surname>Janson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lonardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Szpankowski</surname>
          </string-name>
          .
          <article-title>On average sequence complexity</article-title>
          .
          <source>pages 213{227</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Jacquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Milioris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Szpankowski</surname>
          </string-name>
          .
          <article-title>Classi cation of markov sources through joint string complexity: Theory and experiments</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          Springer{Verlag, Berlin,
          <year>August</year>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Milioris</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Jacquet</surname>
          </string-name>
          .
          <article-title>Joint sequence complexity analysis: Application to social networks information ow</article-title>
          .
          <source>Bell Laboratories Technical Journal</source>
          ,
          <volume>18</volume>
          (
          <issue>4</issue>
          ),
          <year>2014</year>
          . Issue on Data Analytics.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>