=Paper= {{Paper |id=Vol-1150/burnside |storemode=property |title=One Day in Twitter: Topic Detection Via Joint Complexity |pdfUrl=https://ceur-ws.org/Vol-1150/burnside.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/BurnsideMJ14 }} ==One Day in Twitter: Topic Detection Via Joint Complexity== https://ceur-ws.org/Vol-1150/burnside.pdf

One Day in Twitter: Topic Detection Via Joint
Complexity

Gérard Burnside Dimitris Milioris Philippe Jacquet
Bell Labs, Alcatel-Lucent, Centre de Villarceaux, 91620 Nozay, France
{gerard.burnside, dimitrios.milioris, philippe.jacquet}@alcatel-lucent.com

1 Introduction
During recent years the online social media services
Abstract
have seen a huge expansion, and as a result, the
In this paper we introduce a novel method value of information within these networks has in-
to perform topic detection in Twitter based creased dramatically. Interactions and communication
on the recent and novel technique of Joint between users help predict the evolution of informa-
Complexity. Instead of relying on words as tion in online social networks, and the ability to study
most other existing methods which use bag-of- these networks can provide relevant information in real
words or n-gram techniques, Joint Complex- time.
ity relies on String Complexity which is de- The study of social networks has several research chal-
fined as the cardinality of a set of all distinct lenges:
factors, subsequences of characters, of a given • searching in blogs, tweets and other social me-
string. Each short sequence of text is decom- dia is still an open problem since posts are indi-
posed in linear time into a memory efficient vidually small in size but there is a tremendous
structure called Suffix Tree and by overlap- quantity of them delivered in real time, with little
ping two trees, in linear or sublinear average and sometimes transient contextual information.
time, we obtain the Joint Complexity defined Real time search has to balance between quality,
as the cardinality of factors that are common authority, relevance and timeliness of the content,
in both trees. The method has been exten-
sively tested for Markov sources of any order • the information of the correlation between groups
for a finite alphabet and gave good approxi- of users can help predict media consumption, net-
mation for text generation and language dis- work resources and traffic. This can be used in
crimination. One key take-away from this ap- order to improve the quality of service and expe-
proach is that it is language-agnostic since we rience,
can detect similarities between two texts in • analyzing the relationship between members of a
any loosely character-based language. There- group or a community within a social network can
fore, there is no need to build any specific dic- reveal the important teams in order to use them
tionary or stemming method. The proposed for companies marketing plans,
method can also be used to capture a change
of topic within a conversation, as well as the • spam and advertisement detection is another im-
style of a specific writer in a text. In this portant challenge, since with the growth of so-
paper we exploit a dataset collected by using cial networks we also have a continuously grow-
the Twitter streaming API for one full day, ing amount of irrelevant information over the net-
and we extract a significant number of topics work.
for every timeslot. In order to address these challenges, we need to
extract the relevant information from online social
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
media in real time.
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, In this paper we use the theory of Joint Com-
published at http://ceur-ws.org plexity to detect bursty trends among short pieces
of information. Our method is simple, context-free, |I(X) ∩ I(Y )|.
with no grammar and no language assumptions, and The J-complexity is an efficient way to estimate
it does not use semantics. similarity degree of two sequences. Two texts written
in similar languages have more sequences in common
The Suffix Tree contains the algorithmic signa- than texts written in very different languages. Fur-
ture of the text, and we use that to have a fast and thermore, texts in the same language but on different
massive – without human intervention – trend sensing topics have smaller Joint Complexity than texts on the
in social communities. same topic.
The analysis of a sequence in subcomponents is
In a prior work, we proposed a method based done by Suffix Trees, which come with a simple, fast
on Joint Complexity along with pattern matching and low complexity method to store and recall them
on URLs extracted from tweets. That method was from memory, especially for short sequences. Suffix
compared with a version of a Document-Pivot method Trees are frequently used in DNA sequence analysis.
described in [APM+ 13], which was modified in order The construction of the Suffix Tree for the words
to perform classification of tweets, i.e. assigning apple and maple, as well as the comparison between
tweets to categories such as Sports, Politics, Eco- them is shown in Fig. 1.
nomics, etc. The results showed a good performance
of the method compared to Document-Pivot and
motivated us to use a more simplified version of
it based on Joint Complexity for the Snow Data
Challenge 2014.

The paper is organized as follows: Section 2 in-
troduces the Joint Complexity method, while
Section 3 briefly discusses the Snow Data Challenge
dataset. Section 4 describes the implementation of
the proposed method for topic detection in Twitter,
while Section 5 evaluates the performance with real
data obtained from the mentioned dataset. Finally,
Section 6 summarizes our main results and provides
directions for future work.

2 Joint Complexity Figure 1: Suffix Tree of words apple, maple.
In the last decades, several attempts have been made JC(apple, maple) = 9.
to mathematically capture the concept of complexity
of a sequence, i.e, the number of distinct factors con- In our previous work [Jac07] it is proved that the
tained in a sequence. In other words, if X is a sequence J-complexity of two texts of size m built from two
and I(X) its set of factors, then |I(X)| is the com- different binary memoryless sources grows like
plexity of the sequence. For example, if X = “apple” mκ
then I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl, γ√ , (1)
α log m
ple, appl, pple, apple, v} and |I(X)| = 15 (v de-
notes the empty string). The complexity of a string for some κ < 1 and γ, α > 0 which depend on the
is also called the I−complexity [BH11]. The notion parameters of the two sources. When the sources
is connected with deep mathematical properties, in- are identical, then the J-complexity growth is O(m),
cluding the rather elusive concept of randomness in a hence κ = 1. When the texts are identical (i.e,
string [IYZ02, LV93, Nie99]. X = Y ), then the J-complexity is identical to the
2
In general, the information contained in a string I-complexity and it grows as m2 [JLS04]. Indeed the
may be revealed by comparing with a reference string, presence of a common factor of length O(m) would
since its entropy or I-complexity does not give much inflate the J-complexity by a term O(m2 ).
insight. In a previous work [Jac07] we introduced the We should point out that experiments show that
concept of Joint Complexity, or J-complexity, of two the complexity estimated as above for memory-less
strings. The J-complexity is the number of common sources converges very slowly. Furthermore, memory-
distinct factors in two sequences. In other words the J- less sources are not appropriate for modeling text gen-
complexity of sequence X and Y is equal to J(X, Y ) = eration. In a prior work, we extend the J-complexity
estimate to Markov sources of any order on a finite an efficient method to capture the similarity degree of
alphabet. Markov models are more realistic and have short texts.
better approximation for text generation, i.e. for DNA
sequences, than memory-less sources [JMS13, MJ14]. 3 Snow Data Challenge Dataset
Joint string complexity has a variety of applications,
According to the Social Sensor dataset for the Snow
such as detection of the similarity degree of two se-
Data Challenge, we collected tweets for 24 hours;
quences, for example the use of “copy-paste” in texts
between Tuesday Feb. 25, 18:00 and Wednesday
or documents or the identification of authors and era of
Feb. 26, 18:00 (GMT). The result of crawling
texts. We want to show here that it can also be used in
was over 1, 041, 062 tweets between the Unix times-
analysis of social networks (e.g. tweets which are lim-
tamps 1393351200000 and 1393437600000 and was
ited to 140 characters) and classification. Therefore it
constructed through the use of the T witter Stream-
may be a pertinent tool for automated monitoring of
ing API by following 556, 295 users and also looking
social networks. Real time search in blogs, tweets and
for four specific keywords: Syria; terror; Ukraine; bit-
other social media has to balance between quality and
coin. The dataset used for the experiments consists of
relevance of the content, which due to the very short
96 timeslots, where each timeslot contains tweets for
size of the texts and the huge amount of those, is still
every 15 minutes, starting at 18:00 on Tuesday 25th
an unsolved problem.
2014. The challenge then consisted in providing a min-
In a prior work [JMS13, MJ14] we derive a second
imum of one and a maximum of ten different topics per
order asymptotics for J-complexity of the following
timeslot, along with a headline, a set of keywords and
form
mκ a URL of a relevant image for each detected topic. The
γ√ , (2) test dataset activity and the statistics of the dataset
α log m + β
for some β > 0. This new estimate converges more crawl are described more extensively in [PCA].
quickly, and usually works for texts of order m ≈ 102 ;
In fact, for some Markov sources our analysis indicates 4 Topic Detection Method
that J-complexity oscillates with m. This is outlined Until the present, the main methods used for text
by the introduction of a periodic function Q(log m) in classification are based on keywords detection and
the leading term of our asymptotics. This additional Machine Learning techniques. Using keywords in
term even further improves the convergence for small tweets has several drawbacks because of wrong
values of m. spelling or distorted usage of the words – it also
In view of these facts, we can use the J-complexity requires lists of stop-words for every language to be
to discriminate between two identical/non-identical built – or because of implicit references to previous
Markov sources [Ziv88]. We introduce the discrimi- texts or messages. Machine Learning techniques are
nant function as follows generally heavy and complex and therefore may not
1 be good candidates for real time text processing,
d(X, Y ) = 1 − log J(X, Y ), (3)
log m especially in the case of Twitter where we have
for two sequences X and Y of length m. This dis- natural language and thousands of tweets per second
criminant allows us to determine whether X and Y to process. Furthermore, Machine Learning processes
are generated by the same Markov source or not by have to be manually initiated by tuning parameters,
verifying whether and it is one of the main drawbacks for the kind
of application, where we want minimum if any
d(X, Y ) = O(1/ log m) → 0 or human intervention. Some other methods are using
d(X, Y ) = 1 − κ + O(log log m/ log m) > 0, information extracted by visiting the specific URLs
on the text, which makes them a heavy procedure,
respectively when the length of X and Y is equal to m. since one may have limited or no access to the infor-
mation, e.g. because of access rights, or data size/rate.
In [JMS13, MJ14] we show some experimental evi-
dence of usage of our discriminant function on real and In our method we use an analysis of the messages
simulated texts by different Markov orders. We com- (tweets) based on Suffix Trees [THP04] and we use the
pare the J-complexity of a simulated English text with Joint Complexity as a metric to quantify the similarity
same length texts generated in French, Polish, Greek between these messages.
and Finnish by a third order Markov model to our the- According to the dataset described in Section 3 and
oretical results. Even for texts of lengths smaller than in [PCA] we have N = 96 timeslots with n = 1 . . . N .
a thousand characters, one can easily discriminate be- For every tweet ti , where i = 1 . . . M , with M being
tween these languages. Therefore, Joint Complexity is the total number of tweets, in the n-th timeslot, i.e tni ,
we build a Suffix Tree, ST (tni ), as described in Section Algorithm 1 Method to retrieve row i of an upper
2. Building a Suffix Tree is an operation that costs triangular matrix
O(m log m) and takes O(m) space in memory, where // data is the internal TIntArrayList object
m is the length of the tweet.
Then we compute the Joint Complexity metric, int[] row = new int[data.length+1];
JC(tni , tnj ) of the tweet tni with every other tweet tnj // read the k-th column up to i:
of the n-th timeslot, where j = 1 . . . M , and j 6= i for k = 0 to i − 1 do
(by convention we choose JC(tni , tni ) = 0). The Joint row[k] ← data[k].get(i − k − 1);
Complexity between two tweets is the number of the end for
common factors defined in language theory and can row[i] ← 0; // by convention, not computed
be computed efficiently in O(m) operations (sublinear if i < data.length − 1 then
on average) by Suffix Tree superposition. For the N // read the i-th row until the end:
timeslots we store the results of the computation in for j = 0 to data.length − i do
the matrices T1 , T2 , . . . , TN of M × M dimensions. row[i + j + 1] ← data[i].get(j);
end for
else do nothing; // the last tweet does not have a
row!
end if
return row;

···
 
JCtn1 ,tn2 JCtn1 ,tn3 JCtn1 ,tnM
 JCtn2 ,tn3 ··· JCtn2 ,tnM 
..
 
Tnup triang = 
 .. 
 . . 

 JCtn ,tn 
M−1 M

The actual implementation of the upper triangular
Figure 2: Representation of the first row of the n-th data structure is based on an efficient T IntArrayList
timeslot via weighted graph. data structure provided by the Trove4j library (it uses
We represent timeslots by fully-connected weighted arrays of int instead of Java objects) encapsulated in a
graphs. Each tweet is a node in the graph and the two- class which provides methods to access individual ele-
dimensional array Tn holds the weight of each edge, as ments thanks to their row and column index or a whole
shown in Fig. 2. Then, we calculate the score for each row by providing its index. Algorithm 1 illustrates the
node in our graph by summing all the edges that are getRow() method of this class.
connected to the node. The node that gives the highest The whole Cross Joint Complexity computation was
score is the most representative and central tweet of run in a multithreaded way on a 24 processor machine:
the timeslot. k = 23 threads are started and each thread works on
a disjoint set of rows. Note that because each row of

0 JCtn1 ,tn2 JCtn1 ,tn3 ··· JCtn1 ,tnM
 the triangular matrix does not have the same num-
ber of elements, the number of rows assigned to each
 JCtn ,tn 0 JCtn2 ,tn3 ··· JCtn2 ,tnM 
 2 1  thread should not be divided evenly, otherwise the first
Tn =  JCtn3 ,tn1
 JCtn3 ,tn2 0 ··· JCtn3 ,tnM 
.. .. ..
 threads would have much more work to do than the
 .. 
 . . . .  last ones (remember that the input is an upper tri-
JCtnM ,tn1 JCtnM ,tn2 JCtnM ,tn3 ··· 0 angular matrix); instead, because the total number of
non-zero elements of the n∗n matrix is n(n−1)/2, each
Most of the timeslots have M = 5, 000 tweets, thread should work on approximately n(n − 1)/(2 ∗ k)
so matrices T1 , T2 , . . . , TN have approximately 25, 000 elements so the implementation kept assigning rows to
entries for every timeslot. Since they are symmet- a thread until that number of elements was reached.
ric, only half of these entries could be used, i.e the This implementation allowed the program to run
upper triangular or the lower triangular of matrices in on average 95 seconds in order to compute a 15-
T1 , T2 , . . . , TN , as shown below, which save space and minutes timeslot.
make the structure more efficient. These computations were only run once, as soon as
the data was properly divided into 15-minutes times- rence. Finally we report the k most frequent words,
1 2 N
lots, and the results were saved in files which were in a list of keywords K = [K1...k , K1...k , . . . , K1...k ], for
subsequently used to perform the rest of implementa- the N total number of timeslots.
tion.
When we finally get the scores of the Joint Com- 4.2 Media URLs
plexity metric, we try to find the R most repre- The body of a tweet (in the .json file format), con-
sentative and central tweets of every timeslot. At tains a URL information for links to media files such
first we get the sum of the Joint Complexity, Sin =
P as pictures or videos, when available this information
j=1...M,j6=i JCti ,tj , of the i-th tweet with every other
n n
is stored in the following subsection of the json object:
tweet j in the n-th timeslot, and finally we get the vec- entities → media → media url.
tor S n = [S1n , S2n , . . . , SM
n
] for all timeslots. While reporting the most representative and central
We sort the elements of each vector S n in descend- tweets, we scan the original json format in order to re-
ing order and we get the R most representative and trieve such a URL, from the most representative tweet
central tweets in the following way: The best-ranked or any of its related tweets, pointing to valid photos
tweet is chosen unconditionally, the second one is or pictures in a .jpg, .png or .gif format. Then, we
picked only if its JC score with the first one is be- report these pictures along with the headlines and the
low a chosen threshold T hrlow , otherwise it is added set of keywords, as shown in Algorithm 2 .
to the list of related tweets of the first tweet; similarly, Almost half of the headlines (47%) produced by our
the third one is picked only if its JC score with the first method had an image retrieved from the original tweet.
two is below T hrlow , etc. This ensures that the top- When no such URL is provided within the collected
ics are dissimilar enough and it classifies best ranked json objects we planned to visit the URLs of websites
tweets into topics at the same time. and retrieve images from the website but did not have
Then by removing punctuation, special characters, sufficient time to implement this in a way that enforced
etc., of each selected tweet, we construct the headlines that the retrieved image was indeed relevant for the
of each topic and we run through the list of related topic. It is important to point out that the goal of the
tweets to keep only tweets that are different enough challenge was to detect newsworthy items before they
from the selected one (we do not want duplicates), we hit mainstream news websites, so it was decided that
do so by keeping only the tweets whose JC score with parsing images from such websites was not interesting
the selected tweet and all previous related tweets is in that context.
above a chosen threshold T hrmax . We first chose em-
pirical values 400 and 600 for T hrlow and T hrmax re-
5 Evaluation
spectively, then a few hours before submission time we
noticed that many topics had only one related tweet Twitter is one of the most popular social networks and
(all the others were retweets), so we decided to lower micro-blogging service in the world, with more than
that threshold to 240 but did not have time to recom- 540 million users connected by 24 billion links [GL12].
pute the results on the whole dataset so only a handful It allows the information spread between people,
of timeslots benefited from this better T hrlow = 240 groups and advertisers, and since the relation between
value. The final plan is to come up with a formula to its users is unidirectional, the information propagation
have the system determine those thresholds automat- within the network is similar to the way that the in-
ically depending on the number of characters of each formation propagates in real life.
tweet. Apart from the specific implementation for the
While running through the list of related tweets we Snow Data Challenge, the main benefits of our
compute the bag of words used to construct the list method are that we can both classify the messages
of keywords and we also check the original json data and identify the growing trends in real time, without
to find a URL pointing to a valid image related to the having to manually identify lists of keywords for every
topic. language. We can track the information and timeline
We chose to print the first top 8 topics for each within a social network and find groups of users that
timeslot. agree or have the same interests, i.e, perform trend
sensing.
4.1 Keywords
The official evaluation results of our method in
In the bag of words constructed from the list of related the Data Challenge are included in [PCA], there-
tweets, we remove articles (stop-words), punctuation, fore here we will only discuss a quick comparison
special characters, etc., and we get a list of words, and of our method with simply counting the number
then we order them by decreasing frequency of occur- of retweets within a chosen timeslot, namely Feb.
Algorithm 2 Topic detection based on Joint Com- (JC6) BBCSport: Olympiakos take the lead vs
plexity Manchester United, a hideous deflection off
// N = # timeslots, M = # tweets in the n-th Alejandro Dominguez beats De Gea.
timeslot
(JC7) WhatTheBit: It’s going to be pretty damn
for n = 1 to N do near impossible to top this as photo of the week.
for t = 1 to M do
t ← tjson .getT ext(); (JC8) Al Qaeda branch in Syria issues ultimatum
tST ← suf f ixT reeConstruction(t); to splinter group: The head of an al
JCScores ← JCM etric(); Qaeda-inspired militia fighting...
end for A small program was written to compare the num-
// Find the most representative & central tweets ber of retweets observed on the first occurence of a
S n ← sum(JCScores); retweet and substract that to the number of retweets
observed on the last occurence of the same retweet.
// Get headlines for the central tweets
All tweets were then sorted in descending order of the
Rn ← descendingOrder(S n );
number of retweets obtained, and the result is shown
// Get set of keywords below:
K n ← keywords(Rn );
// Get URLs of pictures from the .json file (RT1) RT @OllieHolt22: Ashley Young berating
P n ← mediaU RL(Rn ); someone for going down too easily - that’s funny
// Print the results in appropriate format (RT2) RT @WhiteHouse: "I’m here to announce that
P rint(Rn ); we’re building Iron ManNot really. Maybe. It’s
P rint(K n ); classified." @President Obama #ActOnJobs
P rint(P n );
end for (RT3) RT @BarackObama: LIVE: President Obama is
announcing a new plan to boost job growth for
middle-class Americans. http://t.co/d6x1Bous1S
25th from 8:15 pm until 8:30 pm, comprised of just
under 16, 000 tweets, which is about 50% higher than (RT4) RT @TheEllenShow: I’m very excited @Pharrell’s
the average number of tweets for the whole 24h period. performing his big hit "Happy" at the #Oscars.
Spolier alert: I’ll be hiding in his hat.
The ranking produced by our method for the
timeslot 25-02-14 20:15 is listed below: (RT5) RT @Madonna: Fight for the right to be
free! Fight Fascism and discrimination everywhere!
Free Venezuela the Ukraine...
(JC1) BarackObama: LIVE: President Obama is
announcing a new plan to boost job growth (RT6) RT @civicua: Free #Venezuela ! #Ukraine is
for middle-class Americans. with you! #euromaidan Photo by Liubov Yeremicheva

(JC2) WhiteHouse: "I’m here to announce that (RT7) RT @russellcrowe: Holy Father @Pontifex , it
we’re building Iron Man | Not really. Maybe. would be my deepest pleasure to bring the
It’s classified." President Obama ActOnJobs @DarrenAronofsky film to you to screen. [...]

(JC3) Madonna: Fight for the right to be free! (RT8) RT @DjKingAssassin: $myry recovering
Fight Fascism and discrimination everywhere! on #bitcoin market update on #mtgox
Free Venezuela the Ukraine...
The following differences can be observed between
the two rankings:
(JC4) Karinabarbosa: civicua: Free Venezuela !
Ukraine is with you! euromaidan Photo by • The best ranked retweet does not appear in the
Liubov Yeremicheva list produced by our method! In fact a plausible
explanation for RT 1 and RT 8 not being picked
(JC5) TheEllenShow: I’m very excited Pharrell’s up by our method is a weak centrality due to a
performing his big hit "Happy" at the Oscars. shorter size of text, as briefly explained in Sec-
Spolier alert: I’ll be hiding in his hat. tion 6. For RT 7 however the size of the text is not
a factor and the weak centrality is most probably • Dividing the timeslots every quarter of an hour is
explained by the unlikely co-occurrence of terms arbitrary as some topics may be cut in half if they
within the sentence. Although this is highly sub- started after the middle of the previous timeslot
jective it is worth noting that those tweets do not and may therefore not acquire enough significance
appear to be very newsworthy. on the current timeslot. Before aggregating the
tweets into 15-minutes timeslots, our implementa-
• Five out of eight items produced by our method
tion first divided the data into 5-minutes timeslots
(JC1 to JC5) are listed in the retweet ranking,
and we considered building 20 minutes timeslots
this is rather expected as there is a mechanical
(each time including the 5 minutes preceding the
link between our method and the number of oc-
official 15-minutes slot) to remedy the above is-
currences of the same text.
sue, but this is now left as future work to check if
• The eighth item produced by our method (JC8) it provides significant improvements.
seems to be very newsworthy but appears only
• The current implementation which sorts the top-
past the 100th position in the retweet count
ics by finding the most central tweets is a bit
method. Again, although this is subjective it ap-
unfair to shorter tweets because longer tweets
pears to be an interesting result.
have mechanically a better chance of obtaining a
Finally, although the dataset that was used for this higher centrality score (Joint Complexity counts
challenge did not allow to show this properly, the au- the number of common factors), therefore a fu-
thors strongly believe that one key advantage of using ture work should be to somehow mitigate this.
Joint Complexity is that it can deal with languages One idea may be to modify the Joint Complexity
other than English [JMS13, MJ14] without requiring metric in order for it to behave like a distance and
any additional human intervention. to perform clustering based on this distance.

6 Conclusion and Future Work Acknowledgment
In this short paper we presented an implementation This work is supported by the SocialSensor and RE-
of a topic detection method applied to a dataset VEAL FP7 project under contract number 287975
of tweets emitted during a 24 hour period. The and 610928 respectively. Dimitris Milioris would like
implementation relies heavily on the concept of Joint to thank INRIA, École Polytechnique ParisTech and
String Complexity which has the benefit of being LINCS.
language agnostic and not requiring humans to deal
with list of keywords. The results obtained seemed References
satisfactory and a final evaluation in the context of
[APM+ 13] L. Aiello, G. Petkos, C. Martin, D. Corney,
the SNOW Data Challenge 2014, can be found in
S. Papadopoulos, R. Skraba, A. Goker,
[PCA].
I. Kompatsiaris, and A. Jaimes. Sens-
ing trending topics in twitter. 15(6):1268–
As the timeframe available to participate in the
1282, October 2013.
challenge was quite short, a - probably inexhaustive
- number of items have been identified as possible [BH11] V. Becher and P. A. Heiber. A better com-
future work, and these are listed below: plexity of finite sequences. 8th Int. Conf.
• As mentioned in Section 4, although the current on Computability and Complexity in Anal-
empirical values chosen as thresholds to determine ysis and 6th Int. Conf. on Computabil-
whether two texts are similar and whether two ity, Complexity, and Randomness, page 7,
texts are near-duplicates seemed to be quite satis- 2011.
factory on the dataset that our method was tested [GL12] M. Gabielkov and A. Legout. The com-
against, it would be more useful to compute val- plete picture of the twitter social graph. In
ues that would adjust to the size of the texts. ACM CoNEXT Student Workshop, 2012.
• The current implementation systematically picks [IYZ02] L. Ilie, S. Yu, and K. Zhang. Repetition
the top 8 topics from each timeslot whereas a bet- complexity of words. pages 320–329, 2002.
ter approach might be to identify a threshold un-
der which topics may be considered not significant [Jac07] P. Jacquet. Common words between two
enough, and thus allowing some not very active random strings. In IEEE International
timeslots to contain less than 8 topics while other Symposium on Information Theory, Nice,
very active timeslots may contain more than 8. France, 2007.
[JLS04] S. Janson, S. Lonardi, and W. Sz- In C. Ding, T. Helleseth, and H. Nieder-
pankowski. On average sequence complex- reiter, editors, Sequences and their Appli-
ity. pages 213–227, 2004. cations, Discrete Mathematics and The-
oretical Computer Science, pages 67–78.
[JMS13] P. Jacquet, D. Milioris, and W. Sz- Springer London, 1999.
pankowski. Classification of markov
sources through joint string complexity: [PCA] S. Papadopoulos, D. Corney, and L. Aiello.
Theory and experiments. 2013. Snow 2014 data challenge: Assessing the
performance of news topic detection meth-
[LV93] M. Li and P. Vitanyi. Introduction to Kol- ods in social media. In Proceedings of the
mogorov Complexity and Its Applications. SNOW 2014 Data Challenge.
Springer–Verlag, Berlin, August, 1993.
[THP04] S. Tata, R. Hankins, and J. Patel. Practi-
[MJ14] D. Milioris and P. Jacquet. Joint sequence cal suffix tree construction. In Proceedings
complexity analysis: Application to social of the 30th VLDB Conference, 2004.
networks information flow. Bell Laborato-
ries Technical Journal, 18(4), 2014. Issue [Ziv88] J. Ziv. On classification with empirically
on Data Analytics. observed statistics and universal data com-
pression. IEEE Transactions on Informa-
[Nie99] Harald Niederreiter. Some computable tion Theory, 34:278–286, 1988.
complexity measures for binary sequences.