<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Microblog Contextualization using Continuous Space Vectors: Multi-Sentence Compression of Cultural Documents</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Elvys</forename><forename type="middle">Linhares</forename><surname>Pontes</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">LIA</orgName>
								<orgName type="institution">Université d&apos;Avignon et des Pays de Vaucluse</orgName>
								<address>
									<settlement>Avignon</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stéphane</forename><surname>Huet</surname></persName>
							<email>stephane.huet@univ-avignon.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory">LIA</orgName>
								<orgName type="institution">Université d&apos;Avignon et des Pays de Vaucluse</orgName>
								<address>
									<settlement>Avignon</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Juan-Manuel</forename><surname>Torres-Moreno</surname></persName>
							<email>juan-manuel.torres@univ-avignon.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory">LIA</orgName>
								<orgName type="institution">Université d&apos;Avignon et des Pays de Vaucluse</orgName>
								<address>
									<settlement>Avignon</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">École Polytechnique de Montréal</orgName>
								<address>
									<settlement>Montréal</settlement>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Carneiro</forename><surname>Linhares</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">Universidade Federal do Ceará</orgName>
								<address>
									<settlement>Sobral-CE</settlement>
									<country key="BR">Brasil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="institution">Univer-sité d&apos;Avignon et des Pays de Vaucluse (France)</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Microblog Contextualization using Continuous Space Vectors: Multi-Sentence Compression of Cultural Documents</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">25BA5A5BC3CC3E8249044C2B87A38EE1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Microblog Contextualization</term>
					<term>Multi-Sentence Compression</term>
					<term>Word Embedding</term>
					<term>Wikipedia</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe our work for the MC2 CLEF 2017 lab. We participated in the content analysis task that involves filtering, language recognition and summarization. We combine Information Retrieval with Multi-Sentence Compression methods to contextualize microblogs using Wikipedia's pages.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Many newspapers use microblogs (Twitter, Facebook, Instagram, etc.) to disseminate news quickly. These microblogs have a limited length (e.g. a tweet is limited to 140 characters) and contain few information about an event. Therefore, it is complicated to describe an event completely in a single microblog. A way to overcome this problem is to get more information from another source to better explain the microblog.</p><p>Several studies on tweet contextualization have been done on this topic. To just name a few, Liu et al. introduced a graph-based multi-tweet summarization system <ref type="bibr" target="#b7">[8]</ref>. This graph integrates the functionalities of social networks, solving partially the lack of information contained in tweets. Chakrabarti and Punera used a Hidden Markov Model in order to model the temporal events of sets of tweets <ref type="bibr" target="#b3">[4]</ref>. Linhares Pontes et al. used Word Embedding to reduce the vocabulary size and to improve the results of Automatic Text Summarization (ATS) systems <ref type="bibr" target="#b6">[7]</ref>. MC2 CLEF 2017 <ref type="bibr" target="#b0">[1]</ref> lab analyzes the context and the social impact of of a microblog at large. This lab is composed of three main tasks: Content Analysis, Microblog Search and Time Line Illustration. We participated in the Content Analysis task that involves classification, filtering, language recognition, localization, entity extraction, linking open data, and summarization of Wikipedia's pages and microblogs. Specifically, we worked on the following subtasks: filtering, language recognition and automatic summarization.</p><p>The filtering subtask analyzes whether a tweet describes an existing festival or not (values are between 0 and 1, 1 for the positive case and 0 otherwise). The language recognition subtask consists in identifying the language of a microblog. Finally, the summarization task is to generate a summary (maximum of 120 words) in four languages (English, French, Portuguese and Spanish) of Wikipedia's pages describing a microblog.</p><p>This paper is organized as follows. In Section 2 we describe the architecture of our system to solve the tasks of MC2 CLEF 2017 lab. Then, we present the process of document retrieval on Wikipedia and the summarization system in Sections 3 and 4, respectively. Finally, we conclude in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Architecture</head><p>The CLEF's organizers selected a set of microblogs (tweets) with the keyword "festival" to be contextualized by the participants using four versions of Wikipedia (English, French, Portuguese, and Spanish).</p><p>For the language identification task, we pre-processed microblogs to remove all punctuation and emoticons. Then, we use the library langdetect <ref type="bibr" target="#b8">[9]</ref> to detect the languages of microblogs.</p><p>For filtering and summarization tasks, we divided our system in two parts (Fig. <ref type="figure" target="#fig_0">1</ref>). The first part aims at retrieving the Wikipedia's pages that best describes the festival mentioned in a microblog (Section 3). We scored the Wikipedia's pages according to their relevance with respect to a microblog, which corresponds to the filtering subtask.</p><p>The second part of our system analyzes the 3 best scored pages and creates clusters of similar sentences with relevant information. Then, we use an Automatic Text Compression (ATC) system (Section 4) to compress the clusters and to generate summaries in four languages describing the festival mentioned in a microblog (summarization subtask). The algorithm 1 describes how our method analyzes the microblog, selects the 3 best Wikipedia's pages and generates the summaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Wikipedia's Document Retrieval</head><p>The set of CLEF's microblogs is composed of tweets in different languages related to festivals in all the world. Wikipedia provides a more thorough description of a given festival according to the selected language (e.g. The festival of Avignon  is better described in the French Wikipedia). We independently analyze the four versions of Wikipedia (en, es, fr, and pt) for each microblog, repeating the whole process to first retrieve the best Wikipedia's pages and then to summarize the pages for the four versions of Wikipedia.</p><p>Our system retrieves the most related Wikipedia's pages to a microblog using a method similar to our previous work <ref type="bibr" target="#b6">[7]</ref>. We assume that the hashtags and usernames represent the keywords of a tweet, and are independent of the language <ref type="foot" target="#foot_0">4</ref> . From hashtags, usernames, and the plain text (i.e. the tweet without hashtags, usernames and punctuation), we create Indry queries to retrieve 50 Wikipedia's documents per each tweet. For each of these documents, we analyze the title and the summary in relation to the tweet's elements (hashtag, username and word). Normally, the title of the Wikipedia's document has few words and contains the core information, while the summary of the document, which is made of the first paragraphs of the article before the start of the first section, is larger and provide more information <ref type="foot" target="#foot_1">5</ref> . Therefore, we consider Equation <ref type="formula">4</ref>to compute the relevance score of the Wikipedia's document D with respect to the microblog T .</p><formula xml:id="formula_0">score title = α 1 × sim(ht, title) + α 2 × sim(un, title) + α 3 × sim(nw, title) (1) score sum = β 1 × sim(ht, sum) + β 2 × sim(un, sum) + β 3 × sim(nw, sum) (2) sim(x, y) = γ 1 × cosine(x, y) + γ 2 × occur(x, y)<label>(3)</label></formula><p>score doc = score title + score summary <ref type="bibr" target="#b3">(4)</ref> where ht are the hashtags of the tweet T , un the usernames of T , nw the normal words of T , and sum the summary of D. occur(x, y) represents the number of occurrences of x in y, while cosine(x, y) is the cosine similarity between x and y using Continuous Space Vectors<ref type="foot" target="#foot_2">6</ref>  <ref type="bibr" target="#b2">[3]</ref>. We set up empirically the parameters as follows: α 1 = α 2 = 0.1, α 3 = 0.01, β 1 = β 2 = 0.05, β 3 = 0.005, γ 1 = 1 and γ 2 = 0.5 . These coefficients give more weights to hashtags than usernames and the tweet text and compensate the shorter length of Wikipedia's article titles with respect to their summary. For each tweet, we finally keep in each language the 3 Wikipedia's documents with the highest scores to be analyzed by the ATC system.</p><p>The summary provided at the start of Wikipedia's pages is assumed to be good enough to be coherent and to provide the basic information. However, relying only of this part of the article may lead to miss relevant information about the festival that could be obtained from other sections or even pages in Wikipedia. For this reason, we preferred to use the summary of the top article as a basic abstract and to improve its quality with relevant information using Multi-Sentences Compression (MSC) (i.e. generate sentences that are shorter and more informative than the original sentences of the summary). Therefore, we consider the sentences of the summary of the best scored page as key sentences. Then for each of these sentences, we create a cluster made of the sentences of the complete 3 retrieved Wikipedia's pages which are similar; to do this, the cosine similarity is used as metrics and we empirically set up a threshold of 0.4 to consider two sentences as similar.</p><p>Then, for each cluster MSC generates a shorter and hopefully more informative compression (Section 4.1). Next, we generate the summary concatenating the most similar compression to the microblog <ref type="foot" target="#foot_3">7</ref> .</p><p>Some language versions of Wikipedia do not have a page or they have a small description describing an specific festival. Therefore, we analyzed the summaries of each microblog obtained in the four studied languages and only retain the abstract with contains the best description of the microblog, which is estimated through the similarity between each summary and the microblog. So, we used the Yandex library<ref type="foot" target="#foot_4">8</ref> to translate the kept summary to others languages (en, es, fr, and pt).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Word Graph and Optimization</head><p>Our MSC system adopts the approach proposed by Filippova <ref type="bibr" target="#b4">[5]</ref> to model a document D as a Word Graph (WG), where the vertices represent the words and arcs represent the cohesion of the words (more details in <ref type="bibr" target="#b5">[6]</ref>). The weights of the arcs represent the level of cohesion between the words of two vertices based on the frequency and the position of these words in the sentences (Equation <ref type="formula" target="#formula_1">5</ref>).</p><p>w(e i,j ) = cohesion(e i,j ) f req(i) × f req(j) ,</p><p>cohesion(e i,j ) = f req(i) + f req(j)</p><formula xml:id="formula_2">f ∈D dist(f, i, j) −1 ,<label>(6)</label></formula><formula xml:id="formula_3">dist(f, i, j) = pos(f, i) − pos(f, j), if pos(f, i) &lt; pos(f, j) 0, otherwise<label>(7)</label></formula><p>In a previous study, we proposed to extend this approach with the analysis of the keywords and the 3-grams of the document to generate a more informative MSC. Since each cluster to compress is composed of similar sentences, we consider that there is only one topic; the Latent Dirichlet Allocation (LDA) method is used to identify the keywords of this topic <ref type="bibr" target="#b1">[2]</ref>.</p><p>From the weight of 2-grams (Equation <ref type="formula" target="#formula_1">5</ref>), the relevance of a 3-gram is based on the relevance of the two 2-grams, as described in Equation 8:</p><formula xml:id="formula_4">3-gram(i, j, k) = qt 3 (i, j, k) max a,b,c∈GP qt 3 (a, b, c) × w(e i,j ) + w(e j,k ) 2 ,<label>(8)</label></formula><p>In order to generate a better compression, the objective function expressed in Equation 9 is minimized in order to improve the informativeness and the grammaticality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Minimize α</head><formula xml:id="formula_5">(i,j)∈A b i,j • x i,j − β k∈K c k • w k − γ t∈T d k • z t<label>(9)</label></formula><p>where x ij indicates the existence of the arc (i, j) in the solution, w(i, j) is the cohesion of the words i and j (Equation <ref type="formula" target="#formula_1">5</ref>), z t indicates the existence of the 3-gram t in the solution, d t is the relevance of the 3-gram t (Equation <ref type="formula" target="#formula_4">8</ref>), c k indicates the existence of a word with color (keyword) k in the solution and β is the geometric average of the arc weights in the graph (more details in <ref type="bibr" target="#b5">[6]</ref>). Finally, they calculate the 50 best solutions according to the objective (9) and we select the sentence with the lowest final score (Equation <ref type="formula" target="#formula_6">10</ref>) as the best compression.</p><formula xml:id="formula_6">score norm (f ) = e scoreopt(f ) ||f || ,<label>(10)</label></formula><p>where score opt (f ) is the value of the path to generate the compression f from Equation <ref type="formula" target="#formula_5">9</ref>. As Linhares Pontes et al. <ref type="bibr" target="#b5">[6]</ref>, we set up the parameters to α = 1.0, β = 0.9 and γ = 0.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we presented our contributions to the MC2 CLEF 2017 lab in the Content Analysis task. We considered different scores for each microblog element (hashtags, arrobases, and text) to retrieve in four languages (en, es, fr, and pt) the Wikipedia's pages most related to a microblog. Then, we generated summaries using MSC from clusters initially made of the abstract of the top retrieved article and extended with similar sentences from the 3 top retrieved articles per language. Finally, we analyzed summaries of each microblog obtained in the four languages to select the one most similar to the microblog; the kept summary is translated to other languages (en, es, fr and pt).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Our system architecture to contextualize the microblogs.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Generate the summary (lang language) with the compressed sentences that are the most similar to the tweet end for Select the best version of the summaries (most similar to the tweet) Translate the best summary version with Yandex translator to other languages end for</figDesc><table><row><cell>do</cell></row><row><cell>Create the clusters of similar sentences analyzing the 3 highest scored pages</cell></row><row><cell>end for</cell></row><row><cell>for each cluster do</cell></row><row><cell>Create the Word Graph (Section 4.1)</cell></row><row><cell>Generate compressed sentences (Section 4.1)</cell></row><row><cell>end for</cell></row></table><note>Algorithm 1 Automatic Summarization for each tweet do for lang in English, French, Portuguese and Spanish do Analyze the 3 lang Wikipedia's pages with the highest scores (Equation 4) using LEMUR system (lang version of Wikipedia) for each sentence of the abstract of the first Wikipedia's page (highest score)</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">The langdetect library is only used for the language recognition subtask.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">We did not consider the whole text of Wikipedia's page because it is sometimes huge and we preferred to rely on the work of the contributors to build the summary of the article.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">We used the pre-trained word embeddings (en, es, fr, and pt) of FastText system<ref type="bibr" target="#b2">[3]</ref> that is available in https://github.com/facebookresearch/fastText/blob/ master/pretrained-vectors.md.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">The summary is composed of the first 120 words.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4">https://tech.yandex.com/translate/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">10456</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Latent Dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
		<ptr target="http://dl.acm.org/citation.cfm?id=944919.944937" />
	</analytic>
	<monogr>
		<title level="j">Journal Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003-03">Mar 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Enriching word vectors with subword information</title>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1607.04606</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Event Summarization using Tweets</title>
		<author>
			<persName><forename type="first">D</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Punera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">5th AAAI International Conference on Weblogs and Social Media (ICWSM). Association for the Advancement of Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Multi-sentence compression: Finding shortest paths in word graphs</title>
		<author>
			<persName><forename type="first">K</forename><surname>Filippova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">COLING</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="322" to="330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Linhares Pontes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">G</forename><surname>Da Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Linhares</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Torres-Moreno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huet</surname></persName>
		</author>
		<title level="m">Métodos de Otimização Combinatória Aplicados ao Problema de Compressão MultiFrases</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Tweet contextualization using continuous space vectors: Automatic summarization of cultural documents</title>
		<author>
			<persName><forename type="first">E</forename><surname>Linhares Pontes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Torres-Moreno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Linhares</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF Workshop on Cultural Microblog Contextualization</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Graph-Based Multi-Tweet Summarization using Social Signals</title>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">COLING</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1699" to="1714" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Language detection library for java</title>
		<author>
			<persName><forename type="first">N</forename><surname>Shuyo</surname></persName>
		</author>
		<ptr target="http://code.google.com/p/language-detection/" />
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
