<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Identifying Topic-Related Hyperlinks on Twitter</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Patrick</forename><surname>Siehndel</surname></persName>
							<email>siehndel@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University Hannover</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ricardo</forename><surname>Kawase</surname></persName>
							<email>kawase@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University Hannover</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Eelco</forename><surname>Herder</surname></persName>
							<email>herder@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University Hannover</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Risse</surname></persName>
							<email>risse@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University Hannover</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Identifying Topic-Related Hyperlinks on Twitter</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">ABB3BB7B2B820DED858B53BF12DD0AA1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The microblogging service Twitter has become one of the most popular sources of real time information. Every second, hundreds of URLs are posted on Twitter. Due to the maximum tweet length of 140 characters, these URLs are in most cases a shortened version of the original URLs. In contrast to the original URLS, which usually provide some hints on the destination Web site and the specific page, shortened links do not tell the users what to expect behind them. These links might contain relevant information or news regarding a certain topic of interest, but they might just as well be completely irrelevant, or even lead to a malicious or harmful website. In this paper, we present our work towards identifying credible Twitter users for given topics. We achieve this by characterizing the content of the posted URLs to further relate to the expertise of Twitter users.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The microblogging service Twitter has become one of the most popular and most dynamic social networks available on the Web, reaching almost 300 million active users <ref type="bibr" target="#b0">[1]</ref>. Due to its popularity and dynamics, Twitter has been topic of various areas of research. Twitter clearly trades content size for dynamics, which results in one major challenge for researchers -tweets are too short to put them into context without relating them to other information. Nevertheless, these short messages can be combined to build a larger picture of a given user (user profiling) or a given topic. Additionally, tweets may contain hyperlinks to external additional Web pages. In this case, these linked Web pages can be used for enriching tweets with plenty of information.</p><p>An increasing number of users post URLs on a regular basis, and there are more than 500 million Tweets every day <ref type="foot" target="#foot_0">1</ref> . With such a high volume, it is unlikely that all posted URLs link to relevant sources. Thus, measuring the quality of a link posted on Twitter is an open question <ref type="bibr" target="#b2">[3]</ref>.</p><p>In many cases, a lot can be deduced just by the URL of a given Web page. For example, if the URL domain is from a news provider, a video hosting website or a social network, the user already knows more or less what to expect after clicking on it. However, regular URLs are, in many cases, too long to fit in a single tweet. Consequently, Twitter automatically reduces the link length using shortening services. This leads to the problem that the user's educated guess of what is coming next is completely gone. In this work, we focus on ameliorating these problems by identifying those tweets containing URLs that might be relevant for the rest of the community.</p><p>The reasonable assumption behind our method is that users who usually talk about a certain topic ('experts') will post interesting links about the same topic. The strong point in our method is that it is independent of the users' social graph. There is no need to verify the user's network or the retweet behavior. Thus, it can be calculated on the fly. To achieve our final goal, we divide our work in two main steps: the generation of user profiles <ref type="bibr" target="#b4">[5]</ref> and the generation of URL profiles. In this paper, we focus on the latter step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology</head><p>In our scenario, we build profiles for Twitter users based on the content of their tweets. Besides the profiles for users we also generate profiles for the URLs posted by the users. One of the biggest challenges in this task is to find appropriate algorithms and metrics for building comparable profiles for users and websites. The method we developed to solve this task is based on the vast amount of information provided by Wikipedia. We use the articles and the related category information supplied by Wikipedia to define the topic and the expertise level inherent in certain terms. Our method consists of three main steps to create profiles for users and websites.</p><p>Extraction: In this step, we annotate the given content (all tweets of a user, or the contents of a Web page) using the Wikipedia Miner Toolkit <ref type="bibr" target="#b3">[4]</ref>. The tool provides us with links to Wikipedia articles. The links discovered by Wikipedia Miner have a similar style to the links that can be found inside a Wikipedia article. Not all words that have a related article in Wikipedia are used as links, but only words that are relevant for the whole topic are used as links, if a detected article is relevant for the whole text is based on different features like the relatedness to other detected articles or generality of the article.</p><p>Categorization: In the second stage, Categorization, we extract the categories of each entity that has been mentioned in the users' tweets or in the posted URL. For each category, we follow the path through all parent categories, up to the root category. In most cases, this procedure results in the assignment of several top-level categories to an entity. Since the graph structure of Wikipedia contains also links to less relevant categories, we only follow links to parent categories which distance to the root is shorter or less than the one of the child category. For each category, a weight is calculated by first defining a value for the detected entity. This value is based on the distance of the entity to the root node. Following the parent categories, we divide the weight of each node by the number of sibling categories. This step ensures, that categories do not get higher values just because of a broader structure inside the graph. Based on this calculation, we give higher scores to categories that are deeper inside the category graph and more focused on one topic.</p><p>Aggregation: In the final stage, Aggregation, we perform a linear aggregation over all of the scores for a document, in order to generate the final profile for the user (or for the website). The generated profile displays the topics a user/website talks about, as well as the expertise in -or focus on -a certain topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Validation</head><p>As mentioned in Section 1, in this paper we focus our attention on the generation of URL profiles and the relation to the corresponding tweets and users. Thus, in order to validate our methodology, we crawled Twitter with a number of predefined queries (keywords) and collected all resulting tweets that additionally contain URLs. We have  previously validated our approach by characterizing and connecting heterogeneous resources based on the aggregated topics <ref type="bibr" target="#b1">[2]</ref>. Here, the goal is to qualitatively validate if the topic assignment given by our method in fact represents the real topics that are expected to be covered in a given query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset</head><p>The used dataset consists of around 83,300 tweets related to seven different topics. The idea behind this approach is, to collect a series of tweets that contain links and certain keywords relevant for one particular topic. Within these tweets, we found 40,940 different URLs. For each of these URLs, we tried to download and extract the textual content, which resulted in 26,475 different websites. Additionally we downloaded the last 200 posts for each user. The numbers of the dataset are shown in Table <ref type="table" target="#tab_0">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topic Comparison</head><p>Figure <ref type="figure" target="#fig_0">1</ref> shows the generated profiles for two of the chosen example topics. The shown profiles are averaged over all users and show the profiles based on the content of the crawled web pages, based on the tweets containing the URLs and based on the complete user profile (last 200 Tweets, based on API restrictions). We can see that for the very specific topic 'Israeli Palestinian Talks' the generated profiles are very similar. For the topic 'iPhone 5' the profiles are less similar, since this topic or keyword is less specific it becomes much harder for a user to find the content he is looking for. A tweet like 'The new iPhone is really cool' together with a link may be related to many different aspects of the product. Table <ref type="table" target="#tab_1">2</ref> displays the correlation between the different profiles for the chosen exemplifying topics. While users who write about topic like 'Snowden' or 'Nexus phones' seem to write about related topics in most of their tweets, this is not true for more general topics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In this paper, we presented a work towards the identification of credible topic-related hyperlinks in social networks. Our basic assumption is that users who usually talk about a certain topic ('experts') will post interesting (and safe) links about the same topic. The final goal of our work requires to analyze the quality of the posted URLs. Here, we presented our profiling method with preliminary results of the URL profiles. As future work we plan to analyze the quality of profiles and URLs in order to provide a confidence and quality score for URLs.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Coverage of Wikipedia Categories based on the URL Content for each selected topic.</figDesc><graphic coords="3,152.06,176.40,311.25,287.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Statistics about the used dataset.</figDesc><table><row><cell></cell><cell>Items</cell><cell>Annotations</cell><cell>Annotations per Item</cell></row><row><cell>Topic Tweets</cell><cell>83,300</cell><cell>88,530</cell><cell>1.06</cell></row><row><cell>Linked Wedsites</cell><cell>40,940</cell><cell>457,164</cell><cell>11.1</cell></row><row><cell>All Tweets</cell><cell>11,303,580</cell><cell>30,059,981</cell><cell>3.127</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Correlations between created profiles</figDesc><table><row><cell></cell><cell>URL Content</cell><cell>URL Content</cell><cell>Single Tweet</cell></row><row><cell></cell><cell>Single Tweet</cell><cell>User Tweets</cell><cell>User Tweets</cell></row><row><cell>Edward Snowden</cell><cell>0.995</cell><cell>0.968</cell><cell>0.961</cell></row><row><cell>Higgs Boson</cell><cell>0.812</cell><cell>0.628</cell><cell>0.496</cell></row><row><cell>Iphone 5</cell><cell>0.961</cell><cell>0.698</cell><cell>0.664</cell></row><row><cell>Israel Palastinian Talks</cell><cell>0.984</cell><cell>0.884</cell><cell>0.867</cell></row><row><cell>Nexus 5</cell><cell>0.968.</cell><cell>0.972</cell><cell>0.956</cell></row><row><cell>Obamacare</cell><cell>0.983</cell><cell>0.79</cell><cell>0.752</cell></row><row><cell>World Music Avards</cell><cell>0.921</cell><cell>0.718</cell><cell>0.614</cell></row><row><cell>All topics average</cell><cell>0.946</cell><cell>0.808</cell><cell>0.759</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://blog.twitter.com/2013/new-tweets-per-second-record-and-how</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Acknowledgment</head><p>This work has been partially supported by the European Commission under ARCOMEM (ICT 270239) and QualiMaster (ICT 619525)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://globalwebindex.net/thinking/twitter-now-the-fastest-growing-social-platform-in-the-world/" />
		<title level="m">Twitter now the fastest growing social platform in the world</title>
				<imprint>
			<date type="published" when="2013-01">Jan. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Exploiting the wisdom of the crowds for characterizing and connecting heterogeneous resources</title>
		<author>
			<persName><forename type="first">R</forename><surname>Kawase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Siehndel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">P</forename><surname>Nunes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Herder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nejdl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">HT</title>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Warningbird: Detecting suspicious urls in twitter stream</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>NDSS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An open-source toolkit for mining wikipedia</title>
		<author>
			<persName><forename type="first">D</forename><surname>Milne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">194</biblScope>
			<biblScope unit="page" from="222" to="239" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Twikime! -user profiles that make sense</title>
		<author>
			<persName><forename type="first">P</forename><surname>Siehndel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kawase</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference (Posters &amp; Demos)</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
