<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>David Corney</institution>
          ,
          <addr-line>Dyaa Albakour, Miguel Martinez and Samir Moussa Signal Media 16-24 Underwood Street, London N1 7JQ</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: M. Martinez, U. Kruschwitz</institution>
          ,
          <addr-line>G. Kazai, D. Corney, F. Hopfgartner, R.</addr-line>
          <institution>Campos and D. Albakour (eds.): Proceedings of the NewsIR'16 Workshop</institution>
          ,
          <addr-line>Padua, Italy, 20-March-2016, published at</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We present a detailed description and
analysis of the Signal Media One-Million Articles
dataset. We have released this dataset to
facilitate research on emerging news-related
information retrieval (IR) challenges. In
particular, we have observed over the past decade
emerging novel paradigms for publishing and
consuming news, where users can get updated
on the go with news from multiple sources,
and at the same time, news providers are
increasingly using social media and citizen
journalism as powerful news sources. As a
result, a number of news-related IR tasks have
emerged and attracted attention in
industry and academia. These include news
veri cation, temporal summarization of
multiple news sources, and news recommendation
among others. A number of news datasets
were created and shared for news IR in the
past. However, such datasets are often drawn
from a single outlet, and heavily preprocessed
and cleaned. Also, they have become
outdated and are not suitable any more for the
emerging news IR challenges described above.
Our dataset aims to address this because it is a
recent collection from a wide range of sources
re ecting many real-world issues in news
collection and analysis.</p>
      <p>We present insights obtained from an analysis
Copyright c 2016 for the individual papers by the paper's
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
of certain characteristics of the dataset, such
as article lengths; similarity between articles;
and the temporal characteristics of news
publishing. We also discuss the opportunities and
the limitations of our dataset.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>We present an analysis of The Signal Media
OneMillion News Articles Dataset1. We have created and
shared this collection to stimulate research into new
and improved methods for large-scale text analysis
related to a wide range of applications. The articles were
collected from a variety of news sources in September
2015. The sources include major national and
international outlets, such as Reuters, the BBC and the
New York Times, along with many sources that have
fewer readers and less impact, including news
magazines, blogs, local outlets and specialist publications.
The collection is shared under a Creative Commons
licence2 while the copyright of the articles remains with
the original publishers.</p>
      <p>There are several existing collections of
newsrelated texts that have been widely used by the
information retrieval (IR) and natural language
processing (NLP) communities, such as the Twenty
Newsgroups collection [twe99]; the Reuters-21578 test
collection[Lew96]; several Reuters Corpora, such as
RCV1 [LYRL04], RCV2 and TRC2; and more recently
Yahoo's user-news feed interaction data set [Yah16].
Common Crawl3 provides a collection of 1.8 billion
pages, but this represents a snapshot sample of the
entire web, rather than a focussed news collection.</p>
      <p>While such datasets continue to be useful for
evaluating and guiding research, most are limited to a
single source or a website. Furthermore, such datasets
1Available for research purposes from http:
//research:signalmedia:co/newsir16/signal-dataset:html
2Attribution, non-commercial https://
creativecommons:org/licenses/by-nc/3:0/
3http://commoncrawl:org/
are highly curated and re ned, which may result in
an over-estimation of performance: if an algorithm
performs well on such a clean set, will it still
perform well when presented with more noisy data, such
as content obtained from web-scraping or other
lesscontrolled sources? In the fast changing era with news
publishing and consumption, such datasets have
become outdated and are not suitable for emerging IR
tasks on news. Our dataset addresses this by
providing a recent large sample of news articles from multiple
sources over a one month period.</p>
      <p>Like our collection, the British National
Corpus [Bur07] is monolingual (English-only), synchronic
(sampled from a single time period), general (not
limited to any genre or topic), and sampled from a wide
range of sources. Our dataset shares many of these
features, except that while the BNC is de nitively from
a single nation, our collection spans the globe. While
focussed on news, our collection also contains
imaginative works of blogs and transcriptions of broadcast
speech.</p>
      <p>The articles of the dataset were originally collected
for Signal Media by Moreover Technologies from 1st{
30th September 2015. This included repeated
sampling from over 93,000 di erent news sources ranging
from large-scale mainstream media outlets to
singleauthor blogs. Recent decades have seen a
blurring of the distinction between mainstream and
citizen journalism[Gla08], hence our inclusion of
multiple sources in the collection. As well as being an
attractively round number, one million articles is a
large enough collection to develop and evaluate a wide
range of tools and models, while still being
manageable enough to not require specialist or sophisticated
infrastructure.</p>
      <p>Due to the scale of the collection, and the
importance of speed, the collection process is largely
automated. This, along with the variation in quality
of source websites, inevitably leads to imperfections
in the collection. For example, the goal was to
collect English-language content only, but manual
analysis shows that a small proportion of articles are in
other languages or a mixture of languages. Similarly,
some articles contain fragments of HTML, PHP or
JavaScript code, due to problems with encoding,
rendering or scraping, and some are duplicated (see
Section 2.2). We consider such issues as the inevitable
presence of noise in any real-world collection. One
of our aims in sharing this dataset is to encourage
the development of tools and methods that are robust
enough to cope with data of this nature.</p>
      <p>In the remainder of the paper, we provide insights
from an analysis conducted on a number of the dataset
characteristics (Section 2). We share some tools for
accessing the data (Section 3), and discuss the
opportunities and the limitations of the dataset in Section 4.</p>
    </sec>
    <sec id="sec-3">
      <title>Characteristics of dataset</title>
      <sec id="sec-3-1">
        <title>Article length</title>
        <p>Some news articles are little more than a single
statement, especially if a news story is still breaking. Other
articles may contain a lot of details and background
information or discussion. To investigate this, we use
a simple tokenizer4 to count the number of words in
each article. Figure 1 shows the distribution of article
lengths, comparing the length of `news' articles to the
lengths of `blog' articles. This shows that typical
articles are a few hundred words long, but with a long tail
reaching up past 2000 words. Note also that articles
between 400 and 1000 words long are more likely to be
news articles than blogs, while very short articles (200
words or less) and long articles (1200 words or more)
are more likely to be blogs than news.</p>
        <p>The tokenization also allows us to calculate that
there are 407,754,159 words in the dataset; that there
are 2,003,254 distinct words in the dataset; and that
the average number of words per article is 407.75.</p>
        <p>Which articles are unusually long? And which
articles are unusually short? Just 144 articles in the
collection have more than 10,000 words. The longest article
has 12,450 words and is a transcript of a US college
football match. Other long articles include an
installment of serialized novel; an updated about a fantasy
football competition; detailed personal memoirs; and
a detailed list of shing reports from Florida. While
these may never attain very large readerships, they
4We used the Standard Tokenizer in ElasticSearch v1.7,
which splits on white space and punctuation symbols, while
allowing for abbreviations.
will no doubt be of interest to certain audiences. At
the other extreme, the shortest article in collection is
nothing more than a brief announcement about
options trading with no background discussion or detail.
The article is shown in its entirety in Figure 2,
including available metadata.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Duplicated articles</title>
        <p>Identical, or near-identical articles may appear in the
collection for several reasons, including:
Syndication One publisher may publish the same
article through several outlets, such as regional
newspapers.</p>
        <p>Updates One source may publish multiple versions of
the same article over time, especially in the case
of updating breaking news stories.</p>
        <p>Aggregation Some news aggregation sites (such as
wn:com and www:newslookup:com/) display copies
of articles originally published elsewhere. These
articles may have already been collected from the
primary sources.</p>
        <p>Access issues Some sites give the same content when
an automated tool attempts to access multiple
articles on the site (e.g. a copyright or login notice),
with the full text being behind a rewall.</p>
        <p>To investigate this, we measured the cosine
similarity between pairs of articles. This compares the
frequency of terms found in each article, ignoring the
word order. A score approaching one means that the
same words appear at the same frequencies; a score
close to zero means that entirely di erent words
appear. If two articles di er by only a few words, we still
wish to treat them as near-duplicates. In this way, we
can ignore di erences caused by di erent page layouts,
changes in the byline or other minor edits.</p>
        <p>In Figure 3, we show the probability distribution of
cosine similarity scores between pairs of articles. To
generate this, we randomly sampled 250 articles and
measured the cosine similarity between each of these
and each of the million articles. We removed
stopwords and calculated the term frequencies of the
remaining words (though similar results were obtained
without removing stopwords). The cosine similarity
scores were grouped into 100 equal-sized bins to
generate the graph. Note that we are sampling from the
space of similarity scores and not the space of articles:
full pairwise analysis of a smaller corpus would lead to
a very di erent, sparser distribution.</p>
        <p>The broad peak in central portion shows that when
choosing two articles at random, their cosine
similarity is likely to be around 0.5. 90% of article pairs
have a cosine similarity between 0.13 and 0.70. This
shows a substantial degree of overlap between articles
in terms of word frequencies, despite the wide range of
sources and topics covered. The slight up-tick at the
extreme left shows that around 3% of articles share
no terms (except stopwords) with any other article.
These are typically very short articles containing rare
terms (such as rare proper nouns). This includes some
articles that have been corrupted during collection,
and contain only fragments of code. At the other
extreme, 0.0561% of pairs have a similarity score of
greater than 0.95. This suggests that articles have on
average around 2.2 duplicates or near-duplicates. This
result is consistent with the duplicate recognition
component of Signal Media's news monitoring platform5.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Typical articles</title>
        <p>Beyond considering duplicates, we investigated which
articles are `typical' and `atypical' with regard to the
rest of the collection. To do this, we remove stopwords
and generate term frequency vectors for each article,
as before. We then combined these into a single term
5signalmedia:co
frequency vector representing the centroid of the
entire collection. We then measured the cosine similarity
between each article and this centroid. Articles with
a high score can be seen as typical of the collection, in
that the distribution of terms is similar. We plan to
repeat this analysis using TF IDF, but we expect to
nd a similar pattern of results overall.</p>
        <p>Figure 4 lists the twenty most common terms in the
collection (excluding stopwords). These include the
month and year of the collection, along with words
such as `news', `market', `business' and company,
indicating the typical focus of stories. .</p>
        <p>One of the most typical stories by this de nition,
with a centroid cosine similarity of 0:9620, is about
a new portable charge storage device6. It contains
words such as `power', `time', `heat' and other terms
commonly found in news articles as shown in this
extract: \Harnessing the thermal energy from the
included heating pot, it generates 5-watt power that can
charge a device via the attached USB cable. You
provide the heat source and liquid, and the PowerPot
charges any compatible electronic with a USB
connection. Charging time is actually comparable to a
standard outlet, at about 1 to 2 hours to fully charge a
phone."</p>
        <p>One of the least typical stories is an article about</p>
        <p>All</p>
        <p>BNleowgss
Sep05</p>
        <p>Sep09</p>
        <p>Sep13</p>
        <p>Sep17
Date</p>
        <p>Sep21</p>
        <p>Sep25</p>
        <p>Sep29
tomato soup, in a small-town American newspaper7
: \That slight chill in the evening air? That's fall.
And what better way to chase it than with a bowl
of steamy, homemade soup. Chef Shereen Pavlides'
recipe for Tomato Basil Soup is just what the
season ordered, made with chicken stock and hearty San
Marzano tomatoes." This has a cosine similarity with
the centroid vector of just 0:00160.</p>
        <p>While this centroid approach gives some indication
of typical and atypical articles, it ignores the fact that
the collection is far from homogeneous. A more
sophisticated analysis could de ne a number of
representative centroids, each representing a typical article
belonging to a particular category. Such a
segmentation of the collection could be achieved using any of a
wide variety of document clustering algorithms.
2.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Volume over time</title>
        <p>In Figure 5, we plot the daily volume of articles
published across the full duration of the dataset
(September 2015). We can see the expected weekly pattern
with more activity (published articles) during
weekdays and less activity during the weekends. This is
true for both media types (blogs and news).
However, a couple of exceptions can be observed in these
weekly cycles. First, there are fewer articles published
on Monday, September 7th compared to other
Mondays and other weekdays in the month. This was
a public holiday in the United States (Labor Day),
which resulted in less media activity. Note that the
dataset is sampled only from English-language sources
and therefore the majority of articles originate from
English speaking countries (with the USA the largest
of those). The second exception is Friday, Sept. 11th
where the volume has dropped due to a downtime in
the collection process that occurred on that day.</p>
        <p>We now consider the hourly distribution of article
publication in Figure 6. We calculate the average
volume of articles in each hour of the day using the GMT
timezone over the 30-day period. We observe a sharp
6Article ID 05bab596-55a3-40aa-adde-1a2cb48ebc41
7Article ID 497a3712-dd6d-4b0a-9365-5a7ade79d905
Oceania
Africa
Americas</p>
        <p>EAsuiraope
2
4
6
8 Hou1r0of the day − G14MT 16
12
18
20
22
rise in the \News" volume at 7:00 GMT. The volume of
\News" increases throughout the day and it reaches a
peak between 12:00 GMT and 16:00 GMT. Afterwards
the volume starts to decrease through the late hours of
the day. To explain this behaviour, we further break
down the \News" sources based on the continent of
their sources' origin in Figure 7. We can observe that
the peak at 7:00 GMT is mainly due to increased
activity in Europe with the early hours in the working
day there. Hours later, when it is morning in America
(12:00 to 16:00 GMT), we observe an increase of
activity from American sources which in fact dominate
the stream.</p>
        <p>For \Blogs" (Figure 6), we observe a di erent
picture where the volume is more stable throughout the
day with slight increases during the morning and
afternoon hours.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Open Source Tools</title>
      <p>We have created an open source repository on
GitHub 8 to host useful tools and programming scripts
for processing the dataset. For example, we have
added scripts to index the data with ElasticSearch and
to convert it into the TREC format for easier
compatibility to some other IR tools. We will continue to
maintain and promote this repository encouraging the
8https://github:com/SignalMedia/Signal-1M-Tools
2
4
6
8Hou1r0of the12day −1G4MT 16
18
20
22
The Signal Media One-Million News Articles Dataset
provides a distinctive research asset. The articles come
from multiple sources and re ect many of the realities
of practical large-scale text analytics. However, there
are inevitably limitations to the data, some of which
we now consider.</p>
      <p>Language The dataset is almost monolingual, being
dominated by English-language articles and with
the remaining articles mostly being a mixture of
English and non-English text.</p>
      <p>Date range The articles were all collected during
September 2015; a small number were published
in the preceding days and weeks but only
detected and downloaded in that month. By
limiting the sample to a single calendar month, we
achieved a relatively high density of related
articles, such as multiple articles written about the
same events. One month is also long enough for
some of the news stories to change over time,
making the dataset suitable for topic detection and
tracking studies. Of course, by restricting the
collection to this period makes it less useful for
longer-term topic tracking.</p>
      <p>Missing links and multimedia resource The
dataset is text-only, and does not contain links
to the original articles. This is partly due to
issues around image licensing, but also to avoid
problems with link rot: the ever-changing nature
of the internet means that any published URLs
would not reliably point to the same text as in
the dataset. A dataset containing archived copies
of multimedia content related to news would be
very useful, but is beyond the scope of this work.
Labelling Many tasks require labelled documents,
such as to assign documents to topic categories
or to disambiguate entities. Although the
collection currently has no such labels, our goal is to
encourage the wider community of IR researchers
to use this collection for a variety of tasks and
share label sets for parts (or all) of the articles.</p>
      <p>In conclusion, we believe that this dataset will be
useful for developing, evaluating and comparing a wide
range of news-related information retrieval tools and
algorithms. The analysis provided here provides an
initial description of the collection and forms the basis
for future work.
[Bur07]
[Lew96]
http:
Yahoo News Feed dataset.
//webscope:sandbox:yahoo:com/
catalog:php?datatype=r&amp;did=75, 2016.
http:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Gla08]
          <string-name>
            <given-names>Lou</given-names>
            <surname>Burnard</surname>
          </string-name>
          .
          <article-title>Reference guide for the British National Corpus (XML edition)</article-title>
          . http://www:natcorp:ox:ac:uk/ XMLedition/URG,
          <year>February 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Glaser</surname>
          </string-name>
          .
          <article-title>Distinction between bloggers, journalists blurring more than ever</article-title>
          . http://mediashift:org/
          <year>2008</year>
          / 02/distinction-between
          <article-title>-bloggersjournalists-blurring-more-</article-title>
          <string-name>
            <surname>thanever059</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Lewis</surname>
          </string-name>
          .
          <article-title>Reuters-21578 text categorization test collection</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          http://kdd:ics:uci:edu/databases/ reuters21578/reuters21578:html,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>[LYRL04] David D Lewis</surname>
            ,
            <given-names>Yiming</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            , Tony G Rose, and
            <given-names>Fan</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Rcv1: A new benchmark collection for text categorization research</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>361</fpage>
          {
          <fpage>397</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [twe99]
          <article-title>[Yah16] 20 Newsgroups dataset</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          //kdd:ics:uci:edu/databases/ 20newsgroups/20newsgroups:html,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>