<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Webometric Analysis of Russian Scientific and Education Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denis Kosyakov</string-name>
          <email>kosyakovdv@ipgg.sbras.ru</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Guskov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egor Bykhovtsev</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Technologies of SB RAS</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Petroleum Geology and Geophysics, Siberian Branch, Russian Academy of Sciences</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Novosibirsk State University</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>State Public Scientific Technological Library, Siberian Branch, Russian Academy of Sciences</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>196</fpage>
      <lpage>207</lpage>
      <abstract>
        <p>The aim of this work is to create a publicly available database with webometric indicators for research and higher education organizations in Russia updated on monthly schedule accessible through the projectfl s website http://www.webometrix.ru. This paper describes the set-up of the project including initial gathering and actualization of organizations and their web domains list, sources of data, measuring of indicatorsfl values, analytics available on projects website. Starting from 613 institutions of Russian academies of sciences in January 2015 from the middle of 2015 we gather data for 2201 organizations including research and higher education institutions. Continuous data for more than a year allowed us to assess the reliability of indicators used and to draw some conclusions about Russian scientific and educational web space.</p>
      </abstract>
      <kwd-group>
        <kwd>webometrics</kwd>
        <kwd>informetrics</kwd>
        <kwd>websites</kwd>
        <kwd>rankings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years, webometric studies based on the web search engine usage [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
become a recognized method of measurement of academic institutions websites
quality and impact. However, as other informetric studies, this method remains
quite controversial due to gaps in research base, quality of measurement
instrumentation and weight and meaning of measured indicators. We suppose that
there is a lack of nationwide recurring measurements and juxtaposition between
webometric and other kinds of scientometric assessments.
      </p>
      <p>
        This research is based on monthly webometric data collection for websites of
over than 2200 Russian research organizations and higher education institutions.
Deep analysis of time series of measurements of particular webometric indicators
in some cases combined with the parsing of website structure, examining its
peculiarities, and correlating with usage statistics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] allowed us to examine in
details the significance of each of indicators, propose some justification methods
and to compare different approaches to calculation of webometrics rankings.
      </p>
      <p>In this report, we consider the principles and architecture of the
webometric data collecting system. It contains a webometric indicators database with
monthly data since January 2015, and web interface (http://www.webometrix.ru),
that allows anyone to perform an analysis of trends and evolution of the scientific
websites. Institutions can use it to examine the position and dynamics of their
website, to compare it with the others and, as a result, to find the ways of its
improvement. In the final part of the report, we made an overview of the
Russian Academic and Education Web. We also compared webometric rankings with
bibliometric data on institutionsfl academic output and website usage statistics.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Area of Study</title>
      <p>We began data collection in January 2015 for institutions under the supervision
of Russian Federal Agency of Scientific Organizations (so-called academic
institutions, since before they were subject to state academies). Official websites and
corresponding distinct domains were determined for 613 organizations, most of
them being the research institutions. In July 2015 we have added Russian higher
and further education institutions, non-academic research organizations and
scientific development and production centers to data collection.</p>
      <p>The resulting collection contains data for 2201 organization with distinct
DNS domains which also include some regional branches with separate websites
and corresponding domains. Of these, 1172 organizations are in the research and
development segment and 1029 – in the higher education segment.</p>
      <p>At the time of the initial information gathering we could not find a
consistent and complete list of such organizations and therefore the Scientific Digital
Library database combined with Russian Science Citation Index (RSCI) located
at http://elibrary.ru was used. Since eLibrary.ru database covers most of the
scientific publications of the Russian authors, we assume that it refers almost all
organizations with employees that are somehow engaged in scientific activities.
Corresponding website addresses were obtained via Google and Yandex search
by title with visual verification.</p>
      <p>According to Russian State Statistics Agency, there were 2827 research and
development organizations in Russia at the end of 2015. However, some of them
do not maintain an official website or publish scientific works either due to the
restricted field of research or pure technological and construction character of
its activities. In higher education segment there are 609 state institutions and
universities and 437 private ones with the total of 1046.</p>
      <p>Among higher education institutions we allocated well separated classes
associated with both the tendencies of development of the Russian higher
education in recent years – the federal and national research universities, and with the
legacy of Soviet Union – classical, technical, medical, humanitarian, educational,
economic, legal and agricultural universities. In R&amp;D segment we allocated the
institutions under the control of Federal Agency of Research Organizations
(academic institutions) and national research centers, as these classes are funded
under government programs in a special way. The rest of the research
institutions are under the control of different federal authorities, corporations, and also
various forms of private and public organizations.</p>
      <p>Organizations have a substantially different scale and scientific activity.
Unfortunately, detailed data on the number of researchers and faculty members at
institutions are not available, so we extracted a number of contributing authors
for each organization that has publications in 5 recent years. These data do not
fully reflect the organization’s research staff, however, these numbers allows us
to make an adequate assessment. For large research institutions and
universities, the number of contributing authors may exceed the number of actual
faculty members and research staff because of temporary employees and students.
Thus, the Moscow State University has about 9000 faculty members while
eLibrary counts 14211 contributing authors. And vice versa for small organizations
and universities the number of authors may be less than faculty members and
researchers. The resulting treemap is shown in Fig. 1.</p>
      <p>
        Also, from the same source data the number of articles published in the last 5
years and registered in Web of Science and Scopus databases have been extracted
(Fig. 2). Detailed data are also shown in Table. 1.
As in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we consider those main metrics for our study:
1. Domain size that is measured as a number of hits in corresponding search
engine for a request limited by domain URL.
2. A number of documents in popular formats ps, pdf, doc(x), ppt(x), xls(x)
measured by narrowing previous request by the appropriate filter.
3. A number of scholarly papers indexed by Google Scholar in which can be
full-text documents or web pages with the correct publication metadata.
4. A number of external references hyperlinks from other domains to the target
domain pages (and a number of such domains).
      </p>
      <p>
        The first two metrics are collected via Google, Bing and Yandex search engines
and the last one with the help of Ahrefs and Majestic SEO search optimization
and backlinks tracking services. Additionally, we obtain basic usage statistics
from SimilarWeb service, that uses data extracted from four main sources: 1)
a panel of web surfers made of millions of anonymous users equipped with a
portfolio of apps, browser plugins, desktop extensions, and software; 2) global
and local ISPs; 3) web traffic directly measured from a learning set of selected
websites and intended for specialized estimation algorithms; 4) A colony of web
crawlers that scan the entire Web. Comparison of this data to the data,
gathered from corresponding Google Analytics and Yandex Metrika site counters for
several academic websites, participated in our research [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] shows its sufficient
accuracy.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Collection and Processing Pipeline</title>
      <p>Data are collected by PowerShell scripts that request Yandex and Bing web
services and process web pages obtained from Google, Google Scholar and
SimilarWeb using Internet Explorer automation with operators visual control. Ahrefs
and Majestic data are collected by their bulk export features. Results are stored
in the MongoDB database.</p>
      <p>During data collection we met with 2 types of errors and failures mainly in
interpreting of web pages received as a response to a query: a) errors caused
by wrong response parsing due to unexpected changes in response details, b)
changes in search engine’s database caused by reindexing of web sites and global
index rebuilding.</p>
      <p>The first type of distortions may be detected by low and high pass filter,
which compares the measured value with the average of several previous values.
Such combination of band filter with moving average allows us to detect single
isolated outliers. If errors were detected after the data collection cycle and cannot
be corrected by recollection, data may be recovered by linear interpolation of
neighbor values. We retain originally collected values also.</p>
      <p>
        The situation is much worse with effects caused by reindexing of some
websites and global rebuilding of search enginesfl indices that occur relatively often [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and affect measurements dramatically. For example, in July 2015 Bing counted
2.8 millions of pages in the domain of Institute of Astronomy of the Russian
Academy of Sciences (inasan.ru), but in subsequent months this indicator falls
back to several thousand of pages. We analyzed these effects in details in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
These effects may last more than one month and so we need more
sophisticated logic to determine and justify them and it is one of directions of future
investigations.
      </p>
      <p>For each domain 26 metrics are gathered with 9 main indicators and 17
supplementary. We retain exact timestamp for each value collected. The monthly
cycle lasts for more than a week resulting in 57 226 values. The projects website
provides the following basic functionality:
– Ranking of organizations web domains by one of the indicators and its change
in time in the tabular form.
– Ranking of organizations web domains by every indicator values for the single
month.
– Dynamics of totals, means and medians of selected indicators for all or
selected part of domains during selected time period.
– Comparison of different series indicator month as a scatter chart.
– Detailed info for a single domain.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Analysis of Research and Education Web Space</title>
      <p>
        Data collected for the majority of Russian research and higher education
organizations for a period of 10 months allow us to make some brief review of basic
characteristics of Russian research and higher education web space in terms of
size and quality. By size we mean a number of pages and documents and by
quality a number of papers in Google Scholar index, Yandex thematic citation
index, and a number of visits per month. As we show in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] backlinks data is
quite controversial and cannot be used in thorough analysis without complex
cleaning.
      </p>
      <p>First of all, let us look at the size of Russian scientific and education web
and its dynamics (Fig. 3). At first sight, we can deduce that total size of the
segment under consideration increases with the exclusion of Bing data, that can
be justified by some Bing engine peculiarities. But if we take into account the
total size of Russian web space which can be measured as the .ru zone size we
can see a more complicated situation (Fig. 4). While the size of overall Russian
web in Google index increases in more than 150% and in Bing nearly doubles,</p>
      <p>Bing</p>
      <p>Google
Yandex oscillates near 400 millions of pages. At the same time, academic share
in Bing index decreased from 18% to 7%, in Google remains almost the same
(7% to 6%) but in Yandex it tripled from 6% to 18%.</p>
      <p>aug
sep
oct
2015
nov
dec
jan
feb
apr</p>
      <p>may
mar
2016</p>
      <p>While considering indicators based on search engines we will use aggregated
values calculated as the maximum of values, obtained from Google, Yandex and
Bing onwards. The total size of the Education and Scientific web measured by
this indicator has grown from 78 million pages in August 2015 to more than 100
million in May 2016. The share of HE decreased by almost 3% from 74.3% to
71.6%. The fastest growing classes were TU, CU and NRU. The share of R&amp;D
is divided almost equally between AI and other R&amp;D organizations (14%) with
a minor proportion of NRC (0,3%). During the studied period the proportion
of AI decreased from 17% with a simultaneous growth of others from 8%. Web
space partition in classes is shown on Fig. 5.</p>
      <p>HE average domain size increased from 111 to 146 thousand of pages, R&amp;D
- from 14 to 19.5 thousand, the most active growth of average domain size from
303 to 525 thousand of pages was in FU and NRU domains grew from 272 to
371 thousand of pages. The average size of domains LS, EU, MS and PU has
not changed, and the HU and AU even decreased. In R&amp;D the average size of
AI and NRC domains have not changed, the increase was only observed among
others.
oct
2015
aug
sep
nov
dec
jan
feb
mar
2016
apr
may
Google
Yandex
Bing
The total number of documents increased from 6 to 7.7 million (Fig. 6). The
main growth in both absolute terms and in share occurred in the NRU and
the CU classes, final distribution is shown in Fig. 7. The average number of
documents in FU class increased from 40 to 50 thousand, NRU – from 30 to 42,
CU – from 13 to 18. In the R&amp;D average number of documents grows slowly
and is slightly more than 1000 documents in a domain. It should be noted that
the bulk of documents are on the top domains for each group as mean values
significantly higher than the median, and even the upper quartile.
oct
2015
aug
sep
nov
dec
jan
feb
mar
2016
apr
may</p>
      <p>Number of publications that are indexed by Google Scholar has changed
slightly from 664 up to 693 thousand. The bulk of publications indexed are in
the NRU, CU, and AI domains (Fig. 8) and located on a small number of leading
websites. Main growth was observed in FU domains.</p>
      <p>Finally, let us take a look at site traffic of research organizations and higher
education institutions. The total site traffic has increased from 70 to 97 million
sessions per month, most of the growth occurred in the sites of AI and other
R&amp;D organizations (in general in the R&amp;D segment we can observe almost
twotime growth) as well as in the NRU, FU, PU and CU sites – growth was from
40% to 58%. Final distribution is shown on Fig. 9. The highest average number
of 0.5 million of sessions per month was in FU and NRU classes. In R&amp;D NRC
shows the highest values of about 153 thousand sessions, slightly less than the
CU (186 thousand sessions). The most effective were NRC sites for which 100
pages indexed by search engines resulted in more than 1800 sessions per month.
AI showed the lowest efficiency of all of the classes with 47 sessions per 100 pages
indexed. In HE segment NRU were the best (150 sessions) and the other average
values were about 100 sessions except MS (70) and the PU (82).
A comparison of the segments and classes shares for different indicators allows
identifying of possible points of growth. The greatest growth potential is
concentrated in the area of open science – access to full-text and metadata of scientific
publications. Most of the publications of more than 7 million indexed by Google
Scholar in the Russian segment of the Internet resides on the sites of scientific
digital libraries eLibrary.ru (about 4 million) and CyberLeninka.ru (slightly more
than 1 million). It should be noted that documents, indexed by Google Scholar
are highly rated in general Google search results, leading to an increase in the
site traffic and contribute to the promotion of scientific results. Organizationfl s
web site can provide quite a different context with information on current
research projects and different kinds of scientific output to a visitor than a digital
library and in most cases, it leads to better results. The total number of
publications of the organizations in question only for the last 5 years was more than
3 million of which less than 25% are available online. Noting the great progress
and the rapid development of Internet resources of federal and national research
universities and the weak, and often even a negative trend in other classes, we
can conclude about the high and unrealized development potential in some of
the classical universities and other types of higher education institutions.</p>
      <p>Finally, one can see a clear backlog of R&amp;D segment combined with high
scientific potential, which may be partly explained by a more narrow, niche
nature of Web resources. However, the leaders of this segment show good results
and demonstrate the broad development opportunities for others.</p>
      <p>An analysis of the dynamics of webometric indicators allows a better
understanding of trends in the development of the studied web space, neutralize
weaknesses inherent in measuring instruments and provide a better picture. Source
data and tools located on the project site at http://www.webometrix.ru enable
researchers and owners of Internet resources to explore trends in the
development of scientific and educational web space, determine the position of specific
organizations.</p>
      <p>We understand that webometric rankings are quite rough because of nature
of measurement instrumentation, but we suppose that conclusions drawn from
such assessment may give a rise to efforts to improve web representation of
educational and scientific activities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bar-Ilan</surname>
            ,
            <given-names>J:</given-names>
          </string-name>
          <article-title>The use of web search engines in information science research</article-title>
          .
          <source>Annual Review of Information Science and Technology</source>
          ,
          <volume>38</volume>
          ,
          <issue>231</issue>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aguillo</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Webometric Ranking of world universities</article-title>
          .
          <source>Higher Education in Europe</source>
          ,
          <volume>33</volume>
          ,
          <issue>233</issue>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Shokin</surname>
            ,
            <given-names>Yu.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimenko</surname>
            ,
            <given-names>O.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rychkova</surname>
            ,
            <given-names>E.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shabalnikov</surname>
            ,
            <given-names>I.V.</given-names>
          </string-name>
          :
          <article-title>Website ranking of scientific institutions of SB RAS</article-title>
          .
          <source>Computational Technologies</source>
          ,
          <volume>13</volume>
          ,
          <issue>128</issue>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Guskov</surname>
            ,
            <given-names>A. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bykhovtsev</surname>
            ,
            <given-names>E. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosyakov</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          :
          <article-title>Alternative Webometrics: Study of the trafic of the websites of scientific organizations</article-title>
          .
          <source>Scientific and Technical Information Processing, series 1</source>
          ,
          <issue>12</issue>
          ,
          <issue>12</issue>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Van den Bosch</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          , Bogers, T, de Kunder, M:
          <article-title>Estimating search engine index size variability: a 9-year longitudinal study</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>147</volume>
          ,
          <fpage>839</fpage>
          -
          <lpage>856</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kosyakov</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guskov</surname>
            ,
            <given-names>A. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bykhovtsev</surname>
            ,
            <given-names>E. S.:</given-names>
          </string-name>
          <article-title>Russiafl s academic institutions as mirrored by webometrics</article-title>
          .
          <source>Herald of RAS</source>
          ,
          <volume>86</volume>
          ,
          <fpage>490</fpage>
          -
          <lpage>499</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>