=Paper= {{Paper |id=Vol-2543/spaper13 |storemode=property |title=Properties of Communication Graph of Academic Web |pdfUrl=https://ceur-ws.org/Vol-2543/spaper13.pdf |volume=Vol-2543 |authors=Andrey Pechnikov |dblpUrl=https://dblp.org/rec/conf/ssi/Pechnikov19 }} ==Properties of Communication Graph of Academic Web== https://ceur-ws.org/Vol-2543/spaper13.pdf
    Properties of Communication Graph of Academic Web

                               A.A. Pechnikov 1[0000-0002-0683-0019]
 1 Institute of Applied Mathematical Research of the Karelian Research Centre of the Russian

          Academy of Sciences, 11, Pushkinskaya str., 185910, Petrozavodsk, Russia
                           pechnikov@krc.karelia.ru



        Abstract. Web graph is the most popular model of real web fragments used in
        Web science. The study of communities in the web graph contributes to a better
        understanding of the organization of the web fragment and the processes taking
        place in it. Communities of the web-graph fragments of the real Web are often
        poorly differentiated and more difficult meaningful interpretation. It is proposed
        to select in the web graph a communication graph containing only those verti-
        ces (and arcs between them) that have counter arcs, and already in it to investi-
        gate the problem of splitting into communities. By analogy with social studies,
        the ties implemented through the edges in the communication graph are pro-
        posed to be called “strong” and all others “weak”. Thematic communities with
        meaningful interpretation are built on strong ties. At the same time, weak ties
        facilitate communication between sites that do not have common features in the
        sphere of activity, geography, subordination, etc., and basically keep the web
        fragments connected even in the absence of strong ties. Experiments carried out
        for a fragment of the academic Web of Russia show the possibility of meaning-
        ful interpretation of the results and the prospects of this approach.

        Keywords: Web graph, Communication graph, Community in graph, Strength
        of ties.


1       Introduction

The study of graphs of the real (and virtual) world is an important task in many fields,
such as biology, sociology, social networks, webometrics, and many others, because
they allow you to understand the structure of objects and analyze their properties. It
has long been known that University and academic fragments of the Web both in
Russia and other countries [1–3] have quite specific properties characterizing their
structure. In particular, the corresponding web graph has a large strongly connected
component (SCC) and a significant number of "hanging" sites (having either only
links made from them, or, more rarely, links made only to them).
   SCC itself is an interesting object of research, allowing to establish, for example,
the presence or absence of the property of the "small world" [4], which, however,
does little to explain the processes of hyperlinks between sites fragments of the Web.
In this sense, much more food for thought is provided by structural elements of the
graph, such as site communities, when more links are made "within" communities
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                     415


than between communities. But again, attempts to partition the vertices that make up
the maximum connectivity component of a web graph into communities lead to diffi-
cult to interpret results. The main idea considered in this work is to build a "hard
framework" for a fragment of the Web, leaving only those sites that have counter
hyperlinks, and already on this "framework" to check the properties of the partition
into communities (such a framework will be called a communication graph). And
already further by means of known algorithms the question of structure of communi-
ties of tops, the communication graph and the "weak" web graph from which the
communication graph is removed is investigated. As a real object of research, the
scientific and educational fragment of the Web is taken, for which a meaningful inter-
pretation of the results is given.


2      Basic Concepts, Methods and Tools

The target set of sites is specified by a direct enumeration of their domain names, and
as a result of scanning, all the hyperlinks linking them are found. In the case where
sites belong to one type of activity, such a set is called thematic. Accordingly, a (the-
matic) fragment of the Web is a target set of sites and a set of hyperlinks linking them
[3].
   A web graph of Web fragment is a directed graph without loops and multiple arcs,
whose vertex set is represented by the target site set, and whose arc set is constructed
as follows: two vertices are connected by an arc if there is at least one hyperlink link-
ing the corresponding sites.
   A community (cluster, module, group) of a graph can be informally defined as a set
of vertices with more arcs between them than with the other vertices of the graph [5].
   More strictly, the partition of a graph into communities can be defined through the
notion of modularity. Modularity is a property of a graph and some subdivision of it
into subgraphs (community modules). The modularity measure shows how qualitative
a given partition is in the sense that there are many arcs lying inside subgraphs-
communities and few arcs lying outside subgraphs (connecting communities togeth-
er).
   The modularity measure Q can be given by the following formula [5]:

                                         kk 
                                  
                              1
                        Q          Aij  i j    (ci , c j ).
                                   
                             2m ij        2m 

Here m is the number of arcs in the graph; Aij is the element of the adjacency matrix
of a graph; ki, kj is the degree of the respective nodes; δ(ci, cj) is Kronecker's symbol
calculated by the formula:

                                           1 : if ci  c j
                             (ci , c j )                   ,
                                           0 : if ci  c j

where ci, cj are the class labels of the corresponding vertices.
416


    Accordingly, the best partition into communities is considered to be the partition
on which the maximum Q is achieved.
    BeeCrawler [6] and external hyperlink databases [7] were used for data collection
and analysis. The open-source graph visualization platform Gephi [8] was used for the
study of web graphs, which, among many, implemented programs for calculating the
modularity coefficient and searching for communities in the graph using the algo-
rithms proposed in [9].


3      Fragment of the Scientific and Educational Web of Russia

The initial target set of sites of a fragment of the scientific and educational Web of
Russia dates from the end of 2017, that is, after the Russian Academy of Medical
Sciences and the Russian Academy of Agricultural Sciences (RAAS) joined the Rus-
sian Academy of Sciences (RAS), but in the process of creating enlarged structures
such as federal research centers, as well as enlarging universities.
   Such a “footprint” allows us to expect on the fact that a fragment of the Web con-
tains fairly stable links between sites that have been formed over the past few years.
The initial target set contains 867 objects (it is websites of 279 universities and 588
research institutions). Scientific institutions are understood as organizations of the
RAS from the level of institutes to regional scientific centers and boards of the RAS.
Website of the RAS itself (www.ras.ru) in the target set is not included, because it is a
powerful communicator, significantly distorting the common picture. Further, for the
sake of brevity, we will call scientific institutions "institutes".
   Scanning all 867 sites allows you to get about 20,000 hyperlinks made between
these sites, including multiple links. Replace multiple links to single arcs and get a
web graph containing 867 vertices and 5030 arcs. A test for connectivity and strong
connectivity shows the presence of 73 isolated vertices and 236 hanging (36 vertices
do not have incoming arcs and 200 (!) - outgoing). In terms of content, we note that
73 isolated vertices in the vast majority relate to the sites of institutes that were previ-
ously part of the RAAS.
   The only SCC contains 534 vertices and 4026 arcs. A web graph equal to the max-
imum SCC has a diameter equal to 10 and a modularity coefficient of 0.398 [10], and
is divided into 7 communities containing 40 to 139 vertices. The modularity coeffi-
cient indicates a low tendency of sites to organize into communities: the values of Q
belong to the segment [-1, 1], and clustering is considered "good" somewhere when Q
is greater than 0.6.
   However, one of the communities built has a good meaningful explanation. It in-
cludes 40 vertices corresponding to 4 university sites and 36 institute sites, and a total
of 97 arcs (which on average is significantly less than in the SCC web graph). Of the
40 websites, 35 belong to the current Department of agricultural sciences RAS, as
well as to the Krasnoyarsk scientific center of Siberian Branch of RAS, Kemerovo
Institute of technology of food industry, Kuban state technological University, Voro-
nezh state university of forestry and St. Petersburg Mining University.
   Give comments:
                                                                                    417


   * Krasnoyarsk scientific center for the period of the study managed to become a
Federal research center, which includes 2 agricultural institutes;
   * Kemerovo Institute of technology of food industry is close to agriculture;
   * Kuban state technological university has close ties with agricultural institutes of
Kuban;
   * Voronezh state forestry university has a hyperlink on its website to the website of
the Central scientific agricultural library (which falls under the term "institutes»);
   * St. Petersburg mining university has on its website a hyperlink to the website of
the All-Russian Institute of plant genetic resources named after N.I. Vavilov, where
there is a section of the journal "Agro-industrial complex", issued by the University
library.
   Apart from the last two cases, we can say that this community has a pronounced
agro-industrial theme.


4      The Communication Graph of the Web Graph

Newman and coauthors, when solving the problem of splitting a graph into communi-
ties, avoid graph orientation when dealing with web graphs.
   Quoting from [10, p. 026113-5]: “…Some networks are directed, i.e., their edges
run in one direction only. The world wide web is an example; links in the web point
in one direction only from one web page to another. … However, we have found that
in many cases it is better to ignore the directed nature of a network in calculating
community structure. Often an edge acts simply as an indication of a connection be-
tween two nodes, and its direction is unimportant…”. And if direction (orientation)
matters? - Let's look at fig. 1, which shows agro-industrial community.




                     Fig. 1. Web graph “agro-industrial community”.
418


There are questions. Can the website vniiz.org (All-Russian research institute of grain
and products of its processing) consider a full member of the community when it does
not have hyperlinks to other sites? Can the website spmi.ru (St. Petersburg mining
university) consider a member of a community when there are no links to it from
other members?
   The term "community" (Gemeinschaft) originated in German sociology at the end
of the XIX century and implies the joint activity of its participants with common
goals [11], and in this context the answer to the formulated questions should be nega-
tive. Therefore, we will further consider an undirected graph, which by construction
implies “joint activity”, namely, counter hyperlinks.
   A communication graph of a web graph is an undirected graph that has the same
set of vertices as the web graph, and the set of edges is constructed according to the
following rule: an edge (i, j) belongs to the set of edges of a communication graph if
and only if in the web graph there exist arcs (i, j) and (j, i). A communication graph
constructed in this way can have several connected components and / or isolated ver-
tices. In this case, we exclude isolated vertices (since they do not affect connectivity),
and then study the connected components individually, starting with the maximum.
   The communication graph of the web graph of the scientific and educational Web
contains 313 vertices and 468 edges (Fig. 2).




              Fig. 2. Communication graph of scientific and educational Web.

Of the 313 vertices, 67 belong to universities and 246 to institutes, i.e. the proportion
of universities in the communication graph has decreased by one tenth. The coloring
of the vertices is given according to the resulting partition into communities.
                                                                                             419


   The modularity coefficient of this partition is 0.695, which is quite high. Built 13
communities contain from 7 to 51 vertices, of which 4 can be accurately identified by
one of two features: geography or scientific direction.
   Some of the built communities can be given a meaningful (thematic) interpretation.
   Two "geographical" communities are clearly distinguished:
   No. 2 – "Ural": 1 university (Perm Polytechnical), 23 institutes of the Ural Branch
of RAS and the Research center of fiber optics of RAS from Moscow;
   No. 10 – "Siberia": 4 Siberian universities, 47 Siberian institutes and 1 Institute
from Sevastopol.
   "Humanitarian" community No. 12 contains 10 institutes (archaeology, ethnogra-
phy, linguistic research, linguistics, history, world literature, Slavic studies, philoso-
phy) and two universities (Altai State University and Gorno-Altaisk State University).
   "Medical" community No. 9 is a "star" of websites of medical institutes of Siberia
and the Far East, linked to the website of the Siberian Department of medical Scienc-
es, but not related to each other.
   Members of any communication graph community are present in the web graph by
construction. However, it is not necessary that all members from the same communi-
cation graph community belong to the same web graph community. The reverse is not
true: the sites participating in the agro-industrial community of the web graph are
completely absent from the communication graph.
   Now remove all pairs of counter arcs corresponding to the edges of the communi-
cation graph from the SCC web graph. The resulting "weak" web graph is large
enough – 528 vertices and 3090 arcs, but has a low modularity coefficient of 0.356
and is broken up by 8 communities that do not have a convincing meaningful inter-
pretation.
   Find the maximum SCC of the "weak" web graph, remove all vertices not included
in the SCC and get the SCC "weak" web graph with 461 vertices and 2744 arcs.
Again, the modularity coefficient remains low.


5      Discussion of Results. Strong and Weak Ties

More than 45 years ago, Granovetter [12] proposed the concept of the strength of
social ties. All connections are divided into two categories, strong and weak, in order
to formalize interpersonal relationships based on the duration and frequency of con-
tacts. For example, a strong connection is inherent in friends, and a weak one is inher-
ent in neighbors. Let's try to use the concept of communication strength as an analogy
for sites, although it is initially clear that the analogy is far from complete, since
Granovetter believes that the connections are symmetrical.
   Take a pair of vertices i, j of a directed graph. If there are arcs (i, j) and (j, i) in the
graph, then we will talk about a strong connection of vertices i, j.
   If for vertices i, j there is only one of the arcs (i, j) or (j, i), then we will talk about a
weak connection between vertices i and j.
   Let's look at the dynamics of changes in the proportion of institutes and universi-
ties in the considered graphs, for which we will summarize the data in Table 1.
420


             Table 1. Proportion of sites of institutes and universities in graphs.

 Graph                  Number Number         Proportion Proportion Diame-            Average
                        of       of arcs      of insti-  of univer- ter               path
                        vertices (edges)      tutes      sities                       length
 Initial web graph      867       5030        0,68          0,32          -           -
 SCC web graph          534       4026        0,67          0,33          10          3,6
 Communication web                936
                        313                   0,78          0,22          9           3,45
 graph                            (468)
 "Weak" web graph 528             3090        0,66          0,34          -           -
 SCC "weak" web
                  461             2744        0,63          0,37          11          4,2
 graph

The initial scientific and educational web graph like the SCC web graph contain
strong and weak arcs, the communication graph lacks weak arcs, and the" weak "web
graph and the SCC "weak" web graph lack strong arcs.
   From Table 1 it can be seen that the institutes/universities ratio is practically the
same for all graphs having weak or strong and weak arcs, but changes dramatically
for a graph containing only strong arcs. It can be concluded that strong ties are more
inherent in institutions than universities.
   In [12, p. 1362] it is noted that " ... the stronger the ties the more similar individu-
als are to each other in different aspects". This is confirmed in the case of the com-
munication graph of the scientific and educational Web. We can say that strong con-
nections contribute to the emergence of sustainable thematically "similar" communi-
ties. Although there are few such communities, only 4 out of 13, they have a clear
meaningful interpretation.
   However, if the communication graph is "built up" to the SCC web graph (and
even more so, to the initial web graph), it turns out that none of the four communities
of the communication graph not only was not the basis for communities in web
graphs, but even their members were in different communities. It seems that the weak
links lead to the erosion of communities.
   Recall that in the SCC web graph we found an " agro-industrial" community, and
in the initial web graph we found 73 isolated vertices also related to agricultural insti-
tutes. By linking these two results, the following explanation for this exception can be
offered. The websites of the former RAAS simply did not have time to establish links
with other sites of the target set. Therefore, those RAAS sites that were somehow
connected to each other had few links "outside" and organized a community, and the
rest remained isolated.
   What is the role of weak links? According to Granovetter, weak connections can be
characterized as a system of irregular contacts that do not cover the friends of an indi-
vidual, but connect him with members of other closely related groups in which he
does not belong. From this it follows that an individual can establish relationships
with people from a large number of mutually disjoint groups, while not being a mem-
ber of each of them.
                                                                                          421


   In this sense, weak connections play the role of "bridges" in the graph. Recall that a
bridge is an edge of a (undirected) graph whose removal increases the number of
connectivity components [13]. Granovetter in [12] says that strong ties are almost
never bridges, and as a rule, weak bonds are bridges. In our case, indirect confirma-
tion of this fact is the diameter and average path length, which are almost the same in
the SCC "weak" web graph, SCC web graph and communication graph (Table 1).
That is, the removal of strong ties has almost no effect on the diameter of the graphs,
which means that the arcs corresponding to them are often not bridges.
   Hence the conclusion that the weak ties of the fragment of the Web serve to estab-
lish contacts of sites from disjoint groups, which is very important in the sense of
obtaining new information that does not lie in the circle of interests of this group.
   The work was supported by The Russian Foundation for Basic Research, project
18-07-00628-a.


References
 1. Thelwall, M., Wilkinson, D.: Graph structure in three national academic Webs: Power
    laws with anomalies. Journal of the American Society for Information Science and Tech-
    nology 54(8), pp. 706–712 (2003).
 2. Ortega, J.L., Aguillo, I.F.: Visualization of the Nordic academic web: Link analysis using
    social network tools. Information Processing and Management 44(4), pp. 1624–1633
    (2008).
 3. Pechnikov, A.A.: Metody issledovanija reglamentiruemyh tematicheskih fragmentov Web.
    Trudy Instituta sistemnogo analiza Rossiiskoi akademii nauk. Serija: Prikladnye problem
    upravlenija makrosistemami 59, pp. 134–145 (2010).
 4. Watts, D.J.; Strogatz, S.H.: Collective dynamics of 'small-world' networks. Nature 393,
    pp. 440–442 (1998).
 5. Ermolin, N.A., Mazalov, V.V., Pechnikov, A.A.: Teoretiko-igrovye metody nahojdenija
    soobschestv v akademicheskom Webe. Trudy SPIIRAN 55, pp. 237–254 (2017).
 6. Pechnikov, A.A., Chernobrovkin, D.I. Adaptive Crawler for External Hyperlinks Search
    and Acquisition. Automation and Remote Control 75(3), pp. 587–593 (2014).
 7. Golovin, A.S., Pechnikov, A.A.: Baza dannyh vneshnih giperssylok dlja issledovanija
    fragmentov Weba. In: Informacionnaja sreda vuza XXI veka: materialy VII Vserossiiskoi
    nauchno-prakticheskoi konferencii, pp. 55–57. PetrSU, Petrozavodsk (2013).
 8. The Open Graph Viz Platform, https://gephi.org, last accessed 2019/12/03.
 9. Blondel, V.D., Guillaume, J-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communi-
    ties in large networks. Journal of Statistical Mechanics: Theory and Experiment, P10008
    (2008).
10. Newman, M.E., Girvan, M. Finding and evaluating community structure in networks.
    Physical Review E. 69(2), P026113 (2004).
11. Levkina, L.I.: Social’no-istoricheskaja rol’ soobschestv. Rusains, Moscow (2016).
12. Granovetter, M.S.: The Strength of Weak Ties. The American Journal of Sociology 78(6),
    pp. 1360–1380 (1973).
13. Harary, F.: Graph Theory. Addison-Wesley, Boston (1969).