CompeGe: Computing Company Competitor Pairs By
Knowledge Based Inference Combined With
Empirical Validation
Georg Gottlob1 , Jinsong Guo1,*,† , Aditya Jami2 , Markus Kröll2 , Stéphane Reissfelder2 ,
Lukas Schweizer2 , Eric Aichinger2 and Stefano Sferrazza2

1
    Department of Computer Science, University of Oxford, Oxford, OX1 3QD, UK
2
    Meltwater, San Francisco, California, USA


                                         Abstract
                                         This paper is an interim report about an industrial application that uses Datalog combined with empirical
                                         methods to compute competitor information from a knowledge graph. The Owler knowledge graph
                                         is one of the world’s largest companies information systems (CIS). It contains data about 16+ million
                                         companies crowd-sourced from over 1 million experts. In particular, for most companies, it contains
                                         a set of competitors. Such competitor information is very useful for many B2B applications such as
                                         lead generation. However, competitor relations in crowd-sourced CIS are naturally incomplete. This
                                         paper presents CompeGen, a method that applies Vadalog (a particular variant of Datalog) rules to infer
                                         new competitor pairs from existing ones in the Owler CIS. Since using the Vadalog inference program
                                         alone is insufficient, CompeGen combines its inference process with a “learning” process to acquire
                                         some required logical facts and further validates the inference results via an empirical validation process.
                                         CompeGen was tested using the companies belonging to the “Internet Software” sector in Owler. It has
                                         discovered 23,180 new competitors of which over 80% were correct. We are improving the system and
                                         will report further results in the full paper.


                                         Keywords
                                         Datalog applications, Vadalog, competitor computation, knowledge-based inference, empirical validation


Datalog 2.0 2022: 4th International Workshop on the Resurgence of Datalog in Academia and Industry, September 05,
2022, Genova - Nervi, Italy
*
  Corresponding author.
†
  This work was done while Jinsong Guo was working as an external consultant for DeepReason.ai.
$ georg.gottlob@cs.ox.ac.uk (G. Gottlob); jinsong.guo@cs.ox.ac.uk (J. Guo); aditya.jami@meltwater.com (A. Jami);
markus.kroell@meltwater.com (M. Kröll); stephane.reissfelder@meltwater.com (S. Reissfelder);
lukas.schweizer@meltwater.com (L. Schweizer); eric.aichinger@meltwater.com (E. Aichinger);
stefano.sferrazza@meltwater.com (S. Sferrazza)
 0000-0002-2353-5230 (G. Gottlob); 0000-0002-1142-3610 (J. Guo); 0000-0003-1167-1777 (L. Schweizer);
0000-0001-9569-4428 (S. Sferrazza)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         207
Georg Gottlob et al. CEUR Workshop Proceedings                                             207–211


1. Introduction
In today’s global competitive environment, competitor data constitutes information useful to
many business applications, such as Competitive Intelligence, Lead Generation, Recommender
Systems, and so on. For example, for lead generation, the competitors of a customer A of some
company C may be good sales leads for C, as they are likely to require the same products
or services as A. Competition-data sellers usually maintain a manually curated database or
knowledge graph containing competitor pairs as part of a companies information system (CIS)
that also maintains other useful information about companies such as the industry sectors in
which they operate. The Owler1 knowledge graph (a.k.a., the Owler competitive graph, see
the graph on the left of Fig. 1 as an example) is one of the world’s largest CIS. It contains data
about 16+ million companies crowd-sourced from over 1 million experts. However, competitor
relations in such crowd-sourced CIS are naturally rather incomplete. The main goal that we
target is to infer new competitor pairs from existing competitor pairs in the Owler CIS.
Datalog programs that perform inference based on knowledge about companies are naturally
suitable for inferring new competitor pairs from existing ones in a CIS. However, using a Datalog
program alone is not sufficient because the required knowledge may be absent, and the inference
results are sometimes inaccurate. In this paper, we present CompeGen designed based on our
key idea:
    Key Idea: Combine a knowledge-based inference process with an upstream knowledge-
    learning process and a downstream validation process.

CompeGen “learns” new knowledge from the existing knowledge in Owler, and then performs
an inference process based on both the learned and the existing knowledge about companies.
The inference results are further validated by an empirical validation process based on co-
occurrences of companies in documents of a document repository. The combination of (𝑖)
knowledge about companies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], and (𝑖𝑖) co-occurrences of companies
in documents of a document repository [11, 12, 13, 14, 15], distinguishes CompeGen from
conventional methods that leverage either one or the other of the two tools without combining
them. Fig. 1 illustrates the workflow of CompeGen, which will be detailed in the next two
sections.


2. Knowledge-based Inference
The inference process of CompeGen uses several different types of knowledge (represented as
logical facts) about companies: (𝑎) Competitor(A, B, 𝑠) which represents an existing competitor
pair in Owler that a company A has a competitor of B with a proximity score 𝑠 ranging from 0
(extremely unlikely to compete) to 100000 (sure competitors); (𝑏) CompSector(C1,S1) which
represents the fact that a company identified by the company ID C1 belongs to the industry
sector S1; (𝑐) CompatSec(S1,S2,𝑐) which represents the fact that two sectors S1 and S2 are
compatible sectors with a sector compatibility score of 𝑐. Two sectors, S and S’, are compatible if
there are more than a certain amount of competitor pairs in each of which one company belongs
1
    https://corp.owler.com/


                                                 208
Georg Gottlob et al. CEUR Workshop Proceedings                                                 207–211


Figure 1: A simple example of the workflow of CompeGen.
to S and the other belongs to S’ (see Section 3). For example, in Figure 1, the sectors S1 and
 S2 are compatible. Knowledge of types (𝑎) and (𝑏) comes directly from the Owler CIS while
the knowledge about sector compatibilities is learned from existing knowledge in the Owler
CIS, which is explained in the next section. Based on such knowledge, CompeGen computes
candidate competitor pairs via Vadalog, which is a particular variant of Datalog well-suited for
knowledge graphs [16]:
Cand(C1,C2,PScore):-Competitor(C1,C2,PScore).
Cand(C1,C3,PScore):-Cand(C1,C2,PS12),Competitor(C2,C3,PS23),
   CompSector(C1,SEC1),CompSector(C3,SEC3),C1 != C3,
   CompatSec(SEC1,SEC3,SeCoScore),Penalty(A), Cutoff(B),
   PScore=max((PS12+PS23-100000-A)*SeCoScore), PScore>B.

Each candidate competitor pair (C1,C3) computed by the above program is represented by the
fact Cand(C1,C3,PScore) where PScore is a plausibility score expressing a degree of plausibility
that company C3 is a competitor of company C1. The above Vadalog program computes a new
fact Cand(C1,C3,PScore), either if such a fact is already in the Competitor relation, or if there
is an already computed fact Cand(C1,C2,PS12) and a fact Competitor(C2,C3,PS23), where the
certain conditions are satisfied. These conditions require that C1 be different from C3, and that
the computed plausibility score PScore be larger than some cutoff constant B (represented as
Cutoff(B)), where PScore is the maximum value of (PS12 + PS23 − 100000−𝐴) × 𝑆𝑒𝐶𝑜𝑆𝑐𝑜𝑟𝑒
(SeCoScore is the compatibility score of the sectors of C1 and of C3) over all matching choices of
C2, SEC1, and SEC3. A penalty constant 𝐴 (represented as Penalty(A)), with 0 < 𝐴 < 100000,
lowers the proximity scores of new candidate pairs generated by transitivity.


3. System Overview
The CompeGen approach can be intuitively explained via the main steps described below.
First, at Step I , a knowledge-learning process is performed to learn the knowledge about sector
compatibilities, i.e., sector compatibility scores, from the data in the Owler CIS. Let 𝑆 be a sector
and 𝑆1 , . . . , 𝑆𝑛 be all sectors such that there is at least one competitor pair from 𝑆 and 𝑆𝑖 , i.e.,
there is one edge from 𝑆 to 𝑆𝑖 in the competitive graph, for 1 ≤ 𝑖 ≤ 𝑛. For every such 𝑆𝑖 ,
𝑁𝑖 is defined to be the number of edges from 𝑆 to 𝑆𝑖 . Let 𝑁 = max1≤𝑖≤𝑛 (𝑁𝑖 ). Let 𝑐 = 0.2


                                                  209
Georg Gottlob et al. CEUR Workshop Proceedings                                             207–211


be some empirically decided cutoff-constant. If 𝑁  𝑁 < 𝑐 this then means that 𝑆𝑖 and 𝑆 are not
                                                    𝑖

compatible, and thus the compatibility weight 𝑓 (𝑆, 𝑆𝑖 ) = 0. Otherwise, 𝑓 (𝑆, 𝑆𝑖 ) is calculated
via an empirically determined function 𝑓 (𝑆, 𝑆𝑖 ) = 1 − (1 − 𝑁 𝑁 ) , where 𝑚 = 3. A smaller 𝑚,
                                                                𝑖 𝑚

such as 1, may cause 𝑓 (𝑆, 𝑆𝑖 ) to be lower than expected, especially when 𝑁 is very large while
𝑁𝑖 is also large but much smaller than 𝑁 . For example, when 𝑁 = 20000 and 𝑁𝑖 = 10000,
𝑓 (𝑆, 𝑆𝑖 ) = 0.5 if 𝑚 = 1, while 𝑓 (𝑆, 𝑆𝑖 ) = 0.875 if 𝑚 = 3, and the latter is more reasonable.
The compatibility score 𝑓 * (𝑆, 𝑆𝑖 ) is the maximum of the compatibility weights 𝑓 (𝑆, 𝑆𝑖 ) and
𝑓 (𝑆𝑖 , 𝑆).
Next, at Step II , candidate competitor pairs, e.g., (A,C) in Fig. 1, are generated via the Vadalog
program described in Section 2.
The next task (Step III ) is to validate each generated candidate competitor pair, e.g., (A,C),
against a document repository. The Web is used as the document repository in CompeGen.
A Competition Likelihood Score CLS(A,C) of A and C, ranges from 0 (not competitors) to 1
(competitors), is determined based on the co-occurrences of A and C in different Web pages.
CLS(A,C) is calculated based on a comparison of a number of search results for two groups of
queries to the document repository (Step III’ ): (i) a first group of queries, for co-occurrences of
names of A and of C together with names of some competitors A*𝑖 of A (if any), such as D, or
some competitors C*𝑖 of C (if any), such as E. (ii) a second group of queries, corresponding to the
first queries, where either A or C is replaced by random companies R(A) or R(C) from the CIS
not known to be in a competitor relationship with A or C, such as H. Examples of web pages that
match these two groups of queries are in dashed boxes labeled Q1 and Q2 , respectively. From
the average number of search results of queries in query groups (i) and (ii), denoted by 𝑛1 and
𝑛2 , respectively, a Query-result Difference Ratio (QDR) is calculated by 𝑛1 /𝑛2 (Step III” ). Based
on the QDR, the likelihood score CLS(A,C) is calculated (Step III”’ ) according to a predefined
QDR-to-CLS lookup table, such as T1 , whereby a higher CLS is achieved if the QDR is larger.
The final part (Step IV ) computes for each candidate pair (A,C) a final proximity score via:
                                        5
fs(𝐴, 𝐶) = 𝑃 𝑆𝑐𝑜𝑟𝑒(𝐴,𝐶)+𝐶𝐿𝑆(𝐴,𝐶)×10
                           2              . If (i) fs(𝐴, 𝐶) is larger than a given constant (e.g.,
90000), and (ii) (A,C) is not already stored in the Owler CIS with a score 𝑠 ≥ fs(𝐴, 𝐶) then
(A,C) is inserted into the CIS with fs(𝐴, 𝐶).


4. Evaluation
CompeGen was tested using the companies belonging to the “Internet Software” sector in Owler.
CompeGen has discovered 23,180 new competitor pairs. We randomly sampled 200 of these new
competitor pairs and asked internal experts to validate these 200 pairs’ correctness. It turned
out that 162 pairs were correct, thus the precision of CompeGen is around 0.81. Since the total
number of missing competitor pairs in Owler was unknown, it was hard to evaluate the recall
of CompeGen. However, discovering 23,180 new competitor pairs with a precision of 0.81 for a
single sector has shown the potential to improve the Owler CIS using CompeGen. That has led
to Meltwater’s adoption of CompeGen in the completion task of Owler. We are making several
improvements to CompeGen, of which the results will be reported in the full paper.


                                                 210
Georg Gottlob et al. CEUR Workshop Proceedings                                           207–211


References
 [1] B. H. Clark, D. B. Montgomery, Managerial identification of competitors, Journal of
     Marketing 63 (1999) 67–83.
 [2] J. F. Porac, H. Thomas, Taxonomic mental models in competitor definition, Academy of
     management Review 15 (1990) 224–240.
 [3] T. Lappas, G. Valkanas, D. Gunopulos, Efficient and domain-invariant competitor min-
     ing, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge
     discovery and data mining, 2012, pp. 408–416.
 [4] G. Valkanas, T. Lappas, D. Gunopulos, Mining competitors from large unstructured
     datasets, IEEE Transactions on Knowledge and Data Engineering 29 (2017) 1971–1984.
 [5] N. Joseph, B. S. Kumar, Top-k competitor trust mining and customer behavior investiga-
     tion using data mining technique, Journal of Network Communications and Emerging
     Technologies (JNCET) 8 (2018) 26–30.
 [6] G. Pant, O. R. Sheng, Web footprints of firms: Using online isomorphism for competitor
     identification, Information Systems Research 26 (2015) 188–209.
 [7] T.-N. Doan, F. C. T. Chua, E.-P. Lim, Mining business competitiveness from user visitation
     data, in: International Conference on Social Computing, Behavioral-Cultural Modeling,
     and Prediction, Springer, 2015, pp. 283–289.
 [8] N. Raman, G. Bang, A. Nematzadeh, Multigraph attention network for analyzing company
     relations, in: Proceedings of the 2019 8th International Conference on Computing and
     Pattern Recognition, 2019, pp. 426–433.
 [9] Y.-P. Chen, T.-L. Hsu, W.-K. Chung, S.-C. Dai, L.-W. Ku, Upstream, downstream or
     competitor? detecting company relations for commercial activities, in: International
     Conference on Human-Computer Interaction, Springer, 2019, pp. 42–52.
[10] W. Qin, X. Luo, H. Wang, Implicit business competitor inference using heterogeneous
     knowledge graph, in: 2021 IEEE International Conference on Big Knowledge (ICBK), IEEE,
     2021, pp. 198–205.
[11] R. Li, S. Bao, J. Wang, Y. Yu, Y. Cao, Cominer: An effective algorithm for mining competitors
     from the web, in: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006,
     pp. 948–952.
[12] R. Li, S. Bao, J. Wang, Y. Liu, Y. Yu, Web scale competitor discovery using mutual informa-
     tion, in: International Conference on Advanced Data Mining and Applications, Springer,
     2006, pp. 798–808.
[13] S. Bao, R. Li, Y. Yu, Y. Cao, Competitor mining with the web, IEEE Transactions on
     Knowledge and Data Engineering 20 (2008) 1297–1310.
[14] R. L. Cilibrasi, P. M. Vitanyi, The google similarity distance, IEEE Transactions on
     knowledge and data engineering 19 (2007) 370–383.
[15] O. P. Damani, Improving pointwise mutual information (pmi) by incorporating significant
     co-occurrence, arXiv preprint arXiv:1307.0596 (2013).
[16] L. Bellomarini, E. Sallinger, G. Gottlob, The vadalog system: datalog-based reasoning for
     knowledge graphs, Proceedings of the VLDB Endowment 11 (2018) 975–987.


                                                 211