=Paper=
{{Paper
|id=None
|storemode=property
|title=Analysis of the DBLP Publication Classification Using Concept Lattices
|pdfUrl=https://ceur-ws.org/Vol-706/poster09.pdf
|volume=Vol-706
|dblpUrl=https://dblp.org/rec/conf/dateso/AlwahaishiMSK11
}}
==Analysis of the DBLP Publication Classification Using Concept Lattices==
Analysis
Analysisof
of the DBLP
the DBLP Publication
Publication Classification
Classification Using
UsingConcept Lattices
Concept Lattices
Saleh Alwahaishi, Jan Martinovič, Václav Snášel, and Miloš Kudělka
Saleh Alwahaishi, Jan Martinovič, and Václav Snášel, and Miloš Kudělka
FEECS, VŠB- Technical University of Ostrava,
Department of Computer Science, FEECS, VŠB – Technical University of Ostrava,
Department of Computer Science,
17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic
Ostrava, Czech Republic
{salehw, jan.martinovic, vaclav.snasel}@vsb.cz
{salehw, jan.martinovic, vaclav.snasel, milos.kudelka}@vsb.cz
Abstract. The definitive classification of scientific journals depends on their
aim and scope details. In this paper, we present an approach to facilitate the
journals classification of the DBLP datasets. For the analysis, the DBLP data
sets were pre-processed by assigning each journal attributes defined by its
topics. It is subsequently shown how theory of formal concept analysis can be
applied to analyze the relations between journals and the extracted topics from
their aims and scopes. It is shown how this approach can be used to facilitate
the classifications of scientific journals.
1 Introduction
Formal Concept Analysis (FCA) was invented in the early 1980s by Rudolf Wille as a
mathematical theory [1]. FCA is concerned with the formalization of concepts and
conceptual thinking and has been applied in many disciplines such as software
engineering, knowledge discovery and information retrieved during the last two
decades. The mathematical foundation of FCA is described in [2]. In this paper, we
describe how we used FCA to create a visual overview of the DBLP scientific
journals classification based on their aims and scopes. As a case study, we zoom in on
the top journals based on their impact factors.
FCA is a mathematical theory for concepts and concept hierarchies that reflects an
understanding of “concept”. It explicitly formalizes extension and intension of a
concept, their mutual relationships, and the fact that increasing intent implies
decreasing extent and vice versa. Based on lattice theory, it allows deriving a concept
hierarchy from a given dataset. FCA is thus complementing other conceptual
knowledge representations; and the combination of FCA with other representations
has been the topic of many publications. For instance, several approaches combined
FCA with description logics [3, 4] and with conceptual graphs [5, 6].
The remainder of this paper is composed as follows: In section 2 we introduce an
overview of the Digital bibliography and Library project (DBLP). Section 3
visualizes, with an example, the literature using FCA lattice. In section 4 we
explained the classification criterion of journals and applied the concept lattice on the
selected journals. Section 5 concludes the paper.
V. Snášel, J. Pokorný, K. Richta (Eds.): Dateso 2011, pp. 132–139, ISBN 978-80-248-2391-1.
Analysis of the DBLP Publication Classification Using Concept Lattices 133
2 Digital Bibliography & Library Project (DBLP)
Digital libraries are collections of resources and services stored in digital formats and
accessed by computers. Studying them offers an interesting case study for researches
for the following reasons: Firstly, they grow quickly; secondly, they represent a
multidisciplinary domain which has attracted researchers from a wide area of
expertise. DBLP (Digital Bibliography & Library Project) is a computer science
bibliography database hosted at University of Trier, in Germany.
It was started at the end of 1993 and listed more than one million articles on
computer science in January 2010. These articles were published in Journals such as
VLDB, the IEEE and the ACM Transactions and Conference proceedings [7, 8].
Besides DBLP has been a credible resource for finding publications, its dataset has
been widely investigated in a number of studies related to data mining and social
networks to solve different tasks such as recommender systems, experts finding, name
ambiguity, etc. Even though, DBLP dataset provides abundant information about
author relationships, conferences, and scientific communities it has a major limitation
that is its records provide only the paper title without the abstract and index terms.
In addition to using the DBLP dataset for finding academic experts, it has been used
extensively in academic recommender systems. A number of studies were conducted
to recommend academic events and collaborators for researchers using different
methods and techniques. For example, a recommender system for academic
collaboration called DBconnect was presented in [9]. Authors of this paper used
DBLP data to generate bipartite (author-conference) and tripartite (authorconference-
topics) graph models, and designed a random walk algorithm for these models to
calculate the relevance score between authors. And in another study [10] a
recommender system for events and scientific communities for researchers was
proposed based on social network analysis.
Querying large datasets produces large sets too, which makes the user unable to
decide from where he has to start looking at the results. To solve this problem
clustering and ranking were suggested in many papers. A system to visualize author
information and relationships simultaneously was presented in [11]. The authors
applied two types of clustering, keyword clustering and author clustering to visualize
the relationships and groupings of authors. In [12] document clustering was applied to
provide an overview of the recent trends in data mining activities. Clustering and
ranking are often applied separately but in [13] a novel framework called RankClus
was proposed to integrate them. To increase the accuracy of IR clustering, the authors
in [14] proposed transferring knowledge available on the word side to the document
side; they introduced a model based on nonnegative matrix factorization to achieve it.
3 Concept Analysis Of Journals Classification
This section describes how formal concept analysis is employed to analyze the
DBLP’s journals classification. Formal Concept Analysis can be used as an
unsupervised clustering technique. The starting point of the analysis is a database
table consisting of rows G (i.e. objects), columns M (i.e. attributes) and crosses I
134 Saleh Alwahaishi, Jan Martinovič, and Václav Snášel, Miloš Kudělka
⊆ G×M (i.e. relationships between objects and attributes). The mathematical structure
used to reference such a cross table is called a formal context (G, M, I).
A group of interested similar journals, which covered the scope of computer
science, were selected. The list of selected journals (objects) was obtained from well-
known DBLP database that contains information about the published articles and their
authors as well. The selected list of links to journals has the size of 115 items. The
next step was to identify main topics (attributes), which each of the journals covers.
From the journal web sites we have found the aim and scope of each journal, and have
manually extracted the main topics, such as Pattern Recognition, Image Processing,
etc. Each journal has been identified by an existing classifier by company due to the
problem with using their own names or similar names of topics. The used classifier
that contains about 1224 sub disciplines classified to disciplines and those classified
to discipline field, e.g. sub discipline Pattern Recognition is in disciplines Artificial
Intelligence and Image Processing and that is in Information and computing sciences
[15]. We selected only sub disciplines in the field Technology and Information and
computing sciences. Our manually extracted topic from journals in many cases
correspond the classified disciplines, but in some cases it was necessary to assign the
extracted topic to sub discipline, which was almost similar. Therefore, journals were
classified into a list of topics based in their relation to the topic. The classification
process ends up with ten main topics that have twenty nine subfields or disciplines.
Table 2 shows the main topics and their subfields.
A journal is represented as a list of topics. The topics are the disciplines that being
covered by all journals, based on the extracted data from their aims and scopes. Each
topic is assigned a weight of 0 or 1. A topic’s weight for a journal expresses the
coverage possibility of the topic by the related journal. A value of 1 denotes that the
journal covers the column’s topic and 0 denotes the lack of coverage. Formally, these
data can be represented as a matrix of journals by topics whose m rows and n columns
correspond to m journals and n topics, respectively. The elements of the journal-topic
matrix are the weights of each term for a particular document, that is:
Where yij denotes the weight assigned to topic Tj for journal Ji.
The formal concept analysis of the data starts with the creation of a formal context.
The formal objects of the formal context are the journals Ji that were retrieved from
DBLP database. The set of these journals is denoted by J. Using the information that
was extracted from the aim and scope of the journals in J. The coverage possibility Tj
that shows the topic coverage by the journals in J, constitute the formal attributes of
the formal context. The set containing these attributes is denoted by T.
The cross table of the resulting formal context has a row for each journals in J, a
column for each topic in T and a cross in the row of Ji and the column of Tj if the
corresponding weight yij is 1. To minimize the cross table size, journals impact factors
will be considered to decrease the number of tested journals. The journals with an
Analysis of the DBLP Publication Classification Using Concept Lattices 135
impact factor of 3.0 and above will be enlisted in the matrix, dropping the number of
selected journals to be 18 as shown in Table 3. After the formal context is
constructed, formal concept analysis is applied to produce the concept lattice.
Table 4 represents the formal context. A cross in the row of Ji and the column of Tj
indicates that Tj is believed to be a covered topic by the journal of Ji.
Table 1. Journals’ impact factors and abbreviations
Impact
Abbreviation Journal
Factor
A Nucleic Acids Research 6.878
B IEEE Transactions on Pattern Analysis and Machine Intelligence 5.96
C International Journal of Computer Vision 5.358
D Computer Applications in the Biosciences 4.328
E Journal of Selected Areas in Communications 4.249
F Transactions on Medical Imaging 4.004
G Transactions on Information Theory 3.793
H BMC Bioinformatics 3.78
I Transactions on Neural Networks 3.726
J Journal of Chemical Information and Computer Sciences 3.643
K Transactions on Fuzzy Systems 3.624
L Journal of Computational Chemistry 3.39
M Transactions on Graphics 3.383
N Transactions on Mobile Computing 3.352
O Transactions on Image Processing 3.315
P Pattern Recognition 3.279
Q Automatica 3.178
R Information Sciences 3.095
The intent of each formal concept contains precisely those topics covered by all
journals in the extent. Conversely, the extent contains precisely those journals sharing
all topics in the intent.
The line diagram of the concept lattice, showing the partially ordered set of
concepts is shown in Fig 1, has the minimal set of edges necessary; all other edges
can be derived by using reflexivity and transitivity. Journals and topics label the node
that represents the formal concept they generate. All concept nodes above a node
labeled by a journal have the journal in their extent. All concept nodes below a node
labeled by a topic have the topic in their intent. The extent of the concept node labeled
by the topic “STVV” for example is easily found by collecting the journal H labeling
this concept node on a path going downward.
136 Saleh Alwahaishi, Jan Martinovič, and Václav Snášel, Miloš Kudělka
Table 2. Formal context
AIIP CTM CS ISLIS DF DC CT CA DIP STVV
A x x
B x x
C x x
D x x
E x
F x x x x
G x x x x x
H x x x
I x
J x x x
K x
L x x
M x
N x x x
O x x
P x
Q x x x x x x
R x x x x x x
The intent of this concept is found by first collecting the topic “STVV” and by going
upward to collect the topic “CTM”, and “CS” labeling the two concepts found on
paths going upward. The resulting extent-intent pair of this concept is ({H},
{CTM,CS,STVV}).
The concept generated by the topic “DF” is a sub concept of the concept generated
by the topic “AIIP”, for the extent of the former concept is contained in the extent of
the latter concept. All journals classified by the topic “DF” were also classified by the
topic “AIIP”, suggesting that within the given formal context “DF” is a more specific
topic than “AIIP”.
Another multi constructed example is found in the extent of the concept node
labeled by the topic “DIP”, which is found by collecting the journal Q labeling this
concept node on a path going downward. The intent of this concept is found by
collecting the topics “CTM”, “CA”, and “ISLIS” labeling the three concepts found on
paths going upward. The latter two topics, however, are sub concepts of the concept
generated by the topic “AIIP”. The resulting extent-intent pair of this concept is ({Q},
{AIIP,CTM,CS,ISLIS,CA,DIP}).
Analysis of the DBLP Publication Classification Using Concept Lattices 137
Fig. 1. Concept lattice for journals classification
4 Conclusion
The concept lattice uncovers relational and contextual information. Journals’ topic
categorizations are put into relational context depending on how they are associated
by the journals’ aims and scopes. The topics “Computer Theory and Mathematics –
CTM“, and “Data and Information Processing –DIP” for example are shown as
related because these topics share a similar classification context. The implicit
structures revealed help researchers to classify journals more efficiently. This
approach has the potential to support the emergence of new knowledge by identifying
concept relations, making these explicit and enabling researchers to inspect these
concept relations.
Concept lattices are not intended to build or substitute traditional static ontologies,
rather they aim to support specifications of less rigorous relations, or associations
[16], which might be more intuitive to knowledge workers and lead to more
interesting links via associations.
138 Saleh Alwahaishi, Jan Martinovič, and Václav Snášel, Miloš Kudělka
Abbreviation Main Topic Subfields
Adaptive Agents and Intelligent Robotics,
Neural, Evolutionary and Fuzzy Computation,
Artificial Intelligence and
AIIP Simulation and Modeling, Computer Vision
Image Processing
Pattern Recognition and Data Mining, Signal
processing, Image Processing
Computer Graphics
Other Computation Theory and Mathematics
Computation Theory and Numerical Computation
CTM
Mathematics Applied Discrete Mathematics
Computational Logic and Formal Languages
Analysis of Algorithms and Complexity
Software Engineering
Operating Systems
CS Computer Software
Computer System Security
Bioinformatics Software
Database and Database Management
Information Retrieval and Web Search
Information Systems and Inter-organizational Information Systems and
ISLIS Library and Information Web Services
Studies
Information Systems Management
Information Systems Development
Methodologies
Data Encryption
DF Data Format
Data Structures
Mobile Technologies
DC Distributed Computing
Distributed Computing
Computer Communications Networks
(computer network)
Communications
CT Wireless Communications
Technologies
Other Communications Technologies
(telecommunications)
CA Computer Architecture
Data and Information
DIP
Processing
Software Testing and
STVV
Verification & Validation
Table 3. Journals’ main topics and subfields
Analysis of the DBLP Publication Classification Using Concept Lattices 139
5 References
1. Wille. R. (1982). Restructuring lattice theory: an approach based on hierarchies of
concepts. In I. Rival (Ed.). Ordered sets. Reidel. Dordrecht-Boston. 445-470.
2. Ganter, B., Wille, R. (1999) Formal Concept Analysis: Mathematical foundations.
Springer
3. Beydoun, G. (2009) Using Formal Concept Analysis towards Cooperative E-Learning. D.
Richards and B.H. Kang (Eds.): PKAW, LNAI 5465, 109-117. Springer
4. Priss, U. (2006), Formal Concept Analysis in Information Science. Cronin. Blaise (ed.).
Annual Review of Information Science and Technology, ASIST, Vol. 40.
5. Ganter, B., Kuznetsov, S.O. (2008) Scale Coarsening as Feature Selection. Medina and S.
Obiedkov (Eds.) : ICFCA, LNAI 4933, 217-228. Springer.
6. Stumme, G., Wille, R., Wille, U. (1998) Conceptual knowledge discovery in databases
using Formal Concept Analysis Methods. PKDD, 450-458.
7. Ley, M. (2002) The dblp computer science bibliography: Evolution, research issues,
perspectives. SPIRE 2002: Proceedings of the 9th International Symposium on String
Processing and Information Retrieval. London, UK: Springer-Verlag, pp. 1–10.
8. URL, http://en.wikipedia.org/wiki/DBLP.
9. Zaiane ,O. R. Chen, J. and Goebel, R. (2007) Dbconnect: mining research community on
dblp data. WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-
KDD 2007 workshop on Web mining and social network analysis. New York, NY, USA:
ACM, pp. 74–81.
10. R. Klamma, P. M. Cuong, and Y. Cao (2009) You never walk alone: Recommending
academic events based on social network analysis. Complex (1), pp. 657–670.
11. Chan, S. Pon, R. and Cardenas, A. (2006) Visualization and clustering of author social
networks. Distributed Multimedia Systems Conference, pp. 174–180.
12. Peng, Y. Kou, G. and Shi, Y. (2006) Recent trends in data mining: Document clustering
of dm publications. International Conference on Service Systems and Service
Management, vol. 2, pp. 1653–1659.
13. Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu, “Rankclus: integrating clustering
with ranking for heterogeneous information network analysis,” in EDBT ’09:
Proceedings of the 12th International Conference on Extending Database Technology.
New York, NY, USA: ACM, 2009, pp. 565–576.
14. T. Li, C. Ding, Y. Zhang, and B. Shao, “Knowledge transformation from word space to
document space,” in SIGIR ’08: Proceedings of the 31st annual international ACM
SIGIR conference on Research and development in information retrieval. New York, NY,
USA: ACM, 2008, pp. 187–194.
15. Obadi G., Drazdilova P., Hlavacek L., Martinovic J., and Snasel V. (2010) A Tolerance
Rough Set Based Overlapping Clustering for the DBLP Data, Web Intelligence and
Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, pp. 57-60,
2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent
Agent Technology.
16. Krohn U, Davies NJ, Weeks, R. (1999) Concept lattices for knowledge management. BT
Technology Journal, 17(4):108-116.