=Paper=
{{Paper
|id=Vol-3233/paper7
|storemode=property
|title=Lazy Classification of Underground Forums Messages Using Pattern Structures
|pdfUrl=https://ceur-ws.org/Vol-3233/paper7.pdf
|volume=Vol-3233
|authors=Abdulrahim Ghazal,Sergei O. Kuznetsov
|dblpUrl=https://dblp.org/rec/conf/ijcai/GhazalK22
}}
==Lazy Classification of Underground Forums Messages Using Pattern Structures==
Lazy Classification of Underground Forums Messages Using Pattern Structures Abdulrahim Ghazal1 , Sergei O. Kuznetsov2 1 National Research University Higher School of Economics, Pokrovsky boulevard, 11, 109028, Moscow, Russian Federation 2 National Research University Higher School of Economics, Pokrovsky boulevard, 11, 109028, Moscow, Russian Federation Abstract Underground forums are monitored platforms where hackers announce attacks and tools to carry on attacks on businesses or organizations. In this paper, we will experiment on assessing the risk of a dataset of these messages, using pattern structures and a lazy classification scheme, with some introduced complexity-reducing elements and natural language analysis techniques. The results show promising application for this method for this problem, and serve as an introductory step for deeper investigation. Keywords Formal concept analysis (FCA), Threat intelligence, Underground forums, Pattern structures 1. Introduction 1.1. Threat Intelligence Threat Intelligence constitutes a very critical part of the cybersecurity world nowadays, with the intensifying cases of cyber attacks that are costing millions of dollars to many industries and governments [1], and improving every day in quantity and quality. The practice of collecting information about attacks or offers to attack a target from various sources has attracted a lot of research and business attention. This is done by monitoring online forums and instant messaging services, where threat actors post to let other members know that they have some unauthorized access to a network, database or that they made a new tool that can help to achieve the above goals. Most of the serious criminal activity is done for money [2], but sometimes, it has other motives (political statements [3], sabotage or even corporate espionage). The forums consist of sub-forums and sections that deal with different topics and other administrative sections to control the announcement, sales, and membership processes. In addition to that, many forums have a direct marketplace section, where more strict rules about content are enforced and make these sections designated for sale/buying of illegal products, ranging from cracked software to confidential databases of entities. Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop "What can FCA do for Artificial Intelligence?", FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria, Proceedings, pp. 63–74. $ agazal@hse.ru (A. Ghazal); skuznetsov@hse.ru (S. O. Kuznetsov) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: An example of forum sections (partial structure). The most prestigious forums also have an “escrow” service that works like a mediator between buyers and sellers on the forum and a type of guarantee of impartial control over the interaction. Many forums also have tiers of membership (VIP, Premium, Golden, etc.). Some forums have no free membership tiers, meaning that one have to pay to access their content. The content posted in paid sections or forums is supposed to have more credibility and be written by more interesting members, depending on the prices, which range from 10$ up to 300$. Most transactions are paid with cryptocurrency. 1.2. Threat Intelligence Workflow Detecting threats is usually performed with the help of human analysts. The analysts’ workflow has three main phases: discovery, reaction and analysis. The most time consuming and difficult to do is the first phase, where the analyst has to browse through thousands of messages posted daily to find credible threats to report. This is even harder when there is a constant stream of spam messages flooding the forums. This work will focus on helping the human analysts in the discovery phase, using textual analysis of the messages, and learning approaches to help filter out the irrelevant messages, and tag all relevant messages with the right threat category to further ease the next two phases of the workflow. Now to avoid confusion, we need to define some terms as they are mentioned in the system vs. how they are usually mentioned in the real world. Table 1 gives the terminology. We need a system that can detect threats without capturing many false positive examples, or missing a true positive case, which carries a high business cost. Another constraint on the needed system is to be time sensitive, as many threats are very volatile, meaning that the threat actor will post about the sale of access for example, in several minutes or hours, some other criminal entity (or in some cases the victims) will contact him about the sale, pay him and request removing the sale announcement. The last important feature of the system is explainability, as human analysts need to understand the reasoning behind a classification result. Table 1 Terminology Disambiguation. Usual terminology Our Terminology Meaning Post Thread The group of messages that are posted under one topic. Comment, message, reply Message The one item posted in a thread. The Html tags that contain all the - Last post . relevant information for extracting a message Author, Hacker, original poster Threat Actor The person who wrote the message. Title, Headline Topic The thread title. This will be achieved by building a learning-based system. The rest of the paper is organized as follows: In Section 2 we recall basic definitions in formal concept analysis and pattern structures. In Section 3 we describe the lazy classification method using pattern structures. Section 4 describes the experimental setting. In Section 5, we discuss the preliminary results of applying the lazy classification to underground forum messages. We conclude the work in section 6. 2. Formal Concept Analysis 2.1. Main Definitions Formal Concept Analysis (FCA) as defined in [4] is a mathematical theory that is based on concepts and conceptual hierarchy. It is applied for knowledge discovery and data analysis [5, 6]. Let 𝐺 be a set of objects, 𝑀 a set of attributes or descriptions of these objects, and 𝐼 ⊆ 𝐺 × 𝑀 a binary relation between 𝐺 and 𝑀 . We call the triple (𝐺,𝑀 ,𝐼) a formal context. If 𝑔 ∈ 𝐺 has ′ the attribute 𝑚 ∈ 𝑀 , then (𝑔, 𝑚) ∈ 𝐼. We then define the derivation operators (.) on 𝐴 ⊆ 𝐺 and 𝐵 ⊆ 𝑀 : ′ 𝐴 = {𝑚 ∈ 𝑀 | ∀𝑔 ∈ 𝐴 : 𝑔𝐼𝑚} (1) ′ 𝐵 = {𝑔 ∈ 𝐺 | ∀𝑚 ∈ 𝐵 : 𝑔𝐼𝑚} (2) ′ ′ We call a pair (𝐴,𝐵) such that 𝐴 = 𝐵 and 𝐵 = 𝐴, a formal concept, and 𝐴 is called its extent and 𝐵 is its intent. A partial order ≤ is defined on the set of concepts: (𝐴,𝐵) ≤ (C,D) iff 𝐴 ⊆ 𝐶 and 𝐵 ⊆ 𝐷. In this case, (𝐴,𝐵) is called the subconcept and (𝐶,𝐷) a superconcept. This partial order gives rise to a complete lattice on the set of all formal concepts. We call this the concept lattice 𝜁 of the formal context (𝐺,𝑀 ,𝐼). The hierarchical structure of concept lattices can be applied at mining association rules [5, 7] ontology design [8, 9], and recommendation systems, because of the ability to explain the rationale behind the recommended item [10]. There is a large focus in FCA domain on building concept lattices and extracting the concepts from a given formal context in an efficient manner, for it to become more applicable in the real world [11]. In [12] researchers compared the performance of several algorithms to build concept lattices and gave insights on which algorithm would be the one to choose depending on the data. There is also a number of works to discuss other applications of FCA in learning [13, 14]. 2.2. Pattern Structures To increase the applications of Formal Concept Analysis, it had to be able to represent more complex data structures, like graphs or non-binary data. This need led to the development of pattern structures [15]. Let 𝐺 be a set of objects and (𝐷, ⊓) be a meet-semi-lattice of possible object descriptions or patterns (for standard FCA, it would be the powerset of attribute set) with the similarity operator ⊓. Elements of 𝐷 are ordered by a subsumption relation ⊑ such that 𝑎,𝑏 ∈ 𝐷, then one has 𝑐 ⊑ 𝑑 ⇔ 𝑐 ⊓ 𝑑 = 𝑐. We also define 𝛿 : 𝐺 → 𝐷 as a mapping between objects and their attributes. We call (𝐺, 𝐷, 𝛿) where 𝐷 = (𝐷, ⊓) a pattern structure. We can define the operators (·)◇ on 𝐴 ⊆ 𝐺 and 𝑑 ∈ (𝐷, ⊓) making Galois connection between the powerset of objects and ordered set of descriptions: 𝐴◇ = ⊓𝑔∈𝐴 𝛿(𝑔) (3) 𝑑◇ = {𝑔 ∈ 𝐺 | 𝑑 ⊑ 𝛿(𝑔)} (4) These operators will give us back the maximal set of patterns shared by the objects in 𝐴 and the maximal set of objects that share the description 𝑑, respectively. A pair (𝐴, 𝑑), 𝐴 ∈ 𝐺 and 𝑑 ∈ (𝐷, ⊓) that satisfies 𝐴◇ = 𝑑 and 𝑑◇ = 𝐴 is called a pattern concept, where 𝐴 is called the extent and 𝑑 is called the pattern intent of (𝐴, 𝑑). A partial order ≤ is defined on the set of concepts: (𝐴, 𝑑1 ) ≤ (𝐵, 𝑑2 ) iff 𝐴 ⊆ 𝐵 (or, equivalently, 𝑑2 ⊑ 𝑑1 ). This partial order forms a complete lattice on the set of all pattern concepts. We call this the pattern concept lattice of the pattern structure (𝐺, 𝐷, 𝛿). Pattern structures are helpful in applications with complex data, like graphs, many-valued attributes [5], intervals or interval vectors [6]. For classification tasks we do not need to extract the full hidden knowledge from a dataset in terms of implications, hypotheses or association rules, but a so-called lazy classification can be applied [16, 17]. 2.3. Lazy Classification with Pattern Structures In classification problems we have a target attribute, which, in the simplest case of two classes, has two values, denoted by + and −. By 𝐺+ we denote the set of objects that have the target attribute (positive examples) and by 𝐺− we denote the set of objects that do not have the target attribute (negative examples), so that 𝐺+ ∩ 𝐺− = ∅. Elements of 𝐺 that do not belong to any of these subsets are called unclassified examples 𝐺𝜏 . A version of the lazy classification method [16, 17] is described in Algorithm 1. This algorithm takes 𝑂 | 𝐺 | ·(𝑝(⊓+ | 𝐺 | 𝑝(⊑))) time, where 𝑝(⊓), 𝑝(⊑) are times for computing ⊓, ⊑, respectively. Algorithm 1: Lazy Classification with Pattern Structures Requires: pattern structure (𝐺, 𝐷, 𝛿), test example 𝑔𝑡 ∈ 𝐺𝜏 with description 𝛿(𝑔𝑡 ), parameter 0 ≤ 𝛼 ≤ 1. 1: for 𝑔 ∈ 𝐺+ ∪ 𝐺− : 2: compute sim = 𝛿(𝑔) ⊓ 𝛿(𝑔𝑡 ) 3: extsim = (sim)◇ 4: if 𝛼% of objects in extsim have target attribute, classify 𝑔 positive 5: if 𝛼% of objects in extsim do not have target attribute, classify 𝑔 negative 6: classify undetermined (the algorithm terminates without classification). 3. Experiments The examples dataset used in the following experiments is composed of underground forum messages as the objects and features extracted from these messages as attributes. After building the concept lattice from this dataset, we construct another dataset for testing the classification. Both datasets contain positive and negative examples. The target attribute is a simple flag stating whether the message is a real threat or not. The procedure realizing the lazy classification scheme is as follows: we go through the objects of the formal context and check the extension of the intersection of the test object and the concept object, if all objects of this extension have the attribute value, then the test object is classified positively, if all of the objects of the extension do not have the attribute value, the test object is classified negatively, otherwise, the model cannot classify the object. To avoid object selection bias, we shuffle the objects randomly. Now we move to describe the datasets used in this setting, then talk about the results of the experiments. 3.1. The Datasets The training examples revolve around messages that are classified as real threats by human analysts. These messages were collected using a system which is a part of a commercial prod- uct1 . The only constraint on these examples is being published on the underground forums in 2021 and on. These examples constitute the positive training dataset part. We checked the underground forums where these messages were written, and the list of these forums is used to generate the negative examples, in a way such that for each positive example 𝑋 of a forum, 𝑌 =𝑛*𝑋 negative examples are collected randomly w.r.t the date constraint, where 𝑛 is a factor, called balance ratio, which will be controlled in the experiments. The number of positive training examples is 595. The testing was performed on 960 positive examples and the negative testing examples were constructed in the same manner as training nega- tive examples. These messages are collected from 12 different forums. Several experiments will be run with changing several parameters that might affect the final results. These factors are: 1. Dataset size and distribution: the only constant we have is the number of positive exam- ples used. We test several sizes of negative/positive balance ratio values. In addition to 1 Provided by the cybersecurity firm Group-IB. this, we try to make a less random choice of negative examples, by filtering messages that might be spam messages (for instance, messages like “thanks” or “up” are considered to be of low value). We can also choose messages posted only in subsections of forums that are relevant to the classification (if we are trying to find messages about database leaks, we do not need messages coming from the credit cards section). Filtering based on message length is needed also, because some messages are articles, and while they would contain a lot of triggering keywords, they are not actual threats. 2. The intersection operator: we used set-theoretic intersection, then we use interval inter- section (We will look into two-sided and one-sided intervals). 3. The number of attributes: the attributes are the values of tf-idf for the words of the messages, and changing the number of words to be in the attributes list is examined, and for which popularity of the keyword we can ignore it and remove it from the list is also considered. The values of attributes in case of theoretic set intersection are binary (the message either have the keyword or not). In case of interval intersection, the value of the attribute would be an interval beginning and ending with the tf-idf value. 4. The tolerance factor 𝛼, which represents the probabilistic relaxation allowed for counter examples. 3.2. Assessment Methods We will have the usual classification assessment methods, but we need to be attentive to the case of unclassified examples. Handling unclassified examples can be done in three different ways: • Ignoring them by considering them negative examples, and that depends on how harsh our model is about classifying new examples. i.e., if the model classifies almost all positive examples as positive (high recall) and leaves only a very small number of unclassified examples relative to the number of new examples, then it might be possible to classify unclassified examples as negative, as the probability of them being actually positive is very low. This approach is not acceptable in cases where the cost of missing a positive example is high, which is the case for us, as missing a real threat is of a high cost. • Moving all the unclassified examples to a human analyst, so they can assess their credi- bility, which can be done only in case the number of unclassified examples is low. • Removing unclassified examples all together, by having a probabilistic relaxation of the classifier, so that it classifies examples based on how close they are to a class, not as zero-one state. In addition to the usual definitions of assessment, we define another measure of the improve- ment this brings to human workers, by how much the messages they have to check shrunk in size. We call it "saved effort" and define as 1 − |𝐺|𝐺𝑢𝑛𝑐𝑙 𝜏| | , where 𝐺𝑢𝑛𝑐𝑙 is the set of unclassified examples 𝐺𝑢𝑛𝑐𝑙 ⊆ 𝐺𝜏 . 3.3. Testing Parameters We first state our parameters that would be used in the future. We have |𝐺+ | = 𝑡𝑟𝑝 positive examples and |𝐺− | = 𝑡𝑟𝑛 negative examples for the training dataset. We have zero unclassified examples in the training data. The number of negative examples 𝑡𝑟𝑛 = 𝑡𝑟𝑝 * 𝑛 where 𝑛 is a positive integer. The number of attributes is labeled as 𝑎𝑡𝑡. In the test dataset, we have 𝑡𝑠𝑝 positive examples and 𝑡𝑠𝑛 examples. The model might have a number of unclassified examples of the test dataset. This set of examples is labeled 𝑢𝑛𝑐𝑙 and it is divided in two subsets: true positive examples that were left unclassified by the model (labeled 𝑝𝑜𝑠-𝑢𝑛𝑐𝑙) and true negative examples that were left unclassified by the model (labeled 𝑛𝑒𝑔-𝑢𝑛𝑐𝑙). In the method we described above, the explanation of the model is the intersection of attributes of the test object and the concept object that gave us the result of the classification (meaning that the extension of this intersection of attributes all have/do not have the attribute value). This set is denoted by 𝑠𝑖𝑚. While we can control the balance of the training dataset in our case, this might not be the real world situation, which means that we would have a varying range of the values of saved effort and F1 measure, presenting us with a trade-off between coverage and model predictions correctness. 4. Results We performed multiple experiments. The first one would have only binary attributes. The next would have intervals as attributes. In the next two experiments, the data is represented by one- sided intervals. After this, we repeat the pattern structures experiments but with varying values for 𝛼. Due to space limitations we present the results at https://github.com/abdulrahimGhazal/FCA- results 4.1. Binary Attributes The attribute values here are TT-IDF values for the keywords contained in the vectorizer’s vocabulary resulting from building the tf-idf model. The value of the attribute in an object would be represented here as: 1 𝑘𝑒𝑦𝑤𝑜𝑟𝑑 ∈ 𝑣𝑒𝑐𝑡𝑜𝑟𝑖𝑧𝑒𝑟 𝑣𝑜𝑐𝑎𝑏 (5) {︂ 𝑎𝑡𝑡_𝑣𝑎𝑙𝑢𝑒(𝑘𝑒𝑦𝑤𝑜𝑟𝑑) = 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (6) We test a range of 5 values of the ratio between the positive to negative examples in the training dataset (from 1 to 5). We also test 5 different values of 𝑚𝑖𝑛_𝑑𝑓 which is a factor that controls the number of attributes that we would have. It specifies the threshold at which the TT-IDF builder ignores the term if it has less frequency in the documents. Theoretic set intersection is used. The highest F1 in these experiments occurred on the first experiments where the dataset had the most keywords (lowest 𝑚𝑖𝑛_𝑑𝑓 step) and the same goes for the saved effort measure. The highest value was F1 = 98.8 and saved effort = 88.8 at (ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (1, 0.01). We can observe peaks that happen when we reset the 𝑚𝑖𝑛_𝑑𝑓 step in the training dataset, then a decline in the F1 afterwards until the next peak. This is a natural observation, since the 𝑚𝑖𝑛_𝑑𝑓 step increases between the peaks, meaning that we increase the number of ignored keywords that might play a role in classifying the messages correctly. 4.2. Relaxed Binary Attributes We now repeat the same experiment, using only the lowest balance ratio, varying the values of 𝑚𝑖𝑛_𝑑𝑓 , and also varying values of 𝛼 to allow for a certain small amount of counterexamples (objects of the other class matching the intersections used as classifiers). The highest value was F1 = 98.6 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (90, 1, 0.01) and saved effort = 92.5 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (75, 1, 0.01). Now we look at using pattern structures to classify underground messages by representing textual data as intervals (two-sided and one-sided). 4.3. Interval Representation In this next part, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,𝑥]. In this setting, the intersection operator is defined as an interval that starts with the minimum of the two intervals’ starts and ends with the maximum of the two intervals’ ends. [𝑎1 , 𝑏1 ] ⊓ [𝑎2 , 𝑏2 ] = [𝑚𝑖𝑛(𝑎1 , 𝑎2 ), 𝑚𝑎𝑥[𝑏1 , 𝑏2 )] (7) The highest F1 in these experiments occurred on the first experiments where the dataset had the most keywords (lowest 𝑚𝑖𝑛_𝑑𝑓 step) and the same goes for the saved effort measure. The highest value was F1 = 88.0 and saved effort = 87.7 at (ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (1, 0.01). The values of F1 and saved effort decreases after that with increasing values of 𝑚𝑖𝑛_𝑑𝑓 . 4.4. One-Sided Interval Representation (Max) Here, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,∞]. In this setting, the intersection operator is defined as an interval that starts with the maximum of the two intervals’ starts and ends with infinity. [𝑎1 , ∞] ⊓ [𝑎2 , ∞] = [𝑚𝑎𝑥(𝑎1 , 𝑎2 ), ∞] (8) The highest F1 in these experiments occurred on the first experiments where the dataset had the most keywords (lowest 𝑚𝑖𝑛_𝑑𝑓 step) and the same goes for the saved effort measure. The highest value was F1 = 88.0 and saved effort = 87.7 at (ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (1, 0.01). The values of F1 and saved effort decreases after that with increasing values of 𝑚𝑖𝑛_𝑑𝑓 . 4.5. One-Sided Interval Representation (Min) Here, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,∞]. In this setting, the intersection operator is defined as an interval that starts with the minimum of the two intervals’ starts and ends with infinity. [𝑎1 , ∞] ⊓ [𝑎2 , ∞] = [𝑚𝑖𝑛(𝑎1 , 𝑎2 ), ∞] (9) The highest F1 in these experiments occurred on the first experiments where the dataset had the most keywords (lowest 𝑚𝑖𝑛_𝑑𝑓 step) and the same goes for the saved effort measure. The highest value was F1 = 89.7 and saved effort = 87.6 at (ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (1, 0.01). The values of F1 and saved effort decreases after that with increasing values of 𝑚𝑖𝑛_𝑑𝑓 . 4.6. Interval Representation With Probabilistic Relaxation In this next part, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,𝑥]. As in before, the intersection operator is defined as an interval that starts with the minimum of the two intervals’ starts and ends with the maximum of the two intervals’ ends. The difference now is testing several 𝛼 values (see Algorithm 1). [𝑎1 , 𝑏1 ] ⊓ [𝑎2 , 𝑏2 ] = [𝑚𝑖𝑛(𝑎1 , 𝑎2 ), 𝑚𝑎𝑥[𝑏1 , 𝑏2 )] (10) The highest value was F1 = 94.2 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (80, 1, 0.01) and saved effort = 94.1 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (75, 1, 0.01). The pattern noticed here is the F1 peaks we get when values of 𝑚𝑖𝑛_𝑑𝑓 increases locally at constant values of 𝛼, Then when we increase 𝑚𝑖𝑛_𝑑𝑓 significantly, the values of F1 decreases again. 4.7. One-Sided Interval Representation (Max) With Probabilistic Relaxation Here, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,∞]. The difference now is testing several 𝛼 values (see Algorithm 1). In this setting, the intersection operator is defined as follows: [𝑎1 , ∞] ⊓ [𝑎2 , ∞] = [𝑚𝑎𝑥(𝑎1 , 𝑎2 ), ∞] (11) The highest value was F1 = 91.7 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (95, 1, 0.01) and saved effort = 94.5 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = ((85,90,95), 1, 0.01). The pattern noticed here is the F1 peaks we get when values of 𝑚𝑖𝑛_𝑑𝑓 increases locally at constant values of 𝛼, Then when we increase 𝑚𝑖𝑛_𝑑𝑓 significantly, the values of F1 decreases again. 4.8. One-Sided Interval Representation (Min) With Probabilistic Relaxation Here, we represent the values of the tf-idf as intervals of the floating point value such as if the tf-idf value is 𝑥, the attribute value would be [𝑥,∞]. The difference now is testing several 𝛼 values (see Algorithm 1). In this setting, the intersection operator is defined as an interval that starts with the minimum of the two intervals’ starts and ends with infinity. [𝑎1 , ∞] ⊓ [𝑎2 , ∞] = [𝑚𝑖𝑛(𝑎1 , 𝑎2 ), ∞] (12) The highest value was F1 = 94.1 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = (75, 1, 0.01) and saved effort = 94.2 at (𝛼, ratio, 𝑚𝑖𝑛_𝑑𝑓 ) = ((75,80), 1, 0.01). The pattern noticed here is the F1 peaks we get when values of 𝑚𝑖𝑛_𝑑𝑓 increases locally at constant values of 𝛼, Then when we increase 𝑚𝑖𝑛_𝑑𝑓 significantly, the values of F1 decreases again. 4.9. Discussion Looking at the results, we can see that the binary values of the attributes model performed the best, because the attributes values and intersection method there (the theoretical set intersection) is the most restrictive among all others, giving a little room for error, but the issue with such method is that it suffers from this restrictions when totally new keywords are introduced, as it does not accept any keywords which were not used in the exact intersection. We can also see that the less restrictive the conditions of the intersection, the less accurate it will be. The best conditions of pattern structures in non-binary representation of the data in our example is the Minimum one-sided intervals, because by its nature, it is the most inclusive of values that are non-zero in the classification scheme. The next best one is the Maximum one-sided interval. While worse than the Minimum one- sided interval, because it is restricting attribute values to a smaller interval, it still outperforms the interval representation, because of how limited the values of interval representation are. When relaxation is introduced, we can see that the values of F1 and the saved effort are higher than in the case with no relaxation, supporting that flexibility is useful in case of text messages classification. The sizes of the 𝑝𝑜𝑠-𝑢𝑛𝑐𝑙 were always smaller than 𝑛𝑒𝑔-𝑢𝑛𝑐𝑙, because of the limited set of keywords the model would have compared to the large amount of possible negative messages’ keywords. 5. Conclusion Underground forum messages are hackers announcements that are shared on the internet about attacks or tools used to carry out attacks. We presented an FCA-based approach for classifying forum messages based on their probability of being risky. The results of experiments show that the use of binary attributes (standard FCA) gave better accuracy, while the use of interval pattern structures gave better saved effort. 6. Acknowledgements The article was prepared within the framework of the Basic Research Program at HSE University, RF References [1] McAfee, The Economic Impact of Cybercrime: No Slowing Down, Technical Report, Santa Clara, CA, USA, 2021. [2] S. Pastrana, D. R. Thomas, A. Hutchings, R. Clayton, Crimebb: Enabling cybercrime research on underground forums at scale, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1845–1854. [3] Zone-H Homepage Team, Zone-h homepage, 2022. URL: http://www.zone-h.org/. [4] B. Ganter, R. Wille, Formal concept analysis: mathematical foundations, Springer Science & Business Media, 2012. [5] M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis, Mining gene expression data with pattern structures in formal concept analysis, Information Sciences 181 (2011) 1989–2001. [6] A. Masyutin, Y. Kashnitsky, S. O. Kuznetsov, Lazy classication with interval pattern structures: Application to credit scoring, in: FCA4AI@ IJCAI, 2015. [7] W. Saidi, Formal concept analysis based association rules extraction (2012) 490–497. [8] M. Obitko, V. Snasel, J. Smid, V. Snasel, Ontology design with formal concept analysis., in: CLA, volume 128, 2004, pp. 1377–1390. [9] G. Jiang, K. Ogasawara, A. Endoh, T. Sakurai, Context-based ontology building support in clinical domains using formal concept analysis, International journal of medical informatics 71 (2003) 71–81. [10] P. Vilakone, K. Xinchang, D.-S. Park, Movie recommendation system based on users’ personal information and movies rated using the method of k-clique and normalized discounted cumulative gain, Journal of Information Processing Systems 16 (2020) 494–507. [11] S. M. Dias, N. J. Vieira, Concept lattices reduction: Definition, analysis and classification, Expert Systems with Applications 42 (2015) 7084–7097. [12] S. O. Kuznetsov, S. A. Obiedkov, Comparing performance of algorithms for generating concept lattices, Journal of Experimental & Theoretical Artificial Intelligence 14 (2002) 189–216. [13] S. O. Kuznetsov, Machine learning and formal concept analysis, in: International Confer- ence on Formal Concept Analysis, Springer, 2004, pp. 287–312. [14] S. O. Kuznetsov, N. Makhazhanov, M. Ushakov, On neural network architecture based on concept lattices, in: International Symposium on Methodologies for Intelligent Systems, Springer, 2017, pp. 653–663. [15] B. Ganter, S. O. Kuznetsov, Pattern structures and their projections, in: International conference on conceptual structures, Springer, 2001, pp. 129–142. [16] S. O. Kuznetsov, Scalable knowledge discovery in complex data with pattern structures, in: International Conference on Pattern Recognition and Machine Intelligence, Springer, 2013, pp. 30–39. [17] S. O. Kuznetsov, Fitting pattern structures to knowledge discovery in big data, in: Interna- tional conference on formal concept analysis, Springer, 2013, pp. 254–266.