<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Open Source Intelligence Telegram-bot development</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>KaterynaVasiuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Karelina</string-name>
          <email>karelina@tntu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerij Muzh</string-name>
          <email>vmuzh@tntu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liliana Dzhydzhora</string-name>
          <email>lilyadzhydzhora1970@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ternopil Ivan Puluj National Technical University 1</institution>
          ,
          <addr-line>Ruska, 56, Ternopil, 46001</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Open Source Intelligence Telegram-bot for corporate information retrieval is developed. The bot presents information about the corporate Internet domain and e-mail addresses of employees. If the e-mail address is available in data sources, the access password is displayed. Mathematical models underlying the information retrieval algorithms are considered. The software implementation of OSINT Telegram-bot in Python is described. The system testing results are given.</p>
      </abstract>
      <kwd-group>
        <kwd>1 OSINT</kwd>
        <kwd>Telegram-bot</kwd>
        <kwd>Python</kwd>
        <kwd>automation</kwd>
        <kwd>corporate information</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Various aspects of corporate information retrieval in open sources are studied in scientific
discourse. The solution for retrieving among academic information sources is developed in paper [1].
The solution for pentester`s reconnaissance automation is offered in paper [2]. Papers [3, 4], are
focused on the detection of indicators of compromise (IoC). Automation of the corporate information
retrieval in selected open sources is an important and unsolved problem.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Development of the requirements for Telegram-bot of OSINT automation</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Functional requirements</title>
      <p>OSINT automation project consists of three main tasks that reveal a number of performed
functions. For correct operation of these functions, the requirements ensuring the result achievement
should be created. The methods of this product application are shown in Fig. 1, and detailed
requirements description is given in Table 1.</p>
      <p>OSINT
automations
Account search</p>
      <p>Emails</p>
      <p>Passwords
Involvement in
data breaches</p>
      <p>Domain search</p>
      <p>Stack of
technologies</p>
      <p>SEO rating
Related pages in
social networks
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Non-functional requirements</title>
      <p>Usability can be considered as the most important non-functional requirement. While developing
the bot, the main focus is to create user-friendly interface. The following tasks should be solved:
• to implement the method of displaying information in readable format;
• to create recognition module for responding to the incorrect data input and informing the user
about it;
• to ensure continuous service availability.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Mathematical modeling of information retrieval</title>
      <p>3. Probabilistic models interpret the process of searching for documents as the probabilistic
conclusion. Similarities are calculated as the probability that the document is relevant to the query.
Probability theorems, such as Bayesian theorem, are often used in these models:
• binary model of independence;
• uncertain conclusion;
• language models;
• model of divergence from randomness;
• hidden Dirichlet allocation.</p>
      <p>4. Function-based search models treat documents as vectors of function values and search for the
best way to combine these functions into a single relevance score, usually by ranking methods.</p>
      <p>Models without interdependence of terms consider different terms / words as independent. This
fact is usually represented in vector spatial models by the assumption of orthogonality of term vectors
or in probabilistic models by the assumption of independence for term variables.</p>
      <p>Models with transcendent interdependence of terms make it possible to present interdependencies
between terms, but they do not state how interdependence between two terms is defined. They rely on
the external source for the degree of interdependence of two terms (eg, human or complex
algorithms).</p>
      <p>There are many ways to evaluate how properly the found documents match the query. Precision is
defined as the ratio of the number of relevant documents found by IQ to the total number of
documents found:</p>
      <p>
        Precision = (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where is the set of relevant documents in the database, and is the set of documents
found by the system.
      </p>
      <p>Completeness: the ratio of the number of found relevant documents to the total number of relevant
documents in the database:</p>
      <p>
        Recall =
where is the set of relevant documents in the database, and
found by the system.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
is the set of documents
      </p>
      <p>Fall-out characterizes the probability of finding the irrelevant resource and is defined as the ratio
of the number of found irrelevant documents to the total number of irrelevant documents in the
database:</p>
      <p>
        Fall-out = (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
where is the set of irrelevant documents in the database, and is the set of
documents found by the system.
      </p>
      <p>Van Riesbergen measure.</p>
      <p>Sometimes it is useful to combine precision and completeness in one average value. For this
purpose, the arithmetic mean is not suitable, because, for example, for the search engine it is enough
to return all the documents to ensure completeness equal to one unit at close to zero precision, and the
arithmetic average of precision and completeness is at least 1/2. The harmonic average does not
possess this disadvantage, because with the large difference in average values it is close to their
minimum.</p>
      <p>Therefore, a good measure for the joint assessment of precision and completeness is F-measure,
which is defined as the weighted harmonic average of precision P and completeness R:</p>
      <sec id="sec-6-1">
        <title>As a rule F-measure is presented in the following way</title>
        <p>F</p>
        <p>F
When</p>
        <p>
          or
called balanced or
for it is simplified:
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>F-measure gives the same weight to precision and completeness and is
- measure (it is accepted to specify
value in the lower index), the expression</p>
      </sec>
      <sec id="sec-6-2">
        <title>The use of balanced F-measure is optional: if precision is preferred, and if completeness gains greater weight [6].</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Software implementation of OSINT Telegram-bot</title>
      <p>The implementation of the automated system will take place using the bot in Telegram messenger.
Bots are third-party programs that run inside Telegram. Users can interact with bots by sending them
messages, commands and built-in queries. Bots are controlled by HTTPS requests to API for bots.</p>
      <p>At present Telegram is one of the most popular instant messaging platforms, because it enables to
store messages in the cloud, not only in the device, Telegram can be used on Android, iOS, Windows
and almost any other platform supporting web version.</p>
      <p>Telegram uses its own MTProto encryption protocol. MTProto API (a.k.a. Telegram API) is API
via which the telegram application communicates with the server. Telegram API is completely open,
so any developer can write his/her own messenger client.</p>
      <p>Creation of the new bot is performed using BotFather bot. Due to command/newbot you should
create a new bot, as shown in Figure 3. BotFather calls for the name and username, and then creates
authorization token for new bot.</p>
      <p>Bot name is displayed in the contact information. The username is a short name used in references
and links to t.me. Usernames are 5-32 characters long and case-insensitive, and can contain only Latin
characters, digital symbols, and underscores. The bot's username should end in “bot”, we called it
“OSINTautomation_bot”. After bot creation, BotFather provides token. The token looks like this:
110201543: AAHdqTcvCH1vGWJxfSeofSAs0K5PALDsaw. It is due to the token that you can
control the bot.</p>
      <p>Bot design is set in BotFather: menu / mybots → Edit Bot. There you can change:
• Bot name;
• Description is the text that users will see at the beginning of the dialogue with the bot under
the title "What can this bot do?";
• About is the text that will be visible in the bot profile;
• Bots Avatar, unlike users avatars and chats, cannot be animated, only pictures.
• Commands - here referred to as commands hints in the bot. More information about the
commands is given below.</p>
      <p>When the user opens the bot for the first time, he can see the “Start” button. By clicking on this
button, it sends the command / start.</p>
      <p>There are two main ways to work with Telegram in Python: by sending HTTPS requests and using
Webhook. The developed project has three elemenrs: computer with Python, Telegram server and
Telegram client.</p>
      <p>Python interpreter runs on the computer, and Python program runs inside the interpreter. It is
responsible for all content: it includes text templates, logic and behavior. Inside the program, Python
has a library responsible for communicatioon with Telegram server. The secret key is integrated into
the library so that Telegram server understands that the program is associated with the specific bot.
When the client from Telegram requests information from the bot, the request is fed to the server. The
request is processed by Python program, the response is sent to Telegram server, and the server
responds to the client.</p>
      <p>Bot logic is implemented by means of python-telegrambot library. This library is used for quick
and simple bots creation, as it contains a large number of implemented methods of access to Telegram
information.</p>
      <p>The telegram.ext submodule is built on top of pure API implementation, it provides easy-to-use
interface. The submodule consists of several classes, but two most important are telegram.ext.Updater
and telegram.ext.Dispatcher. Updater class constantly receives updates from Telegram and transfers
them to Dispatcher class. Then command and message processors are added to Dispatcher. They sort
the updates received by Updater, and by already registered processors, the information is transferred
to the specified callback function. Each processor is the instance of any subclass of
telegram.ext.Handler class. The library offers the processors classes almost for all cases.</p>
      <p>While creating the instance of Updater class, the access token obtained during bot creation is used.
The example of creating instances of Updater and Dispatcher classes, as well as the added end-user
command and message processors, is shown in Listing1.</p>
      <p>Listing 1: Bot initialization</p>
      <p>from telegram.ext import CommandHandler, Filters, MessageHandler, Updater from config
import BOT_API_KEY from telegram_bot_handlers import TelegramBotHandlers
updater = Updater(token=BOT_API_KEY, use_context=True) dispatcher = updater.dispatcher
start_handler = CommandHandler(‘start’, TelegramBotHandlers.wellcome_message)
help_handler = CommandHandler(‘help’, TelegramBotHandlers.wellcome_message)
unknown_message_handler = MessageHandler(Filters.text &amp; (~Filters.command), TelegramBo
tHandlers.unknown_message) scan_email_handler = CommandHandler(‘scan_email’,
TelegramBotHandlers.scan_email) scan_domain_handler = CommandHandler(‘scan_domain’,
TelegramBotHadlers.scan_domain)</p>
      <p>dispatcher.add_handler(start_handler) dispatcher.add_handler(help_handler)
dispatcher.add_handler(unknown_message_handler)
dispatcher.add_handler(scan_email_handler) dispatcher.add_handler(scan_domain_handler)
updater.start_polling()</p>
      <p>During the work, TelegramBotHadlers class containing all the functions of event handlers for the
performance of the following actions is created:
• wellcome_message - displays greeting message to the user. It contains information about
other available functions and their description;
• unknown_message - displays information message about incorrect data input and hint about
command / help which also informs about all the functions available in the bot;
• scan_email - command to scan the email address;
• scan_domain - command to scan the domain.</p>
      <p>The IntelxScanner () class is created in the function for e-mail addresses scanning. IntelxScanner,
addresses IntelX API in order to obtain relevant data about the involvement of email address in data
leaks and, if successful, provides information about the total number of data leaks and, if available,
credentials, including password. The email scanning function is shown in Listing 2.
Listing 2 - Email scanning function
def search_email(self, email: str):
"""Method used to search email in intelx database and return results
Args: mail (str): String formated email [example@domain.com]
"""
result_str = f"No information found about {email}!"
record_count, search = self.__search_email(email)
if record_count == 0: return result_str
result_str = f"Information found about {email}:\n\n"
stats_str = self.__parse_email_stats(search=search)
result_str = result_str + stats_str
file_name = self.__download_first_file(search)
email_data = self.__parse_downloaded_file(file_id=file_name, email=email)
result_str = result_str + "\n\n" + email_data
os.remove(f"downloads/intelx/{file_name}")
return result_str</p>
      <p>The domain scan function uses the scanner provided by BuildWith service. It provides data about
the services and tools used by the domain, including their versions and descriptions, additional
information such as site rankings and links to social networks, if any of them are found.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Testing OSINT Telegram-bot functionality</title>
      <p>Software product testing in terms of classification by software purposes is divided into two
classes: functional testing; non-functional testing. Functional testing means checking the compliance
of software product with functional requirements specified in technical design specification for this
product creation. Non-functional testing evaluates software product qualities such as ergonomics or
performance.</p>
      <p>During the load test, it is found that there is the requests limit to Telegram server. Bots FAQ on the
Telegram website are as follows:
• no more than one message per second in one chat;
• no more than 30 messages per second in total;
• no more than 20 messages per minute in one group.</p>
      <p>The limits can be increased for large bots by Telegram support service.</p>
      <p>Figure 4 shows The program output with information about the data input by users, and the time of
their input is shown in Fig. 4. Figure 5 shows The message output from the user is shown in Fig. 5.</p>
      <p>To demonstrate the successful search for the fact of data leakage concerning e-mail address, the
sample, satysfying specified conditions - landry.todd@gmail.com. is selected. Figures 6 and 7 show
The result of information output for an e-mail address with and without data leakage is shown in Fig.
6 and 7.</p>
      <p>Information retrieval by domain is implemented in the bot. Due to this integration you can get
detailed data about the tools used, domain rating and pages in social networks, the example of the
output is shown in Fig. 8.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Conclusions and directions of further investigations</title>
      <p>The developed OSINT Telegram bot automates corporate information retrieval in the open
sources. It is reasonable to use this solution in Security Operation Centers of cybersecurity companies.
The solution is designed on the basis of the authors' experience at Cyberoo Company.</p>
      <p>The proposed development will be expanded including new sources of information retrieval
(search engines, specialized databases in Clearnet and Deep Web). For certain companies (design,
design bureaus) graphic information is of primary importance. In further investigations it is necessary
to study mathematical methods of similarity detection for graphic, audio, video files.</p>
    </sec>
    <sec id="sec-10">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. O.</given-names>
            <surname>Zoder</surname>
          </string-name>
          .
          <source>Automated Collection of Open Source Intelligence. Master`s thesis</source>
          , Masaryk University,
          <year>2020</year>
          , 82 p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mejia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Helling</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Olmsted</surname>
          </string-name>
          ,
          <article-title>Automation of cyber-reconnaissance: A Javabased open source tool for information gathering</article-title>
          ,
          <source>2017 12th International Conference for Internet Technology and Secured Transactions (ICITST)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>424</fpage>
          -
          <lpage>426</lpage>
          , doi: 10.23919/ICITST.
          <year>2017</year>
          .
          <volume>8356437</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Azevedo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Medeiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessani</surname>
          </string-name>
          .
          <source>Automated Solution for Enrichment and Quality IoC Creation from OSINT.INForum</source>
          , 2018
          <article-title>- researchgate</article-title>
          .net. URL: https://www.researchgate.net/profile/AlyssonBessani/publication/327835294_Automated_
          <article-title>Solution_for_Enrichment_and_Quality_IoC_Cre ation_from_OSINT/links/5cc80ea44585156cd7bbe469/Automated-Solution-for-</article-title>
          <string-name>
            <surname>Enrichmentand-Quality-IoC-Creation-</surname>
          </string-name>
          from-OSINT.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Martins</surname>
          </string-name>
          .
          <article-title>Generating Threat Intelligence based on OSINT and a Cyber Threat Unified Taxonomy</article-title>
          .
          <source>Ph. D. thesis</source>
          , Lisbon University,
          <year>2020</year>
          . - 125 p.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuropka</surname>
          </string-name>
          .
          <article-title>Models for the representation of natural language documents. Ontologybased information filtering and retrieval with relational databases</article-title>
          .
          <source>Advances in Information Systems and Management Science</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          . Introduction to Information Retrieval. Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>