<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Educational content aggregator with an open API⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhii Yevseiev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanna Zavolodko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kostiantyn Foksha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerii Zavolodko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aksonova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Individual Entrepreneur</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>The article presents a complete architecture and implementation of a web-based educational content aggregator system that collects, structures, and updates information from open sources via an open API. The relevance of the topic is due to the rapid growth in the number of online courses and learning platforms, which creates a need for tools for easy navigation and personalized access to educational resources. The paper substantiates technical and methodological approaches to the creation of such a system, including the implementation of a Python/Django-based backend, the use of the MariaDB database, web crawling modules, and the creation of a mobile application in Flutter. The proposed architecture involves a modular approach using interfaces for connecting to different platforms, which allows the system to scale and easily adapt to new sources. At the center of user interaction is an aggregator that analyzes the request, checks the availability of relevant data in the database, and, if necessary, launches procedures for collecting and updating information. Special attention is paid to the logic of parsers that can adapt to different web page structures, which ensures the solution's versatility. Multilevel functionality testing was conducted on the example of four educational platforms: Coursera, Alison, Sololearn, and Edx. The analysis covers processing speed, amount of processed data, parsing accuracy, and error tolerance. The results demonstrated stability of the system, especially when working with platforms with clearly structured content. The developed aggregator can be integrated into broader EdTech ecosystems, including the state level, and used to build personalized educational trajectories. The system is able to serve both formal and informal educational requests of users, promoting the spread of digital skills, openness of education and improving its quality.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;content aggregator</kwd>
        <kwd>web service</kwd>
        <kwd>open API</kwd>
        <kwd>educational platforms</kwd>
        <kwd>web crawling</kwd>
        <kwd>cloud technologies</kwd>
        <kwd>mobile application</kwd>
        <kwd>data parsing</kwd>
        <kwd>Python/Django</kwd>
        <kwd>digital education1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today's digital environment, aggregator sites play an important role in providing access to a large
amount of information from various sources in a unified manner. As noted by Matyash D. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Google
actively indexes aggregator sites, as they provide users with aggregated data that simplifies search
and decision-making.
      </p>
      <p>
        The main technical tool for implementing aggregators is web crawling, an automated process of
collecting data from web pages. A detailed explanation of the principles of crawling, as well as
methods of robot management, is presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Along with this, Henderson A. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provides an
overview of modern free web crawling tools that allow you to effectively collect and update
information in aggregator databases.
      </p>
      <p>
        An important role in building aggregators is also played by the use of APIs - application
programming interfaces that allow you to receive structured data directly from source servers. IBM
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] clearly explains the essence of APIs as a standard for interaction between systems. An alternative
approach to obtaining structured information is to use RSS feeds, described by Whitehead C. T. [5]
as another convenient tool for content aggregation.
      </p>
      <p>
        In practice, data collection tools are implemented using programming languages, in particular
Python, which has a developed ecosystem of libraries for working with HTTP requests. Gorobtsov
V. [6] demonstrates the use of web scraper for data collection, and the requests library [
        <xref ref-type="bibr" rid="ref5 ref6">7-11</xref>
        ] acts as
a de facto standard for executing HTTP requests in Python.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The purpose of this study</title>
      <p>The purpose of this study is to develop an architecture for a web service for aggregating educational
content with support for an open API and a mobile application that provides convenient access to
non-formal education.</p>
      <p>This paper presents the development and functionality of a web service that acts as an aggregator
of educational content from different platforms. The main tasks are: to implement the server
infrastructure for course aggregation; to create a mobile client on Flutter; to provide automatic data
updates via web crawling and API; to provide tools for forming a personal educational trajectory; to
implement the system comprehensively.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The methodological basis</title>
      <p>The methodological basis is the use of modern web technologies and principles of data integration:
the Python/Django stack is used for the server side; the MariaDB database provides storage of user
profiles, courses, and platforms; the web scraper system is developed in Python using the requests
[8] and BeautifulSoup libraries; the architecture is modular, which allows adaptation to different
platforms through interfaces.</p>
      <p>The system also has a mobile application built on Flutter that interacts with the backend via an
open API, providing cross-platform compatibility and convenient access to learning content.</p>
      <p>The system can support both free and paid courses, allowing users to create their own educational
trajectory according to their interests and professional goals. The project helps to increase access to
quality education and develop digital skills in society.</p>
      <p>The diagram in Figure 1 demonstrates the architecture and information flows of a web service
that aggregates and provides access to training courses from various non-formal education
platforms. The architecture includes both the server side and the mobile client.</p>
      <p>The platform's architecture implements a full cycle of user interaction with the educational
content aggregator - from the first request to receiving updated results.</p>
      <p>The upper part of Figure 2 shows the target audience of the platform: all citizens who want to
improve their digital competencies. This includes schoolchildren, students, teachers, entrepreneurs,
IT professionals, opinion leaders, civil servants - in fact, anyone interested in modern non-formal
education.</p>
      <p>The aggregator is at the center of the interaction. Its main function is to receive requests from the
user (client), check the availability of data, and return relevant information. The whole process is
divided into three conditional blocks: Client, Web server, and Database server.</p>
      <p>During the development of the educational content aggregator, a comprehensive analysis of
available web scraping tools was conducted to select the most suitable technologies for the task. The
key evaluation criteria included the ability to handle dynamic content, session support, processing
speed for large HTML/XML datasets, and capabilities for bypassing protection mechanisms or
asynchronous data loading. Based on this analysis, a combination of Requests and BeautifulSoup4
was chosen as the primary toolset for parsing static web pages, offering ease of implementation and
high code readability. For more complex scenarios involving JavaScript-rendered content, Selenium
was employed to simulate full browser interactions. While LXML was considered for high-speed
parsing of structured XML documents and Scrapy for scalable and rule-based crawling pipelines,
their use was limited due to the specific structure and access policies of educational platforms. This
hybrid approach enabled a flexible and effective aggregation mechanism tailored to heterogeneous
data sources.</p>
      <p>The process starts with a user request. If the database is unavailable, the system offers to retry or,
if it fails, generates a page with available or updated data.</p>
      <p>If the database is available, the web server makes a request to it. In response, the system either
returns the information already stored or updates it: the aggregator collects new data from open
educational platforms, checks its relevance, and stores it in the database. Then, an HTML page with
relevant content is generated and delivered to the user.</p>
      <p>This approach ensures the flexibility and reliability of the system: even in the event of temporary
failures, it is able to update the information independently, check it for compliance, and provide the
user with the result without the intervention of the administrator.</p>
      <p>Thanks to this architecture, the aggregator can effectively serve different user groups and scale
to the needs of educational initiatives, EdTech platforms, or government services.</p>
      <p>To allow users to access courses from other learning platforms, the platform needs to store
courses in the platform database. In order to receive courses from different platforms in prоstoEDU,
you need to use an information aggregator that will scan data from other platforms and update the
relevant information. The general scheme of the aggregator is shown in Figure 3.</p>
      <p>Among the requirements for the program are the following basic requirements: verification of data
collection; definition of characteristics; error handling; speed and efficiency; and availability of
process data.</p>
      <p>The aggregator uses parser programs, each of which processes an individual website according
to a given algorithm. Once the aggregator is launched, it launches programs that collect data. These
programs are able to read and analyze the structure of web pages, extract the necessary information,
such as headings, text, images, links, pictures, etc. After processing each site, the aggregator collects
all the information obtained and enters the data into a database, for example, a list of new courses
on the platform. Thus, the aggregator collects and compiles information from the platforms.</p>
      <p>Since site structures differ, it becomes necessary to write an individual parser for each platform.
However, each parser uses common parts of the code that should not be duplicated but transferred
to separate interfaces. Using these interfaces, it is convenient to change the code for all programs in
one place at once, rather than changing the code for each one separately.</p>
      <p>To ensure that modern educational platforms can promptly provide up-to-date courses, articles,
or news from dozens of other sources, an aggregator works behind the scenes. It all starts with
getting the URL of the site from which you want to collect data. The system then generates a list of
pages that contain potentially useful information, such as course lists or individual training modules.</p>
      <p>In order not to overload the database with duplicates, the aggregator performs an important
check: it excludes links from which information has already been saved in the system. This allows
you to store only new or updated data, saving resources and reducing the workload.</p>
      <p>Next, the system checks whether all pages have been processed. If not, it proceeds to the stage of
collecting information from a particular page. This is usually done by so-called parsers, which are
mini-programs that extract text, headings, images, or metadata and prepare them for storage.</p>
      <p>Once all the pages have been processed, the system proceeds to update the database, i.e. adds new
information or updates the existing information. And finally, it shuts down, preparing for the next
update cycle.</p>
      <p>Thanks to this approach, aggregators can keep content up to date, work without human
intervention, and most importantly, provide users with convenient and quick access to the most
important things.</p>
      <p>Data processing algorithms depend on page content, web page structure, data type and format,
data volume and complexity, data availability, website limitations, and other factors, but the general
steps of a web parser are shown in Figure 4</p>
    </sec>
    <sec id="sec-4">
      <title>4. Research Results</title>
      <p>The master's projects tested an educational content aggregation system that allows: receiving data
from various sources (including paid and free courses); storing aggregated information in a unified
structure; viewing content through a web interface or mobile application; supporting automatic data
updates (through a task scheduler); scaling the solution for use by educational institutions or
communities.</p>
      <p>As part of the testing of the developed educational content aggregator system, three stages of
testing were carried out with different amounts of input data. This allowed us to evaluate the stability
of the system, the speed of information processing, and the accuracy of the results for four popular
educational platforms: Coursera, Alison, Sololearn, and Edx. A table comparing the results of the
program is shown in Table 1.</p>
      <p>At the first stage of testing with a small amount of data, the system demonstrated high accuracy
and fast response. For the Coursera and Sololearn platforms, the average processing time for one
piece of information did not exceed 1 second. All data was collected without errors.</p>
      <p>The second stage involved increasing the number of courses, which made it possible to identify
the dependence of performance on the amount of input data. The Coursera and Sololearn platforms
maintained low processing time per unit (less than 1 second), but in the case of Coursera, there were
isolated errors related to network timeouts. At the same time, the Alison and Edx platforms showed
significantly slower processing - over 3 and 7 seconds per unit, respectively.</p>
      <p>At the third stage of testing, the aggregator processed the full data set. Despite the heavy load,
the system coped with the task: Coursera processed more than 8,700 courses with an average time
per unit of about half a second, while the Alison and Edx platforms took longer and recorded more
errors. The least loaded platform was Sololearn, which consistently processed with the lowest time
and without any errors.</p>
      <p>A separate test of adding 100 courses to the database showed stable backend operation: all data
was saved without errors, duplication, or loss. This indicates the effective integration of the modules
for collecting, processing, and storing information.</p>
      <p>The study has confirmed that the system has a high level of scalability, flexibility to the structure
of different platforms, and the ability to keep data up-to-date in real time. Despite differences in
processing speed for different sources, the developed aggregator demonstrates stable and reliable
operation at all stages of testing.</p>
      <p>The aggregator collects data by launching individual parsers for each platform. Thanks to the use
of unified interfaces, the processing logic can be changed centrally. This increases reliability and
simplifies system maintenance.</p>
      <p>We also describe typical processing algorithms that depend on the structure of sites and data
features (Fig. 4).
The proposed system of educational content aggregator has been successfully tested as an integrated
solution that includes a server part, a data collection module and a mobile application. This approach:
increases the accessibility of quality education; promotes the digital transformation of the
educational process; provides integration with other EdTech solutions; has the potential to be scaled
and used in formal and non-formal educational environments.</p>
      <p>Declaration on Generative AI
The authors have not employed any Generative AI tools.
[6] What is an API? - IBM. [Electronic resource]. - Access mode: https://www.ibm.com/topics/api</p>
      <p>Title from the screen.
[9] Pihnastyi O., Usik V., Kozhevnikov G., Matiash O. Neural network-based approach for
predicting the flow material in transport systems. CEUR Workshop Proceedings, 2024, 3790, pp.
76-86.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Matyash</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>Why do we need aggregator sites, why does Google love them so much? [Electronic resource] - Access mode: https://jam</article-title>
          .in.ua/blоg/navishchо-pоtribni
          <article-title>-sajty-ahrehatоry-chоmugооgle-ikh-tak-liubyt/#:~:text=Сайт-агрегатор - Title from the screen</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] What is crawling and how to manage robots [Electronic resource]</article-title>
          . - Access mode: https://www.bizmaster.xyz/
          <year>2019</year>
          /04/schо-take
          <article-title>-krauling-i-yak-keruvaty-rоbоtamy.html - Title from the screen</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Henderson</surname>
            <given-names>A.</given-names>
          </string-name>
          15
          <string-name>
            <surname>Best FREE Website Crawler</surname>
            <given-names>Tools</given-names>
          </string-name>
          &amp;
          <source>Software (2023 Update) [Electronic resource]</source>
          . - Access mode: https://www.guru99.
          <article-title>com/web-crawling-tools.html - Title from the screen</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Digital</given-names>
            <surname>Commerce Intelligence</surname>
          </string-name>
          [Electronic resource]. - Access mode: https://www.dexi.io
          <article-title>- Title from the screen</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Popovych</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavolodko</surname>
            <given-names>G.</given-names>
          </string-name>
          <article-title>Analysis of Methods for Classification and Aggregation of Textual Data From Images</article-title>
          .
          <source>Security of Infocommunication Systems and Internet of Things</source>
          ,
          <year>2024</year>
          ,
          <volume>2</volume>
          .1:
          <fpage>01008</fpage>
          -
          <lpage>01008</lpage>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Korolekh</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavolodko</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Enhancing digital search: Synergizing the Levenshtein algorithm with NLP techniques, in IX International Scientific and Practical Conference "Scientific Problems and Options for Their Solution,"</article-title>
          Bucharest, Romania,
          <source>Feb. 7-9</source>
          ,
          <year>2024</year>
          , International Scientific Unity, pp.
          <fpage>60</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>