<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Structured Text Extraction from ID Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olti Qirici</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GenAI</institution>
          ,
          <addr-line>Gemini Flash 2.0, AI Agent, Text extraction, ID Cards information</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>It will be described hereafter an experiment comprised of the methodology and results obtained for which not many years ago, entire PhDs were written and the results unsuccessfully described. The Generative AI revolution made it easy that through small amounts of code, with limited resources, can achieve nearly instantaneously a result that years ago was considered state-of-the-art. Companies driving such services have faced a sudden fall in incomes and even are threatening to close business. While this is not an economics analysis but a technological assertion of our time, let us utilize these technological advancements to illustrate the easiness in which the extraction of text from image, even on complex images such as ID Cards, which have watermarks and color differences, can be performed in a simple and more affordable manner. In the following paper, the author will be describing the results of an experiment on using Generative AI services to extract from specimens of ID cards from various European countries, the details of the ID Card, and furthermore, test the same on smart phone taken photos, under low quality lighting and receive the same results. The text information will be extracted in JSON format ready to be published by any web service.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>For centuries humans have tried to vest into machines, human capabilities. This has been the dream
of millennia. Speaking, listening, seeing, sensing machines that do hard work for humans. While
these aims to have machines instead of men that do the hard labors have accompanied the humanity
for ages, it would be the privilege of the 20th and most notably the 21st century that would literally
do a stupendous leap into this direction.</p>
      <p>
        We couldn’t but notice the tremendous advancement brought by a specific such achievement, which
enlightened the post-covid period and indeed made it possible for this state-of-the-art technology to
disrupt the way computers were behaving into a new field of communication. This breakthrough is
the GenAI. There have been for some decades attempts to build such machines, but it would be the
privilege of the 20th decade of the 21st century to build such a marvelous technology.
During a discussion in our Computer Science Department, on the achievement of GenAI, the author
emphasized that years ago, doctorates with dubious results would have been written on the matter
we are going to present, the text extraction from an image. Similar solutions such as extraction from
PDF documents, as distilled PDFs inclusive of images have been reported in the literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. And
while there were methods from filtering an image upon finding edges on it (please refer to a fairly
close in time paper written by Kumar R. et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), to utilizing the transformed image as an input to
a pretrained pattern recognition system, such as a Neural Network, the PhD candidate would try to
achieve the desired results sometime by meticulously reproducing the very same font type as the one
used to train the system. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Such a challenge was described by different authors such as also by
Sumathi et al [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which despite the different methods used still notify the readers of the difficulties
16th International Conference in Recent Trends and Applications in Computer Sciences and Information Technologies
olti.qirici@fshn.edu.al (O. Qirici)
0000-0003-1080-5403 (O. Qirici)
      </p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
extracting the text from the image. Nowadays, this is not needed anymore. Utilizing a simple GenAI
service and calling the required information with an adequate prompt would produce far better
results in a shorter time.</p>
      <p>We will try to build hereafter a system that is based on two basic GenAI calls towards the Gemini
API interface and extract the text from specimen documents of various ID cards and then produce a
JSON which might then feed an API Endpoint to organize this very data. This method might achieve
fabulous results especially when the extraction of such data would be supervised by a human which
corrects, and updates wrongly extracted values.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The methodology used would utilize the service of Generative AI to extract information on the ID
Cards of various European countries. In order to have a broader example we will try languages which
not necessarily have Latin letters such as the Greek ID Card, specific languages of the European
Branch such as the Albanian Language, and other ID Cards such as the Kosovo ID Card and the
Croatian ID Card. For all these ID Cards the specimens will be utilized to extract the information (the
specimens of ID Cards are general ID Card patterns for a specific country that define the visual
standard of such a document).</p>
      <p>
        After choosing the input information a model should be specified. The model that was chosen for
GenAI is Gemini Flash 2.0. As identified by Gemini[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the Gemini API offers different models that
are optimized for specific use cases .And more specifically Gemini Flash 2.0 is optimized for Next
generation features, speed, thinking, real-time streaming, and multimodal generation. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
An additional difficulty that comes from the same source it is that the Gemini (at least on the day of
the experimentation), does not support the Albanian language. But as we’ll see this is not a problem
at least for the kind of problem that we’ll try to solve.
      </p>
      <p>
        According to Charles Braque [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], whilst text extraction from pdfs can leverage natural language
processing and large language model technologies, text extraction from images relies more heavily
on computer vision and optical character recognition. Anyway, we will show that the results from
models such as Gemini Flash 2.0 are indeed impressive. This is something that has been anticipated
in various technological environments. As Regalado D. expressed himself: “This is a huge leap
forward, not just in model performance but in AI’s ability to be more integrated and effective, it is a
big step towards more powerful AI tools for all of us.” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] Another advantage of using Gemini Flash
2.0 is that Google offers limited utilization of APIs for Enterprise Google Users, which makes it easy
and cost efficiently for experimentation and for didactical purposes.
      </p>
      <sec id="sec-2-1">
        <title>In the following image a system has been designed for this specific experiment:</title>
        <p>As can be seen from the schema, the ID Card that the user will upload will be posted into the Gemini
Flash 2.0 API with the instruction for text extraction. Following the receipt of the text from the cloud
service, the text should be formatted in the specified JSON format so another call is sent toward the
API which would try to allocate the information to the respective field. For this purpose, the AI
service should semantically analyze the text and allocate the appropriate information to the
appropriate JSON label.</p>
        <p>In order to have a simpler way to delve into the design of the system than to program the parts that
are general, Python was chosen as the programming language. Simple snippets to link with Gemini
API services are available across the web.</p>
        <p>Workshop organizers may want to provide a copy of this template to authors where the event title
in the footnote is updated to their workshop details, see “Woodstock …” footnote on page 1. While
we provide a Word/LibreOffice template, we strongly recommend authors to use our LaTeX
template.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>As previously explained, the first step of the experiment is the acquiring of ID Cards for some
European countries. In the following image there are shown some samples (specimens) of real ID
Cards utilized for experimentations (all specimens were extracted from the specific Wikipedia pages
of the ID Cards of each of the above countries):</p>
      <sec id="sec-3-1">
        <title>The above cards were used for experimentation.</title>
        <p>The extract from the system after executing the first step, the step related to the text extraction (and
hereafter I’m demonstrating the solution only for the Albanian ID Card, by emphasizing that the
same results were received also by executing the code for the ID Cards of other countries) is the
following:</p>
        <p>Following the above information extraction, a JSON structured template was sent to the Gemini
Flash 2.0 API. The structured information was correctly detected and provided. Indeed, if the images
above can be thoroughly investigated, it can be identified that all the extracted data were correct
(even the handwritten information, such as the signature field).</p>
        <p>The results have been checked also on other ID Cards (which cannot be shown for ethical reasons
and because of the personal information holding), and the extracted information was always
correct.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We demonstrated above the ease to implement a solution such as the above proposed to solve the
issue of Document Information Extracted. The same experiment was driven for different types of
documents and what was achieved was an excellent result for the document extraction (for specific
documents such as Water Supply bills or Electricity Bills). Even when the medium was badly handled
(the paper washed out in a washing machine or shredded) and the data was hardly or doubtfully read
by the human eye, still the Gemini Flash service had the capability to extract the text.
Even though the experiment for the above paper was performed by Gemini Flash, mainly for cost
cautiousness reasons, the same results were received by executing the above experiment on
ChatGPT, GPT 4.o mini etc. Of course, the Agentic AI (the AI Agents in general) is transforming the
technology and as such is utilizing the information produced by the GenAI to plan and act
accordingly (so even trying to build a logical map as the above designed itself).</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools to create or edit the above text.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] IBM watsonx, "</article-title>
          <source>Extracting text from documents," 26 March</source>
          <year>2025</year>
          . [Online]. Available: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze
          <article-title>-data/fm-api-textextraction.html?context=wx.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>e. a. Kumar R.</surname>
          </string-name>
          ,
          <article-title>"A Study: Extraction of Text on Image," International journal of advance reasearch and Innovative ideas in education</article-title>
          , Vol-
          <volume>7</volume>
          , Issue 1,
          <year>2021</year>
          , pp.
          <fpage>904</fpage>
          -
          <lpage>909</lpage>
          ,
          <year>August 2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosales</surname>
          </string-name>
          ,
          <article-title>"Top AI tools for text extraction," 21 June 2024</article-title>
          . [Online]. Available: https://medium.com/@andrea.rosales08/
          <article-title>top-ai-tools-to-extract-text-fromdocuments-43c3641124a2.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>S. T. P. N. Sumathi C. P.</surname>
          </string-name>
          ,
          <article-title>"Techniques and challenges of automatic text extraction in complex images: A survey,"</article-title>
          <source>Journal of Theoretical and Applied Information Technology</source>
          , Vol.
          <volume>35</volume>
          , N.2, pp.
          <fpage>225</fpage>
          -
          <lpage>235</lpage>
          ,
          <year>January 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[5] "Gemini models," 27 March</source>
          <year>2025</year>
          . [Online]. Available: https://ai.google.dev/gemini-api/docs/models.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Brecque</surname>
          </string-name>
          ,
          <article-title>"An Introduction To Text Extraction From Images," 19 July 2024</article-title>
          . [Online]. Available: https://textmine.com
          <article-title>/post/an-introduction-to-text-extraction-from-images.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Regalado</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Google Gemini 2.0 Flash Is Here!," 12 December</source>
          <year>2024</year>
          . [Online]. Available: https://davidregalado255.medium.com/%EF%
          <article-title>B8%8F-gemini-2-0-is-here-c593b6fe6de0.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>