<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>DOLAP</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Human in Data Analytics: Who leads this Dance?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgia Koutrika</string-name>
          <email>georgia@athenarc.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>. AI in Data Analytics: The</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Languages and Analytical Processing of Big Data</institution>
          ,
          <addr-line>co-located with</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <abstract>
        <p>Integrating AI in data analysis methods promises unleashing the power of data and democratizing data access. It also raises several concerns. Perhaps, the most important one is the risk of less conscious decision-making or even pushing away domain and technical experts as automation is being applied over the whole data lifecycle.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>Opportunities</title>
      <p>
        Machine learning has been used for extracting insights
from data for a long time. For example, it is used for
analyzing historical data to forecast future trends and
behaviours, for extracting customer sentiment and
preferences from customer reviews, or for anomaly detection
in financial transactions to identify fraudulent activities.
Recently, the introduction of deep learning methods to
data analysis has enabled processing larger volumes of
complex data at higher speeds and generating deeper
insights. For instance, by using AI chatbots, businesses
can allow average users to analyze large data sets and
quickly extract key insights. For example, a sales person
can ask questions such as: “Why did the sales decrease in
March?”. Generative AI can provide summary statistics
and visualizations, ofering an immediate and intuitive
understanding of the data (e.g.,[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]). If code is needed,
AI can also help with code generation by translating to
diferent programming languages and summarising code
snippets (e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
mocratization, as more users can directly leverage stored
data without programmers as middlemen, increased
efition, computations and experimentations become faster
or easier, and new discoveries that were not possible
before or would be harder to reach.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. AI in Data Analytics: The Risks</title>
      <p>Example. At the time of the writing, a researcher asks
2. Handling Unanswerable Questions in Text-to-SQL with
3. Unanswerable Question Detection in Text-to-SQL with</p>
      <sec id="sec-3-1">
        <title>Natural Language Inference (arXiv 2023)</title>
        <p>4. A Benchmark for Unanswerable Question Detection in</p>
      </sec>
      <sec id="sec-3-2">
        <title>Text-to-SQL (arXiv 2023)</title>
        <p>
          The papers are not only very relevant but are
published in well-known venues. The researcher happily
announces that there is work on this topic and the
original paper is already well cited. Unfortunately, none of
these papers exists. □
Inevitably, AI is transforming the way we work with and
leverage data and that raises some strong concerns [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>AI tools can lead to more reliance on pattern
recognition without understanding the data. In the case of
generative AI, people tend to treat it as a data retrieval
tool, and hence trust its responses. However, as
generative AI models learn the patterns and structure of their
input training data, they can easily make up facts
(hallueven, amplify bias in data.</p>
        <p>On the other hand, when using deep learning models,
hard. Results can be irreproducible. Consider a
traditional search engine, where a user submits a keyword
search, say “data governance act” and the engine returns
results. The user can easily check the result relevance
and their credibility by simply checking the data (web
pages). Using a tool like ChatGPT makes this (currently)
impossible as there is no way to check how the answer
is supported by the data. How can we trust the results?
How can we base decisions on non-provable outcomes?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. AI and the Human in Data</title>
    </sec>
    <sec id="sec-5">
      <title>Analytics</title>
      <p>One of the biggest challenges we need to solve concerns</p>
      <p>AI used in data access and analysis promises data de- cinations). Furthermore, algorithmic results can hide, or
ciency as many tasks from programming to data acquisi- answer verification, provenance and explainability are
CEUR
htp:/ceur-ws.org
ISN1613-073</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)</p>
      <p>how best to regulate the interaction between the two.</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License the limits of AI, the nature of human intelligence and
This challenge encloses all applications of AI in our lives,
and has several dimensions, including algorithmic,
ethical, regulatory, and environmental. In this statement
paper, we specifically focus on how AI in data access and
analysis should lead to accelerated but not less-conscious
decision-making. We are discussing two inter-connected
directions researchers should work on.</p>
      <p>Human-led Decision Making. Domain and
technical experts are involved in all phases of the data lifecycle,
from building the data pipelines and data analysis tools
to using these tools for decision making. As AI is
applied over the whole data lifecycle, it is natural to see the
number of people involved decreasing. Nevertheless,
humans should always be part of the process. Programmers
should check that the code generated by AI is bug-free
and performs what it is intended for. Domain experts
should check that the algorithm or the data analysis tool
performs well for a given domain. For example, can we
build validation tools and benchmarks that help experts
test the data analysis tool, and more specifically the AI
algorithms working behind the scenes?</p>
      <p>Humans should be also part of any decision-making
process empowered by an AI algorithm or a tool. The
tools should aid but not replace the humans. Humans,
with their depth of understanding, ethical judgement and
creativity, are irreplaceable. To actually enable conscious
and informed decisions by humans, we need to build
tools that enable and require human involvement.</p>
      <p>Explainable by Design. Towards this direction, we
need to build tools that can explain their answers to
the user. In general, explainable AI implements specific
techniques and methods to ensure that each decision
made by the ML algorithm can be traced and explained. A
data analysis tool supported by AI should be able to show
how the answer is traced back to or supported by the
original data. This entails work at both the algorithmic
and the user interaction level.</p>
      <p>At the algorithmic level, there are two ways to achieve
explainability: (in-processing) algorithms with built-in
explainability capabilities, and (post-processing) methods
that trace the answer back to the data. At the interaction
level, we can draw inspiration from the search engine
paradigm, where results are connected to their source.</p>
      <p>Data analysis tools could follow a similar interaction
paradigm. Results should be accompanied by evidence.</p>
      <p>In this way, checking the correctness of the answer will
be inseparable part of the data analysis process.</p>
      <p>In many ways, explainable AI is more critical than
responsible or fair AI. Responsible AI looks at AI during the
planning stages to make the AI algorithm responsible
before the results are computed. Explainable AI looks at AI
results after the results are computed. Without
explainability, we cannot tell for sure whether the results are
correct, biased, or generated by a responsible algorithm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Castro</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <article-title>Solo: Data discovery using natural language questions via a selfsupervised approach</article-title>
          ,
          <source>Proc. ACM Manag. Data</source>
          <volume>1</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/3626756. doi:
          <volume>10</volume>
          . 1145/3626756.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Tableau</surname>
            <given-names>GPT</given-names>
          </string-name>
          , https://tableau.com/products/tableauai (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Unified pre-training for program understanding and generation</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2655</fpage>
          -
          <lpage>2668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Noorden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Perkel</surname>
          </string-name>
          ,
          <source>AI and science: what 1</source>
          ,600 researchers think (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>