<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Preliminary Evaluation of Open-Source LLMs for Datalog-Based Semantic Parsing in the ASVIN Project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Alviano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Capalbo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Gottlob</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irfan Kareem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Lo Scudo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastiano Piccolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Calabria</institution>
          ,
          <addr-line>Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a preliminary evaluation of open-source Large Language Models (LLMs) for semantic parsing in the context of the ASVIN (Assistente Virtuale di Negozio) project, which aims to develop a regulatory-compliant virtual assistant for the fashion retail sector. The core task involves translating natural language user queries into structured semantic representations using Datalog (or Answer Set Programming), enabling logical reasoning and personalized interaction in e-commerce applications. We construct a pilot dataset of fashion-domain queries annotated with Datalog ground truths and evaluate multiple LLMs using zero-shot and one-shot prompting strategies. The evaluation focuses on syntactic and semantic accuracy, using F1-score as the primary metric, and compares performance across a range of open models including Mistral, Gemma, Qwen, and DeepSeek. Our results show that smaller models such as Mistral-small3.1 can outperform larger counterparts when guided by well-structured prompts, highlighting the importance of prompt design and task framing. This study lays the groundwork for a full-stack integration of LLM-based reasoning in virtual retail assistants.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models (LLMs)</kwd>
        <kwd>Semantic Parsing</kwd>
        <kwd>Datalog</kwd>
        <kwd>Answer Set Programming (ASP)</kwd>
        <kwd>Virtual Assistants</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advancements in Large Language Models (LLMs) have significantly improved the capabilities
of conversational agents, enabling fluid and deep interactions across a wide range of domains. These
systems have shown remarkable success in understanding and generating natural language, making
them appealing candidates for applications in virtual assistance, including e-commerce and customer
support scenarios [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. However, while LLMs excel at producing linguistically coherent outputs,
they exhibit notable limitations in tasks requiring formal logical reasoning, consistency, or verifiability
[
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. This shortcoming becomes especially evident in domains such as fashion retail, where users
often seek actionable, context-aware, and personalized recommendations based on incomplete or
underspecified input.
      </p>
      <p>
        In this paper, we present the first systematic efort to leverage open-source LLMs for semantic parsing
into Datalog within the ASVIN project (Assistente Virtuale di Negozio), a project about a
regulatorycompliant virtual shopping assistant designed for the fashion retail sector. Our central objective is
to bridge the gap between natural language understanding and symbolic reasoning, by transforming
user queries into structured logical representations suitable for downstream reasoning tools such as
Datalog or Answer Set Programming (ASP) [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. This step is critical for grounding the responses
generated by the assistant in formal semantics, enabling explainable and verifiable behaviors in product
recommendation, item compatibility checking, and style advice delivery.
      </p>
      <p>
        Current chatbot systems built on LLMs often rely on retrieval augmented generation (RAG) or fine
tuned models to provide answers based on existing product catalogs or textual descriptions [
        <xref ref-type="bibr" rid="ref10">10, 11</xref>
        ].
However, these systems struggle when user queries are vague, incomplete, or implicit as it often happens
with natural language queries. For example, a user might ask, “What can I wear with a floral skirt
for a wedding?” or “I need something trendy but comfortable for a summer party.” These inputs lack
explicit references to catalog items or well-formed constraints. A traditional LLM might respond with
generic style suggestions, missing the opportunity to reason over item compatibility, availability, or the
preferences previously expressed by the user.
      </p>
      <p>Example 1. Consider the user query: “What can I wear with a floral skirt for a wedding?” When prompted
with this input, a general-purpose LLM such as ChatGPT may produce a fluent but vague recommendation,
for example:
Classic &amp; Elegant Look
Top: Silk or satin camisole or blouse in a neutral or complementary
color (e.g., ivory, blush, soft grey).</p>
      <p>Shoes: Nude or metallic (gold/silver) heels or strappy sandals.
Accessories: Dainty gold or pearl jewelry, clutch in a neutral or
metallic tone.</p>
      <p>While the language is stylistically appropriate, the suggestions are not grounded in the actual wardrobe of
the user, catalog availability, or any regulatory or stylistic policies that might constrain recommendations.
Moreover, they are not verifiable: the system cannot explain why a satin camisole is compatible with a
lforal skirt, or whether that combination aligns with brand-specific guidelines for wedding attire.</p>
      <p>Reinforcing the prompt with additional context might improve the output, but even with additional
prompt context, the assistant lacks the structured semantic representation needed to reason over inventory,
compatibility, or preference constraints in a principled way.</p>
      <p>Reinforcing the prompt with additional context might improve the output, but even with such
enhancements, the assistant lacks a structured, verifiable semantic representation of the query. This limits its ability
to reason over inventory, enforce stylistic constraints, or provide explainable recommendations/requirements
that are essential in the ASVIN setting. ■</p>
      <p>In ASVIN, we aim to go further; by extracting a structured semantic representation of the user’s
intent, we aim at performing formal reasoning over product attributes, compatibility rules, and catalog
constraints. The core idea is to parse natural language inputs into Datalog predicates, which formalize
user goals, constraints, and preferences. These predicates can then be used as input to a logic engine (e.g.,
Datalog or ASP solver), whose output can guide product recommendation or inform further dialogue.
This approach allows us to benefit from both sides: the expressive power of LLMs and the precision of
logical reasoning frameworks.</p>
      <p>Example 2. From the following user question “What can I wear with a floral skirt for a wedding?” we
want ASVIN to extract the following Datalog representation:
request_type(fashion_advice). companion_item(skirt).</p>
      <p>style_of_companion_item(skirt, floral). event(wedding).</p>
      <p>Such a representation can inform and guide the answers of the virtual assistant. If more information is
needed, ASVIN can request it; otherwise, ASVIN can suggest items in the catalog that are appropriate for the
specified style and occasion. Even if reasoning is not yet implemented in the current prototype, this Datalog
representation is designed to serve as input to a logic-based module. Given catalog facts (e.g., available
items and their attributes) and style compatibility rules encoded in ASP, the system could infer suitable
combinations (for example, recommending pastel tops or neutral-toned shoes that match the floral pattern
and formality of a wedding). ■</p>
      <p>Mapping natural language to formal logic is challenging due to its ambiguity and context dependence,
which traditional parsers cannot handle well. To overcome this, we use LLMs with in-context learning
to extract Datalog facts, guiding them through carefully designed domain-specific predicates and
taskspecific prompts. Additionally, to address the lack of suitable datasets, we introduce a novel benchmark
consisting of 50 hand-curated fashion-related user queries, each annotated with a corresponding set
of ground truth Datalog predicates. The dataset spans a variety of scenarios, from stylistic queries to
occasion-specific recommendations and compatibility checks.</p>
      <p>Our evaluation focuses on a range of competitive open-source LLMs, including Mistral (3.1-small),
Qwen, Gemma, and DeepSeek, assessed for their ability to extract semantically accurate predicates.
In the zero-shot setting, where models are only provided with a list of valid predicates and a task
description, performance is generally poor, with limited consistency and low F1-scores. In contrast,
in the one-shot setting, where a single example is added to the prompt, performance are significantly
improved. The best results are achieved by Mistral 3.1-small, with F1-score reaching 0.82, illustrating
that even relatively small models can perform well when guided by carefully constructed prompts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>We refer to the ASP-Core-2 format [12] for common constructs of Answer Set Programming (ASP).
Here we consider constants being integers or strings starting by lowercase. A fact has the form (),
where  is a predicate and  is a possibly empty sequence of constants. A database is a possibly empty
set of facts. A program (or ASP Knowledge Base) is a set of rules defining conditions to derive new
facts from an input database, or to eliminate undesirable solutions. For the purposes of this paper, it
is suficient to see a program as a black-box associating one input database to zero or more output
databases (according to the stable model semantics [13]).</p>
      <sec id="sec-2-1">
        <title>Example 3. Let us consider the following database:</title>
        <p>is_item(1, trousers).
is_item(2, shirt).
is_item(3, trousers).
is_appropriate(1, wedding).</p>
        <p>is_appropriate(2, wedding).</p>
        <p>Let us consider the following ASP rule which defines if a product is suitable for a certain type of event:
suitable(ID, Item, Event) :- asks(Item), is_item(ID, Item),
event_type(Event), is_appropriate(ID, Event).</p>
        <p>Finally, let us consider the user query: “Show me the trousers that are appropriate for a wedding”, from
which ASVIN extracts the following facts:
asks(trousers).</p>
        <p>event_type(wedding).</p>
        <p>Giving in input the full program into an ASP solver, we get the following answer set
{suitable(1, trousers, wedding)}. In fact, from the database we know that the product with id=2 is
not a trousers, although it is appropriate for a wedding. The product with id=3, instead, is a trousers but it
is not appropriate for a wedding. ■</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>The problem addressed in this work is the semantic parsing of natural language user queries into
structured Datalog facts, with the goal of enabling formal reasoning in the context of a virtual assistant
for fashion retail. Specifically, given a user request, we want to obtain a structured representation
of it using a set of predefined domain-specific Datalog predicates. These predicates are designed to
capture semantically relevant aspects of the query, such as the intent of the user (e.g., request for a
recommendation), contextual constraints (e.g., event type, weather, color preferences), or compatibility
conditions (e.g., which items should go together).</p>
      <p>Given a query  in natural language and a dictionary of domain-specific predicates , we want
to extract a set of Datalog facts  = {1(1), 2(2), . . . } where  ∈  and each  is a possibly
empty sequence of constants drawn from relevant domains (e.g., clothing items, occasions, styles). The
resulting set of facts  expresses the semantic content of  in a form suitable for downstream reasoning
using Datalog or ASP.</p>
      <p>This paper reports on the initial attempt, in the context of the ASVIN project, to enable the translation
,  ↦→  . Such a translation is part of a broader pipeline, where the set  of facts is subsequently
processed by an ASP solver to address formal reasoning, and enrich user queries expressed in natural
language. Indeed, once a query is parsed into a set of facts, it can be combined with a domain-specific
rule base to derive logical consequences that inform recommendation and retrieval tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Representing User Requests via Domain-specific Predicates</title>
      <p>In order to support formal reasoning over user requests in the ASVIN project, we developed a tailored
set of domain-specific predicates designed to capture the essential semantic elements of
fashionrelated queries. These predicates serve as an intermediate representation between natural language
input and logical inference, enabling the system to produce grounded, explainable, and context-aware
recommendations.</p>
      <p>ASVIN focuses primarily on enhancing the in-store shopping experience through a virtual assistant
capable of interpreting vague or under-specified user queries. We identified two primary classes of user
intent: item recommendations and style advice. In an item recommendation, the user explicitly requests
a type of item; for example, “What shoes can I wear under blue navy trousers?”, where the goal is to
suggest suitable shoes. In contrast, style advice occurs when the user seeks guidance without explicitly
specifying what they want; for example, “What can I wear with a blue shirt?”, where the mentioned
item is already possessed by the user, and the assistant must infer compatible additions.</p>
      <p>To formally capture these variations, we designed a set of predicates that captures the type of request
as well as contextual attributes that influence recommendation quality: item properties (such as color,
material, and style), situational parameters (event type, formality level), and temporal context (time of
day, season, or part of the year).</p>
      <p>This structured representation serves two key purposes: (i) it enables symbolic reasoning via ASP to
iflter and generate suitable item combinations from the product catalog, and (ii) it provides a stable
target for the semantic parsing task performed by the LLM. Rather than attempting to generate logic
rules or full programs from scratch, we focus on extracting a constrained set of well-defined facts, which
can be reliably interpreted and composed with existing background knowledge.
season(season_name)
event(event_name)
location_event(location_name)
formality(formality_level)
time_of_event(time)
weather(weather_condition)</p>
      <p>Description
Request type: recommendation or style advice.</p>
      <p>An item explicitly requested by the user.</p>
      <p>Material of the requested item.</p>
      <p>Color of the requested item.</p>
      <p>Style of the requested item.</p>
      <p>The item for which style advice is being requested.</p>
      <p>Material of the item in the advice request.</p>
      <p>Color of the item in the advice request.</p>
      <p>Style of the item in the advice request.</p>
      <p>An item to be matched by the recommended one.</p>
      <p>Material of the companion item.</p>
      <p>Color of the companion item.</p>
      <p>Style of the companion item.</p>
      <p>Season or part of the year relevant to the request.</p>
      <p>Type of event (e.g., wedding, business meeting).</p>
      <p>Location where the event will take place.</p>
      <p>Formality level of the occasion (e.g., casual, formal).</p>
      <p>Specific time information for the event.</p>
      <p>Weather condition relevant to the request.</p>
      <p>We report the set of predicates in Table 1. Besides the predicate type_of_request/1, the other
predicates fall in one of the following groups: 1): predicates that describe the requested item, 2)
predicates that describe the companion item, 3): predicates that describe the item for which the user
is requesting a style advice, and 4) situational predicates that describe the event for which the user is
asking the query.</p>
      <p>To facilitate accurate semantic parsing by LLMs, we constrained all domain-specific predicates to
have at most two arguments. This design choice reduces syntactic complexity, aligns well with common
natural language structures, and lowers the likelihood of generation errors—particularly in zero- or
few-shot prompting settings. While this results in a larger set of predicates, the benefits in terms of
model reliability and ease of logical integration outweigh the cost in verbosity.</p>
      <p>The set of predicates, although minimal, can capture even complex queries such as queries in which
the user asks for more the one recommendation, or having more than one companion item, or even
queries where the user combines both a recommendation and a style advice.</p>
      <sec id="sec-4-1">
        <title>Example 4. Let us consider the following user query:</title>
        <p>“I am looking for a jacket for a winter business trip. I usually wear gray wool trousers. What
would go well with that?”
This query contains both a recommendation request and a style advice. It can be structured as follows:
type_of_request(request).
request_item(jacket).
season(winter).
event(business_trip).
type_of_request(advice).
advice_item(trousers).
companion_item(trousers).
color_of_companion_item(trousers, gray).</p>
        <p>material_of_companion_item(trousers, wool).</p>
        <p>This example showcases the flexibility of the predicates we designed. Here, the trousers are both a companion
item for the recommendation of the jacket, and an item to take into account for a style advice. ■</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. A Synthetic Dataset of Fashion-related Queries</title>
      <p>In the fashion domain, user queries are typically expressed in natural language. To address our
problem, there is a need for a dataset that captures these queries and translates them into structured
representations, such as those based on Datalog facts.</p>
      <p>To create the dataset, we manually generated 10 meta queries, each comprising query and ground_truth
in JSON file format. These meta queries are divided into two types of requests: (i) item recommendation
and (ii) style advice. From these meta-queries, we expanded the dataset to 50 examples by generating 5
examples from each of the 10 meta-queries. Each example comprises a query, a user request in natural
language simulating a possible request in the fashion domain, and a corresponding ground truth, which
represents the semantic meaning of the query as a set of predicates.</p>
      <p>Example 5. Below is an example entry in the produced dataset:
{ "query": "Recommend for gold sequin skirt a strappy high heels,
a satin clutch, and a cashmere wrap for a dinner.",
"ground_truth": [
"type_of_request(item_recommendation)",
"request_item(heels)",
"style_of_request_item(heels, strappy_high)",
"request_item(clutch)",
"material_of_request_item(clutch, satin)",
"request_item(wrap)",
"material_of_request_item(wrap, cashmere)",
"companion_item(skirt)",
"color_of_companion_item(skirt, gold_sequin)",
"event(dinner)" ]}
The above entry illustrates an item recommendation query along with its ground truth, where the attributes
are represented as structured facts. ■</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiment</title>
      <p>In this section, we present a preliminary evaluation of open-source LLMs for the task of translating
natural language fashion-related queries into structured Datalog predicates. The goal of this experiment
is to assess the ability of diferent models to perform accurate semantic parsing, providing the formal
representations necessary for downstream logical reasoning in the ASVIN system. We conduct this
evaluation using the benchmark dataset of fashion-related queries introduced in Section 5. The selected
models are: Mistral-small 3.1 24B, Gemma 3 27B, Gemma 3 12B, Mistral 7B, Mixtral, Llama 3 70B, Llama
3 8B, Deepseek R1 70B, Qwen 3 32B, and Phi4 14B. We selected these models to cover a
representative sample of contemporary open-source LLMs with competitive performance across general NLP
benchmarks, despite none being specifically trained for semantic parsing into formal languages.</p>
      <p>We evaluate the LLMs under two prompting strategies: zero-shot and one-shot. In the zero-shot
setting, the model is given only the task description and the input query. In the one-shot setting, a
single example of a successfully parsed query is appended to the prompt, providing a template for the
expected output structure. The full prompt used in both settings is illustrated in Figure 1. The one-shot
variant includes the example at the end, whereas the zero-shot version omits it.</p>
      <p>The prompt frames the model as a specialized AI assistant tasked with translating natural language
requests into structured Datalog facts. It imposes strict output requirements: no explanations or extra
text, only the Datalog program. The core rules instruct the model to (i) identify the request type, (ii)
annotate items and attributes conditionally based on whether the query is a recommendation or a style
advice, and (iii) ensure semantic and syntactic correctness in predicate use. A predefined vocabulary of
allowable predicates is provided, encompassing item attributes (e.g., material, color, style), contextual
cues (e.g., event, season, location), and structural roles (e.g., request or advice items, companion items).
This controlled prompt design ensures consistency in model outputs and facilitates precise evaluation
of the ability of the model to semantically interpret domain-specific queries.</p>
      <p>All experiments were executed on a high-performance server equipped with three AMD MI210 GPUs
(64 GB each), running the models in parallel. To evaluate performance, we use the F1-score; which is
defined as</p>
      <p>Precision × Recall</p>
      <p>F1-Score = 2 × Precision + Recall
where:</p>
      <p>TP TP
Precision = and Recall =</p>
      <p>TP + FP TP + FN
Here, TP denotes the number of true positives, FP denotes the number of false positives, and FN denotes
the number of false negatives. For each query in the test set, we compute precision, recall, and F1-score
by comparing predicted facts by the model to the annotated ground truth. The overall F1-score for each
LLM is then obtained as the macro-average across all 50 test queries. That is, we calculate the F1-score
independently for each example and then take the unweighted mean. This approach ensures that each
test case contributes equally, regardless of the number of predicted facts.</p>
      <sec id="sec-6-1">
        <title>6.1. Results</title>
        <p>Zero-shot Prompt. Without examples, the best-performing models are gemma3 27b and gemma3
12b (F1 = 0.36), outperforming deepseek-r1 70b (0.32) and qwen3 32b (0.26), despite the latter being
designed with reasoning tasks in mind. However, even the highest F1-score of 0.36 indicates poor
overall performance in this setting. This suggests that while some models can partially generalize
the task from instructions alone, semantic parsing to Datalog predicates remains challenging without
guidance. As an archetypal example, llama3.3 70b, a very recent and large model, shows clear dificulty
interpreting the task in zero-shot conditions with a disappointing F1-score of 0.11.
One-shot Prompt. With a single example, performance improves substantially. Mistral-small3.1
shows the greatest improvement, from 0.14 (zero-shot) to 0.82, achieving the best result overall (Figure 2).
This is notable given its smaller size relative to larger models like deepseek-r1 70b and llama3.3 70b.
Other strong performers include gemma3 27b (0.74), llama3.3 70b (0.72), and deepseek-r1 70b (0.70).
While within the same family larger models tend to perform better (for instance, gemma3 27b performs
better than gemma3 12b and llama3.3 7b performs better than llama3.1 8b), Figure 2 highlights that
model size, particularly across models of diferent families, is not a good predictor of performance. In
our case, mistrall-small3.1 and gemma3 27b outperform both llama3.3 70b and deepseek-r1 70b that
have a larger number of parameters. This emphasizes that architecture and training quality matter
as much as size. Interestingly, Gemma models maintain solid results in both zero-shot and one-shot
settings, suggesting strong instruction-following capabilities likely supported by diverse training data.
Overall, these findings confirm that LLMs benefit significantly from example-driven prompting, aligning
with established observations that transformers generalize better in multi-task settings with explicit
guidance.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Interactive Evaluation via ASP Chef</title>
      <p>To facilitate hands-on experimentation with our semantic parsing framework, we provide an executable
ASP Chef recipe, i.e., a self-contained environment that enables users to test LLM-based Datalog
generation and evaluate the results against the annotated ground truth. The recipe is hosted at https:
//asp-chef.alviano.net/s/ASVIN/ASPOCP25, and a screenshot focused on its input and output is shown
in Figure 3.</p>
      <sec id="sec-7-1">
        <title>7.1. Purpose and Setup</title>
        <p>This recipe, titled “ASVIN First Report Playground”, allows users to explore how large language models
translate natural language fashion queries into Datalog facts, following the schema and rules established
in our dataset. It supports both zero-shot and one-shot prompting strategies and visualizes performance
through automatic statistics calculation using ASP.</p>
        <p>To run the recipe, users must:</p>
        <p>• Register and configure a Groq API key, following the setup guide provided online (https://asp-chef.</p>
        <p>alviano.net/s/LLMs/getting-started).
• Optionally choose a diferent LLM model from the Groq model catalog by updating the model
name in the @LLMs/Config ingredient. The recipe defaults to using the llama-3.3-70b-versatile
model.</p>
        <p>• Ensure that the temperature is fixed at 0 to maintain deterministic behavior.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Recipe Structure and Ingredients</title>
        <p>The ASP Chef recipe consists of a sequence of ingredients (i.e., modular operations), each handling a
key phase in the evaluation pipeline:
1. Input and Prompt Encoding: The recipe begins with an input JSON object representing a single
dataset entry, consisting of a query and its associated ground_truth. A system message encodes
the structured instructions to the model. If the user wishes to test the one-shot setting, a full
example query with its Datalog translation is included as an additional system message. For
zero-shot, this second message is omitted.
2. User Message Extraction: The natural language query is dynamically extracted from the input
using JSONPath and used as the user message for the LLM interaction. This decouples prompt
design from the data instance and enables automated testing across examples.
3. LLM Interaction: Once the prompt is assembled, the LLM is queried via the Groq API. The response
is expected to be a pure list of Datalog facts, adhering to the rules specified in the prompt.
4. Fact Extraction and Evaluation: The predicted output is parsed and transformed into structured
ASP facts. These are then compared against the ground_truth using ASP rules to compute true
positives (TP), false positives (FP), and false negatives (FN).
5. Metric Calculation: Using ASP expressions, we calculate precision, recall, and F1-score, providing
a clear view of how well the model has reproduced the intended semantics. The results are shown
both as text and as radar/bar charts, ofering a concise yet informative summary.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Exploration and Insight</title>
        <p>This interactive setup serves as both a debugging interface and an educational tool. It allows users to (i)
test prompt variations (e.g., 0-shot vs. 1-shot), (ii) switch between diferent LLMs, (iii) directly observe
the syntactic correctness and semantic completeness of generated programs, and (iv) quantitatively
assess model output through logic-based validation.</p>
        <p>By combining prompt-based generation, ASP reasoning, and visual analytics, this recipe exemplifies
a principled and interpretable way to study LLM behavior in structured parsing tasks. It also represents
a prototype for future integrations with broader systems like LLMASP, where such evaluation modules
could be reused in a domain-agnostic and scalable fashion.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Related work</title>
      <p>The use of natural language interfaces in data-centric systems has received sustained attention over
the past decades. In the database domain, numerous eforts have sought to enable users to express
data queries in natural language, bridging the gap between lay users and structured query languages
such as SQL [14]. Despite advances in natural language processing (NLP), substantial challenges persist
in terms of semantic interpretation and accurate query generation, particularly when dealing with
complex user intents and domain-specific knowledge.</p>
      <p>One of the most studied approaches is text-to-SQL parsing, which aims to convert natural language
queries into executable SQL statements [15]. However, the expressiveness of SQL is tightly bound to
the underlying database schema and lacks access to implicit or contextual knowledge. For instance,
a query that implicitly relies on common-sense knowledge (e.g., identifying products appropriate for
Christmas) cannot be resolved purely through schema-bound SQL translation. Some researchers have
proposed augmenting databases with knowledge graphs to address these gaps, but such solutions
often sufer from scalability and maintenance limitations. A recent survey by Hong et al. [ 16] ofers
a comprehensive update on these evolving techniques and their limitations. In the ASVIN project,
we adopt a diferent approach: rather than generating SQL queries, we extract semantic facts in the
form of Datalog predicates. This has several advantages. First, it decouples query understanding
from rigid schemas, enabling the assistant to operate on logical abstractions that are stable even when
product catalogs evolve. Second, it supports symbolic reasoning, allowing ASVIN to combine user
intent with regulatory constraints and stylistic compatibility rules in a principled, explainable manner.
Finally, as shown in our experiments, open-source LLMs can generate these facts reliably using prompt
engineering, even in zero- or one-shot settings, without requiring access to large SQL-based training
corpora.</p>
      <p>Closely related to our objective, Shaw et al. [17] investigated compositional semantic parsing,
highlighting how complex queries can be decomposed into smaller semantic units to improve interpretation
and system robustness. Their results underscore the importance of generalization in parsing models,
particularly in handling linguistic variability and nested semantic structures. This line of work aligns
with our own focus on breaking down fashion-related queries into structured Datalog facts using
predefined predicate templates.</p>
      <p>With the advent of LLMs, new opportunities have emerged for improving semantic parsing. Brown
et al. [18] demonstrated that LLMs can significantly outperform earlier rule-based and template-based
approaches, especially for complex SQL features like multi-table joins and nested queries. Notably,
their integration of user feedback loops allows for dynamic refinement of generated queries, improving
both interpretability and accuracy. More recently, Schneider et al. [19] evaluated LLMs on generating
structured queries (e.g., graph queries) from dialogue inputs, finding that few-shot prompting and
ifne-tuning substantially enhance performance, particularly for smaller models with weak zero-shot
capabilities.</p>
      <p>In parallel with these developments, our previous work on LLMASP [20, 21] introduced a
generalpurpose framework for connecting LLMs with symbolic reasoning via configurable prompting strategies.
LLMASP supports domain-independent applications through two configurable layers: application files,
which define the target logic task, and behavior files, which provide meta-prompts and formatting
instructions. This modularity enables LLMASP to operate across diverse problem domains (including
planning, diagnosis, and classification) without changing its internal structure.</p>
      <p>The present work difers from LLMASP in several key respects. First, it is domain-specific, targeting
semantic parsing within the fashion retail space as part of the ASVIN project. Second, we adopt a static
prompt design, hardcoding a highly specialized template for Datalog generation, rather than relying
on dynamically assembled meta-prompts. Third, while LLMASP was evaluated exclusively on LLaMA
3 models (8B and 70B), our analysis spans a broader range of open-source LLMs, including Mistral,
Gemma, Qwen, Phi, and DeepSeek, allowing for a more comprehensive comparison of model behaviors
under constrained logical tasks.</p>
      <p>Despite these diferences, we consider LLMASP a natural next step for future extensions of our work.
Integrating the models and task defined here into the LLMASP pipeline would allow us to investigate
whether the best-performing models in our hard-coded prompt setting (notably, mistrall-small3.1 and
gemma3:27b) retain their advantage under dynamic prompting strategies. Such an integration could
also ofer insights into the trade-ofs between fixed-task optimization and prompt flexibility, as well as
the impact of meta-prompt design on semantic fidelity and generalization.</p>
      <p>In summary, our study builds on a growing body of research at the intersection of LLM-based semantic
parsing and structured query generation, contributing a focused evaluation in a practical, regulated
domain while pointing toward broader generalization through systems like LLMASP.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>This paper presented a preliminary evaluation of open-source LLMs for semantic parsing in the fashion
retail domain, a core component of the ASVIN project. The goal is to translate natural language queries
into Datalog facts to enable explainable, regulation-compliant virtual assistance. Fashion-related queries
present unique challenges due to their reliance on style, context, and implicit semantics. Our results
show that zero-shot prompting yields limited accuracy, highlighting the dificulty of the task. However,
the substantial improvement in one-shot settings demonstrates that the problem is feasible when
supported by well-crafted prompts, validating the importance of prompt engineering.</p>
      <p>These findings open several avenues for future work. Tools like LLMASP have shown that
predicatefocused prompts can improve control and accuracy; adapting our task to this framework is a natural
next step. Moreover, grammar-constrained generation techniques (though not explored here) represent
a promising direction for ensuring syntactic and semantic correctness. A current limitation is the lack
of domain-specific datasets. Our pilot dataset is a first step, and we plan to extend it with greater variety
and real-world data from fashion companies. Additionally, we envision developing fashion knowledge
bases that encode concepts such as styles and outfit logic. These could guide ASVIN in interactive
sessions, helping the assistant request missing information and reason over product catalogs more
efectively.</p>
      <p>Overall, our results demonstrate that semantic parsing in this domain is both challenging and
achievable, laying the groundwork for intelligent assistants that blend language, logic, and fashion</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This work was supported by the Italian Ministry of University and Research (MUR) under PRIN project
PRODE “Probabilistic declarative process mining”, CUP H53D23003420006, under PNRR project FAIR
“Future AI Research”, CUP H23C22000860006, under PNRR project Tech4You “Technologies for climate
change adaptation and quality of life improvement”, CUP H23C22000370006, and under PNRR project
SERICS “SEcurity and RIghts in the CyberSpace”, CUP H73C22000880001; by the Italian Ministry of
Health (MSAL) under POS projects CAL.HUB.RIA (CUP H53C22000800006) and RADIOAMICA (CUP
H53C22000650006); by the Italian Ministry of Enterprises and Made in Italy under project STROKE 5.0
(CUP B29J23000430005); under PN RIC project ASVIN “Assistente Virtuale Intelligente di Negozio” (CUP
B29J24000200005); and by the LAIA lab (part of the SILA labs). Mario Alviano is member of Gruppo
Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM).</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4o for grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[11] K. Olawore, M. McTear, Y. Bi, Development and Evaluation of a University Chatbot Using Deep
Learning: A RAG-Based Approach, in: International Symposium on Chatbots and Human-Centered
AI, Springer, 2024, pp. 96–111.
[12] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone, M. Maratea,
F. Ricca, T. Schaub, ASP-Core-2 Input Language Format, TPLP 20 (2020) 294–309. URL: https:
//doi.org/10.1017/S1471068419000450. doi:10.1017/S1471068419000450.
[13] M. Gelfond, V. Lifschitz, Logic programs with classical negation, in: D. Warren, P. Szeredi (Eds.),</p>
      <p>Logic Programming: Proc. of the Seventh International Conference, 1990, pp. 579–597.
[14] C. Ma, B. Molnár, Ontology Learning from Relational Database: Opportunities for Semantic
Information Integration, Vietnam Journal of Computer Science 09 (2022) 31–57. doi:10.1142/
S219688882150024X.
[15] G. Katsogiannis-Meimarakis, G. Koutrika, A survey on deep learning approaches for text-to-SQL,</p>
      <p>The VLDB Journal 32 (2023) 905–936.
[16] Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, X. Huang, Next-generation database
interfaces: A survey of llm-based text-to-sql, arXiv preprint arXiv:2406.08426 (2024).
[17] P. Shaw, M.-W. Chang, P. Pasupat, K. Toutanova, Compositional generalization and natural
language variation: Can a semantic parsing approach handle both?, arXiv preprint arXiv:2010.12725
(2020).
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[19] P. Schneider, M. Klettner, K. Jokinen, E. Simperl, F. Matthes, Evaluating large language models in
semantic parsing for conversational question answering over knowledge graphs, arXiv preprint
arXiv:2401.01711 (2024).
[20] M. Alviano, L. Grillo, Answer Set Programming and Large Language Models interaction with</p>
      <p>YAML: Preliminary Report, in: CILC, CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[21] M. Alviano, L. Grillo, F. L. Scudo, L. A. Rodriguez Reiners, ntegrating Answer Set Programming
and Large Language Models for Enhanced Structured Representation of Complex Knowledge in
Natural Language, in: IJCAI 2025, Montreal, Canada, August 16-22, 2025, ijcai.org, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Llasa: Large language and e-commerce shopping assistant</article-title>
          ,
          <source>arXiv preprint arXiv:2408</source>
          .
          <year>02006</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <article-title>Intelligent virtual assistants with llm-based process automation</article-title>
          ,
          <source>arXiv preprint arXiv:2312.06677</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. I.</given-names>
            <surname>Roumeliotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Tselikas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Nasiopoulos</surname>
          </string-name>
          ,
          <article-title>LLMs in e-commerce: a comparative analysis of GPT and LLaMA models in product review evaluation</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <fpage>100056</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Vedula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Rokhlenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <article-title>Question suggestion for conversational shopping assistants using product metadata</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2960</fpage>
          -
          <lpage>2964</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , R. van Rooij,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Empowering llms with logical reasoning: A comprehensive survey</article-title>
          ,
          <source>arXiv preprint arXiv:2502.15652</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alotaibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Graph of Logic: Enhancing LLM Reasoning with Graphs and Symbolic Logic</article-title>
          , in: 2024
          <source>IEEE International Conference on Big Data (BigData)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>5926</fpage>
          -
          <lpage>5935</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , J.-t. Huang,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <article-title>LogicAsker: Evaluating and improving the logical reasoning ability of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.00757</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Marek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Truszczyński</surname>
          </string-name>
          ,
          <article-title>Stable models and an alternative logic programming paradigm</article-title>
          ,
          <source>in: The Logic Programming Paradigm: a 25-year Perspective</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>375</fpage>
          -
          <lpage>398</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -60085-2_
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>I. Niemelä</surname>
          </string-name>
          ,
          <article-title>Logic programming with stable model semantics as a constraint programming paradigm</article-title>
          ,
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>25</volume>
          (
          <year>1999</year>
          )
          <fpage>241</fpage>
          -
          <lpage>273</lpage>
          . doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1018930122475</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tangarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for optimizing rag for domain chatbots</article-title>
          ,
          <source>arXiv preprint arXiv:2401.06800</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>