<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Contrastive Learning for Fine-grained Attribute Extraction from Fashion Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>TCS Research</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shubham Paliwal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bhagyashree Gaikwad</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mayur Patidar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manasi Patwardhan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lovekesh Vig</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meghna Mahajan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bagya Lakshmi V</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shirish Karande</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>27</volume>
      <issue>2023</issue>
      <abstract>
        <p>Fashion attributes are key to many downstream tasks in e-commerce such as product recommendation, fashion captioning, item matching, fashion image retrieval and generation. Generally, fashion attributes are arranged in an ontology where one fashion attribute may be assigned one or more values. Most state-of-the-art (SOTA) approaches model attribute extraction as a multi-label classification problem and do not consider attribute-value relatedness information during training which leads to poor performance on fine-grained attribute extraction. To address this issue, we propose Ontology Guided Supervised Contrastive Learning For Fine-grained Fashion Attribute Extraction (OGSCL-FAE) where we leverage a fashion ontology to create strong negative pairs, model attribute extraction as a matching problem, and ifne-tune a pre-trained CLIP on attribute extraction. The proposed approach outperforms existing SOTA approaches on two public datasets DeepFashion and FashionAI by 11.65% top-5 recall rate and 0.93 mAP respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Fashion</kwd>
        <kwd>Fashion attribute classification</kwd>
        <kwd>Ontology guided training</kwd>
        <kwd>Supervised contrastive loss</kwd>
        <kwd>CLIP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>CEUR
Workshop
Proceedings
learn to focus on an appropriate part of the image which depicts the position of attribute type
such as Sleeve Length and then has to perform the hard-to-distinguish task of diferentiating
between fine-grained attribute values for that attribute type.</p>
      <p>
        Most of-the-shelf SOTA approaches [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref7 ref8 ref9">7, 8, 9, 10, 11, 12, 13, 14, 15</xref>
        ] treat attribute recognition
as multi-stage, hierarchical, multi-label (or multi-class) classification task. Some of these
approaches use multi-task learning by using product category classification, landmark and/or
key-point detection as auxiliary task(s) to improve the performance. However, these approaches
do not leverage the relatedness of attribute values, embedded in the attribute ontology. For
example, an image representation (e.g. I1 in Figure 1) should be closer to the attribute value
representations with which it is labelled (e.g. text prompt T1 for attribute value ‘wrist length
Sleeves’). Thus, indirectly bringing the image representations of two images having the same
attribute value (e.g. ‘wrist length sleeves’) for an attribute type (e.g. ‘Sleeve Length’) closer and
image representations farther when the two images hold distinct values (e.g. ‘turtle neck’ and
‘Rufle Semi-High Collar neck’) for an attribute type (e.g. ‘neck design’). More importantly, to
embed the attribute relatedness depicted by the ontology, the image representation should be
farther from the attribute values which are siblings (other values of the same attribute type) of
the attribute value the image is annotated with. For example, image representation I1 in figure
1 should be farther to the text prompt representation T2 for attribute value ‘log length sleeves’,
which is the sibling of (belongs to the same attribute type ‘sleeve length’) the attribute value
‘wrist length sleeves’, with which the I1 is labelled. Such attribute values belonging to same
attribute type are hard to distinguish.
      </p>
      <p>
        In this work, we model attribute extraction as a matching problem [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. We fine-tune
the pre-trained CLIP model, contrasting image-attribute representations. We address the
aforementioned limitation of the prior work by a novel supervised contrastive learning-based
training mechanism by leveraging fashion ontology to create hard negative pairs. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] use
contrastive learning with object-level supervision to align pre-trained language and vision
model by increasing the dificulty of mini-batches over training epochs based on object level
ontology. On the other hand, our approach exploits ontology in-built for fashion domain for
more fine-grained task of fashion attribute extraction. The proposed approach outperforms
existing SOTA approaches on two public datasets, viz. DeepFashion [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and FashionAI [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] by
11.65% top-5 recall rate and 0.93 mAP, respectively. The main contributions of this work are:
• We have modeled the fine-granular multi-label fashion attribute classification as a
matching problem to address the relatedness of attribute values embedded in the attribute
ontology.
• We propose a novel ontology-guided supervised contrastive learning approach for fashion
attribute extraction.
• The proposed approach outperforms existing baselines on two public datasets, viz.
Deep
      </p>
      <p>Fashion and FashionAI.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Problem Description</title>
      <p>
        Fashion ontology ( ) [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">18, 20, 19</xref>
        ] consists of fashion-related concepts (e.g., Product Category
(PC), Attribute Type (AT), and Attribute Values (AV), etc.) which are arranged in the form of a
hierarchy and are connected to each other via appropriate relationships. For e.g., ‘Blouse’ is
an instance of a product category with Neck, Sleeve, Print, etc., as attribute types and turtle,
draped collar, etc., are some of the attribute values of attribute type Neck (Figure 1). Fashion
images are annotated w.r.t an ontology  by domain experts at all levels i.e., product category,
valid attribute types, and corresponding attribute values.
category   
   = {  1, ...,
      </p>
      <p>}, objective is to automatically annotate test image w.r.t  .</p>
      <p>Given a fashion ontology 
and corresponding annotated dataset i.e., 
=

{( 1,  1), ..., (  ,   )}, where  ℎ image (  ) is annotated with   = {   , 
  , ..., 
  }, product
 , valid attribute types {   , ...,</p>
      <p>} and corresponding valid attribute values
( ) .</p>
    </sec>
    <sec id="sec-4">
      <title>3. Proposed Approach</title>
      <p>
        We model fine-grained fashion attribute extraction as a matching problem where we fine-tune
CLIP [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] via supervised contrastive loss (SupCon) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] by minimizing the cosine-similarity
between the image and attribute representation. To handle the class imbalance we augment
SupCon with asymmetric contrastive focal loss [23] during the training. During inference, we
choose valid attribute types and values based on the cosine-similarity between the image and
attribute representations.
3.1. Training
3.1.1. Supervised Contrastive Language Image Pre-training Fine-tuning (SCLIP-F)
encoder (

 =   .
 
 
 
 
Similar to CLIP, we obtain the multimodal image representation by first passing it to CLIP’s image
 
 
We use a textual prompt ( ) to verbalize the attribute value and get the corresponding multimodal
representation via CLIP text encoder (
) and multimodal text projection layer ( 
) i.e.,
) and then through a multimodal image projection layer (  ) i.e.,   =   .
 
 
( ) .
      </p>
      <p>CLIP is pre-trained on (image, text) pairs by maximizing the cosine similarity between
representations of ℬ (image, text) pairs and minimizing the cosine similarity for ℬ2 − ℬ
invalid pairs in a batch of size ℬ, as shown in Eq. 1.</p>
      <p>ℒ 
= −
∑  [
(⟨
‘collar design’. Unlike CLIP, all (image, attribute) pairs in ℬ, which share the same attribute
values are referred to as positive pairs and others as negative. During the fine-tuning of CLIP,
we maximize the cosine similarity between representation of positive (image, attributes) and
minimize the cosine similarity between negative pairs, as shown in Eq. 2.
ℬ
∑=1 (⟨</p>
      <sec id="sec-4-1">
        <title>3.1.2. Ontology Guided Supervised Contrastive Learning</title>
        <p>Class imbalance in a dataset makes it harder to learn good representation for rare classes via
supervised contrastive learning, due to absence of positive pairs for low frequency attribute
values in ℬ. To alleviate this issue, inspired by [23], we augment ℒ 
with focal loss as

shown in Eq. 3 and directly minimize the cosine similarity among negative (image, attribute)
pairs (ℒ 
−</p>
        <p>) as shown in Eq. 4.
attribute-invariant transformations [25] over images present in  .</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.1.3. Training</title>
        <sec id="sec-4-2-1">
          <title>3.2. Inference</title>
          <p>We fine-tune pre-trained CLIP on a feature extraction dataset  by minimizing ontology-guided
supervised contrastive focal loss as shown in Eq. 5.</p>
          <p>Given a test image (
 ), we obtain its multimodal representation (
 ) via fine-tuned image
 
encoder</p>
          <p>i.e.,  
=   .
 


(  ). In order to predict the product category, we calculate the cosine similarity between  
and the multimodal representation of the textual prompt corresponding to each product category

present in  and choose the one (   ) with maximum cosine similarity, argmax⟨ 
,    ⟩. For


attribute prediction, we calculate cosine similarity between  
tion of the textual prompt corresponding to each attribute value for an attribute type (applicable
to that product category) and choose the one (  ) with maximum cosine similarity, we repeat
and the multimodal
representathis for all attribute types independently.</p>
          <p>Comparison of diferent approaches on DeepFashion for attribute classification using recall-rate@k.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Attributes</title>
      </sec>
      <sec id="sec-4-4">
        <title>Shape</title>
      </sec>
      <sec id="sec-4-5">
        <title>Texture</title>
      </sec>
      <sec id="sec-4-6">
        <title>Fabric</title>
      </sec>
      <sec id="sec-4-7">
        <title>Part</title>
      </sec>
      <sec id="sec-4-8">
        <title>Style Table 1 Approach</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>4.1. Dataset Details</title>
        <p>
          DeepFashion [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]: It consists of 289,222 fashion images (Train: 209,222, Validation: 40,000 and
Test: 40,000), each belonging to one of 50 diferent categories and annotated w.r.t the ontology
consists of 5 attribute types and 1000 attribute values.
        </p>
        <p>
          FashionAI [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]: It consists of 180,335 fashion images (Train: 144,335, Validation: 18,000 and
Test: 18,000) which belong to 6 diferent categories and are annotated w.r.t the ontology consists
of 8 design-specific attribute types and 54 attribute values.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Baselines</title>
        <p>
          Approaches such as CSN[26], ASEN[27], DARN[28], CAMNet [29] are designed for fashion
image retrieval by learning fine-grained attribute specific embedding for fashion images with
metric learning. In our approach, instead we take attribute relatedness into consideration for
learning image representation by ontology guided training using contrastive setting. WTBI [30],
FashionNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], BCRNNs[31], TS-FashionNet[32], STL w/ HLS, MTL w/ (RNN + VA)[33] and
TwoStreamMN[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] treats the attribute extraction as a multi-class classification task and takes
help of auxiliary task(s) such as pose estimation, landmark prediction, category identification
and/or object type detection by either jointly learning the model or following a staged pipeline,
leading to improvement in the performance of the attribute extraction. As opposed to these
approaches, instead of multi-class classification, we treat the attribute extraction task as a
matching problem. HABP [34] addresses the problem of class imbalance for fashion attribute
extraction, by adaptively focusing on training hard data (attributes with very less tagged
samples) followed by a method to synthesize complementary samples for such hard attributes.
In our approach, we take care of the class imbalance by using focal loss, data augmentation and
ontology guided hard negative sampling.
        </p>
        <p>
          Contrastive Language-Image Pre-Training- Pretrained (CLIP-P) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is our baseline, where we
use the pre-trained version of the CLIP model without any task specific fine-tuning. Whereas,
Supervised Contrastive Language-Image Pre-Training Finetuned (SCLIP-F) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is where we
perform task specific finetuning of CLIP for domain adaptation. Multilabel Classification (MLC)
is where we use the same base model, which is used as the image encoder in the CLIP setting
and fine-tune it for multi-label attribute classification.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Training Details</title>
        <p>We use pre-trained ViT/B-16 as our CLIP image encoder implemented in Pytorch. For all of
our experiments, the models are trained on an Nvidia A-100, using batch size of 96 and the
learning rate of 3e-6. For the MLC baseline, we have appended linear layers of size 512, 1024
and the dimension of attribute classes to the end of the pre-trained image encoder, and
finetuned using asymmetric focal loss [35]. For Deepfashion we use a sigmoid activation layer
per attribute, while for FashionAI we use grouped (as per attribute type) softmax activation
distributed over attribute values. We use validation set assistance in training, in which for each
epoch, the negative pair sampling frequency is set in proportion to the non-diagonal validation
set confusion matrix values, for each attribute pair, helping in better distinguishing confusing
attribute pairs.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Evaluation</title>
        <sec id="sec-5-4-1">
          <title>4.4.1. Top-k Recall</title>
          <p>
            For a given attribute type, it refers to the fraction of test images for which the true attribute
value is present in the top-k predicted attribute values. Also, for a dataset, Top-k recall is the
mean of Top-K recall for each attribute type. To compute this metric We use the oficial code 1
provided by authors of [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ].
          </p>
        </sec>
        <sec id="sec-5-4-2">
          <title>4.4.2. Mean Average Precision (mAP)</title>
          <p>For a given attribute type, it refers to the fraction of test images for which predicted attribute
value matches with the ground truth. And for a dataset, mAP is the mean of mAP for each
attribute type.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results And Discussion</title>
      <p>Pre-training Vs Fine-tuning for fine-grained fashion attribution extraction : as shown
in Table 1, 2 and 3, for both the datasets, SCLIP-F outperforms CLIP by a significant margin,
suggesting the need for fine-tuning pre-trained CLIP for attribute extraction in fashion domain.</p>
      <p>Multilabel classification vs Matching : As depicted in Table 1, 2 OGSCL-FAE outperforms
MLC on DeepFashion by 20.74% and 20% in terms of Top-3 and Top-5 recall, respectively.
Similarly, in Table 3, it also outperforms MLC on FashionAI by 8.1% in terms of mAP. This
suggests that fashion ontology is a key component to achieve better performance on fine-grained
attribute classification. During the training contrasting an attribute value with all its sibling
(OGSCL-FAE) is more important as compared to maximizing the likelihood of an attribute value
in isolation (MLC).</p>
      <p>OGSCL-FAE vs. Baselines In terms of overall performance, OGSCL-FAE outperforms the
best baseline SCLIP-F by 5.04% (Top-3) and 5.8% (Top-5) on DeepFashion and baseline CAMNET
by 0.93% mAP on FashionAI. We use the best-performing variant of CAMNET as a baseline
1attr_predict_eval.py @ https://github.com/open-mmlab/mmfashion/
where HRNet [36] with two-step attention layers is used as the backbone as opposed to
ViTB/16 [37] in the proposed approach. But still, OGSCL-FAE outperforms the best baseline i.e.
CAMNET on FashionAI, in terms of overall mAP (Table 3) for 5 out of 8 attribute types. Since
the code for CAMNET is not publicly available, it’s not possible to test CAMNET with ViT-B/16
as a backbone. For DeepFashion, except for Style and Texture, OGSCL-FAE outperforms all
baselines for all attribute types.</p>
      <p>Discussion about ablations Data augmentations and ontology-guided supervised
contrastive learning are key components of OGSCL-FAE because there is a drop in performance
of 6.07% (Top-3) and 6.72% (Top-5), as shown in Table 4 (OGSCL-FAE w/o DA &amp; OGSCL).
CLIP fine-tuning with self-supervised contrastive loss and data augmentation performs very
poorly as compared to supervised contrastive loss with data augmentation (OGSCL-FAE w/o
OGSCL). Ontology guided negative sampling over random sampling improves performance of
OGSCL-FAE by 2.95% (Top-3) and 3.16% (Top-5) (OGSCL-FAE w/o OG). Data augmentation also
afects the overall performance of OGSCL-FAE by 5.64% (Top-3) and 5.24% (Top-5) ( OGSCL-FAE
w/o DA).</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>This paper proposes a novel approach to fine-granular fashion attribute extraction by exploiting
an ontology-guided negative sampling strategy for supervised contrastive learning of pre-trained
CLIP. The proposed method outperforms existing state-of-the-art results on DeepFashion and
FashionAI datasets, achieving 11.65% top-5 recall rate and 0.93 mAP respectively. Future work
will include using the attribute extraction module for attribute-guided product copy generation.
Supervised contrastive learning, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran
Associates, Inc., 2020, pp. 18661–18673. URL: https://proceedings.neurips.cc/paper/2020/
file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf.
[23] V. Vito, L. Y. Stefanus, An asymmetric contrastive loss for handling imbalanced datasets,</p>
      <p>Entropy 24 (2022).
[24] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Contrastive learning with hard negative
samples, International Conference on Learning Representations (2021).
[25] S. G. Müller, F. Hutter, Trivialaugment: Tuning-free yet state-of-the-art data augmentation,
in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.
774–782.
[26] A. Veit, S. Belongie, T. Karaletsos, Conditional similarity networks, in: Proceedings of the</p>
      <p>IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[27] J. Dong, Z. Ma, X. Mao, X. Yang, Y. He, R. Hong, S. Ji, Fine-grained fashion similarity
prediction by attribute-specific embedding learning, IEEE Transactions on Image Processing
30 (2021) 8410–8425. doi:10.1109/TIP.2021.3115658.
[28] J. Huang, R. Feris, Q. Chen, S. Yan, Cross-domain image retrieval with a dual
attributeaware ranking network, in: 2015 IEEE International Conference on Computer Vision
(ICCV), 2015, pp. 1062–1070. doi:10.1109/ICCV.2015.127.
[29] C. H. Song, H. Joo Han, Convolutional attribute mask with two-step attention for fashion
image retrieval, in: 2022 26th International Conference on Pattern Recognition (ICPR),
2022, pp. 2093–2099. doi:10.1109/ICPR56361.2022.9955640.
[30] H. Chen, A. Gallagher, B. Girod, Describing clothing by semantic attributes, in:
Proceedings of the 12th European Conference on Computer Vision - Volume Part III,
ECCV’12, Springer-Verlag, Berlin, Heidelberg, 2012, p. 609–623. URL: https://doi.org/10.
1007/978-3-642-33712-3_44. doi:10.1007/978-3-642-33712-3_44.
[31] W. Wang, W. Wang, Y. Xu, J. Shen, S.-C. Zhu, Attentive fashion grammar network for
fashion landmark detection and clothing category classification, in: 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2018, pp. 4271–4280. doi:10.
1109/CVPR.2018.00449.
[32] Y. Zhang, P. Zhang, C. Yuan, Z. Wang, Texture and shape biased two-stream networks
for clothing classification and attribute recognition, in: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13535–13544. doi:10.1109/
CVPR42600.2020.01355.
[33] S.-I. Papadopoulos, C. Koutlis, M. Sudheer, M. Pugliese, D. Rabiller, S. Papadopoulos,
I. Kompatsiaris, Attentive hierarchical label sharing for enhanced garment and attribute
classification of fashion imagery, in: N. Dokoohaki, S. Jaradat, H. J. Corona Pampín,
R. Shirvany (Eds.), Recommender Systems in Fashion and Retail, Springer International
Publishing, Cham, 2022, pp. 95–115.
[34] Y. Ye, Y. Li, B. Wu, W. Zhang, L. Duan, T. Mei, Hard-aware fashion attribute classification,
arXiv preprint arXiv:1907.10839 (2019).
[35] T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, L. Zelnik-Manor,
Asymmetric loss for multi-label classification, in: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 82–91.
[36] K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation learning for human
pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019.
[37] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth
16x16 words: Transformers for image recognition at scale, ICLR (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Automatic controllable product copywriting for e-commerce</article-title>
          ,
          <source>Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          , L. Nie,
          <article-title>Generative attribute manipulation scheme for flexible fashion search</article-title>
          ,
          <source>Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Divitiis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Becattini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baecchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bimbo</surname>
          </string-name>
          ,
          <article-title>Disentangling features for fashion recommendation</article-title>
          ,
          <source>ACM Transactions on Multimedia Computing, Communications and Applications</source>
          <volume>19</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. J. Kuo</surname>
          </string-name>
          , Pager:
          <article-title>Progressive attribute-guided extendable robust image generation</article-title>
          ,
          <source>ArXiv abs/2206</source>
          .00162 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Ngo</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Interpretable multimodal retrieval for fashion products</article-title>
          ,
          <source>in: Proceedings of the 26th ACM international conference on Multimedia</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1571</fpage>
          -
          <lpage>1579</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Attribute-guided fashion image retrieval by iterative similarity learning</article-title>
          ,
          <source>2022 IEEE International Conference on Multimedia and Expo (ICME)</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutlis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sudheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pugliese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rabiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <article-title>Attentive hierarchical label sharing for enhanced garment and attribute classification of fashion imagery</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhen</surname>
          </string-name>
          ,
          <article-title>Two-stream multi-task network for fashion recognition</article-title>
          ,
          <source>2019 IEEE International Conference on Image Processing (ICIP)</source>
          (
          <year>2019</year>
          )
          <fpage>3038</fpage>
          -
          <lpage>3042</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shajini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          <article-title>, Multi-staged feature-attentive network for fashion clothing classification and attribute prediction, ELCVIA Electronic Letters on Computer Vision and Image Analysis (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sirts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fishel</surname>
          </string-name>
          , G. Anbarjafari,
          <article-title>Multimodal sequential fashion attribute prediction</article-title>
          ,
          <source>Inf</source>
          .
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <article-title>Semi-supervised learning with a teacher-student network for generalized attribute prediction</article-title>
          , ArXiv abs/
          <year>2007</year>
          .06769 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Parekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shaik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chelliah</surname>
          </string-name>
          ,
          <article-title>Fine-grained visual attribute extraction from fashion wear</article-title>
          ,
          <source>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          (
          <year>2021</year>
          )
          <fpage>3968</fpage>
          -
          <lpage>3972</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kolisnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Zulkernine</surname>
          </string-name>
          ,
          <article-title>Condition-cnn: A hierarchical multi-label fashion image classification model</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>182</volume>
          (
          <year>2021</year>
          )
          <fpage>115195</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Yoo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Seol</surname>
            ,
            <given-names>S. goo Lee</given-names>
          </string-name>
          ,
          <article-title>Leveraging class hierarchy in fashion classification</article-title>
          ,
          <source>2019 IEEE/CVF International Conference on Computer Vision</source>
          Workshop (ICCVW) (
          <year>2019</year>
          )
          <fpage>3197</fpage>
          -
          <lpage>3200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Fine-grained fashion similarity prediction by attribute-specific embedding learning</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          <volume>30</volume>
          (
          <year>2021</year>
          )
          <fpage>8410</fpage>
          -
          <lpage>8425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Goei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hendriksen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
          </string-name>
          , et al.,
          <article-title>Tackling attribute fine-grainedness in crossmodal fashion search with multi-level features</article-title>
          ,
          <source>in: SIGIR 2021 Workshop on eCommerce. ACM</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomason</surname>
          </string-name>
          ,
          <article-title>Curriculum learning for data-eficient vision-language alignment</article-title>
          ,
          <source>arXiv preprint arXiv:2207.14525</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Deepfashion:
          <article-title>Powering robust clothes recognition and retrieval with rich annotations</article-title>
          ,
          <source>in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. K.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Fashionai: A hierarchical dataset for fashion understanding</article-title>
          ,
          <source>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          (
          <year>2019</year>
          )
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sirotenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          , Fashionpedia: Ontology, segmentation, and
          <article-title>an attribute localization dataset</article-title>
          ,
          <source>in: European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.),
          <source>Proceedings of the 38th International Conference on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: https://proceedings.mlr.press/v139/ radford21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Teterwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maschinot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>