-

An interactive atomic-cluster watershed-based system for lifelog moment retrieval

Van-Luon Tran

Trong-Dat Phan

P@10 1

Anh-Vu Mai-Nguyen

Anh-Khoa Vo

Minh-Son Dao?

Koji Zettsu

zettsug@nict.go.jp 0 0 National Institute of Information and Communications Technology , Japan 1 University of Science , VNU-HCMC , Vietnam

In this paper, we introduce a new interactive atomic-cluster watershed-based system for lifelog moment retrieval. We investigate three essential components that can help improve accuracy and support both amateur and professional users to enhance their querying based on di erent content and context hypothesis. These components are (1) the atomic cluster function that clusters dataset to a set of time-consecutive images that shares the same content and context constraints, (2) the text-tosample image generation that helps to overcome the gap between textual queries of users and visual-based feature vectors database, and (3) The interactive interface that assists users to imagine what they want to look for better. The system is customized to meet the challenge of lifelog moment retrieval of imageCLEFlifelog2020. The evaluation and comparison of our method to others con rm the stability of our method when people want to retrieve a large number of results within 100 top results.

Finding a moment in our past with a few hints or cues is the activity we probably carry on almost every day. Except for extraordinary people who have a fantastic memory that can recall every moment in their lives within a split-second, ordinary people need more time to narrow down their searching scope from a very abstract level to detail. The same situation happens when people want to nd their historical moment from their lifelog data. That leads to the fact that if people can have an interactive system that can help them turn their queries from an amateur sketch to an artist's paint, they will retrieve their moment faster and more precisely [1], [2]

Besides, turning a few keywords and less semantic contents of users' text queries to somethings that can be understood by the search engine is another challenge [3]. There is still a big gap between the natural language spoken by users and machine language designed for search engines [4] that can prevent the improvement of accuracy. Feature selection is another factor that can assist in bridging this gap [5], [6], and in support of the well-organized dataset.

Based on the discussion mentioned above, we design an interactive atomiccluster watershed-based system for lifelog moment retrieval. This system is customized to meet the lifelog moment retrieval (LMRT) challenge of imageCLEFlifelog2020 [7], a lab task of imageCLEF2020 [8].

Our system's main contributions are: 1. We introduce an atomic cluster, a set of time-consecutive images that shares the same content and context constraints. 2. We build the text-to-sample image generation to overcome the gap between textual queries of users and visual-based feature vectors database. 3. We create an interactive interface to help users imagine what they want to look for better.

We organize this paper as follows: Section 2 describes our method in details, Section 3 discusses the challenge and evaluates our results, and Section 4 concludes our paper and sketches our plan in the future. 2

Methodology

The principal idea of our method bases on the following observations: 1. daily activities of people can be divided into sequential atomic moments whose content has a consensus of both content and context. In other words, lifelog data recorded during a day can be divided into sequential atomic clusters whose content re ects a unique semantic meaning with a consensus of spatiotemporal dimension. These atomic clusters cannot be divided into smaller clusters. Hence, if we could nd one image that matches the query (i.e., seeds), we can count in the atomic cluster the image belongs and the neighbors of the atomic cluster (i.e., watershed). 2. people can decide which data do not satisfy their queries. In other words, people can remove irrelevant data and modify their queries to get more relevant data. Hence, if we can provide people a friendly interface for interactive querying, people can improve the quali cation of the querying system.

We call the system built upon these observations is an interactive atomiccluster watershed system. The system has four vital components ( 1 ) atomiccluster clustering (Cluster function), ( 2 ) text-to-sample image generation (Attention function), ( 3 ) querying by text-to-sample images (Query function), ( 4 ) interaction (Interactive function), and ( 5 ) querying by user's images (Query function). Algorithm 1 and Figure 1 describe and illustrate how the system works, respectively. There are several notations and feature vectors which are used for our system. Thus, we de ne them below: { Let Q denote the query sentence. { Let I = fIigi=1::N denote the set of given images (e.g., lifelog). { Let C = fCkg denote a set of atomic clusters. { Let S denote the set of samples for each query, and Si denote the status of set S at the time i. { Let BoV = fVi kgik==11::::Nm denote the set of feature vectors of objects extracted from I, where Vi k denotes the feature vector of the kth object of Ii and BoVi denotes the set of all object vectors of Ii. { Let BoVDB denote the database stores all object vectors of all images in I. { Let Seed and LM RT denote a set of seeds and lifelog moments, respectively. { Let denote V oi is the 1024-D vector representation of the ith object region in the photos. { Let denote pi is output vector of the ith image.

{ Let denote V wi is word embedding vector of the ith word.

Algorithm 1 Query-to-Sample Attention-based Search Engine In this subsection, we give a detailed explanation about our Interactive Multimodal Lifelog Retrieval System depicted in gure 1 and the Algorithm 1.

There are two stages in our system's work ow: ( 1 ) o ine stage, and ( 2 ) online stage.

The former stage is for data preprocessing. Firstly, we divide lifelog images into atomic clusters by utilizing Clustering function, described in 2.4. Then, all lifelog images are converted to Vsample by using Feature Extraction (FE) function, described in 2.3. In other words, Vsample contains feature vectors extracted from images. To make use full of FAISS [9], we embed these Vsample into a uni ed database by applying FAISS's function.

The latter stage is for textual and visual querying. For textual querying, our system activates Attention function, described in 2.5, to generate sample images from texts. Then, sample images (and input images if users carry on visual querying) are fed into the FE function to create related Vsample. The Vsample is used to nd the most similar feature vectors from the FAISS-based database with the prede ned similarity threshold. Next, we enrich Vsample by adding these found feature vectors and re-querying upon FAISS-based databased until no new feature vectors found. All images that have their features vectors appear in this set are considered as the queried results and set as seeds. The nal results are all atomic clusters contained these seeds. Then, users use Interactive tools described in 2.6 to polish the output, so they receive wanted results. 2.3

Feature Extraction

{ V oi is extracted by using object detection model (Faster-RCNN backbone Resnet) in scaled Visual Genome dataset [10] (removing semantic overlapping classes) { pi is extracted by utilizing place detection model described in 2.4 { V wi is built as follows: Hidden state 768-D vectors extracted from BERT [11] are combined with one linear Conditional Random Field layer to construct seq2seq model [12] and output keywords (from a long input query sentence) with their representation vectors. 2.4

Atomic-Cluster Clustering

As mentioned in previous sections, an atomic cluster contains a set of consecutive lifelog images (and related metadata) whose content re ects a particular activity constrained by location, time, and semantic meaning. We have the whole dataset clustered into atomic clusters by two steps ( 1 ) enhance the quality of metadata and ( 2 ) cluster multimodal data. The former applies a self-supervised learning method to regenerate metadata. By utilizing SimCLR method [13], we manually label place names for about 20k images and then train a new model to label the remaining images in a dataset automatically. Finally, we strengthen metadata's location constraints by having more precise place names than the original metadata. The latter utilizes the updated metadata and feature vectors extracted from images as the input of the clustering method proposed in [14] to form atomic clusters. 2.5

Text-to-sample Image Generation

The essential idea of this function is to replace a textual query with a set of visual queries. First, we create a dataset of objects using open image datasets (e.g., COCO, image365). It means that we have a set of object names, and each object name links to a set of images that contains the object. Then, we parse a textual query to extract the object's names replaced by linked images. Notably, we utilized the attention mechanism [15] to build our function, as described in Algorithm 2. We rstly utilize Top-Down Attention LSTM in a two-layer LSTM model for captioning images from feature vectors of regions detected by the object detection model [16]. We then determine a useful feature transformation from word vector space to visual space using a well-trained Bottom-up Attention model.

Algorithm 2 Text-to-sample Image Generation Input: Word set fW ordigi=1::M , Object set fObjj gj=1::N Output: fW ordi : Objj g map from word to relevant object. 1: fVwi gi=1::M ( W ordEmb(W ord) 2: Voj j=1::N ( F E(Obj) 3: Training bottom-up attention model as in [15]. 4: for all k kVwk do 5: v^k ( PjN=1 k;j vj 6: j0 ( arg maxj kj 7: v^k ( vj0 8: v^k is the optimized presentation for W ordk in visual space 9: end for 10: return fW ordi : Objj g where i = 1::M; j = 1::N 2.6

Interaction

After having the rst results generated by the query by sample function, users can lter the results using other metadata such as visible objects, places, and time. These metadata are saved as text les using PostgreSQL and stored in Logic Server, as described in 2.7. Besides, users can re-query by manually selecting samples from results visualized on the system's interface or add more query categories by texts. Moreover, users can delete inappropriate images as they think. These images are taken into account by the system to mark as outliers or unnecessary items for the next query. Algorithm 3 explained how the interaction works.

Algorithm 3 Interactive Algorithm Input: PostgreSQL for metadata P ,

I ( LM RT Output: I 1: while Interactive do 2: I ( Remove(I) fContinue if there is no removed imageg 3: F ( Input lters 4: I ( P:select(I; F ) fContinue if F is noneg 5: I ( I[ Re-query(input images or text) 6: end while 7: return I 2.7

Interactive System Architecture

To build a exible system, we design our system following three-tier and threelayer architecture, depicted in gure 2. The rst layer is the presentation layer on a User Client, the second is the logic layer on Logic Server, and the last one is the core layer on the Core Server.

At the rst layer, it is a convenient web-based interface where users can interact with our system. This interface can easily be installed in a wide range of operating systems. It rstly allows users to type text queries, select lters, and input sample images, which is a powerful tool for users to describe which images they would like to retrieve. Then, these data, along with IDs of removed images (in the case users delete the queried results of the previous interaction), will be pushed to Logic Server. Next, the interface has responsibility for presenting images sent from Logic Server. Before users re-query, they can modify their text query, adjust lters, choose images from other sources, and remove unwanted images. They can re-query until presented images satisfy user's demand. Finally, users use the export function to download images or image IDs.

At the second layer, Logic Server has responsibility for processing requests from User Client. Firstly, this server converts query to a suitable form and send it to Core Server. Then, Logic Server receives outputted results with IDs of images and IDs of related atomic clusters. The result will be saved directly to Cache, a temporary memory on Logic Server. At the following steps, depending on the type of lters, this server will apply the lters on the whole dataset or only the results stored in Cache. There are two types of lter ( 1 ) Extend Filter and ( 2 ) Narrow Filter. With the former, Logic Server will nd all images whose metadata are matched to this lter before adding these image's IDs to Cache. With the latter, from IDs in Cache, Logic Server will select images whose metadata is tted to the lter. Finally, the server returns ltered images and ranked clusters to User Client.

In terms of Core Server, it receives input from Logic Server and sends result reversely after completely processing. Core Server is an always-on server where AI components are deployed. 3

Experiments

In this section, we present our system's experiment results when applying to the lifelog dataset CLEF2020. 3.1

Dataset and Evaluation Metrics

The CLEF2020 dataset has been captured by one active lifelogger for 114 days between 2015 and 2018. It contains not only over 191000 lifelog images also metadata, including visual concepts, attributes, semantic content, to name a few. The training set has ten topics, and each topic is described by title and description. These titles are: ( 1 ) Having beers in a bar, ( 2 ) Building Personal Computer, ( 3 ) In A Toy Shop, ( 4 ) Television Recording, ( 5 ) Public Transport In Home Country, ( 6 ) Seaside Moments, ( 7 ) Grocery Stores, ( 8 ) Photograph of The Bridge, ( 9 ) Car Repair, ( 10 ) Monsters. The topic descriptions are used to explain in detail about the content and context of each query. Similar to the training set, the testing set has ten topics which are: ( 1 ) Praying Rite, ( 2 ) Recall, ( 3 ) Bus to work - Bus to home, ( 4 ) Bus at the Airport, ( 5 ) Medicine cabinet, ( 6 ) Order Food in the Airport, ( 7 ) Seafood at Restaurant, ( 8 ) Meeting with people, ( 9 ) Eating Pizza, ( 10 ) Socialising.

The evaluation metrics are de ned by ImageCLEFli og 2020 as follow: { Cluster Recall at X (CR@X) - a metric that assesses how many di erence clusters from the ground truth are represented among the top X results; { Precision at X (P@X) - measures the number of relevant photos among the top X results; { F1-measure at X (F1@X) - the harmonic mean of the previous two. 3.2

Evaluation and Comparison

The ImageCLEFlifelog challenge has ve participant teams including: ( 1 ) RRibeiro, ( 2 ) FatmaBA RegimLab, ( 3 ) DCU Team, ( 4 ) BIDAL HCMUS (ourselves), ( 5 ) HCMUS. We are ranked in the second position. Table 1 and 2 shows our results running on the training and testing set while table 3 and 4 denote the comparison to the other teams.Figures 3-12 illustrate our results of the testing stage. 0.50 0.50 0.50 0.67 0.11 0.50 0.44 0.50 1.00 1.00

1 0.50

1 0.50 0.78 0.33 0.20

1 0.67 0.50

When comparing the results evaluated by F1@10 and F1@50 metrics, we found that our scores are less uctuation than the others (some other teams have a massive reduction in their scores), as described in table 3 and 4. That probably could lead to the conclusion that our proposed method is stable, especially if the user wants to retrieve a large of images.

In some queries, we have worse scores because of misunderstanding the content and context of queries. For instance, query 5 has the title: 'Medicine cabinet' and description: 'Find the moment when u1 was looking inside the medicine cabinet in the bathroom at home', we very confused when trying to con rm whether the lifelogger really looks inside the medicine cabinet or appear nearby (i.e., the medicine cabinet is captured by lifelog camera, but the u1 does not look at it). The result of query 5 is shown in gure 7.

Furthermore, we found that the ground truth could have some incorrect points. We have veri ed with the organizers that the ground-truth might not be precise. For example, the image ID b00000986 21i6bq 20150225 161718e (in query 9) and the image ID 20160904 120624 000 (in query 5) should have been in the ground-truth. Figures 7 and 11 illustrate the results of queries 5 and 9, where the red rectangle denotes mentioned images. That probably makes our results not precise enough as we expected. Fig. 7: The top ten results of query 5 "Medicine cabinet" (F1@10 = 0.74) Fig. 10: The top ten results of query 8 "Meeting with people" (F1@10 = 0.75) We introduced a new interactive atomic-cluster watershed-based system for lifelog moment retrieval. The system is specially customized to meet the requirement of the imageCLEFlifelog2020 challenges. The system rst indexes the database based on atomic clusters that contain similar data based on our similarity measure. The reason behind the atomic clusters is that whenever one image is found, its atomic cluster counts in. We store feature vectors extracted from data in FAISS database for further querying. We convert all textual queries into visual queries by using the attention mechanism approach. The system provides a friendly interactive interface that allows users to select precise results and re-query with modi cation. Our results are evaluated and compared to other participants with positive accuracy. We will investigate the atomic clustering function to improve the consensus and compact of atomic clusters in the future. Moreover, we will consider wrapping spatiotemporal information to the querying engine by strengthening semantic constraints. Last but not least, we will focus on feature engineering and similarity measures to have a higher accuracy of querying.

Acknowledgement

This research is conducted under the Collaborative Research Agreement between National Institute of Information and Communications Technology and University of Science, Vietnam National University at Ho-Chi-Minh City. 15. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, \Bottom-up and top-down attention for image captioning and visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077{6086. 16. S. Ren, K. He, R. Girshick, and J. Sun, \Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in neural information processing systems, 2015, pp. 91{99.

Park ,

Song ,

Park , and

Y. M.

Ro , \Ivist: Interactive video search tool in vbs 2020," in International Conference on Multimedia Modeling . Springer, 2020 , pp. 809 { 814 .

Jonsson ,

O. S.

Khan ,

D. C.

Koelma ,

Rudinac ,

Worring , and

Zahalka , \ Exquisitor at the video browser showdown 2020," in MultiMedia Modeling,

Y. M.

Ro , W.-H. Cheng, J. Kim, W.-T. Chu,

Cui ,

J.-W.

Choi , M.-C. Hu , and W. De Neve, Eds. Cham: Springer International Publishing, 2020 , pp. 796 { 802 .

Andreadis ,

Moumtzidou ,

Apostolidis ,

Gkountakos ,

Galanopoulos ,

Michail , I. Gialampoukidis,

Vrochidis ,

Mezaris , and I. Kompatsiaris , \Verge in vbs 2020 , " in International Conference on Multimedia Modeling . Springer, 2020 , pp. 778 { 783 .

Chen ,

Shang ,

Yang , and

Li , \ Spatial keyword search: a survey," Geoinformatica , vol. 24 , p. 85 { 106 , 2020 .

Zhang ,

Nie ,

Li ,

and X.

Wei , \ Feature selection with multi-view data: A survey," Information Fusion , vol. 50 , pp. 158 { 167 , 2019 . [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253518303841

Sheikhpour ,

M. A.

Sarram ,

Gharaghani , and

M. A. Z.

Chahooki , \ A survey on semi-supervised feature selection methods," Pattern Recognition , vol. 64 , pp. 141 { 158 , 2017 . [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S0031320316303545

7. V.-T. Ninh, T.-K. Le , L.

Zhou , L.

Piras , M.

Riegler , P. l Halvorsen, M.-

Tran ,

Lux ,

Gurrin , and D. -T. Dang-Nguyen, \Overview of ImageCLEF Lifelog 2020: Lifelog Moment Retrieval and Sport Performance Lifelog," in CLEF2020 Working Notes, ser . CEUR Workshop Proceedings . Thessaloniki, Greece: CEURWS.org <http://ceur-ws. org> , September 22 -25 2020 .

Ionescu , H. Muller,

Peteri ,

A. B.

Abacha ,

Datla ,

S. A.

Hasan , D. DemnerFushman,

Kozlovski ,

Liauchuk ,

Y. D.

Cid ,

Kovalev ,

Pelka ,

C. M.

Friedrich , A. G. S. de Herrera , V.-T. Ninh, T.-K. Le , L.

Zhou , L.

Piras , M.

Riegler , P. l Halvorsen, M.-

Tran ,

Lux ,

Gurrin , D.-T. Dang-Nguyen, J.

Chamberlain , A.

Clark , A.

Campello , D.

Fichou , R.

Berari , P.

Brie , M.

Dogariu , L. D.

Stefan , and M. G.

Constantin , \ Overview of the ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications," in Experimental IR Meets Multilinguality, Multimodality, and Interaction, ser . Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020 ), vol. 12260 . Thessaloniki , Greece: LNCS Lecture Notes in Computer Science , Springer, September 22 -25 2020 .

Johnson , M. Douze, and

Jegou , \ Billion-scale similarity search with gpus," arXiv preprint arXiv:1702.08734 , 2017 .

10.

Krishna ,

Zhu ,

Groth ,

Johnson ,

Hata ,

Kravitz ,

Chen ,

Kalantidis ,

L.-J.

Li ,

D. A.

Shamma ,

Bernstein , and L. Fei-Fei , \ Visual genome: Connecting language and vision using crowdsourced dense image annotations," 2016 . [Online]. Available: https://arxiv.org/abs/1602.07332

11. J. Devlin , M.- W.

Chang , K.

Lee , and K.

Toutanova , \Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv: 1810 .04805, 2018 .

12. I. Sutskever , O.

Vinyals , and Q. V.

Le , \ Sequence to sequence learning with neural networks," in Advances in neural information processing systems , 2014 , pp. 3104 { 3112 .

13. T. Chen,

Kornblith ,

Norouzi , and G. Hinton, \ A simple framework for contrastive learning of visual representations," arXiv preprint arXiv: 2002 .05709, 2020 .

14. T. Phan,

Dao , and

Zettsu , \ An interactive watershed-based approach for lifelog moment retrieval," in 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) , Sep . 2019 , pp. 282 { 286 .