Compressing Multi-Modal Temporal Knowledge Graphs of Videos

Compressing Multi-Modal Temporal Knowledge Graphs of Videos ShusakuEgami egami@aist.go.jp National Institute of Advanced Industrial Science and Technology (AIST)

Tokyo Japan

TakanoriUgai ugai@fujitsu.com National Institute of Advanced Industrial Science and Technology (AIST)

Tokyo Japan

Fujitsu Limited

Kanagawa Japan

KenFukuda ken.fukuda@aist.go.jp National Institute of Advanced Industrial Science and Technology (AIST)

Tokyo Japan

Compressing Multi-Modal Temporal Knowledge Graphs of Videos 1613-0073 A83B4DEB0D15EADE12ADFE83703B876C GROBID - A machine learning software for extracting information from scholarly documents Multi-Modal Knowledge Graph RDF Compression Video Dataset Temporal Knowledge Graph

The construction of multi-modal temporal knowledge graphs (MMTKGs) that ground non-symbolic and time-series data, such as videos, into entities in the graph is still in the early stages. Hence, there is a lack of discussion about compressing and publishing MMTKG with huge data size. In this paper, we propose compression methods for MMTKGs of videos based on splitting images and inference rules and conduct experiments to evaluate their performance. As a result, our methods reduced the size of the MMTKGs by 27.7-36.1%. This study contributes to reducing the cost of distributing large MMTKGs on the web.

Introduction

Multi-modal knowledge graphs (MMKGs) [1,2], which ground non-symbolic data into symbolic entities, have attracted attention as datasets for semantic and conceptual processing across modalities. However, constructing and publishing multi-modal temporal knowledge graphs (MMTKG) that ground multi-modal and time-series data, such as videos, into entities in the graph is still in the early stages.

Typical MMKGs describe multi-modal contents by URLs or file paths. This approach may not be suitable for the permanent publication of MMKGs as the multi-modal contents may become inaccessible due to broken links. This issue could potentially be resolved by encoding the file's binary data as an entity in the KG [3,4]. However, building an MMTKG that describes the content of a video in fine-grained time intervals, such as in seconds or video frames, would result in huge data size, making it expensive to publish and share.

We proposed methods compressing MMTKGs of videos and conducted experiments to determine their effectiveness. We focused on two types of MMTKGs: KGs with video frame images encoded in Base64 and KGs with entire video files encoded in Base64. The proposed methods include differential compression based on knowledge representation of splitting video frame images and reduction of redundant triples based on inference rules. The results demonstrated that our compression methods reduced the size of the MMTKGs by 27.7-36.1%. This study contributes to reducing the cost of distributing large MMTKGs on the web.

Related Work

Zhu et al. [1] and Chen et al. [2] comprehensively surveyed and summarized works on MMKGs. Typical multimodal knowledge graphs are MMpedia [5] and IMGpedia [6], which ground images to entities in the graph. VisionKG [7] is an MMKG containing bounding boxes (bboxes) of objects extracted from various image datasets such as MS-COCO [8], CIFAR [9], and PASCAL VOC [10]. These MMKGs represent images by URIs or file paths. Studies on video KGs have evolved in the context of video indexing and retrieval [11,12,13]. VEKG [14] is an MMKG based on the extracted events from videos, bboxes, and image features. However, the data is not publicly available. There have been a lot of studies of compression methods for KGs [15]. However, MMKGs for videos are not covered.

Approach

MMKGs usually describe images and videos by URIs or file paths, which causes broken links to multi-modal files. Thus, we focus on permanently accessible MMTKGs that embed multi-modal files in a KG as an entity, and propose compression methods for these MMTKGs.

Data preparation

As an example, we constructed MMTKGs of indoor daily activities from multi-modal data of videos, text, and JSON output by VirtualHome-AIST1 [16], as shown in the upper left of Figure 1. The multi-modal data was output every five frames. The dataset contains over 3,500 videos, which include both fixed camera views and third-person views of the camera moving. The average video length is 64.2 seconds, with a maximum of 268.9 seconds and a minimum of 12.5 seconds. We prepared two types of MMTKGs: a KG with every five video frame images encoded in Base64 described as literal values (i.e., image-embedded MMTKG), and a KG with videos encoded in Base64 described as literal values (i.e., video-embedded MMTKG). We reused the Multimedia Semantic Sensor Network (MSSN) ontology [17] and VirtualHome2KG [18] ontology for schema design.

MMTKG compression

Compressing image-embedded MMTKG

If the MMTKG contains video frame image data, each video frame image is first compressed as a JPEG. Next, each image is split into a grid. The grid image is encoded in Base64 and described in the knowledge representation as shown in the upper right of Figure 1. Here, if there is no difference between the grid image of the current frame and the grid image at the same position in the previous frame, the entity and the literal value of the current grid image are not created, and those of the previous frame are reused.

Compressing video-embedded MMTKG

We adopted MPEG-4 [19] to reduce the video data size. Each video frame entity has a frame number instead of having a Base64 value, and the video entity has a Base64 value for the compressed video. It is possible to extract arbitrary frame images from the video using FFmpeg [20]. The MMTKG size can be further reduced, but long videos take a longer time to decompress.

Removing redundant triples using inference rules

The MMTKGs have redundant triples if the 2D bboxes are not changed. We reduced the number of entities and triples by referring to the previous entities if the current 2D bboxes have not changed since the previous frame.

Moreover, inspired by the approach of removing triples that can be inferred from the rules [21], we create only the relation equivalentFrame(𝑒 𝑝𝑓 , 𝑒 𝑐𝑓 ) between previous frame entity 𝑒 𝑝𝑓 and current frame entity 𝑒 𝑐𝑓 when all 2D bboxes are not changed from the previous frame. We defined the rule as follows: hasMediaDescriptor(𝑒 𝑝𝑓 , 𝑒 𝑏𝑏𝑜𝑥 ) ∧ equivalentFrame(𝑒 𝑐𝑓 , 𝑒 𝑝𝑓 ) − → hasMediaDescriptor(𝑒 𝑐𝑓 , 𝑒 𝑏𝑏𝑜𝑥 ). Similarly, for grid images, we removed triples that can be inferred from the following rule: image(𝑒 𝑝𝑓 , 𝑒 𝑖𝑚𝑎𝑔𝑒 ) ∧ equivalentImage(𝑒 𝑐𝑓 , 𝑒 𝑝𝑓 ) − → image(𝑒 𝑐𝑓 , 𝑒 𝑖𝑚𝑎𝑔𝑒 ). Note that the image property here refers to a split image.

Result

Tables 1 and 2 show the results of the compression experiments. Our methods achieved data size reductions of 36.1% for image-embedded MMTKG and 28.3% for video-embedded MMTKG.

There is a trade-off between the number of grid divisions and the number of triples. The best strategy is 4 × 4. In this study, we experimented with 𝑛 × 𝑛 grid divisions; however, experiments with 𝑛×𝑚 grid divisions are also necessary for a more detailed analysis. We published MMTKGs in a permanently accessible format. 2 In addition, tools for decoding and extracting images and videos from compressed MMTKG are available.3

Discussion

We proposed compression methods for two types of MMTKGs: image-embedded and videoembedded MMTKGs. The former MMTKGs can display arbitrary images on the web using HTML <img> tags without decoding the videos. The latter MMTKGs can apply video compression methods, and if the video is decoded, any frame can be extracted based on the frame number of the image. These MMTKGs can help create benchmark datasets for vision-language models since it is possible to extract arbitrary text and images using SPARQL queries [16]. The compression method for image-embedded MMTKGs might be effective for image stream data in which no video file is created. In contrast, the compression method for video-embedded MMTKGs is more effective when video files are available. Our compression methods for MMTKGs are effective for fixed-camera view videos but are less effective for first-person view videos.

Conclusion

We proposed compression methods for two types of permanently available MMTKGs in which video data are directly embedded as literal values. As a result, our methods achieved data size reductions of 36.1% for image-embedded MMTKG and 28.3% for video-embedded MMTKG. The two MMTKG datasets and the tools are available on GitHub. In the future, we will consider combining our methods with other RDF compression methods [22,23].

Figure 1 :1Figure 1: Overview of multi-modal temporal knowledge graph compression

Table 11Image-embedded MMTKGMMTKG # of triples Size [GB]raw134,945,485 62.03×3 grid 64,242,29641.8 (-32.5%)4×4 grid 78,384,15639.6 (-36.1%)5×5 grid 96,401,62139.9 (-35.6%)

Table 22Video-embedded MMTKGMMTKG# of triples Size [GB]raw131,786,665 17.3w/o redundant triples37,646,68112.5 (-27.7%)w/o redundant triples and triples can be inferred36,284,40212.4 (-28.3%)

https://github.com/aistairc/virtualhome_aist https://github.com/aistairc/vhakg https://github.com/aistairc/vhakg-tools

Acknowledgments

This paper is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Grant Number JP22K18008 and JP23H03688.

Multi-Modal Knowledge Graph Construction and Application: A Survey XZhu ZLi XWang XJiang PSun XWang YXiao NJYuan IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 36 2024 ZChen YZhang YFang YGeng LGuo XChen QLi WZhang JChen YZhu arXiv:2402.05391 Knowledge graphs meet multi-modal learning: A comprehensive survey 2024 arXiv preprint The knowledge graph as the default data model for learning on heterogeneous knowledge XWilcke PBloem VDeBoer Data Science 1 2017 kgbench: A collection of knowledge graph datasets for evaluating relational and multimodal machine learning PBloem XWilcke LVan Berkel VDe Boer The Semantic Web RVerborgh KHose HPaulheim P.-AChampin MMaleshkova OCorcho PRistoski MAlam

Cham

Springer International Publishing 2021 MMpedia: A Large-Scale Multi-modal Knowledge Graph YWu XWu JLi YZhang HWang WDu ZHe JLiu TRuan 10.1007/978-3-031-47243-5_2 The Semantic Web -ISWC 2023 TRPayne VPresutti GQi MPoveda-Villalón GStoilos LHollink ZKaoudi GCheng JLi

Nature Switzerland, Cham

Springer 2023 IMGpedia: A Linked Dataset with Content-Based Analysis of Wikimedia Images SFerrada BBustos AHogan 10.1007/978-3-319-68204-4_8 The Semantic Web -ISWC 2017 CAmato MFernandez VTamma FLecue PCudré-Mauroux JSequeda CLange JHeflin

Cham

Springer International Publishing 2017 VisionKG: Unleashing the Power of Visual Datasets via Knowledge Graph JYuan ALe-Tuan MNguyen-Duc T.-KTran MHauswirth DLe-Phuoc 10.1007/978-3-031-60635-9_5 The Semantic Web AMeroño Peñuela ADimou RTroncy OHartig MAcosta MAlam HPaulheim PLisena

Nature Switzerland, Cham

Springer 2024 Microsoft COCO: Common Objects in Context T.-YLin MMaire SBelongie JHays PPerona DRamanan PDollár CLZitnick 10.1007/978-3-319-10602-1_48 Computer Vision -ECCV 2014 DFleet TPajdla BSchiele TTuytelaars

Cham

Springer International Publishing 2014 AKrizhevsky GHinton Learning multiple layers of features from tiny images 2009 The Pascal Visual Object Classes (VOC) Challenge MEveringham LVan Gool CK IWilliams JWinn AZisserman 10.1007/s11263-009-0275-4 International Journal of Computer Vision 88 2010 Rdf-powered semantic video annotation tools with concept mapping to linked data for next-generation video indexing: a comprehensive review LFSikos Multimedia Tools and Applications 76 2017 Massive semantic video annotation in high-end customer service KFukuda JVizcarra SNishimura HCI in Business, Government and Organizations FF-H. Nah KSiau

Cham

Springer International Publishing 2020 Ontology-based human behavior indexing with multimodal video data JVizcarra SNishimura KFukuda 10.1109/ICSC50631.2021.00052 IEEE 15th International Conference on Semantic Computing (ICSC) 2021. 2021 Vekg: Video event knowledge graph to represent video streams for complex event pattern matching PYadav ECurry 10.1109/GC46384.2019.00011 2019 First International Conference on Graph Computing (GC) 2019 MBesta THoefler arXiv:1806.01799 Survey and taxonomy of lossless graph compression and space-efficient graph representations 2019 VHAKG: A multi-modal knowledge graph based on synchronized multi-view videos of daily activities SEgami TUgai SN NHtun KFukuda Proceedings of the 33rd ACM International Conference on Information and Knowledge Management the 33rd ACM International Conference on Information and Knowledge Management 2024 To appear MSSN-Onto: An ontology-based approach for flexible event processing in Multimedia Sensor Networks CAngsuchotmetee RChbeir YCardinale 10.1016/j.future.2018.01.044 Future Generation Computer Systems 108 2020 Synthesizing Event-Centric Knowledge Graphs of Daily Activities Using Virtual Space SEgami TUgai MOono KKitamura KFukuda 10.1109/ACCESS.2023.3253807 IEEE Access 11 2023 Mpeg-4 natural video coding -an overview TEbrahimi CHorne 10.1016/S0923-5965(99)00054-5 )00054-5 Signal Processing: Image Communication 15 2000 Logical linked data compression AKJoshi PHitzler GDong The Semantic Web: Semantics and Big Data PCimiano OCorcho VPresutti LHollink SRudolph

Berlin Heidelberg; Berlin, Heidelberg

Springer 2013 Compact representation of large rdf data sets for publishing and exchange JDFernández MAMartínez-Prieto CGutierrez The Semantic Web -ISWC 2010 PFPatel-Schneider YPan PHitzler PMika LZhang JZPan IHorrocks BGlimm

Berlin Heidelberg; Berlin, Heidelberg

Springer 2010 Rdsz: An approach for lossless rdf stream compression NFernández JArias LSánchez DFuentes-Lorenzo ÓCorcho The Semantic Web: Trends and Challenges VPresutti CAmato FGandon MD'aquin SStaab ATordai

Cham

Springer International Publishing 2014