Abstract : The knowledge graph with relational abundant information has been widely used as the basic data support for retrieval platforms. Multi-modal knowledge graphs benefit from the addition of images and text descriptions to node information, which contributes to their advantage. In cross-modal retrieval platforms, multi-modal knowledge graphs can help improve retrieval accuracy and efficiency because they provide abundant relational information. For the application of multimodal knowledge graphs, the representation learning method is crucial. As a foundation for efficient and high-precision multimodal data retrieval, this paper proposes a distributed collaborative vector retrieval platform (DCRL-KG) based on the multimodal knowledge graph VisualSem. To improve retrieval efficiency, use distributed technology to classify and store the data in a knowledge graph. Secondly, this paper uses BabelNet to expand the knowledge graph through multiple filtering processes and increase the diversity of information. As a final step, this paper develops a variety of retrieval models to achieve high- precision language retrieval and image retrieval by fusing retrieval results with linear combination methods. According to the paper, the platform can optimize the multimodal knowledge graph's storage structure and performing well in a multimodal environment.
Cite : Shaik, N., Harichandana, B., & Chitralingappa, P. (2023). Collaborative Representation Learning- Based Distributed Multi-Modal Knowledge Graph Retrieval Platform (1st ed., pp. 141-151). Noble Science Press. https://doi.org/10.52458/9789388996587.2023.eb.ch29
References :
R. Xie, Z. Liu, H. Luan and M. Sun, “Image-embodied knowledge representation learning,” arXiv preprint arXiv:1609.07028, 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, Long Beach, CA, USA, 2017.
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Advances in Neural Information Processing Systems, vol. 26, Lake Tahoe, Nevada, USA, 2013.
H. Alberts, T. Huang, Y. Deshpande, Y. Liu, K. Cho et al., “VisualSem: A high-quality knowledge graph for vision and language,” arXiv preprint arXiv: 2008.09150, 2020.
X. Zhu, Z. Li, X. Wang, X. Jiang, P. Sun et al., “Multi-modal knowledge graph construction and application: Survey,” arXiv preprint arXiv:2202.05786, 2022.
N. Huang, Y. R. Deshpande, Y. Liu, H. Alberts, K. Cho et al., “Endowing language models with multimodal knowledge graph representations,” arXiv preprint arXiv:2206.13163, 2022.
J.Deng,W.Dong,R. Socher, L. J. Li, K. Li et al., “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conf. on Computer Vision and Pattern Recognition, Miami, Florida, USA, pp. 248–255, 2009.
H. Mousselly-Sergieh, T. Botschen, I. Gurevych and S. Roth, “A multimodal translation-based approach for knowledge graph representation learning,” in Proc. of the Seventh Joint Conf. on Lexical and Computational Semantics, New Orleans, Louisiana, USA, pp. 225–234, 2018.
Y. Liu, H. Li, A. Garcia-Duran, M. Niepert, D. Onoro-Rubio et al., “MMKG: Multi- modal knowledge graphs,” in European Semantic Web Conf., Portorož, Slovenia, pp. 459–474, 2019.
R. Navigli and S. P. Ponzetto, “BabelNet: The automatic construction, evaluation and application of a wide coverage multilingual semantic network,” Artificial Intelligence, vol. 193, pp. 217–250, 2012.
N. Reimers and I. Gurevych, “Sentencebert: Sentence embeddings using Siamese Bert networks,” arXiv preprint. arXiv:1908.10084, 2019.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. on Machine Learning, Virtual Event, pp. 8748–8763, 2021.
J. Devlin, M. W. Chang, K. Lee and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.3306 IASC, 2023, vol.36, no.3
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai et al., “An image is worth 16 × 16 words:Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2010.
C. Sun, A. Myers, C. Vondrick, K. Murphy and C. Schmid, “Videobert: A joint model for video and language representation learning,” in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, Seoul, South Korea, pp.7464–7473, 2019.
L. H. Li, M. Yatskar, D. Yin, C. J. Hsieh and K. W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
W. Su, X. Zhu, Y. Cao, B. Li, L. Lu et al., “Vl-bert: Pre-training of generic visual- linguistic representations, ”arXiv preprint arXiv:1908.08530, 2019.
J. Lu, D. Batra, D. Parikh and S. Lee, “Vilbert: Pretraining task-agnostic Visio linguistic representations for vision and-language tasks,” in Advances in Neural Information Processing Systems, vol. 32, Vancouver, BC, Canada, 2019.
G. Li, N. Duan, Y. Fang, M. Gong and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp.11336–11344, 2020.
Y. C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed et al., “Uniter: Universal image-text representation learning,” in European Conf. on Computer Vision, Glasgow, UK, pp. 104–120, 2020.
H.TanandM.Bansal,“Lxmert:Learningcross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” arXiv preprint arXiv:2201.12086, 2022.
C. Alberti, J. Ling, M. Collins and D. Reitter, “Fusion of detected objects in text for visual question answering,”arXiv preprint arXiv:1908.05054, 2019.