Scalable Distributed Machine Learning for Knowledge Graphs

01 July 2023
research

PhD Thesis - Carsten Felix Draschner, Dr.rer.nat, supervisors - Jens Lehmann and Stefan Wrobel

PhD Thesis

TL;DR ⏱️

Scalable Distributed Machine Learning
Knowledge Graphs
Ethical and Sustainability Dimensions

Within this work, we developed novel approaches for Machine Learning on Knowledge Graphs while considering ethical and sustainability dimensions. In particular, we developed technologies that create fixed-length numeric feature vectors. These include methods that, like graph kernels, extract features from the graph in the context of the map-reduce operations relevant for distributed computation. The feature extraction also includes the multi-modal data of KG literals. Accordingly, we have developed methods that enable SPARQL-based feature extraction and assist in creating complex feature-extracting queries. Based on these extracted features, we further contributed scalable, distributed, and explainable ML and data analytics methods such as semantic similarity estimation and classification or regression ML pipelines demonstrating noticeable performance. We support the transparency, reusability, and reproducibility of our novel open-source approaches by results and meta-data semantification. This semantification transfers the original graph data with the hyper-parameter setup and explainability information, in addition to the predicted results of the ML pipelines, into a semantic native KG. Due to the technological complexity, we enable the application of our algorithm technologies through complementary work such as the use in coding notebooks and the use in Rest API-based environments. Our work also describes the multidimensional and interwoven optimization dimensions of ethical and sustainable KG-based ML. We extended the existing technology stack SANSA, which is used for distributed processing and native semantic data handling, by several scientific publications and software framework releases to offer these functionalities for distributed ML on KGs.