Authors: Haleh Dizaji, Reza Farahani, Dragi Kimovski, Joze Rozanec, Ahmet Soylu, Radu Prodan
Venue: 31st IEEE International Conference on High Performance Computing, Data, and Analytics; Bengaluru, India, 18-21 December
Abstract: The increasing size of graph structures in real-world applications, such as distributed computing networks, social media, or bioinformatics, requires appropriate sampling algorithms that simplify them while preserving key properties. Unfortunately, predicting the outcome of graph sampling algorithms is challenging due to their irregular complexity and randomized properties. Therefore, it is essential to identify appropriate graph features and apply suitable models capable of estimating their sampling outcomes. In this paper, we compare three machine learning (ML) models for predicting the divergence of five metrics produced by twelve node, edge, and traversal-based graph sampling algorithms: degree distribution (D3), clustering coefficient distribution (C2D2), hop-plots distribution (HPD2) (including the largest connected component (HPD2C)), and execution time. We use these prediction models to recommend suitable sampling algorithms for each metric and conduct mutual information analysis to extract relevant graph features. Experiments on six large real-world graphs from three categories (scale-free, power-law, binomial) demonstrate an accuracy under 20% in C2D2 and HPD2 prediction for most algorithms despite the relatively high similarity displacement. Sampling algorithm recommendations on ten real-world graphs show higher hits@3 for D3 and
C2D2 and comparable results for HPD2 and HPD2C compared to the K-best baseline method accessing true empirical data. Finally, ML models show superior runtime recommendations compared to baseline methods, with
hits@3 over 86% for synthetic and real graphs and hits@1 over 60% for small graphs. These findings are promising for algorithm recommendation systems, particularly when balancing quality and runtime preferences.