Graph-Based Exploration of Packet Capture Data

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Gaining insights into workload behavior across hybrid cloud environments is difficult when visibility into internal systems is restricted. This study investigates whether communi- cation patterns and deployment structure can be explored from packet capture (PCAP) data alone using a fully unsupervised, graph-based approach. Raw PCAP data from the CICIDS2017 dataset was aggregated into confirmed bidirectional flows and transformed into workload communication graphs, where nodes represent workloads and edges capture verified traffic exchanges. Node-level features were extracted by aggregating flow-level behavior and computing structural characteristics, including session volatility, TTL variability, external communi- cation ratio, and structural role score. Additional topological context was derived through Louvain community detection and component-type labeling; both were included as node- level features. Structural roles were mined using recursive feature expansion (ReFeX), followed by non-negative matrix factorization (NMF). The resulting node-role matrix W was used for behavioral scoring and clustering, while the role-feature matrix H, which encodes interpretable structural patterns, was examined separately to support the inter- pretation of topological traits. The resulting clusters exhibited distinct communication patterns, including stable services, orchestration nodes, data exporters, and volatile edge workloads. For example, short-lived, externally focused traffic was indicative of serverless or aggregator functions, while persistent, internally scoped patterns corresponded to virtual machines or long- running services. These clusters reflected structural formations such as high-degree hubs, dense clusters, and chain-like components. Several limitations and strengths were identified. While ground-truth labels were un- available, cluster characteristics aligned with recognizable workload types. Internal vali- dation using the DBCV score yielded a value of 0.5695. Key limitations include the static scope of the dataset, exclusion of application-layer semantics, and the qualitative na- ture of interpretation. Nonetheless, the proposed pipeline offers an interpretable method to analyze workload behavior from PCAP data, providing insight in

Keywords

PCAP data analysis; workload behavior; deployment artifacts; graph net- works; systems thinking

Citation