Beyond Packets: Identifying IoT Devices by What They Do

Author: Denis Avetisyan

A new approach analyzes network service access to create robust device fingerprints, offering a more understandable alternative to deep packet inspection.

Network service usage patterns among IoT devices vary considerably, prompting a comparative study of three representation methods-SL, SP, and G-to achieve stable device fingerprinting; while SL representations prove unreliable for devices employing dynamic ports, SP representations, though stabilizing behavior, exhibit sensitivity to usage variations, and G representations, when appropriately parameterized, offer a robust solution to these challenges by balancing granularity and stability.

This review details a method for macroscopic behavioral fingerprinting of IoT devices via network service analysis, enabling improved traffic classification and network security.

Despite growing cybersecurity risks, identifying Internet of Things (IoT) devices within a network remains challenging due to the computational cost and opacity of current behavioral fingerprinting techniques. This paper, ‘From Flows to Functions: Macroscopic Behavioral Fingerprinting of IoT Devices via Network Services’, introduces a novel approach that characterizes devices by their stable patterns of network service usage over extended periods-effectively shifting analysis from individual traffic flows to functional behaviors. We demonstrate that these ‘service-level fingerprints’ offer a lightweight, explainable, and robust alternative for device identification, even in scenarios with previously unseen devices. Could this macroscopic perspective fundamentally reshape how we secure and manage the rapidly expanding IoT ecosystem?

The Expanding Vulnerability Surface of IoT Networks

The rapid expansion of Internet of Things (IoT) devices has created a substantial challenge for network security, largely due to a critical lack of visibility into their operational behavior. As billions of devices connect to networks – from smart thermostats to industrial sensors – traditional security tools struggle to monitor and manage this unprecedented scale and diversity. Many IoT devices lack robust security features, and their often-unmapped network activity makes it difficult to distinguish between legitimate communication and malicious intent. This limited observability creates blind spots, allowing attackers to potentially exploit vulnerabilities, establish a foothold within a network, and launch attacks without immediate detection. Consequently, organizations face increasing difficulty in accurately assessing risk and implementing effective security measures across their growing IoT ecosystems.

The expanding universe of Internet of Things (IoT) devices presents a formidable challenge to conventional security systems. These systems, often designed for more standardized network environments, grapple with the sheer heterogeneity of IoT – a landscape populated by devices employing a vast array of communication protocols, operating systems, and functionalities. This diversity undermines the efficacy of signature-based detection and behavioral analysis, as established security tools struggle to accurately identify and classify the normal behavior of each unique device. Consequently, a critical need arises for advanced identification methods capable of discerning legitimate activity from malicious intent, even amidst the complex and varied communication patterns characteristic of modern IoT deployments. Without such refined identification, network defenses remain vulnerable to exploitation through seemingly innocuous, yet compromised, devices.

Effective Internet of Things (IoT) security fundamentally relies on a comprehensive understanding of the network services each device employs. These services – such as DNS, NTP, or specific application-layer protocols – represent the pathways through which devices communicate and, consequently, potential avenues for attack. By meticulously cataloging and analyzing the network services utilized by each IoT device, security professionals can establish a baseline of normal behavior, identify anomalies indicative of compromise, and implement targeted security policies. This granular visibility allows for the creation of more precise firewall rules, intrusion detection signatures, and threat intelligence feeds, moving beyond generic security measures that often prove ineffective against the diverse landscape of IoT devices. Consequently, a deep understanding of these network interactions is not merely beneficial, but rather a prerequisite for building robust and adaptive security defenses in the increasingly interconnected world.

The expanding universe of connected devices presents a unique challenge to network security: a lack of clear device identification effectively blinds defenders to potential threats. Many Internet of Things (IoT) devices are designed with minimal user interaction and often lack robust security features, making them easy targets for compromise. Once infiltrated, these seemingly innocuous gadgets can become entry points for malicious actors, launching attacks from within the network perimeter. Without the ability to accurately identify and monitor each device’s behavior, security teams struggle to differentiate between legitimate traffic and covert malicious activity. This creates a critical visibility gap, where compromised devices can operate undetected, exfiltrating data, participating in distributed denial-of-service attacks, or serving as staging grounds for further intrusions, all while appearing as normal components of the network.

Behavioral Fingerprinting: Establishing Device Identity Through Observation

Traffic fingerprinting circumvents reliance on device-supplied identifiers – such as User-Agent strings or network configuration details – by establishing identification based on observable network behavior. This technique analyzes patterns in network traffic, including protocol usage, packet sizes, inter-packet timings, and application-layer data, to create a unique profile for each device. The resulting profiles are derived solely from externally visible traffic characteristics, offering resilience against spoofing or manipulation of self-reported information. Consequently, devices can be identified and tracked even when employing privacy-enhancing technologies like MAC address randomization or VPNs, as the behavioral fingerprint remains consistent regardless of these changes.

Device profiling through traffic fingerprinting utilizes multiple representation methods to create a comprehensive behavioral baseline. Service List Representation identifies the specific ports and protocols a device actively uses, generating a list that characterizes its network behavior. Complementing this, Service Prevalence Representation quantifies the frequency with which a device connects to various services, providing statistical data on its typical communication patterns. By combining these approaches – detailing both the types of services and their usage rates – a more robust and accurate device profile is established, improving identification accuracy and the ability to detect anomalous activity.

Device profiling through behavioral representations, such as Service List and Service Prevalence, establishes a unique identifier based on observed network communication patterns. Service List Representation catalogs the specific network services a device interacts with, while Service Prevalence Representation quantifies the frequency of those interactions. Combining these data points creates a statistically significant profile, effectively a ‘digital DNA’, allowing for device identification even when traditional methods like MAC address spoofing or IP address changes are employed. Anomaly detection is then achieved by comparing current traffic behavior against the established baseline profile; deviations indicate potential security threats or device compromise.

IPFIX (Internet Protocol Flow Information Export) records contain detailed metadata about network traffic flows, providing the foundation for behavioral fingerprinting. These records include information such as source and destination IP addresses, port numbers, protocols, and timing information. Analyzing aggregated IPFIX data allows the creation of device profiles based on observed communication patterns – specifically, the services each device attempts to utilize and the frequency of those attempts. The granularity of IPFIX data enables the identification of subtle behavioral differences between devices, even those sharing similar configurations. Continuous analysis of incoming IPFIX records allows for the refinement of existing fingerprints and the detection of deviations from established baselines, which can indicate compromised devices or anomalous network activity.

Fingerprint convergence decreases with higher thresholds and lower granularity, while recurrence scores initially improve with stricter thresholds but diminish at very low granularity levels.

The Stability and Accuracy of Generalized Representations

The Generalized Representation, used for device identification, is constructed by weighting service usage frequency against the temporal stability of that usage. This methodology moves beyond simple counts of service interactions by incorporating a time-decay factor; recent service usage contributes more heavily to the representation than older data. By balancing these two factors, the representation is less susceptible to short-term fluctuations in service use – such as background tasks or temporary network conditions – while still accurately reflecting long-term device behavior. The resulting fingerprint is therefore more robust and reliable for identifying devices over extended periods and across varying usage patterns.

The Granularity Level parameter, denoted as ‘g’, directly influences the dimensionality of the service representation vectors. A higher value of ‘g’ increases the vector’s dimensionality, capturing finer-grained details of service usage patterns and potentially improving accuracy in distinguishing between devices. However, this increased dimensionality also renders the representation more sensitive to noise and transient variations in service traffic. Conversely, a lower value of ‘g’ produces a more compact representation, enhancing resilience to noise but potentially sacrificing the ability to differentiate between nuanced service behaviors. Therefore, selecting an appropriate granularity level requires balancing the trade-off between accuracy and robustness, dependent on the characteristics of the deployment environment and the expected level of signal-to-noise ratio.

Cosine similarity is employed as the primary metric for quantifying the resemblance between service-level representations, effectively measuring the angle between two vectors representing service usage patterns. A value of 1 indicates perfect similarity, 0 indicates orthogonality (no similarity), and -1 indicates complete dissimilarity. Computationally, cosine similarity is determined by the dot product of two vectors divided by the product of their magnitudes: $similarity = \frac{A \cdot B}{||A|| \cdot ||B||}$. This metric facilitates efficient device identification by allowing for rapid comparison of usage profiles and enables effective clustering of devices exhibiting similar service behaviors, even in the presence of variations in usage volume. The resulting similarity scores are then used in algorithms to group devices or identify potential matches.

Evaluations of the Generalized Representation demonstrate a classification accuracy of up to 98% when applied to closed-set scenarios. This performance level was achieved using a granularity level setting of $g=2048$, which defines the dimensionality of the service representation. The high accuracy indicates the method’s ability to reliably differentiate between devices under controlled conditions, establishing its precision and suitability for device identification and clustering tasks where the complete set of possible devices is known.

Classification using augmented fingerprints at parameters (2048, 0.95) achieves high accuracy in both closed-set and open-set scenarios, as demonstrated by the confusion matrices.

Extending Classification Capabilities to the Unknown

Closed-set classification functions as a fundamental security layer by relying on pre-established knowledge of expected devices within a network. This approach meticulously identifies devices based on a defined set of characteristics, effectively confirming the presence of known entities and flagging any deviations as potential anomalies. While not capable of identifying entirely novel device types, it provides a critical baseline for network security by ensuring that all recognized devices operate as expected. The strength of this method lies in its simplicity and efficiency; it’s a highly reliable system for detecting misconfigurations or compromised devices within the known inventory, forming a vital first line of defense before more complex open-set techniques are deployed.

The proliferation of Internet of Things (IoT) devices introduces a constant stream of novel technologies, rendering traditional closed-set classification methods increasingly inadequate. These systems, designed to categorize devices within a predetermined list, struggle with the emergence of previously unknown device types, potentially leading to security vulnerabilities or inaccurate data analysis. Open-set classification addresses this limitation by enabling the identification of not only known devices but also the detection of anomalies – devices falling outside the established categories. This capability is crucial for maintaining robust security and adapting to the ever-evolving IoT landscape, as it allows systems to flag potentially malicious or malfunctioning devices that would otherwise go unnoticed, ensuring a more dynamic and resilient network.

The system’s efficacy in identifying IoT devices extends beyond known categories through a refined open-set classification approach, achieving remarkably high levels of accuracy. Utilizing a granularity level of $g=2048$, the methodology demonstrates 98% precision, meaning that when a device is classified, it is overwhelmingly correct. Complementing this is a 97% recall rate, indicating the system effectively identifies nearly all instances of a given device type. This high precision and recall, achieved simultaneously, represents a significant advancement in IoT security, allowing for reliable detection of both familiar and previously unseen devices within a network and minimizing the risk of misidentification.

The system exhibits a strong capacity for anomaly detection, correctly classifying 63% of previously unseen devices as UNKNOWN – a critical function for securing dynamic Internet of Things environments. This ability to flag unfamiliar devices minimizes the risk of false positives, preventing legitimate devices from being incorrectly identified as threats. Complementing this, analysis reveals that over 90% of the system’s inferred seasonality periods – the time it takes to recognize recurring patterns in device behavior – converge to a duration of eight days or less. This rapid convergence suggests the system quickly adapts to established routines and is highly sensitive to deviations, further enhancing its ability to pinpoint anomalous activity and maintain a robust security posture.

The pursuit of identifying IoT devices through network service analysis, as detailed in the study, echoes a fundamental principle of computational elegance. Just as a mathematical proof demands rigorous definition and logical progression, so too does accurate device fingerprinting require a clearly defined methodology. Vinton Cerf aptly stated, “The internet is not just about technology; it’s about people.” This aligns with the study’s focus on interpretable fingerprints – moving beyond opaque packet data to reveal the underlying behavioral characteristics of devices. The research demonstrates that service-level analysis, when approached with a similar insistence on logical structure, yields a more robust and understandable system for device identification, mirroring the pursuit of provable algorithms over mere functional code.

What Lies Ahead?

The pursuit of device identification, particularly within the burgeoning landscape of the Internet of Things, frequently devolves into a complex game of pattern matching. This work, by shifting the focus to service-level interactions, offers a welcome departure – a move towards understanding what a device does, rather than simply how it communicates. However, the inherent limitations of any behavioral fingerprinting scheme remain. The consistency of service access, while more stable than ephemeral packet characteristics, is not absolute. Devices evolve, software updates alter behavior, and user habits shift. The true test will lie in demonstrating robustness against these inevitable perturbations.

Future investigations should address the challenge of open-set recognition with greater rigor. Identifying known device types is a comparatively trivial exercise. The real problem – and the one with the most practical import – is the accurate classification of unknown devices, and, crucially, the reliable detection of anomalous behavior. A purely statistical approach, divorced from a deeper understanding of service semantics, risks becoming yet another arms race between fingerprint and evasion.

Ultimately, the elegance of a solution resides not in its ability to categorize existing devices, but in its capacity to define the boundaries of what constitutes a legitimate interaction. The question is not simply “Is this device X?”, but “Does this behavior fall within the expected parameters of any known device?”. A mathematically grounded framework, one that prioritizes predictability and consistency over mere accuracy, is the only path towards a truly robust and interpretable system.

Original article: https://arxiv.org/pdf/2512.16348.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Expanding Vulnerability Surface of IoT Networks

Behavioral Fingerprinting: Establishing Device Identity Through Observation

The Stability and Accuracy of Generalized Representations

Extending Classification Capabilities to the Unknown

What Lies Ahead?

See also: