Towards comparable and reusable research

Sep 11, 2025 by Thibault Debatty | 434 views

https://cylab.be/blog/440/towards-comparable-and-reusable-research

I was recently a jury member of a Master thesis entitled “Towards Standardized Evaluation: A Meta-Analysis of AI/ML Techniques for Network Intrusion Detection”. The findings of the student struck me! So here is an excerpt of his work.

Research is only useful if it can be used, which also requires research results should be comparable to previous and future work.

In his thesis, the student presents a systematic review and meta-analysis of 86 peer-reviewed studies published between 2019 and 2025, selected from an initial pool of 1,052 records.

The contribution of the thesis is 3 fold.

First, catalogue the AI/ML techniques applied to NIDS and their reported performance outcomes.

Second, examine the datasets, metrics, and experimental protocols used for evaluation. The main finding of the work is that most publications and advances in the field of AI/ML applied to Intrusion Detection cannot be compared to each-other.

So the third contribution of the thesis is a list of recommendations towards standardized evaluation of detection algorithms.

Here are the recommendations of the thesis, which IMHO should serve as guidelines by all researchers in the field…

Adopt Consistent and Comprehensive Performance Metrics

The studies surveyed employed a wide array of metrics, often without consensus. This inconsistency makes it difficult to compare models objectively. We recommend researchers report a core set of standard metrics for every new IDS technique. At minimum, metrics should include detection rate, false positive rate, precision, F1-score, and, if possible, area under the ROC curve, to capture both effectiveness and false alarm tendencies. Such uniform reporting enables fair comparison of results. In addition, statistical significance tests or confidence intervals should be used to confirm that improvements are meaningful, rather than artifacts of particular test splits. This level of rigor will mitigate the “randomness of… testing criteria” problem identified in prior work and move the field toward more reliable conclusions.

Use Up-to-Date and Diversified Evaluation Data

A major finding of this meta-analysis was the continued use of antiquated datasets in many studies, even though they no longer reflect modern network traffic or attack patterns. To ensure relevance, researchers should prioritize recent, realistic datasets and avoid single-dataset evaluation whenever possible. Contemporary benchmark corpora like CIC- IDS2017, CSE-CIC-IDS2018, or sector-specific datasets offer more credible testing grounds and use more current, realistic data for evaluations.

Incorporate Adversarial Robustness Testing

Hardly any studies in our review assessed how models perform under intentional evasion or poisoning attacks—yet adversaries in the real world will adapt to AI-driven defenses. To move toward standardized robustness evaluation, researchers should include experiments that simulate adversarial attempts to deceive the IDS. For example, generating adversarial network traffic can reveal if a high-accuracy model might be brittle against clever attackers.

The importance of this facet is underscored by recent findings: Kumar and Shanthini show that carefully crafted adversarial examples can “drastically reduce model accuracy” of an intrusion detector, exposing vulnerabilities that remained hidden under normal test conditions. By contrast, when they applied adversarial training, the model’s accuracy was successfully restored, greatly “strengthening resilience against attacks”. We therefore recommend that future IDS evaluations include at least one form of adversarial robustness assessment, reporting the model’s performance degradation and recovery if defenses are applied. Over time, a repository of such tests could become part of a standardized benchmark for IDS robustness, analogous to how cryptographic algorithms are tested against known attack vectors.

Prioritize Interpretability and Explainability

Our meta-analysis noted a general lack of interpretability—many high-performing intrusion detectors operate as “black boxes”, offering little insight into why an alert was raised. Low interpretability not only hampers analyst trust in AI/ML systems but also makes it difficult to diagnose false positives or improve models using domain knowledge. To address this, researchers should integrate explainable AI techniques into their evaluation process.

As the field of AI/ML continues to evolve, it is essential that researchers prioritize standardized evaluation to ensure that their work is reliable, reproducible, and relevant to real-world applications. The thesis presented here serves as a critical step towards achieving this goal, and I strongly encourage all researchers in the field to adopt these recommendations and work towards a more standardized and rigorous evaluation process.

Download the thesis

This blog post is licensed under CC BY-SA 4.0