Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies

Abstract

Motivated by the need to introduce design improvements to the Internet network to make it robust to high traffic volume anomalies, we analyze statistical properties of the time separation between arrivals of consecutive anomalies in the Internet2 network. Using several statistical techniques, we demonstrate that for all unidirectional links in Internet2, these interarrival times have distributions whose tail probabilities decay like a power law. These heavy-tailed distributions have varying tail indexes, which in some cases imply infinite variance. We establish that the interarrival times can be modeled as independent and identically distributed random variables, and propose a model for their distribution. These findings allow us to use the tools of of renewal theory, which in turn allows us to estimate the distribution of the waiting time for the arrival of the next anomaly. We show that the waiting time is stochastically substantially longer than the time between the arrivals, and may in some cases have infinite expected value. All our findings are tabulated and displayed in the form of suitable graphs, including the relevant density estimates.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Adler R, Feldman R, Taqqu MS (1998) A practical guide to heavy tails: statistical techniques for analyzing heavy tailed distributions. Birkhauser, Boston

    Google Scholar 

  2. Bandara VW, Pezeshki A, Jayasumana AP (2014) A spatiotemporal model for internet traffic anomalies. IET Netw 3:41–53

    Article  Google Scholar 

  3. Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor 16:303–336

    Article  Google Scholar 

  4. Chandolla V, Benerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(15):58

    Google Scholar 

  5. Good PI (2013) Permutation, parametric, and bootstrap tests of hypotheses. Springer, Berlin

    Google Scholar 

  6. Hall P (1990) Using the bootstrap to estimate mean squared error and select smoothing parameter in nonparametric problems. J Multivar Anal 32:177–203

    MathSciNet  Article  Google Scholar 

  7. Kallitsis M, Stoev S, Bhattacharya S, Michailidis G (2016) AMON: an open source architecture for online monitoring, statistical analysis and forensics of multi-gigabit streams. IEEE J Sel Areas Commun 34:1834–1848

    Article  Google Scholar 

  8. Kulkarni VG (2017) Modeling and analysis of stochastic systems. Chapman and Hall, Atlanta

    Google Scholar 

  9. Leland WE, Taqqu MS, Willinger W, Wilson DV (1994) On the self-similar nature of ethernet traffic (extended version). IEEE/ACM Trans Netw 2:1–15

    Article  Google Scholar 

  10. Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24

    Article  Google Scholar 

  11. Park K, Willinger W (2000) Self-similar network traffic and performance evaluation. Wiley, Hoboken

    Google Scholar 

  12. Paschalidis IC, Smaragdakis G (2009) Spatio-temporal network anomaly detection by assessing deviations of empirical measures. IEEE/ACM Trans Netw 17:685–697

    Article  Google Scholar 

  13. Peng L, Qi Y (2017) Inference for heavy-tailed data analysis: applications in insurance and finance. Academic Press, Cambridge

    Google Scholar 

  14. Resnick SI (1997) Heavy tail modeling and teletraffic data. Ann Stat 25:1805–1869

    MathSciNet  Article  Google Scholar 

  15. Resnick SI (2007) Heavy-tail phenomena: probabilistic and statistical modeling. Springer, Berlin

    Google Scholar 

  16. Shumway RH, Stoffer DS (2017) Time series analysis and its applications with R examples. Springer, Berlin

    Google Scholar 

  17. Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 39:11994–12000

    Article  Google Scholar 

  18. Vaughan J, Stoev S, Michailidis G (2013) Network-wide statistical modeling, prediction and monitoring of computer traffic. Technometrics 55:79–93

    MathSciNet  Article  Google Scholar 

  19. Xie M, Han S, Tian B, Parvin S (2011) Anomaly detection in wireless sensor networks: a survey. J Netw Comput Appl 34:1302–1325

    Article  Google Scholar 

  20. Zarpelao BB, Miani RS, Kawakani CT, de Alvarenga SC (2017) A survey of intrusion detection in internet of things. J Netw Comput Appl 84:25–37

    Article  Google Scholar 

Download references

Acknowledgements

This research has been partially supported by NSF grants DMS–1737795, DMS 1923142 and CNS 1932413. We thank Professor Anura P. Jayasumana of the CSU’s Department of Electrical and Computer Engineering for sharing the Internet2 anomalies data.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Piotr Kokoszka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

Estimation of the mean time between the arrivals of the anomalies

As explained in Sect. 4, to reliably estimate the distribution of the waiting time, we need a good estimator of the mean interarrival time. In the context of our data, this is a delicate task because the interarrival times have heavy tails, which will bias the usual sample mean. We compared the performance (root mean squared error-RMSE) of the following 12 mean estimators:

  • Median,

  • Huber location estimator with varying truncation constants: \(k = 5, 10, 20, 30\),

  • Sample mean,

  • Trimmed mean with varying trimming fractions: trim = 0.025, 0.05, 0.10, 0.15, 0.20,

  • The estimator \(\hat{\tau }= \int _0^{\infty } (1 - \widehat{G}(t)) dt\), based on the formula \(\tau = E(X) = \int _0^{\infty }P(X>t)dt\).

A crucial question is to generate observation from a distribution which resembles the distribution of the real interarrival times, and whose mean (expected value) can be computed analytically. We can then consider differences between an estimated mean and the true mean within a simulation study. Since interarrival times have many small values and dominating large values with lower frequencies, we use a mixture model: interarrival times come from a Weibull distribution with probability p and from a half-t distribution with probability \(1-p\). The Weibull component is designed to model the occurrence of small values, while the half-t component is designed to model the tail behavior and allow for either finite or infinite variance. (The t distribution satisfies (3.1) with \(\alpha \) equal to the degrees of freedom parameter \(\nu \).) The density function from which we simulated observations is thus given by

$$\begin{aligned} f(x) = p f_{w}(x; k, \lambda ) + (1-p)f_{t}(x; \nu , \sigma ), \ \ \ x > 0, \end{aligned}$$

where

$$\begin{aligned} f_{w}(x;k, \lambda )&= \frac{k}{\lambda } \left( \frac{x}{\lambda }\right) ^{k-1}e^{-(x/\lambda )^k}, \\ f_{t}(x;\nu , \sigma )&= 2\frac{\Gamma ((\nu +1)/2)}{\Gamma (\nu /2)\sqrt{\nu \pi \sigma ^2}} \left[ 1 + \frac{1}{\nu }\frac{x^2}{\sigma ^2}\right] ^{-(\nu +1)/2}. \end{aligned}$$

and where \(k, \lambda , \nu , \sigma > 0\) and \(0 \le p \le 1\). For this model the value of \(\tau \) can be computed, it is equal to

$$\begin{aligned} \tau = c\left[ \lambda \Gamma (1 + 1/k)\right] + (1-c)\left[ 2\sigma \sqrt{\frac{\nu }{\pi }} \frac{\Gamma ((\nu +1)/2)}{\Gamma (\nu /2)(\nu -1)}\right] , \ \ \ \nu > 1. \end{aligned}$$

We estimated the mixture model using Maximum Goodness-of-fit Estimator, which minimizes Kolmogorov–Smirnov distance, using the R package fitdistrplus. We then used the estimated model to generate \(n=1000\) samples of synthetic interarrival times to compute Monte Carlo RMSE of mean estimators as

$$\begin{aligned} RMSE = \sqrt{\frac{1}{n} \sum _{r=1}^{n}(\hat{\tau }_{r} - \tau )^2}, \end{aligned}$$

where \(\hat{\tau }_r\) is a mean estimator computed from the rth Monte Carlo sample, and \(\tau \) is the mean of the estimated mixture model.

We found that the mixture model has good fit to the observed interarrival times. The fits in all links are similar to those shown in Fig. 2. The Kolmogorov–Smirnov goodness-of-fit test also fails to reject, for all links, the null hypothesis of equal distribution between real and simulated data. In all 28 links, estimates for \(\nu \) are between 1.2 and 2.2, so the half-t distribution successfully captures the tail behavior inferred from the Hill plots. Since the \(\nu \) estimates are all greater than 1, the means of estimated mixture distributions exist.

Table 3 RMSEs of the sample mean and three best mean estimators for interarrival time; bold indicates the lowest RMSE among the 12 estimators we considered

The RMSEs for the sample mean, the most commonly used estimator, and the three best estimators are shown in Table 3. We see that sample mean performs poorly. The estimator \(\hat{\tau }\), which we used in Sect. 4, is most often the best estimator, and when it is not, its RMSE is very close to the lowest RMSE. This justifies its choice as the preferred mean estimator for the interarrival time.

Significance tests

We present here formal statistical significance tests that confirm the conclusions stated in Sect. 4. We first consider the testing problem:

\(H_0:\):

The distributions of interarrival times are identical for the 28 links,

\(H_A:\):

The distributions of interarrival times are not identical for the 28 links.

Since, as shown in Sect. 4, \(f_B(x) = \tau ^{-1}(1 - G(x))\), this test also applies to the distributions of waiting times. If these distributions are equal, then their expected values are also equal. We therefore use a permutation test based on the usual F-statistic:

$$\begin{aligned} F = \frac{U}{V}, \ \ \ U:= \frac{\sum _{i = 1}^{28}(\overline{X}_{i.} - \overline{ X}_{..})^2 n_{i}}{28-1}, \ \ V:= \frac{\sum _{i =1}^{28}\sum _{j=1}^{n_i}(X_{ij} - \overline{ X}_{..})^2}{N - 28}, \end{aligned}$$
(B.1)

where \(\overline{X}_{i.}\) is the sample mean of interarrival times in link i, \(\overline{X}{..}\) is the sample mean of interarrival times across all links, \(n_i\) is the number of observed interarrival times for link i, N is the number of observed interarrival times in all 28 links.

Fig. 7
figure7

Sampling null distribution of the F statistic (B.1) based on ten thousand permutations

The observed value of the test statistic is \(F=5.52\). However, we cannot compare it to a tabulated critical value because the distribution of the interarrival times is not normal. We therefore estimate the null distribution using permutations, see e.g. Good (2013). Under \(H_0\), the interarrival times among the 28 links are iid random variables; hence, by randomly reassigning the N interarrival times to the 28 groups, such that the number of observations in each group is not changed, we produce a new pseudo dataset for which \(H_0\) is true. We resample this way for 10,000 times, and obtain the null distribution of the test statistics shown in Fig. 7. It is seen that the observed value of \(F=5.52\) is far to the right of the range of the test statistics under the null hypothesis. Formally, we approximate the p-value with the proportion of samples with \(F > 5.52\), and see that \(p-value < 0.0001\). As the result, we reject \(H_0\).

We also performed the Anderson–Darling test with the R package kSamples. The standardized Anderson–Darling test statistics is 28.36 with \(p-value < 0.0001\). Hence, we also reject \(H_0\).

We conclude that the interarrival time distributions among the 28 links are not identical; hence the waiting time distributions among the 28 links are not identical either.

We also used three standard goodness-of-fit tests implemented with R package EnvStats: Kolmogorov–Smirnov test, Cramer-von Mises test and Anderson–Darling test to check if the distribution of the interarrival times is exponential. For all 28 links, and for each test, the null hypothesis of an exponential distribution is rejected at the significance level of 5 percent. We conclude that the anomaly interarrival time does not have an exponential distribution.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kokoszka, P., Nguyen, H., Wang, H. et al. Statistical and probabilistic analysis of interarrival and waiting times of Internet2 anomalies. Stat Methods Appl 29, 727–744 (2020). https://doi.org/10.1007/s10260-019-00500-x

Download citation

Keywords

  • Heavy-tailed distributions
  • Interarrival times
  • Internet anomalies
  • Renewal theory