Keywords

1 Introduction

AIOps refers to multi-layered technology platforms that automate and enhance IT operations by using analytics and machine learning [6]. AIOps was introduced to reduce the cost and increase the effectiveness of O&M tasks on ever-increasing complex public, private, edge, mobile, and hybrid cloud environments. The transition from mainframes, to virtual machines, to containers, and serverless computing made existing approaches and tools which rely on simple statistical methods obsolete due to the increasing complexity and communication patterns between services. Notable examples include Zabbix, Cacti, and Nagios [17, 33].

Monitoring data is a key element of new AIOps tools and one of the cornerstones of research. The data generated by distributed IT systems can be classified into three main categories: metrics, application logs, and distributed traces [30]. Metrics are numeric values measured over a period of time. They describe the utilization and status of the infrastructure, typically regarding CPU, memory, disk, network throughput, and service call latency. Application logs enable developers to record what actions were executed at runtime by software. Service, microservices, and other systems generate logs which are composed of timestamped records with a structure and free-form text. Distributed traces record the workflows of services executed in response to requests, e.g., HTTP or RPC requests. The records contain information about the execution graph and performance at a (micro)service level.

Recently, various approaches – focusing on a wide range of datasets, O&M tasks, and IT systems – have been proposed. This includes variety of tasks, which extract knowledge from a specific type of data. For example, anomaly detection has been applied to metrics (numeric) [11, 26, 27], logs (unstructured numeric and text data) [4, 7, 24], and also to distributed system traces (unstructured numeric and text data) [18, 19].

The existing research has mainly explored publicly available data, which usually captures only a single data source category. This limits both the development of new methods that could extract knowledge from multi-source data and their proper evaluation. The absence of data repositories capturing the three data categories from modern distributed systems prevents the development of methods for multi-source mining, knowledge extraction, semantic information learning from the naturally linked data sources. Furthermore, enables fault detection, root-cause analysis, and remediation that could give advances in the field as existing approaches typically produce a large number of false positives.

We address this issues by producing the following contributions:

  • A new data of metrics, logs, and traces generated by a distributed system based on microservice architecture.

  • Description of the approach developed to generate the multi-source system data and its statistics.

  • Analysis of existing datasets utilized for the evaluation of AIOps algorithms, highlighting their benefits and their limitations.

  • Applications of the multi-source data to develop new algorithms to support additional O&M tasks.

Specifically, during the development and data generation process, we derived the following requirements.:

R1 Originality. The data should fill the gaps in existing datasets for various AIOps tasks including anomaly detection and root-cause analysis. Moreover should open new possibilities for the development of novel methods for O&M tasks.

R2 Reusability. The data should be modular and open for and adaptable to various use cases. Next to that, the system should be easy to handle. This should allow development of single- and multi-source methods.

R3 Quality. The data should be analyzed before publishing, free of errors, and directly usable.

R3 Extendability. The testbed generating the data should allow different system configurations, fault injections, workloads, and thus data generation which suits the real production scenario of various interested parties.

2 Related Work

Metrics, logs, and traces are important data sources that are fundamental to the operation of complex distributed systems. In following we study related work for these data accordingly.

The metric data is a common way to extract useful information for describing the state of the system. However, often it is not sufficient and reliable to model the complex systems. The metrics data are obtained from monitoring of the resources such as CPU, memory, disk and network throughput and latency. A plethora of available collections of datasets containing metric data can be found in Stonybrook [31], where multiple datasets for different tasks related to anomaly detection can be found. Numenta [1] predominantly contains datasets from streaming and real-time applications, while Harvard [9], ELKI [8], LMU [15] store network intrusion data. Recently, there are multiple studies which utilize these datasets for anomaly detection, root-cause analysis, and remediation. In Subutai et al. [1], a novel anomaly detection method based on hierarchical temporal memory (HTM) is introduced. It enables anomaly detection in the streaming setting to tackle the problems of concept drift and the problem of multiple streaming sources utilizing metrics data. In Schmidt et al. [26], an unsupervised anomaly detection framework is developed and applied to real-time monitoring data in a distributed environment.

The main challenge that AIOps systems analyzing log data are facing is the unstructured nature of the logs. This problem usually requires prior and proper preprocessing and/or inclusion of domain knowledge. Often, approaches extract log key identifiers for the logs and are modeling their sequences. There exist two resources of log data for cluster systems available. The CFDR resource [3] stores links or 19 log datasets grouped in 11 data collections. The datasets cover both hardware and software logs. The second resource is the loghub data resource [35]. It consists of 16 datasets describing systems spanning across distributed systems, supercomputers, operating systems, mobile systems, server applications and standalone software. The datasets cover a different time from a few days until a few months. From the perspective of the system description, these data have weakness in providing just a single aspect of the system. In Meng et al. [16] the LogAnomlay system for detection of anomalies from logs is introduced. It utilizes a novel template2vec technique to encode the logs. Further, it extracts quantitative patterns from the logs. It uses LSTMs to detect the sequential and quantitative anomalies in the logs. In Zheng et al. [7] the DeepLog system is introduced. It tries to model the logs as natural language sequences. It allows to update the model by the operator and provides an automatic reconstruction of the workflows to enable root cause analysis.

In microservice architectures, traces are graph-like structures composed of events or spans [22]. The traces represent the system execution workflow, hence detailed information for individual services and the causal relationship to other related services can be inferred. Nedelkoski et al. [18, 19] introduce novel anomaly detection methods for distributed tracing data. They proposed a multimodal neural network with long short-term memory (LSTM) to enable the learning from the sequential nature in the tracing data. They describe how the data is obtained, but the datasets are not publicly available. Azure Public dataset composes of two datasets representing two representative traces of the virtual machine of Microsoft Azure [5]. It is mostly utilized to improve resource management in large cloud platforms. Alibaba’s cluster data is a collection of two datasets from real-world production [2, 14, 28]. In Zhen et al. [28] a novel system which automatically diagnoses stragglers for jobs is introduced. Li et al. [14] propose a deep reinforcement learning approach towards the job scheduling task. It can automatically obtain a fitness calculation method that optimizes the throughput of a set of jobs from experience. Google’s collection of two tracing datasets originates from parts of Google cluster management software and systems [10].

Limitation for all the above-mentioned datasets is the absence of multi-source (view) data describing a single system. The lack of data from all observability components from one system does not allow the development of holistic systems for fault detection, root-cause analysis and remediation that consider multiple sources of data simultaneously. Our collection of data, describing the same system from the 3 perspectives of logs, metricise and traces, to the best of our knowledge, is the first of its kind. This enables building models with diverse complementary information, hence making AIOps systems to perform better [19].

3 Dataset Generator

In this section, we describe the infrastructure, experiments, workload, and the injected faults as part of the testbed for data generation. The testbed and the generated data follow the requirements stated above, as every parameter stated in following can be easily changed, satisfying part of R2, and R4.

3.1 Infrastructure

An OpenStack [29] testbed based on a microservice architecture that is running in a dockerized environment called Kolla-Ansible [13] was first deployed. OpenStack is a cloud operating system that controls large pools of computing, storage, and networking resources throughout a data-centre, all managed and provisioned through APIs with common authentication mechanisms.

The experimental testbed setup is shown in Fig. 1 and for the purpose of the generation of the data it consists of one control node named wally-113 and four compute nodes: wally-122, wally-123, wally-124, and wally-117. It was deployed on bare-metal nodes of a cluster where each node has RAM 16 GB, 3x 1TB of disks, and 2x 1Gbit Ethernet NIC. Three hard disks were combined to a software RAID 5 for data redundancy.

Fig. 1.
figure 1

Illustration of the infrastructure from where the data was generated.

3.2 Workloads and Faults Injected

To generate workloads and inject faults into the infrastructure we used Rally [25]. Rally docker image was used to create the load and inject os-faults [23] appropriately. Jasmin: We selected a list of workloads and faults that are close representatives to real production faults. The listed workloads and faults in following cover user request that is served by the main Openstack projects.

  • Create and delete server. Jasmin: Creates and deletes a server (virtual machine). Nova project is mostly affected and present in the data. We injected a compute fault which is restarting the api container that run on the compute nodes.

  • Create and delete image. Jasmin: The task for creating and deleting images accepts the image-location locally/ over the internet, format of the output image once created. It creates and deletes an image. The glance project of Openstack provides a service where users can upload and discover data assets that are meant to be used with other services. Here we inject the fault in the glance-api running on the controller node.

  • Create and delete network. Jasmin: Rally provides task that accepts the format for creating and deletion of networks for various configurations such as multiple users and tenants. Neutron is an OpenStack project to provide networking as a service between interface devices (e.g., vNICs) managed by other Openstack services (e.g., nova, heat etc). There are various components that we focus on while injecting faults such as disrupting the below-mentioned services running in docker containers: neutron metadata agent, neutron l3 agent, neutron dhcp agent, neutron openvswitch agent and neutron server.

We performed two different experiments. In the first experiment, the user actions as a workload were executed in a sequential way, when one finishes then the next is started. This experiment was performed for 750, 1000, and 1000 iterations (create and delete server, create and delete image, create and delete network), where faults were injected every 250 iterations respectively. The fault was injected in only one iteration, however, we noticed that some of the faults take time and propagate the errors to other iterations as well. In the second experiment, the rally workloads were concurrently executed. This experiment was performed for 2000, 3000, and 6000 iterations for create and delete server, create and delete image and create and delete network, respectively. The faults were injected at different rates, 250 for create and delete server and create and delete image and 500 iterations for create and delete network. The number of the iterations for each action was chosen so that all workloads approximatelly finish in the same time. The data from the second experiment is slightly more suited for multi-source methods utilizing distributed log data, as it was generated with that as a goal. Also, HTML reports were collected which correlates all the events of creations, failures and which injections were made. This report serves as ground truth for the normal and anomalous state of the system.

Jasmin:

3.3 Data Collection

In following we describe the technologies and the methods used to collect the generated data.

Metrics. For the metrics collection across the physical nodes in the infrastructure, we utilize Glances  [20], a cross-platform monitoring tool which aims to present a maximum of information into a minimal space through curses or Web-based interface. Glances is written in Python and uses the psutil library to get information from a system. It can adapt dynamically the displayed information depending on the terminal size. It can also work in client/server mode, also remote monitoring could be done via terminal, Web interface or API (XMLRPC and RESTful). Glances was used to gather information such as CPU, MEM and load of the machine (either controller or the compute nodes). These metrics were saved into a CSV file via the glances-cli.

Logs. OpenStack services use standard logging levels. For aggregating logs from all services running across the physical nodes, was used ELK (Elasticsearch, Logstash, and Kibana). Elasticsearch is a search and analytics engine which resolves the search requests. Logstash is a server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to Elasticsearch. For this Fluentd, which is an open-source data collector for the unified logging layer, was utilized. It allows unifying data collection and consumption for better use and understanding of data. Kibana is a dashboard that gives the ability to the users to visualize data with charts and graphs using data that is collected by Elasticsearch. Finally, for exporting data from Elasticsearch into CSV a CLI tool, es2csv  [32] was utilized. The benefit we obtain from this tool is that it can query bulk docs in multiple indices and get only selected fields, this reduces query execution time and enhances the speed of aggregating these logs that are existing on various physical nodes. We provide both, the aggregated logs as well as the raw logs to cover possible development of methods that process raw logs, such as log parsing.

Traces. OpenStack consists of multiple projects, where each project is composed of multiple services. To process user requests, e.g., creating a virtual machine, OpenStack uses multiple services from different projects. To support troubleshooting, OpenStack introduces a small but powerful library called osprofiler that is used by all OpenStack projects and their Python clients  [21] to generate traces. It generates one trace per request, that goes through all involved services, and builds a tree of calls which captures a workflow of service invocations. To identify workflows, we monitor the following call types:

  • HTTP. Captures HTTP requests, the latency of service, and projects involved.

  • RPC. Represent the duration of parts of request related to different services in one project.

  • DB API. The time that the request spent in the DB layer.

  • Driver. In the case of nova, cinder and others we have vendor drivers.

The osprofiler library collects these records in a trace per request and stores them in a database (e.g., Redis). From Redis, we can query and analyze traces.

4 Dataset Description

The workloads and faults described in the previous section were executed on the testbed. As explained, the execution generated three main categories of observability data: distributed traces, metrics, and application logs. These data were recorded in concurrently in order to provide the state of the system from multiple points of view, which satisfies the R1 for originality as no such dataset exists in previous work. In the following two sections, we describe the main attributes, properties, and statistics of each data category of the first experiment. Due to page limitations, we refer the reader to the above link in the abstract for the code for extracting the data statistics from the second experiment. All other properties hold for both experiments.

4.1 Metrics

The metrics data category contains data for the 5 physical nodes in the infrastructure. The 5 files are named metrics_wally_N, where N is either the controller node or one of the compute nodes. Each of these files has 7 features:

  • now. The timestamp of the recording.

  • cpu.user. Percent time spent in userspace. The user CPU time is the time spent on the processor running your program’s code (or code in libraries).

  • mem.used. The RAM usage of the physical host.

  • load.cpucore. The number of cores of the physical host.

  • load.min1, min5, min15. Linux load averages are system load averages that show the running tasks demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing.

A small sample of the metrics data for the wally113 is shown in Table 1 where we can see part of the metrics data.

Table 1. Metrics from the controller node (wally 113)

4.2 Logs

The log files are distributed over the infrastructure and they are grouped in directories by the OpenStack projects (e.g., nova, neutron, glance, etc.) at the wally nodes. At each of the physical nodes, there are different project running. The control node has more services running and thus has more log files for the OpenStack projects. Each project on the physical hosts has its log directory where the logs are stored. Inside each of the log directories for the projects, there are several log files. Important to note here is that even the log files are highly distributed over projects and physical nodes, they all represent the state of the system. We provide the raw log directories in this dataset along with the aggregated log file. Using the elastic search and Kibana stack we can aggregate all the logs into a central database which can serve as a starting point for the analysis.

The log entries have in total of 23 features. Not all the features are always present for all the log entries. The features: _id, _index, _score are added metadata from Kibana. The _type is fluent, the collector which is responsible for sending all the metrics and logs to Kibana. In the following, we describe the main features present in the log data.

  • hostname. Name of the physical host (e.g., wally113)

  • user id, project domain, tenant id, request id, user domain, domain id. Are features describing the user request to Openstack.

  • timestamp, @timestamp. The time when the record was created.

  • log level. Describes the level of the log entry. It can be info, error, warning, etc.

  • pid. Process ID.

  • Payload. Gives the most important information of the log i.e., the body of the log entry.

  • programname. The OpenStack project that generated the log entry.

  • python module The module responsible for generation of the log entry, and the

  • logger Tells which project logs the event.

  • http * related fields. Are only present if there is an HTTP call describing the endpoint, status core, version, and the method.

For the parsing of the logs, template matching, and analysis we suggest using the aggregated file described instead of the directories with raw log files, as all of the information is preserved and more structured for direct analysis. For multi-source log anomaly detection, if the aggregated file is utilized, we suggest splitting by “logger” in order to obtain entries which are grouped by their corresponding service.

4.3 Traces

The traces in the dataset are contained in 3 directories: boot_delete, create_delete_image, and network_create_delete. Each of the directories contains the scripts for running the workload and the fault injections along with the actual tracing data. These directories contain JSON files of the traces. This structure is preserved among all types of workloads (Rally actions).

Every trace has its features in the JSON entries or events. These features depend on multiple factors such as the user request, infrastructure, load balancers, and caching. An event is a vector of key-value pairs \((k_i, v_i)\) describing the state, performance, and further characteristics of service at a given time \(t_i\). In following we describe the main features of the events in a trace:

  • host. Name of the physical host.

  • name. Event name (e.g., compute_apistop).

  • service. Service name (e.g., osapi_compute).

  • project, Openstack project (e.g., nova).

  • timestamp. The time when the event is recorded.

  • trace_id. ID of the span (contains two events, e.g., compute_api-stop and compute_api-start).

  • parent_id. The parent_id gives the ID of the parent event. This attribute can be used to represent the trace in a graph.

  • base_id. ID of the trace, different events and spans with same base_id belong to one trace.

Two start and stop events (e.g., compute_apistart and compute_apistop) with the same trace_id. The subtraction between the stop timestamp and the start timestamp gives the duration of the span. The above features together with the duration are the most important in describing the structure, preserving the parent-child causal relationship, and the duration which represents the response time of the service invoked.

The events also contain other attributes that can be found for specific types. For example, path, scheme, method for HTTP calls, where the path and scheme represents the HTTP endpoint and HTTP scheme and method can be GET or POST. Further, the db statement in DB calls gives information about the SQL query, while the function, name, args, kwargs in RPC calls tell which function was invoked with the its corresponding arguments.

4.4 Ground Truth Labels

The workloads described along with the faults injected are both recorded in Rally HTML and JSON reports which are located at each of the directories containing trace data. These reports provide pseudo ground truth labels for the traces, metrics, and logs. They contain information for the times when the faults were injected and the resulting high level error messages. Taking the period when the anomaly was injected and merging it with the timestamps of the data files can give us true labels for the evaluation. We suggest to use the ground truth labels to evaluate algorithms and methods which are based on unsupervised learning, as in production systems injection of anomalies and access to labeled data is restricted.

Jasmin:

5 Dataset Statistics

This section provides a descriptive statistic of the datasets generated. It quantitatively describes the properties of the trace, metrics, and log datasets. Additionally, it ensures the R3 and R2 requirements. In following, due to page limitations, we discuss the statistics for the first experiment only. The code for extracting the statistics for the second experiment is provided in the data repository.

5.1 Metrics

The number of recordings of the utilization of the resources, more specifically the CPU, memory and the load, per node varies in the range of (108900, 298251). The average number of recordings is 239127. The total number of the metric recordings is 1195637. All of the nodes have 8 CPU cores. It is important to note that the metrics data cover a time span larger than the period of execution of the experiments.

As depicted in Figs. 22 a and b, in general, the wally113 experience the greatest CPU and memory load as observed by the distribution of these two features. Furthermore, the correlation analysis of the load.min1, load.min5 and load.min15 show that they exhibit high correlation given their relatedness through time. The correlation analysis also shows quite distinct behaviour for the load.min5, load.min10, load.min15 correlations between the control node and the remaining nodes. Regarding the dependence between the cpu.user, memory.used and load.min features, no significant correlation can be identified. Roughly 3 groups of features emerge - the load.CPU, mem.used and the load.min group.

Fig. 2.
figure 2

Traces: counts of services per rally action

Table 2. Traces information: count of operations per workload execution.
Table 3. Traces information: median time of a service per iteration

5.2 Logs

Since the logs are semi-structured data, first we try to organize them and observe the range of interesting features that can appear in them. There are 139799 log messages appearing in the sequential execution of the operations. We used Kibana to identify the different features describing them. Each log has its unique identifier referenced by the label _id. The Timestamp feature has 8 missing values. However, the timestamps provided by Kibana, stored in @timestamp contain the relevant information for the moment where the logging happened.

There are a total of 6 services recording their logs in the OpenStack logger: nova, neutron, keystone, glances, placement and cinder. Nova and neutron are services with the greatest number of logs appearing. The logs contain 3 levels of logging (INFO, WARNING and ERROR). There are 5 operation host nodes - Hostname (wally113, wally117, wally122, wally123, wally124). Most of the logs originate from the control node wally113. The python_module contains the name of the 61 modules that are logging their information into the logs with wsgi related modules being the most frequent ones (neutron_wsgi, nova.osapi_wsgi and server_wsgi). The programname refers to the program which operations are being executed. There are a total of 127654 different Payloads happened in the system and the most frequent is related to the GET operation.

For the realized HTTP calls there is information for the http status with 6 different code values, http_method with 4 possible values (GET, POST, DELETE and PUT), http_urls with a total of 3655 values and the version of the http protocol stored ins http_version. There are columns such as domain id, user domain, tenant_id, request_id, user_id, _score, _type, project_domain, Pid and domain_id that have either very large or very small variance in the number of unique values per feature. They represent start and end point in form of IP address or a result from a hash function.

5.3 Traces

Table 2 represents the total number of services for each of the traces for the three sequential operations being executed. It is given as a total sum over all the repetition of the experiment. One can be observe that there are different service invoked per operation. For example, for the image_create_delete operation the open stack service involved is completely on the controller node, hence the compute nodes are contacted and there is no operation related to them. The most frequently occurring invocation is split between db and wsgi. Second the operations are ordered by complexity and it can be seen that the boot_delete_task involves all of the 7 services.

Table 3 represents the median time of execution for each of the invoked services. The median is chosen since the distributions are skewed and the mean is not representative of the sample distribution. As it can be observed, the wsgi services are slower than the db calls since wsgi relays on http communication. It is interesting to observe that for the network create delete operation the rpc is quite small. One explanation for this is the small rate of rpc call per individual execution. This means that not all executions of this operation involve rpc calls. Since multiple workloads involve invoking different number of individual operation the times should be compared with caution.

We inject the fault in the glance-api running on the controller node.

6 Applications of Multi-source AIOps

While previous work has been generally done on single-source data, we believe that to develop robust, holistic approaches for anomaly detection, root-cause analysis, self-healing, resource optimization, and performance analysis a multi-source data is highly desirable.

In this section, we shortly describe possible AIOps approaches that can exploit the benefits of processing multi-source observability data.

Multi-source Anomaly Detection. The distributed logs over projects and physical hosts enable multimodal end-to-end learning and more robust log anomaly detection. Of course, this adds complexity for data integration and fusion, as the distributed logs are produced with different timestamps. Together, the distributed logs and metrics can again be combined into more complex model or network of models. Lastly, the graph-like structures of the tracing data can be incorporated to complete the robust anomaly detection where all available observability data is considered.

Root-Cause Analysis. The integration of multi-source observability data can be exploited by using some kind of Fishbone diagrams [12] to find the root-cause of problems. A method can start with simple metric-only anomaly detection, which typically provides little information about the root-causes of problems, and drill down to more complex data structures which are richer in explaining anomalies. For example, one can start by analyzing the latency of microservices endpoints. If anomalies are detected after processing metrics, one can use the timeframe when the anomaly occurred to select and analyze structural changes in traces. Traces can provide information about which servers are possibly faulty. Afterwards, application logs can be accessed to find the root-cause of problems.

Precision Increase. Ensemble learning [34] can be used to machine learning algorithm results by combining several models applied to the three correlated data sources categories. Such an approach would allow the production of algorithms with better predictive accuracy when compared to the algorithms which process single-data sources.

Feature Extension. Many machine learning algorithms rely on features, which for AIOps are individual measurable characteristics of the behaviour of IT distributed systems at a given time. By using multi-source data, the spectrum of available features to an algorithm is dramatically increased. Thus, we expect the quality of algorithms and their results to increase in the future.

7 Conclusion

AIOps systems rely on suitable observability data. We released a multi-source data containing distributed metrics, logs, and tracing data obtained from a complex distributed system based on microservice architecture. We describe in details the infrastructure, experiments performed, and the fault injection. Furthermore, we provided descriptive statistical properties of the data.

Furthermore, we motivated possible applications of this data for improvements in anomaly detection, root-cause analysis, remediation, and feature extension. We hope that this dataset will foster advances in the research of AIOps, which has been limited mainly to explored data capturing only a single data source category.