Monitoring

Cornejo, Roger

doi:10.1007/978-1-4842-4137-0_9

Roger Cornejo²

409 Accesses

Abstract

Sadly, and all too often, we learn of database performance problems as a complaint from database users. Most DBAs and managers would like to stay ahead of performance problems by monitoring their databases and taking peremptory action to avoid the types of performance complaints to which most of us have been accustomed. Some level of automated monitoring is or should be an integral and important task for DBAs, especially in environments with hundreds if not thousands of databases, where the sheer volume makes it impossible to monitor manually.

Download chapter PDF

Sadly, and all too often, we learn of database performance problems as a complaint from database users. Most DBAs and managers would like to stay ahead of performance problems by monitoring their databases and taking peremptory action to avoid the types of performance complaints to which most of us have been accustomed. Some level of automated monitoring is or should be an integral and important task for DBAs, especially in environments with hundreds if not thousands of databases, where the sheer volume makes it impossible to monitor manually.

There are currently many approaches to and implementations of database monitoring, and it is impossible in the context of this chapter to provide a full survey of the database monitoring landscape, so I will speak in a general sense on monitoring solutions. Monitoring usually goes hand in hand with alerting so that detected problems are pushed out to the personnel that need to action a solution. Most generic alerting mechanisms will support the ability to create named rules that specify an arithmetic expression of the form, <metric> <operator> <value>, where

1)
<metric> is the metric that you want to monitor.
2)
<operator> is the comparison operator to use.
3)
<value> is the “threshold” value of that metric that triggers the alert (two values for warning or critical threshold levels are commonly used).

When the expression that defines the rule evaluates to true, then the tool will send an alert. For example, one might want an alert when the dba_hist_sysmetric_summary metric “Average Synchronous Single-Block Read Latency” is greater than 20 milliseconds.

Years ago I implemented a consistency check system in Oracle for a large clinical trials system that used a key-value pair structure similar to that used by the DOPA process. In this system we supported complex expressions where the simple expressions could be combined into more complex expressions with logical operators AND/OR. For example, one might want a complex expression looking at two or more metrics at a time, such as:

Average Active Sessions > 4 AND Database Wait Time Ratio >= 80.

Since SQL is an interpreted language, it was fairly easy to get SQL to evaluate the expressions and return true (i.e., push an alert) or false (i.e., don’t push an alert).

The problem with the typical metric by metric thresholding approach is that it is hard to know

1)
What metrics are important
2)
What values should be used for the thresholds and
3)
What values are specific to a system or
4)
What values are considered absolutes across all systems

Although some organizations may choose to build a monitoring system, as I did in the preceding scenario, it is also possible to use an off-the-shelf monitoring solution. Although monitoring products are available, most notably Oracle’s OEM, they are not without cost or limitations.

The Costs and Shortcomings of Current Monitoring Tools: The total cost of ownership of a monitoring tool is always an important consideration; those costs involve not only any licensing and maintenance fees that may apply, but the cost of time/effort necessary for learning the tool, provisioning an environment for it, for implementing it, and for maintaining it. A tool with a large footprint might well be justified for large-scale enterprise systems, but for smaller database environments, the tool may involve more costs than the value it delivers. As I will show in the following section, the DOPA process is a rather lightweight approach that can be used as a starting point to address many of the database monitoring requirements.

When looking at the available monitoring tools, another factor that you need to consider besides cost is how well a particular tool will suit your need. The current monitoring tools have one important characteristic in common with current performance analysis tools, and that is they rely on a predetermined set of metrics and can therefore be referred to as small model monitoring tools. As with the performance tools, so here with the monitoring tools, they have “blind spots” built into them because they are only looking at a few hand-picked metrics. For example, the classical monitoring approach in older versions of Oracle (9i and older, but still being used by some DBAs in 12c) was to create shell scripts to identify problems and send e-mail notifications of them (e.g., tablespace size alert, UNDO or TEMP error alerts, operating system file space alerts, long running SQL alerts).

I view the small model approaches in both performance tools and in monitoring tools as “noise avoidance” approaches that keep you from being overwhelmed by the tens of thousands of metrics that are out there. The small model approaches focus on only preselected metrics because it is impossible to instrument many metrics manually (i.e., it is difficult to scale the small model approach). However, because they are using only a limited number of metrics, the small model approaches are only as good as the metrics they use; if the metrics used are representative of the range of problems one is likely to encounter, they can be helpful, but if not, you will miss valuable information. The “blind spots” are intentional and necessary without a way to process all the information, but I suspect the consequences are unintentional. The small model approach, by missing pertinent observations, can have the unintended consequence of allowing “mysterious,” recurring problems in your database(s) to go undiagnosed and therefore not addressed in a timely way.

In my experience, the range of problems that one can encounter on an Oracle database is very diverse, so trusting a tool that uses the small model approach is risky. The problem is you can’t monitor or fix what you can’t see.

I designed the DOPA process as a tool for the Oracle performance metrics analytics use case; however, the many DBAs with whom I’ve interacted almost without exception draw the connection with the DOPA process and database monitoring in general. While I have not yet had the time to implement DOPA for this purpose, I am confident that the DOPA process would form the basis of an extremely effective monitoring tool and that it might be able to perform that function with less cost and fewer limitations than the more established tools currently available—at least for some organizations or in some situations.

As I said, you can’t fix what you can’t see, but using the DOPA process you can pull from all the available metrics, so I am confident that it has the potential to be a more useful monitoring tool once modifications for this functionality are built. The DOPA process approach may even be paired with machine learning to provide an even greater impact. I’ll discuss that in the next chapter.

Why the DOPA Process Is a Good Candidate for Monitoring: As I’ve already stated, the intent of database monitoring is to identify, diagnose, and resolve problems before they create a negative impact to the business. Without database performance metrics, it is impossible to gain a clear picture of what is happening on the database in the background. The Oracle RDBMS is probably the most well-instrumented software on the planet, and the DOPA process takes advantage of this instrumentation. It not only exposes thousands of metrics but detects anomalies in the metric values. It is not a large leap to observe that the DOPA process could be leveraged to not only analyze and solve performance problems that have already happened but to inform your database monitoring and alerting decisions as well.

In this book, I have demonstrated the usefulness of the DOPA process for discovering performance problems. I have used the DOPA process in many performance use cases where the metrics are evaluated in the context of a known problem (i.e., correlated to a problem), and using the flagged metrics to inform my conclusions and tuning actions. With monitoring and alerting, since you are trying to preempt a problem, it may not be the case that the metrics are evaluated in the context of a known problem. This characteristic of no problem or unknown problem makes the monitoring use case somewhat more challenging as you have to be more confident in your conclusions and proposed actions. I will show in the following section how the DOPA process can be used to study metrics and develop this confidence.

Development Requirements for DOPA to Be Used for Monitoring

In this section I will discuss in a general way the high-level requirements of a monitoring solution and how the DOPA process fits in.

Monitoring solutions all share some common features:

1)
Persisted historic metrics
2)
Monitoring and analysis of metrics
3)
Threshold alerting

In order to develop the DOPA process as a monitoring tool, each of these areas needs to be addressed/developed. I discuss in the following section some ideas I have about how that can be accomplished using DOPA. I reiterate, it is finite time that has prevented me from doing so thus far. I welcome any feedback from others who may have put some time and effort into doing this already.

1)
Persisted historic metrics:

The time-series AWR metrics instrumented in DOPA represent native measurements of the historic resource usage and are the source [i.e., “raw material”] that is fed into the analysis process. These metrics can be low-level information from the operating system such as memory swapping to more granular-level information from within Oracle such as latching, specific memory areas, and specific subcomponents of Oracle. DOPA surfaces this already persisted AWR information to help identify the root cause when evaluating performance, but for monitoring this ability can be used to mine the necessary metric data points during normal operation.

The DOPA process makes use of the persisted historic AWR metrics and provides a powerful analysis capability for drawing from any of the available AWR (and other) sources, not just those already predetermined by a monitoring tool. Using DOPA, the choice would be yours, simple SQL subqueries are all that is needed to include more/new metrics in the analysis, and with the built-in subsetting capability of the DOPA process, you can easily include more metrics to gain a more complete understanding of a particular area or fewer if you are only focusing on a particular area of concern.

DOPA is very adept at instrumenting and analyzing metric source data since this is what it is designed to do, but in order to use the metrics analysis for monitoring purposes, it is necessary to create a mechanism for persisting the results of a DOPA analysis as a metrics baseline. In the next subsection, I will speak to the analysis side.
2)
Monitoring and analysis of metrics:

The DOPA process uses statistical methods to find performance problems by establishing normal ranges and then identifying metric values that fall outside of normal. This same functionality can be used to establish normal ranges for any metric. In order to use the DOPA process in the context of database monitoring, one would want to evaluate metric trends and normal/abnormal ranges within the context of the normal operation of the applications within the database. This would involve a study of the metrics and “baselining” the normal ranges in the scenarios that make the most sense for your applications. For example, what may be normal on one machine will be different from another, and it may change over time—within a workday, within the workweek, and as workloads increase over time. It would also be quite useful to be able to run the normal range analysis for various periods (e.g., for the entire month, for the 9–5 time slot over the month, for the weekend, for daytime or nighttime workloads, etc.). The ability to sample data and dynamically calculate and persist normal ranges for designated periods and update the “baselines” on a continuing basis is something I believe could be accomplished with DOPA with some modifications.

Persisting Baselines

As I mentioned in the preceding section, it would be necessary to be able to designate and store the metric normal ranges according to the period/scenario during which it was collected. You could persist normal range “baselines” periodically (weekly or monthly given sufficient retention period) over several periods and compare the normal ranges. You also want to distinguish between daytime and nighttime database workloads. This would be a tweak to DOPA if I wanted just 10pm–8am for a consecutive 30 days. The persisted baseline normal ranges could be used for a database or set of comparable DBs and then later used in the monitoring process for threshold setting.

Further, I would want to be able to identify the metrics as being obtained during “normal nonevent periods” as well as periods of performance degradation or other unusual situations (e.g., operating system upgrade X, application change Y, unusual event/problem Y).

Note

This persisted data could become the basis for “training sets” for further machine learning analysis/study.

Further Thoughts on the Study of Normal Ranges

An important consideration in this discussion is: “How do you decide what is the normal operation of an application/database?” What follows is an approach I think would be a good start for establishing “normal” ranges for the database.

A starting point for database monitoring using DOPA would be to look at Average Active Sessions (AAS) with a focus on the time periods having the highest AAS peaks. To do this using the DOPA process, I would use the Metrics Time-Series View, subsetting on “Average Active Sessions” and looking at all the values, not just the flagged values. For example, on one particular database, the highest workload period was between 04:00 hours and 22:00 hours on the 15th, as in Figure 9-1.

Next, I’d take the highest workload period [04:00 hours—22:00 hours on the 15th] and generate the normal ranges using the Metrics Aggregate View.

The following example in Figure 9-2 is an extract of the normal ranges provided from the Metrics Aggregate View subset on source dba_hist_sysmetric_summary.

So as you can see, the DOPA process can be used as is to help establish threshold values for any of the metrics. The process can be applied to single databases or across multiple databases depending on the need.

Again, I can use the DOPA process as is to study metrics across one or more databases to understand what ranges of values I could expect. For example, if I wanted to determine a typical i/o capacity in bytes/sec or i/o’s/sec, I could build a model using the Metrics Aggregate View subset on metric_name like “Physical % Total % Per Sec” in metric source dba_hist_sysmetric_summary in DOPA. This would provide a table as in Figure 9-3 that focuses in on the lower, average, and upper ranges. From this analysis I could establish a higher upper bound for normal by taking the max upper bound across multiple DBs.

In this subsection we looked at the analysis of metric normal ranges and noted that baseline normal ranges need to be persisted in a variety of scenarios and used to help satisfy the monitoring requirement of threshold setting, which is the next topic I will address.

3.
Threshold alerting—detecting a problem:

The DOPA process does not currently have any mechanism for threshold alerting, but as outlined in the preceding section, the DOPA process can be used to facilitate this requirement of a monitoring solution by engaging in metric normal range analysis and using this analysis to adjust thresholds as necessary. Additionally, if you have a mechanism to set thresholds for a nighttime vs. a daytime workload, the DOPA process could be used to inform those decisions as well. For example, daytime applications might be more sensitive response time per transaction, whereas at night you could allow a slower response time, without sending an alert, since transactions at night are usually larger, longer-running transactions.

Typically monitoring solutions accomplish threshold alerting in two ways: by establishing absolute values for the metrics and by setting percentile values and driving the alerting off the percentiles. Each of these will be dealt with separately.

Alerting via Absolute Values

Threshold alerting using absolute values requires that the absolute values for the monitored metrics be predetermined. The DOPA model-building process can facilitate this by supporting the study of metric normal ranges as described in the previous subsection. Even with a tool like OEM, the absolute values for metrics recommended by Oracle may not be fit for all environments, so the DOPA process could be used to check or set these values in your own environment. As described in the preceding section, you could use a simple rule-based mechanism to drive the alerting by evaluating an arithmetic expression/logical expression (i.e., <metric> <operator> <value>) and alerting when the expression evaluates to true. Such a rule-based approach would be an extension to the DOPA process.

Alerting via Percentiles

Threshold alerting using percentiles is a more dynamic process that relies on statistical calculations of percentiles. Such a percentile-based approach would be an extension to the DOPA process. If I were to implement this requirement, I would want the percentile used at a particular alert level to be modifiable; for example, it could be that if metric values are above .9999th percentile, you have a critical alert, but above .98th percentile, it would just signal a warning. As with absolute value threshold setting, you must have one or more examples of typical workload or take your normal ranges from another comparator (like) database.

In both threshold setting situations, absolute value and percentile, you would want to be able to specify the number of occurrences required to set off an alert. I used the Host CPU Utilization % as an example in the preceding section.

Alerting is not all fun and games, there are some difficulties with alerting. For example, if you have too many alerts being sent, it could create a situation of “white noise” and allow something to slip by unnoticed. You can also have too few alerts, where, again, some important event slips by unnoticed. Therefore, I would expect that there will be some level of trial and error in setting absolute values or percentiles for alerting.

Summary

As you can see, the DOPA process can be used to facilitate the monitoring process and with some extensions meets many or even most of the monitoring requirements. Further, the DOPA process’s “big model” approach (or many metrics approach), along with the statistical analysis, makes the DOPA process a very interesting possibility for taking an entry into monitoring. The fact that the DOPA process has a small footprint and is lightweight and accessible also makes it even more interesting to be considered as a basis for a monitoring solution. I’m not suggesting that the DOPA process is the be-all-end-all with regard to monitoring solutions (there are many very capable tools out there), but in some environments, this might be all you need. For the ever-learning DBA, at the very least, the metric normal range analysis capability of DOPA (as described in this chapter) could be used even without monitoring to learn more about the metrics and their values that are most at play in the database environments in which they work.

Next, we move on to further enhancements in our last chapter together.

Author information

Authors and Affiliations

Durham, NC, USA
Roger Cornejo

Authors

Roger Cornejo
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cornejo, R. (2018). Monitoring. In: Dynamic Oracle Performance Analytics. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4137-0_9

Download citation

DOI: https://doi.org/10.1007/978-1-4842-4137-0_9
Published: 07 December 2018
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-4136-3
Online ISBN: 978-1-4842-4137-0
eBook Packages: Professional and Applied ComputingProfessional and Applied Computing (R0)Apress Access Books

Publish with us

Policies and ethics