Gathering I O Resource Manager Metrics

Gathering I O Resource Manager Metrics

Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)

Each Exadata storage cell gathers several types of metrics that are important for understanding how the storage is being shared by multiple workloads, their I/O performance, and the effect of I/O Resource Manager. These metrics include:

Disk and flash usage, by database and consumer group.

Disk latencies.

If I/O Resource Manager is enabled, how long I/Os waited to be scheduled, by database and consumer group.

The Enterprise Manager 12c UI displays these metrics in user-friendly graphs. In the absence of the EM 12c UI, CellCLI can be used to view both current and historical values of these metrics. However, these metrics are much easier to interpret if the relevant ones are grouped together.

In this document, we provide an easy-to-use script for gathering I/O metrics and some pointers for how to interpret them. This script can also be used if I/O Resource Manager is not enabled. This script can be used for all versions of the Exadata storage cell.

Scope

This document is intended to be used by Exadata administrators who are using Exadata storage cells for mixed workloads or consolidation.

Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl

The script, metric_iorm.pl, is attached to this document. Place this script on each Exadata storage cell. We recommend that you put this script in your home directory for several reasons. First, files in your home directory are preserved across upgrades. Second, dcli automatically looks for the script in your home directory. Therefore, if the user name is "celladmin", then the script should be located in /home/celladmin.

You can then invoke it as follows:

1. To query the storage cell's current I/O metrics:

./metric_iorm.pl

2. To view the storage cell's historical I/O metrics, using explicit start and/or end times:

./metric_iorm.pl "where collectionTime > (start_time) and collectionTime < (end_time)"

The start and end times must be provided in a CellCLI-compliant format. For example, to view I/O metrics from 9 AM to 10 AM on 2011-05-25:

./metric_iorm.pl "where collectionTime > '2011-05-25T09:00:00-07:00' and collectionTime < '2011-05-25T10:00:00-07:00'"

3. To view the storage cell's historical I/O metrics, using relative start and/or end times:

./metric_iorm.pl "where ageInMinutes > (number_minutes) and ageInMinutes < (number_minutes)"

For example, to view I/O metrics between 2 and 3 hours ago:

./metric_iorm.pl "where ageInMinutes > 120 and ageInMinutes < 180"

4. To view I/O metrics from multiple storage cells, use dcli from the compute nodes. For example:

dcli -c cel01,cel02 -l celladmin "metric_iorm.pl > /var/log/oracle/diag/asm/cell/metric_output"

dcli -g cell_names.dat -l celladmin "metric_iorm.pl > /var/log/oracle/diag/asm/cell/metric_output"

where cell_names.dat contains the names of all the cells (cel01 and cel02), one per line.

dcli -g cell_names.dat -l celladmin metric_iorm.pl \"where collectionTime \> \'2010-12-15T17:10:51-07:00\' and collectionTime \< \'2010-12-15T17:15:51-07:00\'\" \> /var/log/oracle/diag/asm/cell/metric_output

The "utilization" metric shows you how busy the disks are, on average. This metric is broken down by database and by consumer group. In each case, it is also further broken down by small I/Os (indicative of latency-sensitive workloads) and large I/Os (indicative of throughput-intensive workloads). They correspond to these metrics: DB_IO_UTIL_SM, DB_IO_UTIL_LG, CG_IO_UTIL_SM, and CG_IO_UTIL_LG.

The "small and large IOPS" metrics are an alternative to the "utilization" metrics. They show the rate of I/Os per second to disk, broken down by database, consumer group, and I/O size. They correspond to these metrics: DB_IO_RQ_SM_SEC, DB_IO_RQ_LG_SEC, CG_IO_RQ_SM_SEC, CG_IO_RQ_LG_SEC.

The "throughput" metrics are another alternative to the "utilization" metrics. They show the rate of I/Os to disk in megabytes per second, broken down by database and consumer group. They correspond to these metrics: DB_IO_BY_SEC and CG_IO_BY_SEC.

The "Flash Cache IOPS" metric shows you the rate of I/Os to the flash cache. It is also broken down by database and by consumer group. It corresponds to these metrics: DB_FC_IO_RQ_SEC and CG_FC_IO_RQ_SEC.

The "avg qtime" metric shows you the average amount of time in milliseconds that I/Os waited to be scheduled by I/O Resource Manager. They are broken down by database, consumer group, and I/O size. It corresponds to these metrics: DB_IO_WT_SM_RQ, DB_IO_WT_LG_RQ, CG_IO_WT_SM_RQ, CG_IO_WT_LG_RQ.

The "disk latency" metrics show the average disk latency in milliseconds. They are broken down by I/O size and by reads and writes. They are not broken down by database and consumer group since the disk driver does not recognize I/Os by database and consumer group. Therefore, the disk latencies across databases and consumer groups are the same. They correspond to these metrics: CD_IO_TM_R_SM_RQ, CD_IO_TM_R_LG_RQ, CD_IO_TM_W_SM_RQ,

CD_IO_TM_W_LG_RQ.

If your storage cell hosts multiple databases, then a database may not be listed for the following reasons:

All metrics for the database are zero because the database is idle.

The database is not explicitly listed in the inter-database IORM plan and the inter-database IORM plan does not have a default "share" directive. Use "list iormplan" to view the inter-database IORM plan.

How to Interpret the Output

These metrics can be used to answer the following, common questions:

Which database or consumer group is utilizing the disk the most? Use the disk utilization metrics to answer this question. You can also use the disk IOPS metrics. However, the total number of IOPS that can be sustained by the disk is extremely dependent on the ratio of reads vs writes, the location of the I/Os within the disk, and the ratio of small vs large I/Os. Therefore, we recommend using the disk utilization metrics.

Am I getting good latencies for my OLTP database or consumer group? The I/O latency, as seen by the database, is determined by the flash cache hit rate, the disk latency, and the IORM wait time. OLTP I/Os are small, so you should focus on the disk latency for small reads and writes. You can use IORM to improve the disk latency by giving high resource allocations to the OLTP databases and consumer groups. If necessary, you can also use the "low latency" objective. You can also decrease the IORM wait time by giving high resource allocations to the critical databases and consumer groups.

What is the flash cache hit rate for my database or consumer group? For OLTP databases and consumer groups, you should expect the flash cache to be used for a significant number of I/Os. Since the latency of flash cache I/Os is very low, the I/O response time, as seen by the database, can be optimized by improving the flash cache hit rate for critical workloads.

How much is I/O Resource Manager affecting my database or consumer group? If IORM is enabled, then IORM may delay issuing an I/O when the disks are under heavy load or when a database or consumer group has reached its I/O utilization limit, if any. IORM may also delay issuing large I/Os when it is optimizing for low latencies for OLTP workloads. These delays are reflected in the "average queue time" metrics. You can decrease the delays for critical databases and consumer groups by increasing their resource allocation.

No comments:

Post a Comment

Subscribe to: Posts (Atom)