Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (Doc ID 1302539.1)
Applies to:
Oracle Database - Enterprise Edition - Version 11.1.0.7 to 11.2.0.3 [Release 11.1 to 11.2]
Information in this document applies to any platform.
Purpose
Oracle Data Guard is a data protection and availability solution for the Oracle Database. The purpose of this document is to provide Database Administrators best practice recommendations for configuring key parameters, database features and operational practices to enable corruption detection, prevention, and automatic repair, in a MAA or Data Guard configuration. This note also provides additional background information on each parameter, performance considerations, and relevant Oracle documentation references.
Scope
This document is intended for Database Administrators wanting to learn how to prevent and detect from various data block corruptions.
A corrupt block is a block that has been changed so that it differs from what Oracle Database expects to find. This note covers three data block corruption types:
Causes of corrupted blocks
Block corruptions can be caused by various failures including, but not limited to the following:
Configuration Details
Configure at Primary Database:
Deploy Primary and Standby on Oracle Exadata Database Machine
In addition to the settings above that provide optimal data protection for the Oracle Database on any platform supported by Oracle, the Exadata Database Machine also implements all Oracle Hardware Assisted Resilient Data (HARD) specifications, providing a unique level of validation for Oracle block data structures. Oracle Exadata Storage Server Software detects corruptions introduced into the I/O path between the database and storage. It stops corrupted data from being written to disk, and validates data when reading the disk. This eliminates a large class of failures that the database industry had previously been unable to prevent.
Oracle Exadata Storage Server Software implements all the HARD checks, and because of its tight integration with Oracle Database, additional checks are implemented that are specific to Oracle Exadata Storage Server Software. Unlike other implementations of HARD checking, HARD checks with Oracle Exadata Storage Server Software operate completely transparently after enabling database's DB_BLOCK_CHECKSUM parameter. The HARD checks transparently handle cases including Oracle ASM disk rebalance operations and disk failures.
The server parameter file (SPFILE) is compliant with the HARD specifications implemented by Oracle Exadata Database Machine. To fully enable HARD protection for the data in the SPFILE, the SPFILE must reside on Oracle Exadata Storage Server.
General Guidance on Performance Trade-offs
Performance implications are discussed in each of the sections below. In general, the processing that accompanies higher levels of corruption checking, automatic repair, or fast point in time recovery, will create additional overhead on primary and standby systems. While this overhead is reduced with every Oracle release as validation and repair algorithms are enhanced, the usual recommendation for conducting thorough performance testing still applies.
DB_BLOCK_CHECKSUM- Background
This parameter determines whether DBWn and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block and redo log when writing to disk. The checksum is used to validate that a block is not physically corrupt, detecting corruptions caused by underlying disks, storage systems, or I/O systems. If checksum validation fails when it is set to FULL, Oracle will attempt to recover the block by reading it from disk (or from another instance) and applying the redo needed to fix the block. Corruptions are recorded as ORA-600 or ORA-01578 in the database or ASM alert logs.
Checksums do not ensure logical consistency of the block contents (see DB_BLOCK_CHECKING). Checksum checks happen in memory when a process reads the data or redo block into the SGA or PGA. Prior to writing an updated or new data or redo block, a new checksum is created. Potential clients of DB_BLOCK_CHECKSUM include: all foregrounds, DBWR, LGWR, LNS, RFS, ARCH, MRP, and recovery slaves.
Possible settings include:
This parameter specifies whether or not Oracle performs logical intra-block checking for database blocks (memory semantic check). Block checking will check block contents, including header and user data, when changes are made to the block and prevents in-memory corruptions from being written to disk. It performs a logical validation of the integrity of a block by walking through the data on the block, making sure it is self consistent. When DB_BLOCK_CHECKING is set at MEDIUM or FULL, block corruptions that are detected in memory are automatically repaired by reading the good block from disk and applying required redo. If for any reason the corruption cannot be repaired an error will be reported and the data block write will be prevented. All corruptions are reported as ORA-600 or ORA-01578 errors in the database or ASM alert logs.
Possible settings include:
Performance testing is particularly important given that overhead is incurred on every block change. Block checking typically causes 1% to 10% overhead, but for update and insert intensive applications (such as Redo Apply at a standby database) the overhead can be much higher. OLTP compressed tables also require additional checks that can result in higher overhead depending on the frequency of updates to those tables.
If performance concerns prevent setting DB_BLOCK_CHECKING to either FULL or MEDIUM at a primary database, then it becomes even more important to enable this at the standby database. This protects the standby database from logical block corruption that would be undetected at the primary database.
Note: When DB_BLOCK_CHECKING is set on the primary database, end-to-end checksums introduced in Oracle Database 11g make it unnecessary to use DB_BLOCK_CHECKING at the standby to detect primary database corruption. Oracle, however, still recommends enabling this parameter on the standby database for the following reasons:
This parameter enables lost write detection. A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage.
Possible settings include:
When a primary database lost write corruption is detected by a Data Guard physical standby database, Redo Apply (MRP) will stop and the standby will signal an ORA-752 error to explicitly indicate a primary lost write has occurred. Note that a Data Guard physical standby also detects other types of corruptions that can occur at either the primary or standby database and will likewise stop Redo Apply, in these cases signaling a more generic ORA-600 [3020] error. In cases where problems originated at the primary database this prevents corruptions from being applied to the standby database and indicates immediate action must be taken to prevent further corruption and downtime. While there is no threat to the primary database in cases where problems originate at the standby database, this provides immediate warning that corrective action must be taken to maintain data protection. For more information on lost write detection and instructions for repair from either ORA-752 or ORA-600 [3020] errors, see My Oracle Support Document 1265884.1.
Primary performance overhead is negligible when lost write protection is enabled as the validation is performed by the standby database. For DSS and Data Warehouse applications, however, where there is a high volume of queries from non-READONLY tablespaces that are updated infrequently, the primary redo generation will increase. The redo apply throughput at the standby database can be impacted due to additional data block read I/Os. For most cases, however, the impact on the primary database and standby databases is negligible. The best way to estimate the impact on the DB_LOST_WRITE_PROTECT on the primary or standby database is to enable it dynamically in your test environment and eventually in your production and standby databases.
Monitor the change in the application or redo apply throughput and evaluate the database statistics described below. The primary redo rate and redo size will grow (as measured by database statistic redo size for lost write detection) after enabling this setting since block read redo (BRR) will be created to record the version of the block at subsequent reads. The size impact is mitigated since Oracle attempts to batch the BRRs in a single redo record if possible and a typical BRR redo change is only 144 bytes and redo header is 40 bytes in a 64 bit platform. BRRs are recorded only for physical reads from disk since the goal is to detect lost writes or stale reads from disk. After enabling DB_LOST_WRITE_PROTECT on a physical standby database, you will encounter an increase in physical reads especially for data that is updated very infrequently but read semi-frequently (forcing a physical read) on the primary database. The additional physical reads on the physical standby database will be measured via database statistic recovery blocks read for lost write detection which can be extracted from the Standby Statspack report.
An example of the benefit of using DB_LOST_WRITE_PROTECT, refer to Data Guard Protection From Lost-Write Corruption demo at http://www.oracle.com/technetwork/database/features/availability/demonstrations-092317.html.
Oracle Automatic Storage Management (ASM) - Background
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good copy of the extent and copies it to the disk that had the read error.
When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O. For information about the remap command, see "remap".
When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or application if ASM can correct the corruption automatically.
Oracle Flashback Technologies - Background
Flashback Database is used for fast point-in-time recovery to recover from human errors that cause widespread damage to a production database. It is also used for fast reinstatement of a failed primary database as a new standby database following a Data Guard failover. Flashback Database uses flashback logs to rewind an Oracle Database to a previous point in time. See My Oracle Support Document 565535.1 for Flashback Database best practices. See Section 13.2 of Data Guard Concepts and Administration for Fast Reinstatement using Flashback Database. Note that new optimizations in Oracle Database 11.2.0.2 reduce the impact on load operations when Flashback Database enabled.
Flashback Database can also be used on a Data Guard standby database to enable point-in-time recovery of the primary database at a schema level without requiring that the entire primary database be flashed back. For example, take the case of a primary database hosting a number of applications and different schemas where an incorrect batch job was run for one of the applications. Assuming no object dependencies between schemas, you can perform a granular repair of the primary database in the following manner:
More granular levels of repair from human errors (e.g. drop table, inadvertent or malicious updates or DDLs) are enabled by other Oracle Flashback technologies that include: Flashback Query, Flashback Version Query, Flashback Transaction, Flashback Transaction Query, Flashback Table and Flashback Drop. Flashback Drop only requires configuring a Recycle Bin. All other features use automatic undo management. See Section 4.2.7 of Oracle Database HA best practices for detailed best practices on the full complement of Oracle Flashback Technologies.
Starting in Oracle 11g, manual RMAN block media recovery will automatically search flashback logs for good copies of blocks to help repair from physical data block corruptions quickly.
Oracle Data Guard
Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. One of the key benefits of Oracle Data Guard is its continuous Oracle-aware validation of all changes using multiple checks for physical and logical consistency of structures within an Oracle data block and redo before updates are applied to a standby database. This isolates the standby database from being impacted by data corruptions that can occur on the primary system.
Active Data Guard Automatic Block Repair - Background
If Oracle detects a physical block corruption on either the primary or a standby database in a configuration that uses Active Data Guard, Oracle will automatically repair the corrupt block using a valid copy from the other database. This repair is transparent to primary database applications and users. No application modifications are necessary. If the nature of the corruption makes it impossible to be repaired automatically (e.g. file header corruption, max block repair timeout of 60 seconds for one block repair or number of outstanding block corruptions reaching 100 block corruption incidents), an ORA-1578 error is returned to the application. Manual block media recovery can be executed at a later time. Automatic Block Repair requires an Active Data Guard standby and Data Guard real-time apply. The following database initialization parameters must be configured on the standby database: the following database initialization parameters are configured on the standby database:
Starting in Oracle 11g Release 2, manual RMAN block media recovery on the primary will automatically try to use a real-time query physical standby database.
For an example of the benefits of using Active Data Guard auto block repair mechanisms, refer to http://www.oracle.com/technetwork/database/features/availability/demonstrations-092317.html
Additional Operational Practices to detect block corruptions
For example, in rare cases such as hardware or other system failures, an index can become corrupted and not perform correctly. When validating the index, you can confirm that every entry in the index points to the correct row of the associated table. If the index is corrupt, you can drop and re-create it.
If a table, index, or cluster is corrupt, you should drop it and re-create it. If a materialized view is corrupt, perform a complete refresh and ensure that you have remedied the problem. If the problem is not corrected, drop and re-create the materialized view.
Automatic and Manual ASM DISK Repair in 11g
Oracle 11g New Features Tips
Applies to:
Oracle Database - Enterprise Edition - Version 11.1.0.7 to 11.2.0.3 [Release 11.1 to 11.2]
Information in this document applies to any platform.
Purpose
Oracle Data Guard is a data protection and availability solution for the Oracle Database. The purpose of this document is to provide Database Administrators best practice recommendations for configuring key parameters, database features and operational practices to enable corruption detection, prevention, and automatic repair, in a MAA or Data Guard configuration. This note also provides additional background information on each parameter, performance considerations, and relevant Oracle documentation references.
Scope
This document is intended for Database Administrators wanting to learn how to prevent and detect from various data block corruptions.
A corrupt block is a block that has been changed so that it differs from what Oracle Database expects to find. This note covers three data block corruption types:
- In a physical block corruption, which is also called a media corruption, the database does not recognize the block at all: the checksum is invalid, the block contains all zeros, the header and footer of the block do not match or one of the key data block data structure is incorrect such as the data block address (DBA).
- In a logical block corruption, the contents of the block are physically sound and pass the physical block checks; however the block can be logically inconsistent. Examples of logical corruption include corruption of a row piece or index entry.
- Block corruptions caused by stray writes, lost writes or misdirected writes can also cause havoc to your database availability. The data block may be physically or logically correct but in this case the block’s content is older or stale or in the wrong location.
- In intrablock corruption, the corruption occurs in the block itself and can be either a physical or a logical corruption.
- In an interblock corruption, the corruption occurs between blocks and can only be a logical corruption.
Causes of corrupted blocks
Block corruptions can be caused by various failures including, but not limited to the following:
- Faulty disks and disk controllers
- Faulty memory
- Faulty network components
- Firmware, operating system, volume manager, NFS or third party software defects
- Oracle Database software defects
Configuration Details
Configure at Primary Database:
- DB_BLOCK_CHECKSUM=FULL
- DB_BLOCK_CHECKING=FULL or MEDIUM
- DB_LOST_WRITE_PROTECT=TYPICAL
- Enable Flashback Technologies for fast point-in-time recovery from human errors (e.g. drop table, inadvertent or malicious data changes) and for fast reinstatement of a primary database following failover.
- DB_BLOCK_CHECKSUM=FULL
- DB_BLOCK_CHECKING=FULL or MEDIUM
- DB_LOST_WRITE_PROTECT=TYPICAL
- Enable Flashback Technologies for fast point-in-time recovery from human errors (e.g. drop table, inadvertent or malicious data changes) and for fast reinstatement of a primary database following failover.
- Use Active Data Guard to enable Automatic Block Repair (Data Guard 11.2 onward).
Deploy Primary and Standby on Oracle Exadata Database Machine
In addition to the settings above that provide optimal data protection for the Oracle Database on any platform supported by Oracle, the Exadata Database Machine also implements all Oracle Hardware Assisted Resilient Data (HARD) specifications, providing a unique level of validation for Oracle block data structures. Oracle Exadata Storage Server Software detects corruptions introduced into the I/O path between the database and storage. It stops corrupted data from being written to disk, and validates data when reading the disk. This eliminates a large class of failures that the database industry had previously been unable to prevent.
Oracle Exadata Storage Server Software implements all the HARD checks, and because of its tight integration with Oracle Database, additional checks are implemented that are specific to Oracle Exadata Storage Server Software. Unlike other implementations of HARD checking, HARD checks with Oracle Exadata Storage Server Software operate completely transparently after enabling database's DB_BLOCK_CHECKSUM parameter. The HARD checks transparently handle cases including Oracle ASM disk rebalance operations and disk failures.
The server parameter file (SPFILE) is compliant with the HARD specifications implemented by Oracle Exadata Database Machine. To fully enable HARD protection for the data in the SPFILE, the SPFILE must reside on Oracle Exadata Storage Server.
General Guidance on Performance Trade-offs
Performance implications are discussed in each of the sections below. In general, the processing that accompanies higher levels of corruption checking, automatic repair, or fast point in time recovery, will create additional overhead on primary and standby systems. While this overhead is reduced with every Oracle release as validation and repair algorithms are enhanced, the usual recommendation for conducting thorough performance testing still applies.
DB_BLOCK_CHECKSUM- Background
This parameter determines whether DBWn and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block and redo log when writing to disk. The checksum is used to validate that a block is not physically corrupt, detecting corruptions caused by underlying disks, storage systems, or I/O systems. If checksum validation fails when it is set to FULL, Oracle will attempt to recover the block by reading it from disk (or from another instance) and applying the redo needed to fix the block. Corruptions are recorded as ORA-600 or ORA-01578 in the database or ASM alert logs.
Checksums do not ensure logical consistency of the block contents (see DB_BLOCK_CHECKING). Checksum checks happen in memory when a process reads the data or redo block into the SGA or PGA. Prior to writing an updated or new data or redo block, a new checksum is created. Potential clients of DB_BLOCK_CHECKSUM include: all foregrounds, DBWR, LGWR, LNS, RFS, ARCH, MRP, and recovery slaves.
Possible settings include:
- OFF or FALSE: When set to OFF, DBWn calculates checksums only for the SYSTEM tablespace, but not for user tablespaces. In addition, no log checksum is performed when this parameter is set to OFF.
- TYPICAL or TRUE: Checksums are verified when a block is read and the last write of the block stored a checksum.
- FULL: In addition to checks made by TYPICAL, Oracle also verifies the checksum before a change application from update/delete statements and recomputes it after the change is applied. In addition, Oracle gives every log block a checksum before writing it to the current log.
- Starting with Oracle Database 11g, CPU and cache efficiency has been improved by having the generating foreground processes perform most of the log block checksum, while the LGWR performs the remaining work. Prior to Oracle Database 11g, the LGWR solely performed the log block checksum, it would verify the checksum of each log block generated by the foreground processes before writing it to disk.
- Best practice is to set DB_BLOCK_CHECKSUM=FULL on both the primary and standby databases, which typically incurs from 4% to 5% overhead on the system. Overhead for a setting of TYPICAL ranges from 1% to 2% for an OLTP workload and has minimal effect on Redo Apply performance, but also provides less protection. If tests show that using FULL at the primary results in unacceptable performance impact, consider setting the primary to TYPICAL and the standby database to FULL fo an optimal tradeoff between protection and performance (and also accept the fact that settings will have to be changed following role transitions).
This parameter specifies whether or not Oracle performs logical intra-block checking for database blocks (memory semantic check). Block checking will check block contents, including header and user data, when changes are made to the block and prevents in-memory corruptions from being written to disk. It performs a logical validation of the integrity of a block by walking through the data on the block, making sure it is self consistent. When DB_BLOCK_CHECKING is set at MEDIUM or FULL, block corruptions that are detected in memory are automatically repaired by reading the good block from disk and applying required redo. If for any reason the corruption cannot be repaired an error will be reported and the data block write will be prevented. All corruptions are reported as ORA-600 or ORA-01578 errors in the database or ASM alert logs.
Possible settings include:
- OFF or FALSE - No block checking is performed for blocks in user tablespaces. However, semantic block checking for SYSTEM tablespace blocks is always turned on.
- LOW - Basic block header checks are performed after block contents change in memory (for example, after UPDATE or INSERT statements, on-disk reads, or inter-instance block transfers in Oracle RAC). Low is very limited in usefulness for detecting and preventing data block corruption as it does not perform any data layer checks.
- MEDIUM - All LOW checks and full semantic checks are performed for all objects except indexes (whose contents can be reconstructed by a drop and rebuild upon encountering a corruption).
- FULL or TRUE - All LOW and MEDIUM checks plus full semantic checks are performed for all objects.
Performance testing is particularly important given that overhead is incurred on every block change. Block checking typically causes 1% to 10% overhead, but for update and insert intensive applications (such as Redo Apply at a standby database) the overhead can be much higher. OLTP compressed tables also require additional checks that can result in higher overhead depending on the frequency of updates to those tables.
If performance concerns prevent setting DB_BLOCK_CHECKING to either FULL or MEDIUM at a primary database, then it becomes even more important to enable this at the standby database. This protects the standby database from logical block corruption that would be undetected at the primary database.
Note: When DB_BLOCK_CHECKING is set on the primary database, end-to-end checksums introduced in Oracle Database 11g make it unnecessary to use DB_BLOCK_CHECKING at the standby to detect primary database corruption. Oracle, however, still recommends enabling this parameter on the standby database for the following reasons:
- The parameter is always enabled regardless of role, ensuring that the standby will have the correct settings when it assumes the primary database role
- Even though very rare, enabling the parameter on the standby will protect against logical corruptions that may be introduced on the standby database independent of the primary.
This parameter enables lost write detection. A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage.
Possible settings include:
- NONE: When set to NONE on either the primary database or the standby database, no lost write detection functionality is enabled.
- TYPICAL: When set on a primary database it will log buffer cache reads for read-write tablespaces in the redo log and will detect lost writes when the primary database performs media recovery. By also setting TYPICAL at a Data Guard physical standby database, the Data Guard MRP process will continuously check for lost writes that occurred to read-write tablespaces at the primary database, enabling immediate action to be taken and preventing corruptions from being applied to the standby database. It will also detect lost writes that may occur at a standby database independent of the primary.
- FULL: All logging, checking, and detecting that occurs with TYPICAL, but for both read-write and read-only tablespaces.
When a primary database lost write corruption is detected by a Data Guard physical standby database, Redo Apply (MRP) will stop and the standby will signal an ORA-752 error to explicitly indicate a primary lost write has occurred. Note that a Data Guard physical standby also detects other types of corruptions that can occur at either the primary or standby database and will likewise stop Redo Apply, in these cases signaling a more generic ORA-600 [3020] error. In cases where problems originated at the primary database this prevents corruptions from being applied to the standby database and indicates immediate action must be taken to prevent further corruption and downtime. While there is no threat to the primary database in cases where problems originate at the standby database, this provides immediate warning that corrective action must be taken to maintain data protection. For more information on lost write detection and instructions for repair from either ORA-752 or ORA-600 [3020] errors, see My Oracle Support Document 1265884.1.
Primary performance overhead is negligible when lost write protection is enabled as the validation is performed by the standby database. For DSS and Data Warehouse applications, however, where there is a high volume of queries from non-READONLY tablespaces that are updated infrequently, the primary redo generation will increase. The redo apply throughput at the standby database can be impacted due to additional data block read I/Os. For most cases, however, the impact on the primary database and standby databases is negligible. The best way to estimate the impact on the DB_LOST_WRITE_PROTECT on the primary or standby database is to enable it dynamically in your test environment and eventually in your production and standby databases.
Monitor the change in the application or redo apply throughput and evaluate the database statistics described below. The primary redo rate and redo size will grow (as measured by database statistic redo size for lost write detection) after enabling this setting since block read redo (BRR) will be created to record the version of the block at subsequent reads. The size impact is mitigated since Oracle attempts to batch the BRRs in a single redo record if possible and a typical BRR redo change is only 144 bytes and redo header is 40 bytes in a 64 bit platform. BRRs are recorded only for physical reads from disk since the goal is to detect lost writes or stale reads from disk. After enabling DB_LOST_WRITE_PROTECT on a physical standby database, you will encounter an increase in physical reads especially for data that is updated very infrequently but read semi-frequently (forcing a physical read) on the primary database. The additional physical reads on the physical standby database will be measured via database statistic recovery blocks read for lost write detection which can be extracted from the Standby Statspack report.
An example of the benefit of using DB_LOST_WRITE_PROTECT, refer to Data Guard Protection From Lost-Write Corruption demo at http://www.oracle.com/technetwork/database/features/availability/demonstrations-092317.html.
Oracle Automatic Storage Management (ASM) - Background
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good copy of the extent and copies it to the disk that had the read error.
- If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad block reallocation.
- If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked as unusable. If the write fails, the disk is taken offline.
When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.
- If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered successful.
- If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O. For information about the remap command, see "remap".
When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or application if ASM can correct the corruption automatically.
Oracle Flashback Technologies - Background
Flashback Database is used for fast point-in-time recovery to recover from human errors that cause widespread damage to a production database. It is also used for fast reinstatement of a failed primary database as a new standby database following a Data Guard failover. Flashback Database uses flashback logs to rewind an Oracle Database to a previous point in time. See My Oracle Support Document 565535.1 for Flashback Database best practices. See Section 13.2 of Data Guard Concepts and Administration for Fast Reinstatement using Flashback Database. Note that new optimizations in Oracle Database 11.2.0.2 reduce the impact on load operations when Flashback Database enabled.
Flashback Database can also be used on a Data Guard standby database to enable point-in-time recovery of the primary database at a schema level without requiring that the entire primary database be flashed back. For example, take the case of a primary database hosting a number of applications and different schemas where an incorrect batch job was run for one of the applications. Assuming no object dependencies between schemas, you can perform a granular repair of the primary database in the following manner:
- Stop the application where point-in-time recovery is required and let other applications continue to run.
- Flashback the standby database to the desired point in time and extract a good copy of the affected schema.
- Repair the primary database by replacing the schema using the good copy from the standby, and restart the application.
- Restart redo apply on the standby database. It will resynchronize the standby by reapplying the bad batch job run and the reload that repaired it.
More granular levels of repair from human errors (e.g. drop table, inadvertent or malicious updates or DDLs) are enabled by other Oracle Flashback technologies that include: Flashback Query, Flashback Version Query, Flashback Transaction, Flashback Transaction Query, Flashback Table and Flashback Drop. Flashback Drop only requires configuring a Recycle Bin. All other features use automatic undo management. See Section 4.2.7 of Oracle Database HA best practices for detailed best practices on the full complement of Oracle Flashback Technologies.
Starting in Oracle 11g, manual RMAN block media recovery will automatically search flashback logs for good copies of blocks to help repair from physical data block corruptions quickly.
Oracle Data Guard
Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. One of the key benefits of Oracle Data Guard is its continuous Oracle-aware validation of all changes using multiple checks for physical and logical consistency of structures within an Oracle data block and redo before updates are applied to a standby database. This isolates the standby database from being impacted by data corruptions that can occur on the primary system.
Active Data Guard Automatic Block Repair - Background
If Oracle detects a physical block corruption on either the primary or a standby database in a configuration that uses Active Data Guard, Oracle will automatically repair the corrupt block using a valid copy from the other database. This repair is transparent to primary database applications and users. No application modifications are necessary. If the nature of the corruption makes it impossible to be repaired automatically (e.g. file header corruption, max block repair timeout of 60 seconds for one block repair or number of outstanding block corruptions reaching 100 block corruption incidents), an ORA-1578 error is returned to the application. Manual block media recovery can be executed at a later time. Automatic Block Repair requires an Active Data Guard standby and Data Guard real-time apply. The following database initialization parameters must be configured on the standby database: the following database initialization parameters are configured on the standby database:
- The LOG_ARCHIVE_CONFIG parameter is configured with a DG_CONFIG list and a LOG_ARCHIVE_DEST_n parameter is configured for the primary database
- The FAL_SERVER parameter is configured and its value contains an Oracle Net service name for the primary database.
Starting in Oracle 11g Release 2, manual RMAN block media recovery on the primary will automatically try to use a real-time query physical standby database.
For an example of the benefits of using Active Data Guard auto block repair mechanisms, refer to http://www.oracle.com/technetwork/database/features/availability/demonstrations-092317.html
- Automatic Block Repair at a Primary Database - Active Data Guard demo
- Automatic Block Repair at a Standby Database - Active Data Guard demo
Additional Operational Practices to detect block corruptions
- Query v$database_block_corruption periodically on the primary and standby databases for any persistent data block corruptions
- Scan and alert on any ORA-1578s in the primary, standby and ASM alert.logs for every instance
- Scan for IO errors in the database alert.log, ASM alert.log and cell alert.log files (Example: “IO Error” or “I/O error”)
- Execute RMAN backup and restore operations which will automatically check for physical block corruptions. This is particularly helpful for data that is rarely queried or updated.
- Periodically execute RMAN commands with “check logical” option to detect logical block corruptions.
- Use ANALYZE statement with the VALIDATE STRUCTURE option if you suspect corruptions due to inter-object or inter-block corruptions. Refer to below for more information.
For example, in rare cases such as hardware or other system failures, an index can become corrupted and not perform correctly. When validating the index, you can confirm that every entry in the index points to the correct row of the associated table. If the index is corrupt, you can drop and re-create it.
If a table, index, or cluster is corrupt, you should drop it and re-create it. If a materialized view is corrupt, perform a complete refresh and ensure that you have remedied the problem. If the problem is not corrected, drop and re-create the materialized view.
Automatic and Manual ASM DISK Repair in 11g
Oracle 11g New Features Tips
Oracle 11g ASM is capable of repairing broken blocks on the fly when reading them. When Oracle finds a broken block, it attempts to read it from the mirror copy of the respective extent and repairs the broken version automatically. This can only happen on disk groups with at least normal redundancy and it only happens for those blocks which are read. The new remap command of asmcmd can be used to recover a range of unreadable blocks. ASM would read a bad block from the good copy and restore it to a new location on the disk. This can also be run from the enterprise manager interface. ASM Disk Group Checks Enhanced The check command was already available in Oracle ASM 10g. It was necessary there to specify what to check with the check command. As of Oracle ASM 11g, this command has been simplified. It is not necessary to specify any more what to check because the command makes all available checks in one go now. The CHECK keyword performs the following operations without the need for specifications:
The above as well as the below command issued in an ASM instance would check all disks in the disk group and add a message into the ASM instance’s alert log file with information about the successful run of the check and about found broken blocks. In addition, these commands would also recover the bad blocks. The repair option is the DEFAULT for the check command. SYS AS SYSDBA @ +ASM SQL> ALTER DISKGROUP data CHECK REPAIR; The next command would only check for bad blocks but not recover them. SYS AS SYSDBA @ +ASM SQL> ALTER DISKGROUP data CHECK NOREPAIR; [oracle@rhas4 ~]$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM/alert/log.xml The below listing shows the message in the xml version of the alter log of the ASM instance: <txt>SUCCESS: check of diskgroup DATA found no errors </txt> </msg> <msg time='2007-10-22T10:26:33.700+02:00' org_id='oracle' comp_id='asm' client_id='' type='UNKNOWN' level='16' host_id='rhas4.mydomain' host_addr='192.168.1.1' module='OMS' pid='12810'> <txt>SUCCESS: ALTER DISKGROUP DATA CHECK </txt> </msg> The full check for all metadata packages into one single command makes the checks easier, but the drawback is the additional overhead which is burdening the I/O subsystem now and might slow down performance. ASM metadata ASM storage is described in a number of so called fixed tables. It is possible to view important parts of this metadata via (g)v$views which, in fact, read the information from these x$tables. In 11gR1, there are 28 x$tables which fully describe the ASM storage and there are 22 v$views which show information about ASM. Some important x$tables are: x$kffxpi: mapping between files and allocation units x$kfdpartner: disk-to-partner relationship of two disks of a given ASM diskgroup which hold a mirror copy of the same extent x$kfdat: details of all free and used allocation units List of Dynamic Performance Views related to ASM: SYS AS SYSDBA @ +ASM SQL> select view_name from v$fixed_view_definition where view_name like '%ASM%'; VIEW_NAME ------------------------------ GV$ASM_TEMPLATE V$ASM_TEMPLATE GV$ASM_ALIAS V$ASM_ALIAS GV$ASM_FILE V$ASM_FILE GV$ASM_CLIENT V$ASM_CLIENT GV$ASM_DISKGROUP V$ASM_DISKGROUP GV$ASM_DISKGROUP_STAT V$ASM_DISKGROUP_STAT GV$ASM_DISK V$ASM_DISK GV$ASM_DISK_STAT V$ASM_DISK_STAT GV$ASM_DISK_IOSTAT V$ASM_DISK_IOSTAT GV$ASM_OPERATION V$ASM_OPERATION GV$ASM_ATTRIBUTE V$ASM_ATTRIBUTE 22 rows selected. List of Fixed Tables related to ASM: SYS AS SYSDBA @ +ASM SQL> select name from v$fixed_table where name like 'X$KF%'; NAME ------------------------------ X$KFALS X$KFCB X$KFCCE X$KFBH X$KFDSK X$KFDSK_STAT X$KFDAT X$KFDFS X$KFDDD X$KFGRP X$KFGRP_STAT X$KFGMG X$KFGBRB X$KFKID X$KFKLIB X$KFMDGRP X$KFNCL X$KFNSDSKIOST X$KFTMTA X$KFFIL X$KFFXP X$KFDPARTNER X$KFCLLE X$KFENV X$KFVOL X$KFVOLSTAT X$KFVOFS X$KFVOFSV |
No comments:
Post a Comment