FlashCache
Exadata - Flash Cache
Picture of an F20 PCI FlashCard as delivered with Exadata Storage server X2-2 |
This Flash Card * (s. Addendum at the bottom) delivers 96 GB Flash Storage, devided into 4 Flash Drives. This summarizes to 5 TB Flash Storage for a Full Rack – many Databases will probably fit completely into it. In an OLTP Database, the number of IOs per second deliverable is one of the most critical factors. With the above configuration, we can deliver up to 1 Million IOs per second.
The default way to deal with the Flash Storage is to use it completely as Flash Cache. You may think of Flash Cache as a prolongation of the Database Buffer Cache. It is populated automatically by the system with objects deemed useful to cache them. Without any intervention it is used that way:That was from one of the cells as the celladmin user connected using the command line interface. The whole Flash Storage is in use for Flash Cache on that cell. On the Database Layer, we may see the effect like this:CellCLI> **list flashcache detail**name: exa5cel02_FLASHCACHEcellDisk: [... list of all 16 flashdisks]creationTime: 2011-02-02T01:14:04-08:00id: 880d826b-47cc-4cf5-95fc-c36d6d315ba8degradedCelldisks: effectiveCacheSize: 365.25Gstatus: normalsize: 365.25G
That is already a good ratio: The majority of IO requests is resolved from the Flash Cache. We can specify storage attributes on the segment layer to influence the caching behavior of that segment:SQL> **select name,value from v$sysstat where name in ('physical read total IO requests','cell flash cache read hits'); **NAME VALUE ---------------------------------------------------------------- ----------cell flash cache read hits 32344851physical read total IO requests 51572139
We have 3 possible values here: DEFAULT (the default), KEEP and NONE. Keep means that the sales table will be stored in the Flash Cache “more aggressively” than the default. It will in other words increase the chance to read it from there. The table is the same as in the previous posting. Because of this setting and previous selects on the table that populated the Flash Cache with it, I am now able to read it from there. I reconnect to initialize v$mystat:SQL> **select segment_name,cell_flash_cache from user_segments;**SEGMENT_NA CELL_FL ---------- -------SALES KEEP
SQL> **connect adam/adam**Connected.SQL> select count(*) from sales;SQL> set timing on20000000COUNT(*) ----------SQL> **select name,value from v$mystat natural join v$statname where name in ('physical read total IO request', 'cell flash Cache read hits') **Elapsed: 00:00:00.50**('physical read total IO requests','cell flash cache read hits');**NAME VALUEphysical read total IO requests 10265---------------------------------------------------------------- ----------cell flash cache read hits 10265
The second possibility to deal with the Flash Storage is to take a part of it for building ASM diskgroups upon. All files on these ASM diskgroups will then reside permanently on Flash Storage:
The Flash Cache for this cell is now reduced to 100 GB; all means “upon all 16 Flash Drives” here. I am doing the same on the second cell – my Database Machine is limited to only one Server Node and two Cells. That gives me 32 Grid Disks based on Flash Drives to create ASM diskgroups upon:CellCLI> drop flashcacheFlash cache exa5cel01_FLASHCACHE successfully droppedCellCLI> create flashcache all size=100gFlash cache exa5cel01_FLASHCACHE successfully createdGridDisk flashdrive_FD_00_exa5cel01 successfully createdCellCLI> create griddisk all flashdisk prefix=flashdriveGridDisk flashdrive_FD_02_exa5cel01 successfully createdGridDisk flashdrive_FD_01_exa5cel01 successfully created GridDisk flashdrive_FD_03_exa5cel01 successfully createdGridDisk flashdrive_FD_15_exa5cel01 successfully createdGridDisk flashdrive_FD_14_exa5cel01 successfully created
CellCLI> **drop flashcache**Flash cache exa5cel02_FLASHCACHE successfully droppedCellCLI> **create flashcache all size=100g**Flash cache exa5cel02_FLASHCACHE successfully createdGridDisk flashdrive_FD_00_exa5cel02 successfully createdCellCLI> **create griddisk all flashdisk prefix=flashdrive**GridDisk flashdrive_FD_03_exa5cel02 successfully createdGridDisk flashdrive_FD_01_exa5cel02 successfully created GridDisk flashdrive_FD_02_exa5cel02 successfully createdGridDisk flashdrive_FD_15_exa5cel02 successfully createdGridDisk flashdrive_FD_04_exa5cel02 successfully created GridDisk flashdrive_FD_05_exa5cel02 successfully createdcomment:CellCLI> **list griddisk flashdrive_FD_10_exa5cel02 detail** name: flashdrive_FD_10_exa5cel02 availableTo: cellDisk: FD_10_exa5cel02offset: 6.28125GcreationTime: 2011-02-02T02:56:52-08:00 diskType: FlashDisk errorCount: 0 id: 0000012d-e604-87f2-0000-000000000000 size: 16.578125Gstatus: active
Changing to the Database Server Node to create the ASM diskgroup as sysasm:
Please notice the Allocation Unit size of 4MB, necessary for Exadata. Normal Redundancy is strongly recommended – automatically each cell is also a Failure Group. In other words: Should one Storage Server crash, no loss of Data will happen, regardless whether we use Flash Drives or spinning drives to build the diskgroups upon. After the creation, we use the diskgroup like any other:SQL> **select path,header_status, os_mb,free_mb from v$asm_disk where path like '%flash%'**PATH HEADER_STATU OS_MB FREE_MBo/192.168.14.10/flashdrive_FD_14_exa5cel02 CANDIDATE 16976 0-------------------------------------------------- ------------ ---------- ---------- o/192.168.14.9/flashdrive_FD_12_exa5cel01 CANDIDATE 16976 0o/192.168.14.10/flashdrive_FD_08_exa5cel02 CANDIDATE 16976 0o/192.168.14.10/flashdrive_FD_05_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_11_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_15_exa5cel02 CANDIDATE 16976 0o/192.168.14.9/flashdrive_FD_06_exa5cel01 CANDIDATE 16976 0o/192.168.14.10/flashdrive_FD_00_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_03_exa5cel02 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_15_exa5cel01 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_02_exa5cel02 CANDIDATE 16976 0Diskgroup created.32 rows selected. SQL> **create diskgroup flashdrive normal redundancy** **disk 'o/*/flashdrive*'** **attribute 'compatible.rdbms'='11.2.0.0.0',** **'compatible.asm'='11.2.0.0.0',** **'cell.smart_scan_capable'='TRUE',** **'au_size'='4M';**
SQL> create tablespace veryfast datafile '+flashdrive' size 10g;Tablespace created.
Any segment created in this tablespace will reside on Flash Drives permanently.
Let’s take the opportunity to give an example for Information Lifecycle Management (ILM):
SQL> **create tablespace compahigh datafile size 1g;**Tablespace created.SQL> **alter tablespace compahigh default compress for archive high;**Tablespace altered.
The tablespace got created on spinning drives, because my DB_CREATE_FILE_DEST parameter points to such a diskgroup. Same for the next two:
My plan is to store one large partitioned table partly compressed, partly uncompressed on spinning drives and partly on Flash Drives – the newest and most volatile part.SQL> **create tablespace querylow datafile size 1g;**Tablespace created.SQL> **alter tablespace querylow default compress for query low;**Tablespace altered. SQL> **create tablespace ordinary datafile size 1g;**Tablespace created.
SQL> **create table sales_part****(id number, flag number, product char(25),channel_id number,cust_id number,****amount_sold number, order_date date, ship_date date)****store in (veryfast)****partition by range (order_date)** **interval (numtoyminterval(1,'year'))** **(****partition archhigh values less than (to_date('01.01.1991','dd.mm.yyyy')) tablespace compahigh,****partition querylow values less than (to_date('01.01.1998','dd.mm.yyyy')) tablespace querylow,****partition ordi1999 values less than (to_date('01.01.2000','dd.mm.yyyy')) tablespace ordinary,****partition ordi1998 values less than (to_date('01.01.1999','dd.mm.yyyy')) tablespace ordinary,** **partition ordi2004 values less than (to_date('01.01.2005','dd.mm.yyyy')) tablespace ordinary,**Table created.**partition ordi2005 values less than (to_date('01.01.2006','dd.mm.yyyy')) tablespace ordinary** **)** **;**
This uses the 11g New Feature Interval Partitioning to create new partitions automatically on Flash Drives. Now loading the table from the old sales table:
Notice the order by above. It makes it later on possible not only to do partition pruning but also to use Storage Indexes if we query after ORDER_DATE or even SHIP_DATE. The single partitions now look like this:SQL> **alter table sales_part nologging;**Table altered.SQL> **insert /*+ append */ into sales_part select * from sales;**20000000 rows created.
SQL> **select partition_name,tablespace_name,bytes/1024/1024 as mb from user_segments where segment_name='SALES_PART'; ****from user_segments where segment_name='SALES_PART';**------------------------------ ------------------------------ ----------PARTITION_NAME TABLESPACE_NAME MBORDI1998 ORDINARY 56ARCHHIGH COMPAHIGH 8 ORDI1999 ORDINARY 56SYS_P106 VERYFAST 8ORDI2000 ORDINARY 56 SYS_P105 VERYFAST 56 16 rows selected.SQL> **select count(*) from sales_part partition (ordi1998);**SQL> **select count(*) from sales_part partition (archhigh);** COUNT(*) ---------- 5330000 SQL> **select count(*) from sales_part partition (querylow);** COUNT(*) ---------- 5114000 COUNT(*) ---------- 730000730000SQL> **select count(*) from sales_part partition (sys_p101);** COUNT(*)----------
During the life cycle of the data, partitions may no longer be highly volatile and can be moved to spinning drives or even get compressed:
We have no indexes in place – so there is no rebuild needed :-)SQL> **alter table sales_part move partition sys_p101 tablespace ordinary;**Table altered.SQL> **alter table sales_part move partition ordi1998 compress for query low tablespace querylow;**Table altered.SQL> **select partition_name,tablespace_name,bytes/1024/1024 as mb from user_segments where segment_name='SALES_PART'; ****from user_segments where segment_name='SALES_PART';** PARTITION_NAME TABLESPACE_NAME MBARCHHIGH COMPAHIGH 8------------------------------ ------------------------------ ---------- ORDI1998 QUERYLOW 8ORDI2001 ORDINARY 56ORDI1999 ORDINARY 56 ORDI2000 ORDINARY 56 SYS_P105 VERYFAST 5616 rows selected.</span>SYS_P106 VERYFAST 8
Summary: Flash Storage inside the Oracle Exadata Database Machine is used completely as Flash Cache by default, effectively working as an extension of the Database Buffer Cache and delivering faster Access together with a very high IO per Second rate which is especially important for OLTP. Additionally, we may take a part of the Flash Storage to build ASM diskgroups upon it. Files placed on these diskgroups will reside permanently on Flash Storage – no Caching needed.
- Addendum: The posting reflects the state of X2. X3 Cells come with 4 x F40 Flashcards that deliver each 400 GB Flash capacity, now to a total of 1600 GB Flash capacity for each Storage Server. Given that the F40 is also faster than the old F20 and together with the new Write Back Flash Cache technology, there will be even less likely a need to build ASM diskgroups upon Flash storage than with X2.
Second Addendum: Already with newer versions of X2, we introduced Flash Logging, which takes 512 MB Flash Storage from each cell (regardless whether it is X2 or X3). This relatively small amount reduces the capacity of the Flash Cache accordingly.
No comments:
Post a Comment