Direct NFS access to Sun Storage 7410 with Oracle 11g and Solaris… configuration and verifcation

November 19, 2009, 3:34 pm

≫ Next: Monitoring Direct NFS with Oracle 11g and Solaris… pealing back the layers of the onion.

During the course of experimentation with 11gR2, I was given some space on a Sun Storage 7410 NAS. In the past NAS meant using NFS with obscure mount options that seemed to vary from platform to platform. So, at first I went scrambling for the “best practices” to use with Oracle NAS on Solaris.

There is a nice Metalink article Note:359515.1 with the latest information for all platforms. This Metalink note does include the “tcp” option which is not necessary on Solaris. So it boiled down to the following mount options for using Oracle data files on NAS devices with Solaris.

rw,bg,hard,nointr,rsize=32768, wsize=32768,noac, forcedirectio, vers=3,suid

But wait, what about the new 11g feature to use direct NFS “dNFS”? More searching…

configuring dNFS on Solaris

This is a fairly simple process. Although Oracle dNFS configuration is fairly well documented for Linux, I will post my interpretation and commentary to help other Solaris users that might want to configure dNFS.

First, create mount the NFS share just as you would have in the past. Oracle still needs to see the file system from the OS point of view. You don’t have to use the mount options as in the past, but you might want them anyway for OS tools may access the mount. You would most likely place these options in the “/etc/vfstab” file, but I will just show the mount command.

mount -o rw,bg,hard,nointr,rsize=32768,\ wsize=32768,noac,forcedirectio,vers=3,suid \toromondo.west:/export/glennf /ar1

Second, you have to link the direct NFS libraries in place of ODM. This is a little clunky, but not terrible.

cd $ORACLE_HOME/lib cp libodm11.so libodm11.so_stub ln -s libnfsodm11.so libodm11.so

Third, create the “$ORACLE_HOME/dbs/oranfstab” file. This file defines the various details Oracle needs to directly access the NFS share. You can configure multiple paths, so that Oracle can multiplex access to the NFS share. This is for redundancy and load balancing. There is another Metalink article ID:822481.1 that details how to configure dNFS with multiple paths across the same subnet and force the OS to not route packets. This is a great feature, which I will try once I get some more network plumbing. For now, I just did the most simple configuration as shown below.

cat $ORACLE_HOME/dbs/oranfstab server: toromondo.west path: toromondo.west export: /export/glennf mount:/ar1

Finally, you will be able to see if this takes effect by looking at the “alert.log” file. When Oracle starts up it places debug information in the alert.log file so we can see if Oracle is using Direct NFS or not.

grep NFS alert_*.log Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 2.0 Direct NFS: attempting to mount /export/glennf on filer toromondo.west defined in oranfstab Direct NFS: channel config is: Direct NFS: mount complete dir /export/glennf on toromondo.west mntport 38844 nfsport 2049 Direct NFS: channel id [0] path [toromondo.west] to filer [toromondo.west] via local [] is UP Direct NFS: channel id [1] path [toromondo.west] to filer [toromondo.west] via local [] is UP

That’s all there is to it. Hopefully, you will find this useful.

Posted in Oracle, Storage Tagged: 7410, dNFS, NAS, NFS, ODM, Oracle, Solaris

↧

Monitoring Direct NFS with Oracle 11g and Solaris… pealing back the layers of the onion.

November 25, 2009, 10:58 am

≫ Next: Direct NFS vs Kernel NFS bake-off with Oracle 11g and Solaris… and the winner is

≪ Previous: Direct NFS access to Sun Storage 7410 with Oracle 11g and Solaris… configuration and verifcation

When I start a new project, I like to check performance from as many layers as possible. This helps to verify things are working as expected and helps me to understand how the pieces fit together. My recent work with dNFS and Oracle 11gR2, I started down the path to monitor performance and was surprised to see that things are not always as they seem. This post will explore the various ways to monitor and verify performance when using dNFS with Oracle 11gR2 and Sun Open Storage “Fishworks“.

why is iostat lying to me?

“iostat(1M)” is one of the most common tools to monitor IO. Normally, I can see activity on local devices as well as NFS mounts via iostat. But, with dNFS, my device seems idle during the middle of a performance run.

bash-3.0$ iostat -xcn 5
cpu
us sy wt id
8  5  0 87
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0    6.2    0.0   45.2  0.0  0.0    0.0    0.4   0   0 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf
cpu
us sy wt id
7  5  0 89
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   57.9    0.0  435.8  0.0  0.0    0.0    0.5   0   3 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf

From the DB server perspective, I can’t see the IO. I wonder what the array looks like.

what does fishworks analytics have to say about IO?

The analytics package available with fishworks is the best way to verify performance with Sun Open Storage. This package is easy to use and indeed I was quickly able to verify activity on the array.

There are 48,987 NFSv3 operations/sec and ~403MB/sec going through the nge13 interface. So, this array is cooking pretty good. So, let’s take a peek at the network on the DB host.

nicstat to the rescue

nicstat is wonderful tool developed by Brendan Greg at Sun to show network performance. Nicstat really shows you the critical data for monitoring network speeds and feeds by displaying packet size, utilization, and rates of the various interfaces.

root@saemrmb9> nicstat 5
Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
15:32:11    nxge0    0.11    1.51    1.60    9.00   68.25   171.7  0.00   0.00
15:32:11    nxge1  392926 13382.1 95214.4 95161.8  4225.8   144.0  33.3   0.00

So, from the DB server point of view, we are transferring about 390MB/sec… which correlates to what we saw with the analytics from Fishworks. Cool!

why not use DTrace?

Ok, I wouldn’t be a good Sun employee if I didn’t use DTrace once in a while. I was curious to see the Oracle calls for dNFS so I broke out my favorite tool from the DTrace Toolkit. The “hotuser” tool shows which functions are being called the most. For my purposes, I found an active Oracle shadow process and searched for NFS related functions.

root@saemrmb9> hotuser -p 681 |grep nfs
^C
oracle`kgnfs_getmsg                                         1   0.2%
oracle`kgnfs_complete_read                                  1   0.2%
oracle`kgnfswat                                             1   0.2%
oracle`kgnfs_getpmsg                                        1   0.2%
oracle`kgnfs_getaprocdata                                   1   0.2%
oracle`kgnfs_processmsg                                     1   0.2%
oracle`kgnfs_find_channel                                   1   0.2%
libnfsodm11.so`odm_io                                       1   0.2%
oracle`kgnfsfreemem                                         2   0.4%
oracle`kgnfs_flushmsg                                       2   0.4%
oracle`kgnfsallocmem                                        2   0.4%
oracle`skgnfs_recvmsg                                       3   0.5%
oracle`kgnfs_serializesendmsg                               3   0.5%

So, yes it seems Direct NFS is really being used by Oracle 11g.

performance geeks love V$ tables

There are a set of V$ tables that allow you to sample the performance of the performance of dNFS as seen by Oracle. I like V$ tables because I can write SQL scripts until I run out of Mt. Dew. The following views are available to monitor activity with dNFS.

v$dnfs_servers: Shows a table of servers accessed using Direct NFS.
v$dnfs_files: Shows a table of files now open with Direct NFS.
v$dnfs_channels: Shows a table of open network paths (or channels) to servers for which Direct NFS is providing files.
v$dnfs_stats: Shows a table of performance statistics for Direct NFS.

With some simple scripting, I was able to create a simple script to monitor the NFS IOPS by sampling the v$dnfs_stats view. This script simply samples the nfs_read and nfs_write operations, pauses for 5 seconds, then samples again to determine the rate.

timestmp|nfsiops 15:30:31|48162 15:30:36|48752 15:30:41|48313 15:30:46|48517.4 15:30:51|48478 15:30:56|48509 15:31:01|48123 15:31:06|48118.8

Excellent! Oracle shows 48,000 NFS IOPS which agrees with the analytics from Fishworks.

what about the AWR?

Consulting the AWR, shows “Physical reads” in agreement as well.

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               93.1            1,009.2       0.00       0.00
       DB CPU(s):               54.2              587.8       0.00       0.00
       Redo size:            4,340.3           47,036.8
   Logical reads:          385,809.7        4,181,152.4
   Block changes:                9.1               99.0
  Physical reads:           47,391.1          513,594.2
 Physical writes:                5.7               61.7
      User calls:           63,251.0          685,472.3
          Parses:                5.3               57.4
     Hard parses:                0.0                0.1
W/A MB processed:                0.1                1.1
          Logons:                0.1                0.7
        Executes:           45,637.8          494,593.0
       Rollbacks:                0.0                0.0
    Transactions:                0.1

so, why is iostat lying to me?

iostat(1M) monitors IO to devices and nfs mount points. But with Oracle Direct NFS, the mount point is bypassed and each shadow process simply mounts files directly. To monitor dNFS traffic you have to use other methods as described here. Hopefully, this post was instructive on how to peel back the layers in-order to gain visibility into dNFS performance with Oracle and Sun Open Storage.

Posted in Oracle, Storage Tagged: 7410, analytics, dNFS, monitoring, network, NFS, Oracle, performance, Solaris

↧

Direct NFS vs Kernel NFS bake-off with Oracle 11g and Solaris… and the winner is

December 14, 2009, 12:30 pm

≫ Next: Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

≪ Previous: Monitoring Direct NFS with Oracle 11g and Solaris… pealing back the layers of the onion.

NOTE:: Please see my next entry on Kernel NFS performance and the improvements that come with the latest Solaris.

==============

After experimenting with dNFS it was time to do a comparison with the “old” way. I was a little surprised by the results, but I guess that really explains why Oracle decided to embed the NFS client into the database

bake-off with OLTP style transactions

This experiment was designed to load up a machine, a T5240, with OLTP style transactions until no more CPU available. The dataset was big enough to push about 36,000 IOPS read and 1,500 IOPS write during peak throughput. As you can see, dNFS performed well which allowed the system to scale until DB server CPU was fully utilized. On the other hand, Kernel NFS throttles after 32 users and is unable to use the available CPU to scale transactional throughput.

lower cpu overhead yields better throughput

A common measure for benchmarks is to figure out how many transactions per CPU are possible. Below, I plotted the CPU content needed for a particular transaction rate. This chart shows the total measured CPU (user+system) to for a given TPS rate.

dNFS vs kNFS (TPS/CPU)

As expected, the transaction rate per CPU is greater when using dNFS vs kNFS. Please do note, that this is a T5240 machine that has 128 threads or virtual CPUs. I don’t want to go into semantics of sockets, cores, pipelines, and threads but thought it was at least worth noting. Oracle sees a thread of a T5240 as a CPU, so that is what I used for this comparison.

silly little torture test

When doing the OLTP style tests with a normal sized SGA, I was not able to fully utilize the 10gigE interface or the Sun 7410 storage. So, I decided to do a silly little micro benchmark with a real small SGA. This benchmark just does simple read-only queries that essentially result in a bunch of random 8k IO. I have included the output from the Fishworks analytics below for both kNFS and dNFS.

Random IOPS with kNFS and Sun Open Storage

Random IOPS with dNFS and Sun 7410 open storage

I was able to hit ~90K IOPS with 729MB/sec of throughput with just one 10gigE interface connected to Sun 7140 unified storage. This is an excellent result with Oracle 11gR2 and dNFS for a random test IO test… but there is still more bandwidth available. So, I decided to do a quick DSS style query to see if I could break the 1GB/sec barrier.

===dNFS===
SQL> select /*+ parallel(item,32) full(item) */ count(*) from item;
 COUNT(*)
----------
 40025111
Elapsed: 00:00:06.36

===kNFS===
SQL> select /*+ parallel(item,32) full(item) */ count(*) from item;
 COUNT(*)
----------
 40025111

Elapsed: 00:00:16.18

kNFS table scan

dNFS table scan

Excellent, with a simple scan I was able to do 1.14GB/sec with dNFS more than doubling the throughput of kNFS.

configuration notes and basic tuning

I was running on a T5240 with Solaris 10 Update 8.

$ cat /etc/release
Solaris 10 10/09 s10s_u8wos_08a SPARC
Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009

This machine has the a built-in 10gigE interface which uses multiple threads to increase throughput. Out of the box, there is very little to tuned as long as you are on Solaris 10 Update 8. I experimented with various settings, but found that only basic tcp settings were required.

ndd -set /dev/tcp tcp_recv_hiwat 400000
ndd -set /dev/tcp tcp_xmit_hiwat 400000
ndd -set /dev/tcp tcp_max_buf 2097152
ndd -set /dev/tcp tcp_cwnd_max 2097152

Finally, on the storage front, I was using the Sun Storage 7140 Unified storage server as the NFS server for this test. This server was born out of the Fishworks project and is an excellent platform for deploying NFS based databases…. watch out NetApp.

what does it all mean?

dNFS wins hands down. Standard kernel NFS only essentially allows one client per “mount” point. So eventually, we see data queued to a mount point. This essentially clips the throughput far too soon. Direct NFS solves this problem by having each Oracle shadow process mount the device directly. Also with dNFS, all the desired tuning and mount point options are not necessary. Oracle knows what options are most efficient for transferring blocks of data and configures the connection properly.

When I began down this path of discovery, I was only using NFS attached storage because nothing else was available in our lab… and IO was not initially a huge part of the project at hand. Being a performance guy who benchmarks systems to squeeze out the last percentage point of performance, I was skeptical about NAS devices. Traditionally, NAS was limited by slow networks and clumsy SW stacks. But times change. Fast 10gigE networks and Fishworks storage combined with clever SW like Direct NFS really showed this old dog a new trick.

Posted in Oracle, Storage Tagged: 11g, 7410, analytics, dNFS, fishworks, NAS, NFS, Oracle, performance, Solaris, Sun

↧

Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

December 17, 2009, 10:38 am

≫ Next: Simple script to monitor dNFS activity

≪ Previous: Direct NFS vs Kernel NFS bake-off with Oracle 11g and Solaris… and the winner is

After my recent series of postings, I was made aware of David Lutz’s blog on NFS client performance with Solaris. It turns out that you can vastly improve the performance of NFS clients using a new parameter to adjust the number of client connections.

root@saemrmb9> grep rpcmod /etc/system
set rpcmod:clnt_max_conns=8

This parameter was introduced in a patch for various flavors of Solaris. For details on the various flavors, see David Lutz’s recent blog entry on improving NFS client performance. Soon, it should be the default in Solaris making out-of-box client performance scream.

DSS query throughput with Kernel NFS

I re-ran the DSS query referenced in my last entry and now kNFS matches the throughput of dNFS with 10gigE.

Kernel NFS throughput with Solaris 10 Update 8 (set rpcmod:clnt_max_conns=8) This is great news for customers not yet on Oracle 11g. With this latest fix to Solaris, you can match the throughput of Direct NFS on older versions of Oracle. In a future post, I will explore the CPU impact of dNFS and kNFS with OLTP style transactions.

Posted in Oracle, Storage Tagged: 11g, 7410, analytics, database, dNFS, NAS, NFS, Oracle, performance, Solaris, Sun, tuning

↧

Simple script to monitor dNFS activity

February 18, 2010, 5:52 pm

≫ Next: Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

≪ Previous: Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

In my previous series regarding “dNFS” vs “kNFS” I reference a script that monitors dNFS traffic by sampling the v$dnfs_stats view. A reader requested the script so I thought it might be useful to a wider audience. This simple script samples some values from the view and outputs the a date/timestamp along with rate information. I hope it is useful.

------ mondnfs.sql -------

set serveroutput on format wrapped size 1000000
create or replace directory mytmp as '/tmp';

DECLARE
n number;
m number;
x number := 1;
y number := 0;

bnio number;
anio number;

nfsiops number;

fd1 UTL_FILE.FILE_TYPE;

BEGIN
fd1 := UTL_FILE.FOPEN('MYTMP', 'dnfsmon.log', 'w');

LOOP
bnio := 0;
anio := 0;

select  sum(nfs_read+nfs_write) into bnio from v$dnfs_stats;

n := DBMS_UTILITY.GET_TIME;
DBMS_LOCK.SLEEP(5);

select  sum(nfs_read+nfs_write) into anio from v$dnfs_stats;

m := DBMS_UTILITY.GET_TIME - n ;

nfsiops := ( 100*(anio - bnio) / m ) ;

UTL_FILE.PUT_LINE(fd1, TO_CHAR(SYSDATE,'HH24:MI:SS') || '|' || nfsiops );
UTL_FILE.FFLUSH(fd1);
x := x + 1;
END LOOP;

UTL_FILE.FCLOSE(fd1);
END;
/

========================

Filed under: Oracle, Storage

↧

Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

June 8, 2010, 10:22 am

≫ Next: Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

≪ Previous: Simple script to monitor dNFS activity

I have worked on Exadata V2 performance projects with Kevin Closson for nearly a year now and have had the opportunity to evaluate several methods of loading data into a data warehouse. The most common, and by far the fastest method, involves the use of “External Tables”. External tables allow the user to define a table object made up of text files that live on a file system. Using External Tables allows for standard SQL parallel query operations to be used to load data into permanent database tables.

SQL> alter session enable parallel dml ;
SQL> insert /*+ APPEND */ into mytable select * from ext_tab ;

With the size and power of Exadata, businesses are creating larger and larger data warehouses. There will often be dozens of machines that collect and stage data for ingest by the data warehouse. So this means the staging area for these flat-files must be huge, real fast, and accessible from multiple networks.

What options are available for staging input files?

With Exadata V2, or any RAC environment, flat-file data has to be present on all nodes in order to fully utilize parallel query. The natural first choice with Exadata V2 is to use DBFS.

DBFS comes with Exadata and allows for easy clustering across all Exadata database nodes. The real data store for DBFS are database tables residing on a tablespace within the database machine. The DBFS client program is then used to mount the DBFS filesystem such that they appear to the Linux user to be just another file system. This allows for file system data to be managed just like any other database while using the full power of Exadata. DBFS is quite fast and works well for housing external tables, but it does cut down on the storage available for the data warehouse. Also, since DBFS is simply a client on an Exadata database node, it uses CPU resources on the database machine to initially transfer or create the flat files.

Open Storage S7000 a natural staging area for Exadata

If you want to extend the amount of storage to stage data for your warehouse, then the S7000 is an excellent choice. The S7000 can stage files off traditional networks using 1gigE and 10gigE connections. This allows for multiple machines to seamlessly connect to the S7000 in-order to stage data for ingest. This activity has no effect on the Exadata users since the S7000 is a self contained storage server – unlike DBFS that uses CPU cycles from the database grid to manage and store the flat-file data.

Once the data is on the S7000, we can use IPoIB and connect directly into the high-bandwidth Infiniband network that is part of Exadata V2. This allows the S7000 to be positioned neatly between Exadata and the traditional gigE networks.

what about performance?

As part of a larger project, I was able to run a quick test. I had the following:

S7410 with 12 drives
128 x 1GB files on a share
8 db nodes active (x4170) with the share mounted on all nodes.

I created an external table across all the files and performed two tests:

Select count(*).
```
SQL> select count(*) from ext_tab;
```

Insert as Select “IAS”

SQL> alter session enable parallel dml ;
SQL> insert /*+APPEND */ into mytable select * from ext_tab;

Both when querying and loading data with “IAS”, I was able to get 1.2GB/sec throughput as I saw with my earlier tests with S7000 and 10gigE. That is over 4TB/hr with just one head node for the S7410. With a clustered configuration and multiple mount points, the load rate could be increased even further.

summary

The Sun Open Storage S7410 server is an excellent choice for managing file system data. With the ability to connect to multiple networks, it is a perfect fit to stage data for Exadata environments as well.

Filed under: Exadata, Oracle, Storage

↧

Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

May 10, 2011, 12:52 pm

≫ Next: Tuning is in the eye of the beholder… Memory is memory right?

≪ Previous: Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

I recently had an interesting time with a customer who is all too familiar with SAN’s. SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive. This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata. So, how could this be possible? Does Exadata have an enchantment spell that makes the drives magically spin faster? Maybe a maybe a space time warp to service IO?

The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives. This works out to be 300 IOs per second. At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange. But really, it only means that you have to on average service more than one IO per revolution. So, how do you service more than one IO per revolution?

Drive command queuing and short stroking

Modern drives have the ability to queue up more than one IO at a time. If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution. As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases. Lots of literature exists on this topic and indeed many have tested this phenomena. A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.

This effect of servicing more than one IO per revolution is enhanced when the seek distances are short. There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance. This technique combined with command queuing increases the probability of servicing more than one IO per revolution.

But how can this old trick work with real world environments?

ASM intelligent data placement to the rescue

ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive. The drive is essentially split into “Hot” and “Cold” regions. This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive. This is the realization of an old benchmark trick, using a real feature in ASM.

the proof is in the pudding… “calibrate” command shows drive capabilities

The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage. The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations. The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns. This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives. I have included an example output from an X2-2 machine below.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk luns: 1809 MBPS
Aggregate random read throughput across all flash disk luns: 4264.59 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923
Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197
Calibrating hard disks (read only) ...
Lun 0_0  on drive [20:0     ] random read throughput: 155.60 MBPS, and 422 IOPS
Lun 0_1  on drive [20:1     ] random read throughput: 155.95 MBPS, and 419 IOPS
Lun 0_10 on drive [20:10    ] random read throughput: 155.58 MBPS, and 428 IOPS
Lun 0_11 on drive [20:11    ] random read throughput: 155.13 MBPS, and 428 IOPS
Lun 0_2  on drive [20:2     ] random read throughput: 157.29 MBPS, and 415 IOPS
Lun 0_3  on drive [20:3     ] random read throughput: 156.58 MBPS, and 415 IOPS
Lun 0_4  on drive [20:4     ] random read throughput: 155.12 MBPS, and 421 IOPS
Lun 0_5  on drive [20:5     ] random read throughput: 154.95 MBPS, and 425 IOPS
Lun 0_6  on drive [20:6     ] random read throughput: 153.31 MBPS, and 419 IOPS
Lun 0_7  on drive [20:7     ] random read throughput: 154.34 MBPS, and 415 IOPS
Lun 0_8  on drive [20:8     ] random read throughput: 155.32 MBPS, and 425 IOPS
Lun 0_9  on drive [20:9     ] random read throughput: 156.75 MBPS, and 423 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS
Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS
Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS
Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS
Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS
Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS
Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS
Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS
Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS
Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS
Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS
Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS
Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS
Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS
Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS
Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS
CALIBRATE results are within an acceptable range.

As you can see, the drives can actually be driven even higher than the stated 300 IOPS per drive.

So, why can’t SANs achieve this high number?

A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing. But, SANs are not typically configured in this manner. SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email. When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset. In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN. This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.

long story short…

A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.

Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.

Filed under: Exadata, Oracle, Storage

↧

Tuning is in the eye of the beholder… Memory is memory right?

February 13, 2012, 10:10 am

≫ Next: Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

≪ Previous: Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

It is human nature to draw from experiences to make sense of our surroundings. This holds true in life and performance tuning. A veteran systems administrator will typically tune a system different from an Oracle DBA. This is fine, but often what is obvious to one, is not to the other. It is sometimes necessary to take a step back to tune from another perspective.

I recently have ran across a few cases where a customer was tuning “Sorts” in the database by adding memory. Regardless of your prospective, every one knows memory is faster than disk; and the goal of any good tuner is to use as much in memory as possible. So, when it was noticed by the systems administrator that the “TEMP” disks for Oracle were doing a tremendous amount of IO, the answer was obvious right?

RamDisk to the rescue

To solve this problem, the savvy systems administrator added a RAM disk to the database. Since, it was only for “TEMP” space this is seemed reasonable.

ramdiskadm -a oratmp1 1024m
/dev/ramdisk/oratmp1

Indeed user performance was improved. There are some minor issues around recovery upon system reboot or failure that are annoying, but easily addressed with startup scripts. So, SLA’s were met and everyone was happy. And so things were fine for a few years.

Double the HW means double the performance… right?

Fast forward a few years in the future. The system was upgraded to keep up with demand by doubling the amount of memory and CPU resources. Everything should be faster right? Well not so fast. This action increased the NUMA ratio of the machine. After doubling memory and CPU the average user response time doubled from ~1 second to 2 seconds. Needless to say, this was not going to fly. Escalations were mounted and the pressure to resolve this problem reached a boiling point. The Solaris support team was contacted by the systems administrator. Some of the best kernel engineers in the business began to dig into the problem. Searching for ways to make the “ramdisk” respond faster in the face of an increased NUMA ratio.

A fresh set of eyes

Since I have worked with the Solaris support engineers on anything Oracle performance related for many years, they asked me to take a look. I took a peak at the system and noticed the ramdisk in use for TEMP. To me this seemed odd, but I continued to look at SQL performance. Things became clear once I saw the “sort_area_size” was default.

It turns out, Oracle was attempting to do in-memory sorts, but with the default settings all users were spilling out to temp. With 100′s of users on the system, this became a problem real fast. I had the customer increase the sort_area_size until the sorts occurred in memory with out the extra added over head of spilling out to disk (albit fast disk). With this slight adjustment, the average user response time was better than it had ever been.

lessons learned

Memory is memory, but how you use it makes all the difference.
It never hurts to broaden your perspective and get a second opinion

Filed under: Linux, Oracle, Solaris, Storage Tagged: Oracle, ramdisk, Solaris, sort_area_size, temp, tuning

↧

Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

June 18, 2013, 8:23 am

≫ Next: Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

≪ Previous: Tuning is in the eye of the beholder… Memory is memory right?

Lately I have been drawn into to a fare number of discussions about IO characteristics while helping customers run benchmarks. I have been working with a mix of developers, DBAs, sysadmin, and storage admins. As I have learned, every group has there own perspective – certainly when it comes to IO and performance.

Most DBA’s want to see data from the DB point of view so AWR’s or EM works just fine.
Most System Admin’s look at storage from the Filesystem or ASM disk level.
Storage Admins want to see what is going on within the array.
Performance geeks like myself, like to see all up and down the stack

As part of pulling back the covers, I came up with a simple little tool for show IOPS at the cell level.

Mining IO statistics from cellcli

The cellsrv process collects data about various events and performance metrics in an Exadata storage cell. I certainly am a huge fan of the table and index usage data gathered using the ”pythian_cell_cache_extract.pl” written by Christo Kutrovsky. It is really provides a great look inside the Exadata Smart Flash Cache. So, this got me to thinking. What about IOPS data?

With the introduction of the Write Back Flash cache in X3, there is much more analysis about what is going to flash vs disk – and how what is written to flash is flushed to disk.

To look at all the current metrics gathered from the storage cells in your Exadata or SuperCluster you can run “cellcli -e list metriccurrent” on all the storage cells. The “metriccurrent” parameters are updated every minute by cellsrv to store performance data. There are a few convient parameters that can be used to sum up all the IOPS.

CD_IO_RQ_R_LG_SEC + CD_IO_RQ_R_SM_SEC
CD_IO_RQ_W_LG_SEC + CD_IO_RQ_W_SM_SEC

These parameters shore the number of IO/sec for reads and writes. By mining this data and breaking it down by “FD” vs “CD” you can see hit ratios for reads from an overall cell point of view, but now you can also see how many writes are going to FLASH vs DISK.

The “ciops-all.sh” script will look at all the cells and sum up all the IOPS and report the findings. This is very useful to get a quick look at the IO profile in the cells.

[oracle@exa6db01 WB]$ ./ciops-all.sh
FLASH_READ_IOPS: 6305
DISK_READ_IOPS: 213
FLASH_WRITE_IOPS: 488203
DISK_WRITE_IOPS: 6814
TOTAL_NUMBER_OF_DRIVES: 84
WRITE_PCT_to_FLASH: 98
READ_PCT_from_FLASH: 96
IOPS_PER_DISK: 83

This can be very helpful when trying to figure out if you need to go with high performance or high capacity disks. This case shows most IO going to the flash and only 83 IOPS are spilled to each disk. So, with this case HC disks would be a fine choice. With a simple modification, I made the “ciops-mon.sh” script to print out the throughput every few minutes to graph the results over time.

This has been helpful as I have been investigating and explaining the inter-workings of the Exadata smart flash cache. Hopefully, you will find this useful when trying to analyze and understand Exadata Cell level IO with your workload.

Filed under: Exadata, Linux, Oracle, Solaris, Storage

↧

Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

July 1, 2013, 10:15 pm

≫ Next: Analyzing IO at the Exadata Cell level… iostat summary

≪ Previous: Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

Oracle 12c certainly has some great features, but for the performance guy like myself, performance monitoring features are particularly interesting. There are three new v$ tables that track anomalies in the IO path. The idea is to provide more information about really poorly performing IO that lasts more than 500ms.

V$IO_OUTLIER : tracks the attributies of an IO. The size, latency as well as ASM information is recorded.
V$LGWRIO_OUTLIER : tracks information specifically on Log writer IO.

These two tables are going to be useful to monitor when performance issues occur. I can already see the SQL scripts to monitor this activity starting to pile up. But, there is one little extra table that dives even further into the IO stack using Dtrace.

“V$KERNEL_IO_OUTLIER” : This table dives into the KERNEL to provide information about Kernel IO. This table uses my old friend Dtrace to provide information about where the waits are occurring when Kernel IO is in-play. This shows the time for every step involved in the setup and teardown Kernel IO. This information allows us to more easily debug anomalies in the IO stack.

Back in 2009 when Oracle was buying Sun I posted “Oracle buys Sun! Dtrace probes for Oracle?” and lamented on how cool that would be… It is good to know that someone was listening

Filed under: Exadata, Linux, Oracle, Solaris, Storage

↧

Analyzing IO at the Exadata Cell level… iostat summary

July 26, 2013, 10:16 am

≫ Next: Analyzing IO at the Cell level with cellcli… a new and improved script

≪ Previous: Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

While analyzing Write-Back cache activity on Exadata storage cells, I wanted something to interactively monitor IO while I was running various tests. The problem is summarizing the results from ALL storage cell. So, I decided to use my old friend “iostat” and a quick easy script to roll up the results for both DISK and FLASH. This allowed me to monitor the IOPS, IO size, wait times, and service times.

The “iostat-all.sh” tool shows the following data:

day           time  device  r      w   rs       ws     ss    aw    st
---------------------------------------------------------------------
 2013-06-24 14:40:11 DISK  47  40252   54  2667941  66.15  0.28  0.07
 2013-06-24 14:40:11 FLASH  9  40354  322  2853674  70.70  0.13  0.13
 2013-06-24 14:41:13 DISK  48  39548   80  2691362  67.95  0.31  0.08
 2013-06-24 14:41:13 FLASH  9  53677  324  3975687  74.06  0.14  0.13
…

Hopefully this will be useful for those that like to dive into the weeds using our good old friends.

Filed under: Exadata, Linux, Oracle, Storage Tagged: Exadata, iostat, storage cell, storage cells

↧

Analyzing IO at the Cell level with cellcli… a new and improved script

January 21, 2014, 8:46 am

≫ Next: “external table write” wait events… but I am only running a query?

≪ Previous: Analyzing IO at the Exadata Cell level… iostat summary

Recently I had the pleasure of corresponding with Hans-Peter Sloot. After looking at my simple tool in this post to gather cell IO data from cellcli, he took it a several steps further and created a nice python version that goes to the next level to pull IO statistics from the cells.

current_rw_rq.py

This script provides breaks down the IO by “Small” and “Large” as is commonly done by the Enterprise manager. It also provides a summary by cell. Here is a sample output from this script.

Hans-Peter also added two other scripts to drill in to historical data stored in cellcli. Thanks for sharing your tools and further expanding my toolbox!

Filed under: Exadata, Linux, Oracle, Storage

↧

“external table write” wait events… but I am only running a query?

April 23, 2014, 11:17 am

≫ Next: Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

≪ Previous: Analyzing IO at the Cell level with cellcli… a new and improved script

I was helping a customer debug some external table load problems. They are developing some code to do massive inserts via external tables. As the code was being tested, we saw a fair number of tests that were doing simple queries of an external table. I expected to see “external table read” wait events, but was surprised when we saw more “external table write” wait events than reads.

I thought this was due to writes to the “log” file and possible “bad” file, but I had to be sure. I searched the docs but could not find reference to this wait event. I specifically was seeing the following:

WAIT #139931007587096: nam='external table write' ela= 7 filectx=139931005791096 file#=13 size=41 obj#=-1 tim=1398264093597968
WAIT #139931007587096: nam='external table write' ela= 3 filectx=139931005791096 file#=13 size=89 obj#=-1 tim=1398264093597987

I searched on how to debug the filectx and file# but still couldn’t find anything. So, I resorted to my good old friend “strace” from the Linux side of the house. By running “strace” on the oracle shadow process, I was able to find indeed that these write events were to going to the LOG file for the external table.

mylap:EXTW glennf$ egrep 'write\(13' strace-truss-trace.txt
 write(13, "\n\n LOG file opened at 04/23/14 0"..., 41) = 41
 write(13, "KUP-05004: Warning: Intra sour"..., 100) = 100
 write(13, "Field Definitions for table ET_T"..., 36) = 36
 write(13, " Record format DELIMITED, delim"..., 43) = 43
 write(13, " Data in file has same endianne"..., 51) = 51
 write(13, " Rows with all null fields are "..., 41) = 41
 write(13, "\n", 1) = 1
 write(13, " Fields in Data Source: \n", 26) = 26
 write(13, "\n", 1) = 1
 write(13, " ID "..., 47) = 47
 write(13, " Terminated by \"7C\"\n", 25) = 25
 write(13, " Trim whitespace same as SQ"..., 41) = 41
 write(13, " TDATE "..., 46) = 46
  ....
  ....

Each time you open an external table, the time is logged as well the table definition. We have some very wide tables so there was actually more data logged than queried. With the proper amount of data now in the dat files, we are indeed seeing more “external table read” requests as expected. Regardless, this was a fun exercise.

So, the moral of the story… Sometimes you have turn over a few rocks and drill down a bit to find the pot of gold.

Filed under: Exadata, Linux, Oracle, Storage, Uncategorized

↧

Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

December 17, 2009, 10:38 am

≫ Next: Simple script to monitor dNFS activity

≪ Previous: “external table write” wait events… but I am only running a query?

After my recent series of postings, I was made aware of David Lutz's blog on NFS client performance with Solaris. It turns out that you can vastly improve the performance of NFS clients using a new parameter to adjust the number of client connections.

root@saemrmb9> grep rpcmod /etc/system
set rpcmod:clnt_max_conns=8

This parameter was introduced in a patch for various flavors of Solaris. For details on the various flavors, see David Lutz's recent blog entry on improving NFS client performance. Soon, it should be the default in Solaris making out-of-box client performance scream.

DSS query throughput with Kernel NFS

I re-ran the DSS query referenced in my last entry and now kNFS matches the throughput of dNFS with 10gigE.

Kernel NFS throughput with Solaris 10 Update 8 (set rpcmod:clnt_max_conns=8) This is great news for customers not yet on Oracle 11g. With this latest fix to Solaris, you can match the throughput of Direct NFS on older versions of Oracle. In a future post, I will explore the CPU impact of dNFS and kNFS with OLTP style transactions.

↧

Simple script to monitor dNFS activity

February 18, 2010, 5:52 pm

≫ Next: Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

≪ Previous: Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

In my previous series regarding "dNFS" vs "kNFS" I reference a script that monitors dNFS traffic by sampling the v$dnfs_stats view. A reader requested the script so I thought it might be useful to a wider audience. This simple script samples some values from the view and outputs the a date/timestamp along with rate information. I hope it is useful.

[sourcecode language="css"]
------ mondnfs.sql -------

set serveroutput on format wrapped size 1000000
create or replace directory mytmp as '/tmp';

DECLARE
n number;
m number;
x number := 1;
y number := 0;

bnio number;
anio number;

nfsiops number;

fd1 UTL_FILE.FILE_TYPE;

BEGIN
fd1 := UTL_FILE.FOPEN('MYTMP', 'dnfsmon.log', 'w');

LOOP
bnio := 0;
anio := 0;

select sum(nfs_read+nfs_write) into bnio from v$dnfs_stats;

n := DBMS_UTILITY.GET_TIME;
DBMS_LOCK.SLEEP(5);

select sum(nfs_read+nfs_write) into anio from v$dnfs_stats;

m := DBMS_UTILITY.GET_TIME - n ;

nfsiops := ( 100*(anio - bnio) / m ) ;

UTL_FILE.PUT_LINE(fd1, TO_CHAR(SYSDATE,'HH24:MI:SS') || '|' || nfsiops );
UTL_FILE.FFLUSH(fd1);
x := x + 1;
END LOOP;

UTL_FILE.FCLOSE(fd1);
END;
/

========================

[/sourcecode]

↧

Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

June 8, 2010, 10:22 am

≫ Next: Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

≪ Previous: Simple script to monitor dNFS activity

I have worked on Exadata V2 performance projects with Kevin Closson for nearly a year now and have had the opportunity to evaluate several methods of loading data into a data warehouse. The most common, and by far the fastest method, involves the use of "External Tables". External tables allow the user to define a table object made up of text files that live on a file system. Using External Tables allows for standard SQL parallel query operations to be used to load data into permanent database tables.

SQL> alter session enable parallel dml ;
SQL> insert /*+ APPEND */ into mytable select * from ext_tab ;

What options are available for staging input files?

With Exadata V2, or any RAC environment, flat-file data has to be present on all nodes in order to fully utilize parallel query. The natural first choice with Exadata V2 is to use DBFS.

Open Storage S7000 a natural staging area for Exadata

If you want to extend the amount of storage to stage data for your warehouse, then the S7000 is an excellent choice. The S7000 can stage files off traditional networks using 1gigE and 10gigE connections. This allows for multiple machines to seamlessly connect to the S7000 in-order to stage data for ingest. This activity has no effect on the Exadata users since the S7000 is a self contained storage server - unlike DBFS that uses CPU cycles from the database grid to manage and store the flat-file data.

what about performance?

As part of a larger project, I was able to run a quick test. I had the following:

S7410 with 12 drives
128 x 1GB files on a share
8 db nodes active (x4170) with the share mounted on all nodes.

I created an external table across all the files and performed two tests:

Select count(*).
```
SQL> select count(*) from ext_tab;
```

Insert as Select "IAS"

SQL> alter session enable parallel dml ;
SQL> insert /*+APPEND */ into mytable select * from ext_tab;

Both when querying and loading data with "IAS", I was able to get 1.2GB/sec throughput as I saw with my earlier tests with S7000 and 10gigE. That is over 4TB/hr with just one head node for the S7410. With a clustered configuration and multiple mount points, the load rate could be increased even further.

summary

↧

Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

May 10, 2011, 12:52 pm

≫ Next: Tuning is in the eye of the beholder… Memory is memory right?

≪ Previous: Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

I recently had an interesting time with a customer who is all too familiar with SAN's. SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive. This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata. So, how could this be possible? Does Exadata have an enchantment spell that makes the drives magically spin faster? Maybe a maybe a space time warp to service IO?

The Exadata X2-2 data sheet states "up to 50,000 IOPS" for a full rack of high performance 600GB 15K rpm drives. This works out to be 300 IOs per second. At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange. But really, it only means that you have to on average service more than one IO per revolution. So, how do you service more than one IO per revolution?

Drive command queuing and short stroking

Modern drives have the ability to queue up more than one IO at a time. If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution. As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases. Lots of literature exists on this topic and indeed many have tested this phenomena. A popular site "Tom's Hardware" has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.

But how can this old trick work with real world environments?

ASM intelligent data placement to the rescue

ASM has a feature "Intelligent Data Placement" IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive. The drive is essentially split into "Hot" and "Cold" regions. This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive. This is the realization of an old benchmark trick, using a real feature in ASM.

the proof is in the pudding... "calibrate" command shows drive capabilities

The "calibrate" command, which is part of the Exadata storage "cellcli" interface, is used to test the capabilites of the underlinying components of Exadata storage. The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations. The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns. This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives. I have included an example output from an X2-2 machine below.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk luns: 1809 MBPS
Aggregate random read throughput across all flash disk luns: 4264.59 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923
Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197
Calibrating hard disks (read only) ...
Lun 0_0  on drive [20:0     ] random read throughput: 155.60 MBPS, and 422 IOPS
Lun 0_1  on drive [20:1     ] random read throughput: 155.95 MBPS, and 419 IOPS
Lun 0_10 on drive [20:10    ] random read throughput: 155.58 MBPS, and 428 IOPS
Lun 0_11 on drive [20:11    ] random read throughput: 155.13 MBPS, and 428 IOPS
Lun 0_2  on drive [20:2     ] random read throughput: 157.29 MBPS, and 415 IOPS
Lun 0_3  on drive [20:3     ] random read throughput: 156.58 MBPS, and 415 IOPS
Lun 0_4  on drive [20:4     ] random read throughput: 155.12 MBPS, and 421 IOPS
Lun 0_5  on drive [20:5     ] random read throughput: 154.95 MBPS, and 425 IOPS
Lun 0_6  on drive [20:6     ] random read throughput: 153.31 MBPS, and 419 IOPS
Lun 0_7  on drive [20:7     ] random read throughput: 154.34 MBPS, and 415 IOPS
Lun 0_8  on drive [20:8     ] random read throughput: 155.32 MBPS, and 425 IOPS
Lun 0_9  on drive [20:9     ] random read throughput: 156.75 MBPS, and 423 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS
Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS
Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS
Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS
Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS
Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS
Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS
Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS
Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS
Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS
Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS
Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS
Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS
Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS
Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS
Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS
CALIBRATE results are within an acceptable range.

As you can see, the drives can actually be driven even higher than the stated 300 IOPS per drive.

So, why can't SANs achieve this high number?

A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing. But, SANs are not typically configured in this manner. SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email. When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset. In Solaris, SAN vendors require the setting of the "sd_max_throttle" parameter which limits the amount of IO presented to the SAN. This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.

long story short...

A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the "drive" is definitely capable of more.

Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.

↧

Tuning is in the eye of the beholder… Memory is memory right?

February 13, 2012, 10:10 am

≫ Next: Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

≪ Previous: Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

I recently have ran across a few cases where a customer was tuning "Sorts" in the database by adding memory. Regardless of your prospective, every one knows memory is faster than disk; and the goal of any good tuner is to use as much in memory as possible. So, when it was noticed by the systems administrator that the "TEMP" disks for Oracle were doing a tremendous amount of IO, the answer was obvious right?

RamDisk to the rescue

To solve this problem, the savvy systems administrator added a RAM disk to the database. Since, it was only for "TEMP" space this is seemed reasonable.

ramdiskadm -a oratmp1 1024m
/dev/ramdisk/oratmp1

Indeed user performance was improved. There are some minor issues around recovery upon system reboot or failure that are annoying, but easily addressed with startup scripts. So, SLA's were met and everyone was happy. And so things were fine for a few years.

Double the HW means double the performance... right?

Fast forward a few years in the future. The system was upgraded to keep up with demand by doubling the amount of memory and CPU resources. Everything should be faster right? Well not so fast. This action increased the NUMA ratio of the machine. After doubling memory and CPU the average user response time doubled from ~1 second to 2 seconds. Needless to say, this was not going to fly. Escalations were mounted and the pressure to resolve this problem reached a boiling point. The Solaris support team was contacted by the systems administrator. Some of the best kernel engineers in the business began to dig into the problem. Searching for ways to make the "ramdisk" respond faster in the face of an increased NUMA ratio.

A fresh set of eyes

It turns out, Oracle was attempting to do in-memory sorts, but with the default settings all users were spilling out to temp. With 100's of users on the system, this became a problem real fast. I had the customer increase the sort_area_size until the sorts occurred in memory with out the extra added over head of spilling out to disk (albit fast disk). With this slight adjustment, the average user response time was better than it had ever been.

lessons learned

Memory is memory, but how you use it makes all the difference.
It never hurts to broaden your perspective and get a second opinion

↧

Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

June 18, 2013, 8:23 am

≫ Next: Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

≪ Previous: Tuning is in the eye of the beholder… Memory is memory right?

Most DBA's want to see data from the DB point of view so AWR's or EM works just fine.
Most System Admin's look at storage from the Filesystem or ASM disk level.
Storage Admins want to see what is going on within the array.
Performance geeks like myself, like to see all up and down the stack

As part of pulling back the covers, I came up with a simple little tool for show IOPS at the cell level.

Mining IO statistics from cellcli

The cellsrv process collects data about various events and performance metrics in an Exadata storage cell. I certainly am a huge fan of the table and index usage data gathered using the "pythian_cell_cache_extract.pl" written by Christo Kutrovsky. It is really provides a great look inside the Exadata Smart Flash Cache. So, this got me to thinking. What about IOPS data?

With the introduction of the Write Back Flash cache in X3, there is much more analysis about what is going to flash vs disk - and how what is written to flash is flushed to disk.

To look at all the current metrics gathered from the storage cells in your Exadata or SuperCluster you can run "cellcli -e list metriccurrent" on all the storage cells. The "metriccurrent" parameters are updated every minute by cellsrv to store performance data. There are a few convient parameters that can be used to sum up all the IOPS.

CD_IO_RQ_R_LG_SEC + CD_IO_RQ_R_SM_SEC
CD_IO_RQ_W_LG_SEC + CD_IO_RQ_W_SM_SEC

These parameters shore the number of IO/sec for reads and writes. By mining this data and breaking it down by "FD" vs "CD" you can see hit ratios for reads from an overall cell point of view, but now you can also see how many writes are going to FLASH vs DISK.

The "ciops-all.sh" script will look at all the cells and sum up all the IOPS and report the findings. This is very useful to get a quick look at the IO profile in the cells.

[oracle@exa6db01 WB]$ ./ciops-all.sh
FLASH_READ_IOPS: 6305
DISK_READ_IOPS: 213
FLASH_WRITE_IOPS: 488203
DISK_WRITE_IOPS: 6814
TOTAL_NUMBER_OF_DRIVES: 84
WRITE_PCT_to_FLASH: 98
READ_PCT_from_FLASH: 96
IOPS_PER_DISK: 83

This can be very helpful when trying to figure out if you need to go with high performance or high capacity disks. This case shows most IO going to the flash and only 83 IOPS are spilled to each disk. So, with this case HC disks would be a fine choice. With a simple modification, I made the "ciops-mon.sh" script to print out the throughput every few minutes to graph the results over time.

↧

Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

July 1, 2013, 10:15 pm

≫ Next: Analyzing IO at the Exadata Cell level… iostat summary

≪ Previous: Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

V$IO_OUTLIER : tracks the attributies of an IO. The size, latency as well as ASM information is recorded.
V$LGWRIO_OUTLIER : tracks information specifically on Log writer IO.

"V$KERNEL_IO_OUTLIER" : This table dives into the KERNEL to provide information about Kernel IO. This table uses my old friend Dtrace to provide information about where the waits are occurring when Kernel IO is in-play. This shows the time for every step involved in the setup and teardown Kernel IO. This information allows us to more easily debug anomalies in the IO stack.

Back in 2009 when Oracle was buying Sun I posted "Oracle buys Sun! Dtrace probes for Oracle?" and lamented on how cool that would be... It is good to know that someone was listening

↧