Monthly Archives: October 2014

Monitoring Oracle with oratop

Intro

I recently had a situation where I was troubleshooting a system that had fallen victim to some intermittent and very high load averages.  At first, there were no apparent reasons for the high load averages, so that sent me searching for some monitoring tools to help identify the issue.  Besides using OSWBB and dstat, I also used the oratop utility which turns out to be a great way to get a quick overview and monitor active sessions in near real time in Oracle 11g and 12c databases.  It is also RAC and ASM aware.

 

References

Here’s the MOS note for more info and downloads.
oratop – Utility for Near Real-time Monitoring of Databases, RAC and Single Instance (Doc ID 1500864.1)

 

Supported Oracle Releases

11gR2 (11.2.0.3, 11.2.0.4)

12cR1

oratop in Action

There is a user guide available for download on the MOS note, but I will show my favorite options.

 

Startup

> ./oratop -f  -d -i 10 / as sysdba
The switches:
f – detailed format option
d – real time wait events ( the default is cumulative )
i – update interval in seconds

The remaining part of the string is the login.  Basically, any string that you use for sqlplus should work with oratop.

oratop_df

 

Output Sections

Here’s an explanation of the sections.  These menus can be found when pressing “h” when running oratop in interactive mode.

 
1 Database
Section 1 - database
        Global Database information

   Version        : Oracle major version
   role           : database_role
   db name        : db_unique_name
   time        [s]: time as of the most recent stats (hh24:mi:ss)
   up          [T]: database uptime
   ins         [N]: total number of instance(s)
   sn        [c,N]: total user sessions (active/inactive)
   us        [c,N]: number of distinct users
   mt        [s,N]: global database memory total (sga+pga)
   fra         [N]: flashback recovery area %used, (red > 90%)
   er          [N]: diag active problem count (faults)
   % db      [s,N]: database time as %(dbtime/cpu) (red if > 99%)

 

2 Instance

Section 2 - instance
        Top 5 Instance(s) Activity
        o Ordered by Database time desc

   ID        [c,N]: inst_id (instance id)
   %CPU      [m,N]: host cpu busy %(busy/busy+idle). (red if > 90%)
   LOAD      [m,N]: current os load. (red if > 2*#cpu & high cpu)
   %DCU      [m,N]: db cpu otusef as %host cpu. (red if > 99% & high AAS)
   AAS       [s,N]: Average Active Sessions. (red if > #cpu)
   ASC       [c,N]: active Sessions on CPU
   ASI       [c,N]: active Sessions waiting on user I/O
   ASW       [c,N]: active Sessions Waiting, non-ASI (red if > ASC+ASI)
   ASP       [m,N]: active parallel sessions (F/G)
   AST       [c,N]: Active user Sessions Total (ASC+ASI+ASW)
   UST       [c,N]: user Sessions Total (ACT/INA)
   MBPS      [m,N]: i/o megabytes per second (throughput)
   IOPS      [m,N]: i/o requests per second
   IORL      [m,T]: avg synchronous single-block read latency. (red > 20ms)
   LOGR      [s,N]: logical reads per sec
   PHYR      [s,N]: physical reads per sec)
   PHYW      [s,N]: physical writes per sec
   %FR       [s,N]: shared pool free %
   PGA       [s,N]: total pga allocated
   TEMP      [s,N]: temp space used
   UTPS      [s,N]: user transactions per sec
   UCPS    [c,m,N]: user calls per sec
   SSRT    [c,m,T]: sql service response time (T/call)
   DCTR      [m,N]: database cpu time ratio
   DWTR      [m,N]: database wait time ratio. (red if > 50 & high ASW)
   %DBT      [s,N]: instance %Database Time (e.g. non-rac shows 100%)

 

3 Wait Events

Section 3 - db wait events
        Top 5 Timed Events
        o Cluster-wide, non-idle
        o Ordered by wait time desc

  EVENT      : wait event name. (red if active)
        (RT) : Real-Time mode
  WAITS      : total waits
  TIME(s)    : total wait time in seconds)
  AVG_MS     : average wait time in milliseconds
  PCT        : percent of wait time (all events)
  WAIT_CLASS : name of the wait class

 

 

4 Processes

Section 4 - process
        o Non-Idle processes
        o Ordered by event wait time desc

   ID          [N]: inst_id. (red if blocking)
   SID         [N]: session identifier. (red if blocking)
   SPID        [N]: server process os id
   USERNAME       : Oracle user name
   PROGRAM        : process program name
   SRV            : SERVER (dedicated, shared, etc.)
   SERVICE        : db service_name
   PGA         [N]: pga_used_mem. (red if continuously growing)
   SQL_ID/BLOCKER : sql_id or the final blocker's (inst:sid, in red)
   OPN            : operation name, e.g. select
   E/T         [T]: session elapsed time (active/inactive)
   STA            : ACTive|INActive|KILled|CAChed|SNIped
   STE            : process state, e.g. on CPU or user I/O or WAIting
   WAIT_CLASS     : wait_class for the named event
   EVENT/*LATCH   : session wait event name. Auto toggle with *latch name.
                    (red if process is hung/spin)
   W/T         [T]: event wait time. (red if > 1s)

 

 

Blocking Sessions

If there are blocking sessions present, the instance and sid will be shown in the SQL ID/BLOCKER column.

oratop_blocking_session

 

Stopping

To quit the program, user may press any of the following keyboard keys:
Character “q” or “Q”, or Esc key
Ctrl+c (to abort).

 

Execution Plans

By pressing “x”, and then giving the sqlid, the explain plan of the sql in question will be displayed.  Although this is a nice and quick feature, it does not show as much detail about the plan as the display_cursor procedure will as shown below.  As far as I can tell, it does show the actual plan and not the estimated ones that are displayed without acttually running the query, but I have not thorougly tested this.
select * from table(dbms_xplan.display_cursor(‘&sql_id’,’&child_no’,’typical’));

 

Space

The “a” option can be used to show info on the disk groups.
The “t” option can be used to show info on the tablespaces.

Oracle Golden Gate – OGG-01028 Incompatible record

Intro

The incompatible record error is one I have seen a couple of times now with DB2 as the source and it seems to hing around something that occurs on weekends when they take the system down for maintenance.  I have yet to pinpoint the cause, but this workaround does get everything back up and running.

 

Symptoms

From the report file:

Switching to next trail file ./dirdat/qt000141 at 2014-10-20 09:14:57 due to EOF, with current RBA 99999985
Opened trail file ./dirdat/qt000141 at 2014-10-20 09:14:57


Source Context :
  SourceModule            : [er.processloop]
  SourceID                : [/scratch/aime1/adestore/views/aime1_adc4150267/oggcore/OpenSys/src/app/er/processloop.cpp]
  SourceFunction          : [process_extract_loop]
  SourceLine              : [874]
  ThreadBacktrace         : [7] elements
                          : [/apps/oracle/ggs/libgglog.so(CMessageContext::AddThreadContext()+0x1e) [0x7fc89734bc3e]]
                          : [/apps/oracle/ggs/libgglog.so(CMessageFactory::CreateMessage(CSourceContext*, unsigned int, ...)+0x2cc) [0x7fc89734498c]]
                          : [/apps/oracle/ggs/libgglog.so(_MSG_ERR_ER_GENERIC_FAILURE(CSourceContext*, char const*, CMessageFactory::MessageDisposition)+0x31) [0x7fc8
9732f839]]
                          : [/apps/oracle/ggs/replicat(process_extract_loop()+0x31f1) [0x53e581]]
                          : [/apps/oracle/ggs/replicat(main+0x732) [0x54faa2]]
                          : [/lib64/libc.so.6(__libc_start_main+0xfd) [0x3e5ce1ed1d]]
                          : [/apps/oracle/ggs/replicat(__gxx_personality_v0+0x332) [0x4c233a]]

2014-10-20 09:15:17  ERROR   OGG-01028  Incompatible record (101) in ./dirdat/qt000141, rba 1155 (getting header).

 

Fix

Use the logdump utility to gather the required info from the trail file.

 

> cd $GG_HOME
> ./logdump

Oracle GoldenGate Log File Dump Utility for Oracle
Version 11.2.1.0.7 16934304 OGGCORE_11.2.1.0.7_PLATFORMS_130709.1600.1

Copyright (C) 1995, 2013, Oracle and/or its affiliates. All rights reserved.



Logdump 22 >log to incompat_record.log
--- Session log incompat_record.log opened 2014/10/20 09:31:32.434.370 ---
Logdump 23 >open /apps/oracle/ggs/dirdat/qt000141
Current LogTrail is /apps/oracle/ggs/dirdat/qt000141
Logdump 24 >ghdr on
Logdump 25 >pos 1155
Reading forward from RBA 1155
Logdump 26 >n
Bad record found at RBA 1155, format 5.50 Header token)
 8004 8647                                         | ...G
Logdump 27 >sfh prev
Scanned    1155 bytes and did not find a header
Bad record found at RBA 1155, format 5.50 Header token)
 8004 8647                                         | ...G
Logdump 28 >sfh next
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :   119  (x0077)   IO Time    : 2014/10/18 00:26:10.455.904
IOType     :    15  (x0f)     OrigNode   :   255  (xff)
TransInd   :     .  (x01)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :          0       AuditPos   : 58745476396002
Continued  :     N  (x00)     RecCount   :     1  (x01)

2014/10/18 00:26:10.455.904 FieldComp            Len   119 RBA 1158
Name: schema_name.table_name
After  Image:                                             Partition 4      m
 0000 000a 0000 0000 0000 0335 dba7 0001 000a 0000 | ...........5........
 0000 0000 0000 0001 0021 0034 0000 c4e3 d3f0 f0f1 | .........!.4........
 c2f3 4040 4040 4040 4040 4040 4040 4040 4040 4040 | ..@@@@@@@@@@@@@@@@@@
 4040 4040 4040 4040 4040 4040 4040 4040 4040 4040 | @@@@@@@@@@@@@@@@@@@@
 4040 4040 0022 001f 0000 f2f0 f1f4 60f1 f060 f1f8 | @@@@."........`..`..
 7af0 f07a f2f6 7af5 f54b f8f7 f0f8 f5f4 f0f0 f0   | z..z..z..K.........

Logdump 29 >exit


Since the issue is with the header of the file, I went to the next record instead of the previous one.  Looks like RBA 1158 is good, and doing a count from here works so the rest of the file is good.  So, let’s alter the replicat to this next good record and start it up.

 

> ./ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.7 16934304 OGGCORE_11.2.1.0.7_PLATFORMS_130709.1600.1_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Jul 18 2013 07:04:28

Copyright (C) 1995, 2013, Oracle and/or its affiliates. All rights reserved.

GGSCI > info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
REPLICAT    ABENDED     DB_REP    56:32:07      00:35:28

GGSCI > info *

REPLICAT   DB_REP  Last Started 2014-10-20 10:05   Status ABENDED
Checkpoint Lag       57:39:50 (updated 00:02:38 ago)
Log Read Checkpoint  File ./dirdat/qt000140
                     2014-10-18 00:26:09.793226  RBA 99981632


GGSCI > alter replicat DB2P_REP, extseqno 141, extrba 1158
REPLICAT altered.

GGSCI > start replicat DB_REP

Sending START request to MANAGER ...
REPLICAT DB_REP starting

 

After starting the replicat, normal processing of records started again.

 

References

 

How to Recover from an OGG-01028 Incompatible Record If the Trail Is Not Corrupt (Doc ID 1507462.1)

Using dstat for monitoring CPU, Memory, Disk, etc

Overview

dstat is a good tool for replacing vmstat, netstat, iostat, and ifstat.  It it a very lightweight tool that can consolidate and correlate data  that  you would normally have to capture in different terminals.

 

Examples

 

RMAN Backup Throughput

I’ve used dstat before in the past to monitor backup throughput over interfaces.  Of course, this is possible if the backups are going to disk.

Run ifconfig to get the name of the interface you want to monitor.

Run dstat and watch the receive and send columns.

dstat -cn -N bond0.133 -C total -f 5

----total-cpu-usage---- net/bond0.1
usr sys idl wai hiq siq| recv  send
  6   7  83   4   0   0|   0     0
 11  40  44   5   0   0|1303k 1686k
 14  39  40   7   0   0|5295k 6131k
 11  36  47   6   0   0|1733k 3630k
 10  34  51   5   0   0|2334k   11M
  8  32  54   5   0   0|2189k   14M
  6  10  78   6   0   0|1418k   25M
  6   0  86   7   0   0|1277k 1563k
  8   1  84   8   0   0|1847k 2013k
  8   1  84   7   0   0|1831k 4063k
 13  47  35   5   0   0|1974k 7948k
 12  28  55   5   0   0|1269k   16M
 10  19  64   7   0   0|1308k 2674k
 10  11  73   7   0   0|1474k 2590k
 15  65  18   3   0   0|1566k 4440k

 

CPU, Load, Disk I/O and Memory

Running “dstat -h” will show the options available.  Below is a script that shows some of the options I normally use. 

 cat dstat.sh

#!/bin/sh
echo
dstat -h | egrep "time|load|proc|cpu|disk|io|sys|vm" | egrep -v "cpu0|aio|filesystem|epoch|Usage"
echo

CURRENT_TIME=`date +"%m_%d_%Y_%H_%M"`

#dstat --time --load -p -c --disk --mem --top-cpu --top-bio --top-latency --output /home/oracle/rick/log/dstat_output_${CURRENT_TIME}_.csv 10
#dstat --time --load -p -c --disk --mem --top-cpu --top-bio --top-latency 10
dstat --time --load --proc --cpu --top-cpu --disk --io --sys --vm 60

exit 0

Below that is a sample run on a busy server.

 ./dstat.sh

Versatile tool for generating system resource statistics
Dstat options:
  -c, --cpu              enable cpu stats
  -d, --disk             enable disk stats
  -l, --load             enable load stats
  -p, --proc             enable process stats
  -r, --io               enable io stats (I/O requests completed)
  -t, --time             enable time/date output
  -y, --sys              enable system stats
  --vm                   enable vm stats
  -v, --vmstat           equals -pmgdsc -D total

----system---- ---load-avg--- ---procs--- ----total-cpu-usage---- -most-expensive- -dsk/total- --io/total- ---system-- -----virtual-memory----
  date/time   | 1m   5m  15m |run blk new|usr sys idl wai hiq siq|  cpu process   | read  writ| read  writ| int   csw |majpf minpf alloc  free
15-10 14:17:19|15.9 21.3 24.1|0.0 0.0 150|  6   7  83   4   0   0|ora_as04_LASP0.2|  89M   33M|1141  1095 |  24k   21k|   0    31k   52k   52k
15-10 14:18:19|30.5 24.5 25.0| 13 6.2 465|  8  15  72   5   0   0|oracleORCL  1.0|  32M   21M|1674  1273 |  43k   31k|   0   122k   91k   91k
15-10 14:19:19|22.8 23.0 24.5| 11 4.6 473|  6  12  77   5   0   0|oracleORCL  1.0|  25M   15M|1645  1062 |  40k   30k|   0   115k   82k   84k
15-10 14:20:19|23.6 22.9 24.3| 12 5.2 476|  6  13  76   5   0   0|oracleORCL  1.2|  66M   13M|1716  1127 |  40k   29k|   0   117k   86k   86k
15-10 14:21:19|14.9 20.5 23.4|5.3 5.2 467|  3   6  86   5   0   0|oracleORCL  0.4|  28M   14M|1530  1069 |  32k   26k|   0    99k   57k   57k
15-10 14:22:19|20.5 20.6 23.2|8.8 6.4 466|  5  10  80   5   0   0|oracleORCL  0.8|  70M   20M|1824  1296 |  39k   32k|   0   104k   73k   73k
15-10 14:23:19|15.7 19.2 22.6| 12 7.3 476|  6  12  76   6   0   0|oracleORCL  1.5|  39M   18M|1840  1218 |  41k   33k|   0   117k   89k   89k
15-10 14:24:19|15.8 18.6 22.2| 12 7.5 278|  7  11  75   6   0   0|oracleORCL  1.3| 160M   95M|2142   885 |  37k   30k|   0    64k   77k   77k
15-10 14:25:19|28.2 22.4 23.3| 18 7.5 203|  8  22  65   5   0   0|oracleORCL  1.1|  98M   58M|2608   684 |  42k   31k|   0    70k  110k  110k
15-10 14:26:19|31.6 23.8 23.7| 12 6.8 176|  7  13  75   5   0   0|oracleORCL  1.1|  54M   14M|1944   715 |  30k   25k|   0    39k   56k   57k missed 2 ticks
15-10 14:27:19|32.0 26.4 24.6| 18 4.8 200|  6  20  70   4   0   0|oracleORCL  0.9|  30M   17M| 932  1214 |  37k   27k|   0    72k   98k   98k
15-10 14:28:19|50.7 32.5 26.8| 26 5.3 227| 10  34  52   4   0   0|oracleORCL  1.3|  55M   20M|1200  1088 |  47k   25k|   0   108k  152k  152k missed 2 ticks
15-10 14:29:19|35.4 31.4 26.8| 22 4.5 217|  7  26  64   3   0   0|oracleORCL  0.8|  13M   22M|1071  1113 |  40k   23k|   0    94k  127k  127k
15-10 14:30:19|30.9 30.8 26.8| 20 3.5 212| 11  23  63   3   0   0|oracleORCL  1.5|  70M   16M|1062  1426 |  37k   20k|   0    83k   98k   98k missed 2 ticks
15-10 14:31:19|26.7 28.9 26.4| 10 2.4 182|  7  11  79   3   0   0|oracleORCL  1.0|  38M   16M| 786   892 |  27k   19k|   0    59k   59k   59k
15-10 14:32:19|15.7 25.4 25.4| 18 2.5 208|  6  23  69   2   0   0|oracleORCL  0.7|  53M   16M| 577  1012 |  33k   17k|   0    85k  107k  108k