Category Archives: Linux

Monitoring Oracle with oratop

Intro

I recently had a situation where I was troubleshooting a system that had fallen victim to some intermittent and very high load averages.  At first, there were no apparent reasons for the high load averages, so that sent me searching for some monitoring tools to help identify the issue.  Besides using OSWBB and dstat, I also used the oratop utility which turns out to be a great way to get a quick overview and monitor active sessions in near real time in Oracle 11g and 12c databases.  It is also RAC and ASM aware.

 

References

Here’s the MOS note for more info and downloads.
oratop – Utility for Near Real-time Monitoring of Databases, RAC and Single Instance (Doc ID 1500864.1)

 

Supported Oracle Releases

11gR2 (11.2.0.3, 11.2.0.4)

12cR1

oratop in Action

There is a user guide available for download on the MOS note, but I will show my favorite options.

 

Startup

> ./oratop -f  -d -i 10 / as sysdba
The switches:
f – detailed format option
d – real time wait events ( the default is cumulative )
i – update interval in seconds

The remaining part of the string is the login.  Basically, any string that you use for sqlplus should work with oratop.

oratop_df

 

Output Sections

Here’s an explanation of the sections.  These menus can be found when pressing “h” when running oratop in interactive mode.

 
1 Database
Section 1 - database
        Global Database information

   Version        : Oracle major version
   role           : database_role
   db name        : db_unique_name
   time        [s]: time as of the most recent stats (hh24:mi:ss)
   up          [T]: database uptime
   ins         [N]: total number of instance(s)
   sn        [c,N]: total user sessions (active/inactive)
   us        [c,N]: number of distinct users
   mt        [s,N]: global database memory total (sga+pga)
   fra         [N]: flashback recovery area %used, (red > 90%)
   er          [N]: diag active problem count (faults)
   % db      [s,N]: database time as %(dbtime/cpu) (red if > 99%)

 

2 Instance

Section 2 - instance
        Top 5 Instance(s) Activity
        o Ordered by Database time desc

   ID        [c,N]: inst_id (instance id)
   %CPU      [m,N]: host cpu busy %(busy/busy+idle). (red if > 90%)
   LOAD      [m,N]: current os load. (red if > 2*#cpu & high cpu)
   %DCU      [m,N]: db cpu otusef as %host cpu. (red if > 99% & high AAS)
   AAS       [s,N]: Average Active Sessions. (red if > #cpu)
   ASC       [c,N]: active Sessions on CPU
   ASI       [c,N]: active Sessions waiting on user I/O
   ASW       [c,N]: active Sessions Waiting, non-ASI (red if > ASC+ASI)
   ASP       [m,N]: active parallel sessions (F/G)
   AST       [c,N]: Active user Sessions Total (ASC+ASI+ASW)
   UST       [c,N]: user Sessions Total (ACT/INA)
   MBPS      [m,N]: i/o megabytes per second (throughput)
   IOPS      [m,N]: i/o requests per second
   IORL      [m,T]: avg synchronous single-block read latency. (red > 20ms)
   LOGR      [s,N]: logical reads per sec
   PHYR      [s,N]: physical reads per sec)
   PHYW      [s,N]: physical writes per sec
   %FR       [s,N]: shared pool free %
   PGA       [s,N]: total pga allocated
   TEMP      [s,N]: temp space used
   UTPS      [s,N]: user transactions per sec
   UCPS    [c,m,N]: user calls per sec
   SSRT    [c,m,T]: sql service response time (T/call)
   DCTR      [m,N]: database cpu time ratio
   DWTR      [m,N]: database wait time ratio. (red if > 50 & high ASW)
   %DBT      [s,N]: instance %Database Time (e.g. non-rac shows 100%)

 

3 Wait Events

Section 3 - db wait events
        Top 5 Timed Events
        o Cluster-wide, non-idle
        o Ordered by wait time desc

  EVENT      : wait event name. (red if active)
        (RT) : Real-Time mode
  WAITS      : total waits
  TIME(s)    : total wait time in seconds)
  AVG_MS     : average wait time in milliseconds
  PCT        : percent of wait time (all events)
  WAIT_CLASS : name of the wait class

 

 

4 Processes

Section 4 - process
        o Non-Idle processes
        o Ordered by event wait time desc

   ID          [N]: inst_id. (red if blocking)
   SID         [N]: session identifier. (red if blocking)
   SPID        [N]: server process os id
   USERNAME       : Oracle user name
   PROGRAM        : process program name
   SRV            : SERVER (dedicated, shared, etc.)
   SERVICE        : db service_name
   PGA         [N]: pga_used_mem. (red if continuously growing)
   SQL_ID/BLOCKER : sql_id or the final blocker's (inst:sid, in red)
   OPN            : operation name, e.g. select
   E/T         [T]: session elapsed time (active/inactive)
   STA            : ACTive|INActive|KILled|CAChed|SNIped
   STE            : process state, e.g. on CPU or user I/O or WAIting
   WAIT_CLASS     : wait_class for the named event
   EVENT/*LATCH   : session wait event name. Auto toggle with *latch name.
                    (red if process is hung/spin)
   W/T         [T]: event wait time. (red if > 1s)

 

 

Blocking Sessions

If there are blocking sessions present, the instance and sid will be shown in the SQL ID/BLOCKER column.

oratop_blocking_session

 

Stopping

To quit the program, user may press any of the following keyboard keys:
Character “q” or “Q”, or Esc key
Ctrl+c (to abort).

 

Execution Plans

By pressing “x”, and then giving the sqlid, the explain plan of the sql in question will be displayed.  Although this is a nice and quick feature, it does not show as much detail about the plan as the display_cursor procedure will as shown below.  As far as I can tell, it does show the actual plan and not the estimated ones that are displayed without acttually running the query, but I have not thorougly tested this.
select * from table(dbms_xplan.display_cursor(‘&sql_id’,’&child_no’,’typical’));

 

Space

The “a” option can be used to show info on the disk groups.
The “t” option can be used to show info on the tablespaces.

Using dstat for monitoring CPU, Memory, Disk, etc

Overview

dstat is a good tool for replacing vmstat, netstat, iostat, and ifstat.  It it a very lightweight tool that can consolidate and correlate data  that  you would normally have to capture in different terminals.

 

Examples

 

RMAN Backup Throughput

I’ve used dstat before in the past to monitor backup throughput over interfaces.  Of course, this is possible if the backups are going to disk.

Run ifconfig to get the name of the interface you want to monitor.

Run dstat and watch the receive and send columns.

dstat -cn -N bond0.133 -C total -f 5

----total-cpu-usage---- net/bond0.1
usr sys idl wai hiq siq| recv  send
  6   7  83   4   0   0|   0     0
 11  40  44   5   0   0|1303k 1686k
 14  39  40   7   0   0|5295k 6131k
 11  36  47   6   0   0|1733k 3630k
 10  34  51   5   0   0|2334k   11M
  8  32  54   5   0   0|2189k   14M
  6  10  78   6   0   0|1418k   25M
  6   0  86   7   0   0|1277k 1563k
  8   1  84   8   0   0|1847k 2013k
  8   1  84   7   0   0|1831k 4063k
 13  47  35   5   0   0|1974k 7948k
 12  28  55   5   0   0|1269k   16M
 10  19  64   7   0   0|1308k 2674k
 10  11  73   7   0   0|1474k 2590k
 15  65  18   3   0   0|1566k 4440k

 

CPU, Load, Disk I/O and Memory

Running “dstat -h” will show the options available.  Below is a script that shows some of the options I normally use. 

 cat dstat.sh

#!/bin/sh
echo
dstat -h | egrep "time|load|proc|cpu|disk|io|sys|vm" | egrep -v "cpu0|aio|filesystem|epoch|Usage"
echo

CURRENT_TIME=`date +"%m_%d_%Y_%H_%M"`

#dstat --time --load -p -c --disk --mem --top-cpu --top-bio --top-latency --output /home/oracle/rick/log/dstat_output_${CURRENT_TIME}_.csv 10
#dstat --time --load -p -c --disk --mem --top-cpu --top-bio --top-latency 10
dstat --time --load --proc --cpu --top-cpu --disk --io --sys --vm 60

exit 0

Below that is a sample run on a busy server.

 ./dstat.sh

Versatile tool for generating system resource statistics
Dstat options:
  -c, --cpu              enable cpu stats
  -d, --disk             enable disk stats
  -l, --load             enable load stats
  -p, --proc             enable process stats
  -r, --io               enable io stats (I/O requests completed)
  -t, --time             enable time/date output
  -y, --sys              enable system stats
  --vm                   enable vm stats
  -v, --vmstat           equals -pmgdsc -D total

----system---- ---load-avg--- ---procs--- ----total-cpu-usage---- -most-expensive- -dsk/total- --io/total- ---system-- -----virtual-memory----
  date/time   | 1m   5m  15m |run blk new|usr sys idl wai hiq siq|  cpu process   | read  writ| read  writ| int   csw |majpf minpf alloc  free
15-10 14:17:19|15.9 21.3 24.1|0.0 0.0 150|  6   7  83   4   0   0|ora_as04_LASP0.2|  89M   33M|1141  1095 |  24k   21k|   0    31k   52k   52k
15-10 14:18:19|30.5 24.5 25.0| 13 6.2 465|  8  15  72   5   0   0|oracleORCL  1.0|  32M   21M|1674  1273 |  43k   31k|   0   122k   91k   91k
15-10 14:19:19|22.8 23.0 24.5| 11 4.6 473|  6  12  77   5   0   0|oracleORCL  1.0|  25M   15M|1645  1062 |  40k   30k|   0   115k   82k   84k
15-10 14:20:19|23.6 22.9 24.3| 12 5.2 476|  6  13  76   5   0   0|oracleORCL  1.2|  66M   13M|1716  1127 |  40k   29k|   0   117k   86k   86k
15-10 14:21:19|14.9 20.5 23.4|5.3 5.2 467|  3   6  86   5   0   0|oracleORCL  0.4|  28M   14M|1530  1069 |  32k   26k|   0    99k   57k   57k
15-10 14:22:19|20.5 20.6 23.2|8.8 6.4 466|  5  10  80   5   0   0|oracleORCL  0.8|  70M   20M|1824  1296 |  39k   32k|   0   104k   73k   73k
15-10 14:23:19|15.7 19.2 22.6| 12 7.3 476|  6  12  76   6   0   0|oracleORCL  1.5|  39M   18M|1840  1218 |  41k   33k|   0   117k   89k   89k
15-10 14:24:19|15.8 18.6 22.2| 12 7.5 278|  7  11  75   6   0   0|oracleORCL  1.3| 160M   95M|2142   885 |  37k   30k|   0    64k   77k   77k
15-10 14:25:19|28.2 22.4 23.3| 18 7.5 203|  8  22  65   5   0   0|oracleORCL  1.1|  98M   58M|2608   684 |  42k   31k|   0    70k  110k  110k
15-10 14:26:19|31.6 23.8 23.7| 12 6.8 176|  7  13  75   5   0   0|oracleORCL  1.1|  54M   14M|1944   715 |  30k   25k|   0    39k   56k   57k missed 2 ticks
15-10 14:27:19|32.0 26.4 24.6| 18 4.8 200|  6  20  70   4   0   0|oracleORCL  0.9|  30M   17M| 932  1214 |  37k   27k|   0    72k   98k   98k
15-10 14:28:19|50.7 32.5 26.8| 26 5.3 227| 10  34  52   4   0   0|oracleORCL  1.3|  55M   20M|1200  1088 |  47k   25k|   0   108k  152k  152k missed 2 ticks
15-10 14:29:19|35.4 31.4 26.8| 22 4.5 217|  7  26  64   3   0   0|oracleORCL  0.8|  13M   22M|1071  1113 |  40k   23k|   0    94k  127k  127k
15-10 14:30:19|30.9 30.8 26.8| 20 3.5 212| 11  23  63   3   0   0|oracleORCL  1.5|  70M   16M|1062  1426 |  37k   20k|   0    83k   98k   98k missed 2 ticks
15-10 14:31:19|26.7 28.9 26.4| 10 2.4 182|  7  11  79   3   0   0|oracleORCL  1.0|  38M   16M| 786   892 |  27k   19k|   0    59k   59k   59k
15-10 14:32:19|15.7 25.4 25.4| 18 2.5 208|  6  23  69   2   0   0|oracleORCL  0.7|  53M   16M| 577  1012 |  33k   17k|   0    85k  107k  108k

GNU screen Command

The screen utility can be used to start a session that will be persistent if the terminal is lost, plus it also provides the ability to login from another computer simply by attaching to the screen session name.

 
Here are some essential commands to know:

Start new session with a name and a log file

screen -S <name> -L

List running sessions

screen -ls

Attach to a running session

screen -r <name>

Detach

<ctl> a d

Detach, ReAttach/create

screen -d -R <name>

 
 
The “screen -S <name> -L” command will create a session with the specified name and also create a logfile with a naming convention of screenlog.<#>.
 
Once you have your session open, you can start a long running process, and then exit the session via the detach command.
 
When you are ready to reconnect, you can do so using the “screen -r <name>” command.  If you didn’t specify a name, or forgot the name, you can run the “screen -ls” to show a listing of the screen sessions.
 
 
 

Root Filesystem Full – No Space Left on Device due to open files

Here’s an interesting scenario that I was asked to look into recently with the root file system on an Oracle database server filling up. Normally cleaning up disk space is straight forward; find the large and/or old files and delete them. However, in this case there was a difference is space usage reported between df and du, and the find utility could not locate any file over 1G in size.

Here’s the status of the root file system which was causing the “No Space Left on Device” error message.

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   30G     0 100% /

After deleting around 2G of old files and logs, the error went away but the output of df -h showed the root file system slowing filling up again. These directory sizes hardly changed at all, only MB differences. From the “/” directory, here are all the directories that are on the “/” file system as seen in df -h $dir .

# du -sh *
7.7M     bin
67M     boot
3.5M     dev
8.5M     etc
1.2G     home
440M     lib
28M     lib64
16K     lost+found
4.0K     media
1.2G     mnt
6.8G     opt
1.1G     root
41M     sbin
4.0K     selinux
4.0K     srv
0     sys
12M     tmp
3.0G     usr
260M     var

Notice here that the sum of these directories only adds up to around 15G, leaving the rest of the used space unaccounted for, and the file system used space was still increasing.

Next was to look at open files. It is worth mentioning here that even if a file is deleted, it’s space may not be reclaimed if the process that created it, or still using it, is still running. Using the lsof ( list open files ) utility will show these files.

# lsof | grep deleted
...
expdp      7271  oracle    1w      REG              253,0 16784060416    2475867 /home/oracle/nohup.out (deleted)
expdp      7271  oracle    2w      REG              253,0 16784060416    2475867 /home/oracle/nohup.out (deleted)
…
#
# ps -ef | grep 7271
oracle    7271     1 99 May31 ?        3-10:43:36 expdp               directory=DP_DIR dumpfile=exp_schema.dmp logfile=exp_schema.log schemas=schema

The above shows an export data pump job ( pid = 7271 ) whose process was still running at the OS level, although it was not running in the database. This job was probably canceled out for some reason, but was not cleaned up although the nohup file was deleted. The background process was still running at the OS level and the nohup.out file is taking up the space filling up the “/” partition. It is worth mentioning here that the use of nohup is NOT desired with data pump. The data pump utilities are server side processes; if you kick off a job and then loose your terminal for whatever reason, the data pump job is still running.

Once the expdp process 7271 was killed at the OS level, the space was reclaimed.

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   13G   16G  45% /