Monthly Archives: June 2013

Exadata Flashdisk in Poor Performance Status After Replacement

I recently encountered a bad flashdisk on an Exadata X2-2 machine and had it replaced. That is simple enough, just open a ticket with Oracle and schedule the hardware replacement. However, after the replacement two separate modules went into a “poor performance” state. Oracle support then advised me to follow the note listed below to fix it.

Flash Disks may report ‘Not Present’ or ‘Poor Performance’ after FDOM/Flash Disk Replacement [ID 1306635.1]

The note mentions that it covers Version 11.2.1.2.1 to 11.2.2.2.0 [Release 11.2], but I was on Version 11.2.3.2.1 .

A couple of notes about 1306635.1 as I encountered two errors with the statements.

1) “drop celldisk all flashdisk” command is out of order; “drop flashlog” needs to come first.
2) “drop celldisk all flashlog” is not a valid command

Please be sure to read the entire note as some additional commands may be required depending on the storage software version and whether or not you are in WriteBack or WriteThrough mode.

Here’s the original error with the part that was replaced; FLASH_4_1 reported as critical.

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk
FLASH_1_0 1206M0CY4H normal
FLASH_1_1 1206M0CYL2 normal
FLASH_1_2 1206M0CY43 normal
FLASH_1_3 1206M0CY44 normal
FLASH_2_0 1202M0CN7C normal
FLASH_2_1 1202M0CRLX normal
FLASH_2_2 1202M0CN6D normal
FLASH_2_3 1202M0CQXA normal
FLASH_4_0 1202M0CR8P normal
FLASH_4_1 1202M0CR7P critical
FLASH_4_2 1202M0CRLQ normal
FLASH_4_3 1202M0CMWU normal
FLASH_5_0 1202M0CRJT normal
FLASH_5_1 1202M0CRL1 normal
FLASH_5_2 1202M0CRLT normal
FLASH_5_3 1202M0CRLP normal

The flashdisk in question was successfully replaced ( FLASH_4_1 ) and it reported normal; however, two more issues popped up on flashdisks in the same PCI slot 4 immediately after replacing the flashdisk.

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk
FLASH_1_0 1206M0CY4H normal
FLASH_1_1 1206M0CYL2 normal
FLASH_1_2 1206M0CY43 normal
FLASH_1_3 1206M0CY44 normal
FLASH_2_0 1202M0CN7C normal
FLASH_2_1 1202M0CRLX normal
FLASH_2_2 1202M0CN6D normal
FLASH_2_3 1202M0CQXA normal
FLASH_4_0 1202M0CR8P normal
FLASH_4_1 1101M067DW normal
FLASH_4_2 1202M0CRLQ warning - poor performance
FLASH_4_3 1202M0CMWU warning - poor performance
FLASH_5_0 1202M0CRJT normal
FLASH_5_1 1202M0CRL1 normal
FLASH_5_2 1202M0CRLT normal
FLASH_5_3 1202M0CRLP normal

Following the note the support gave me, here’s the fix. As I mentioned before, please be sure to read the entire note as some additional commands may be required depending on the storage software version and whether or not you are in WriteBack or WriteThrough mode.

[root@exaxcel07 ~]# cellcli
CellCLI: Release 11.2.3.2.1 - Production on Tue Jun 04 13:22:11 CDT 2013

Copyright (c) 2007, 2012, Oracle.  All rights reserved.
Cell Efficiency Ratio: 6,291

CellCLI> alter lun 4_2 reenable force
LUN 4_2 on physical disk FLASH_4_2 successfully marked to status normal.LUN 4_2 successfully reenabled.

CellCLI> alter lun 4_3 reenable force
LUN 4_3 on physical disk FLASH_4_3 successfully marked to status normal.LUN 4_3 successfully reenabled.

CellCLI> drop flashcache
Flash cache exaxcel07_FLASHCACHE successfully dropped

CellCLI> drop flashlog
Flash log exaxcel07_FLASHLOG successfully dropped

CellCLI> drop celldisk all flashdisk
CellDisk FD_00_exaxcel07 successfully dropped
CellDisk FD_01_exaxcel07 successfully dropped
CellDisk FD_02_exaxcel07 successfully dropped
CellDisk FD_03_exaxcel07 successfully dropped
CellDisk FD_04_exaxcel07 successfully dropped
CellDisk FD_05_exaxcel07 successfully dropped
CellDisk FD_06_exaxcel07 successfully dropped
CellDisk FD_07_exaxcel07 successfully dropped
CellDisk FD_08_exaxcel07 successfully dropped
CellDisk FD_09_exaxcel07 successfully dropped
CellDisk FD_10_exaxcel07 successfully dropped
CellDisk FD_11_exaxcel07 successfully dropped
CellDisk FD_12_exaxcel07 successfully dropped
CellDisk FD_13_exaxcel07 successfully dropped
CellDisk FD_14_exaxcel07 successfully dropped
CellDisk FD_15_exaxcel07 successfully dropped

CellCLI> create celldisk all flashdisk
CellDisk FD_00_exaxcel07 successfully created
CellDisk FD_01_exaxcel07 successfully created
CellDisk FD_02_exaxcel07 successfully created
CellDisk FD_03_exaxcel07 successfully created
CellDisk FD_04_exaxcel07 successfully created
CellDisk FD_05_exaxcel07 successfully created
CellDisk FD_06_exaxcel07 successfully created
CellDisk FD_07_exaxcel07 successfully created
CellDisk FD_08_exaxcel07 successfully created
CellDisk FD_09_exaxcel07 successfully created
CellDisk FD_10_exaxcel07 successfully created
CellDisk FD_11_exaxcel07 successfully created
CellDisk FD_12_exaxcel07 successfully created
CellDisk FD_13_exaxcel07 successfully created
CellDisk FD_14_exaxcel07 successfully created
CellDisk FD_15_exaxcel07 successfully created

CellCLI> create flashlog all
Flash log exaxcel07_FLASHLOG successfully created

CellCLI> create flashcache all
Flash cache exaxcel07_FLASHCACHE successfully created

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk
	 FLASH_1_0	 1206M0CY4H	 normal
	 FLASH_1_1	 1206M0CYL2	 normal
	 FLASH_1_2	 1206M0CY43	 normal
	 FLASH_1_3	 1206M0CY44	 normal
	 FLASH_2_0	 1202M0CN7C	 normal
	 FLASH_2_1	 1202M0CRLX	 normal
	 FLASH_2_2	 1202M0CN6D	 normal
	 FLASH_2_3	 1202M0CQXA	 normal
	 FLASH_4_0	 1202M0CR8P	 normal
	 FLASH_4_1	 1101M067DW	 normal
	 FLASH_4_2	 1202M0CRLQ	 normal
	 FLASH_4_3	 1202M0CMWU	 normal
	 FLASH_5_0	 1202M0CRJT	 normal
	 FLASH_5_1	 1202M0CRL1	 normal
	 FLASH_5_2	 1202M0CRLT	 normal
	 FLASH_5_3	 1202M0CRLP	 normal

All is good now.

Exadata Auto Mgmt Process Dropping Disks

I was recently working on an Exadata X2-2 machine that had several disks that were not in ASM. Upon further inspection, the physical, celldisk, and griddisks were all fine, but there was an issue trying to add them back into ASM.

I’ll show you the steps provided by Oracle support to get the disks added back into the diskgroups in ASM.

Here, you can see the status of the disks on the storage cells and within ASM.

-- celldisk

name: CD_03_exaucel02
comment:
creationTime: 2011-01-05T16:42:36-06:00
deviceName: /dev/sdd
devicePartition: /dev/sdd
diskType: HardDisk
errorCount: 4
freeSpace: 0
id: 0000012d-5858-9649-0000-000000000000
interleaving: none
lun: 0_3
physicalDisk: L2HTLY
raidLevel: 0
size: 1861.703125G
status: normal

-- griddisk

name: DATA_CD_03_exaucel02
asmDiskgroupName: DATA
asmDiskName: DATA_CD_03_EXAUCEL02
asmFailGroupName: EXAUCEL02
availableTo:
cachingPolicy: default
cellDisk: CD_03_exaucel02
comment:
creationTime: 2011-01-05T16:45:47-06:00
diskType: HardDisk
errorCount: 4
id: 0000012d-585b-811e-0000-000000000000
offset: 32M
size: 1562G
status: active

name: RECO_CD_03_exaucel02
asmDiskgroupName: RECO
asmDiskName: RECO_CD_03_EXAUCEL02
asmFailGroupName: EXAUCEL02
availableTo:
cachingPolicy: default
cellDisk: CD_03_exaucel02
comment:
creationTime: 2011-01-11T10:44:37-06:00
diskType: HardDisk
errorCount: 0
id: 90c29a38-f9d3-405e-8909-d79dcdf5a909
offset: 1562.046875G
size: 299.65625G
status: active


From V$ASM_DISK
SQL:ASM> @asm_info

GROUP_NUMBER FAILGROUP                      PATH                                     MOUNT_STATUS STATE
------------ ------------------------------ ---------------------------------------- ------------ --------
           0 EXAUCEL02                      o/192.168.10.6/DATA_CD_03_exaucel02      CLOSED       NORMAL
           0 EXAUCEL02                      o/192.168.10.6/RECO_CD_03_exaucel02      CLOSED       NORMAL
           0 EXAUCEL07	                    o/192.168.10.11/DATA_CD_08_elsucel07     IGNORED	  NORMAL
           0 EXAUCEL07		            o/192.168.10.11/RECO_CD_08_elsucel07     IGNORED	  NORMAL

The disks had been in this state for some time as the log files had already aged off, and since the disks seemed to be fine, I moved forward with Oracle support assuming that the disks were fine and to attempt to add the disks back into the diskgroups in ASM. For reference, you can read the following note for reference.

After replacing disk on Exadata storage, v$asm_disk shows CLOSED/IGNORED as mount_status [ID 1347155.1]

Unfortunately, we had to go through several attempts at getting the disks back into ASM. Our attempts included:

ATTEMPT 1
Run the commands to add the disk back to the diskgroup. I did not see any errors, however the mount_status changed from ignored to closed.
sql> alter diskgroup RECO add disk ‘o/192.168.10.6/RECO_CD_03_exaucel02’ force;
sql> alter diskgroup DATA add disk ‘o/192.168.10.6/DATA_CD_03_exaucel02’ force;

ATTEMPT 2
Next, we tried to add 2 of the disks to +DATA first as they need to be added in pairs to preserve the disk partnerships.
alter diskgroup data
add failgroup EXAUCEL04 disk ‘o/192.168.10.8/DATA_CD_10_exaucel04’ force
add failgroup EXAUCEL07 disk ‘o/192.168.10.11/DATA_CD_08_exaucel07’ force
rebalance power 11;

ATTEMPT 3
Next, we tried the following to clear the cache of the Exadata Auto Mgmt process.
On all DB nodes, identify the PIDs of xdmg and xdwk processes and kill them; then add the disks back into ASM.
ps -ef | grep xdmg
ps -ef | grep xdwk
Since the xd* processes are non-fatal background processes, killing it does not bring down the ASM instance; they will be automatically respawned. Once the xd* processes are back up, the DISKs were added back again.

— ————————————————————
— Always the same result,
— the disk was always dropped by the Exadata Auto Mgmt process
— The following is from the ASM alert log.
— ————————————————————

...
Starting background process XDWK
Sun May 05 02:27:30 2013
XDWK started with pid=40, OS id=28500
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup DATA drop
disk DATA_CD_03_exauCEL02 force
...

The next logical step was to drop and re-create the celldisk and griddisks through the cellcli on the cell.
In order to do this, it is necessary to gather some info on the names and sizes of the disks as follows.
Please note that some of the output has been truncated for brevity.

As is in most cases, here’s a note that has the steps and their explanation.
Steps to manually create cell/grid disks on Exadata V2 if auto-create fails during disk replacement [ID 1281395.1]

[root@exaucel02 ~]# cellcli
CellCLI: Release 11.2.3.2.1 - Production on Wed May 29 13:12:33 CDT 2013

Copyright (c) 2007, 2012, Oracle.  All rights reserved.
Cell Efficiency Ratio: 1,955

CellCLI> list physicaldisk
	 28:0     	 L2KDEH    	 normal
	 28:1     	 L5KQS3    	 normal
	 28:2     	 L2KD6X    	 normal
	 28:3     	 L2HTLY    	 normal
	 28:4     	 L2HTB8    	 normal
	 28:5     	 L2KJAB    	 normal
	 28:6     	 L2KJ98    	 normal
	 28:7     	 L2KD54    	 normal
	 28:8     	 L2KD6Z    	 normal
	 28:9     	 L37G5R    	 normal
	 28:10    	 L2KD6V    	 normal
	 28:11    	 L2J1LM    	 normal
...

CellCLI> list lun
	 0_0 	 0_0 	 normal
	 0_1 	 0_1 	 normal
	 0_2 	 0_2 	 normal
	 0_3 	 0_3 	 normal
	 0_4 	 0_4 	 normal
	 0_5 	 0_5 	 normal
	 0_6 	 0_6 	 normal
	 0_7 	 0_7 	 normal
	 0_8 	 0_8 	 normal
	 0_9 	 0_9 	 normal
	 0_10	 0_10	 normal
	 0_11	 0_11	 normal
...

CellCLI> list celldisk
	 CD_00_exaucel02	 normal
	 CD_01_exaucel02	 normal
	 CD_02_exaucel02	 normal
	 CD_03_exaucel02	 normal
	 CD_04_exaucel02	 normal
	 CD_05_exaucel02	 normal
	 CD_06_exaucel02	 normal
	 CD_07_exaucel02	 normal
	 CD_08_exaucel02	 normal
	 CD_09_exaucel02	 normal
	 CD_10_exaucel02	 normal
	 CD_11_exaucel02	 normal
...

CellCLI> list griddisk
	 DATA_CD_00_exaucel02	 active
	 DATA_CD_01_exaucel02	 active
	 DATA_CD_02_exaucel02	 active
	 DATA_CD_03_exaucel02	 active
	 DATA_CD_04_exaucel02	 active
	 DATA_CD_05_exaucel02	 active
	 DATA_CD_06_exaucel02	 active
	 DATA_CD_07_exaucel02	 active
	 DATA_CD_08_exaucel02	 active
	 DATA_CD_09_exaucel02	 active
	 DATA_CD_10_exaucel02	 active
	 DATA_CD_11_exaucel02	 active
	 RECO_CD_00_exaucel02	 active
	 RECO_CD_01_exaucel02	 active
	 RECO_CD_02_exaucel02	 active
	 RECO_CD_03_exaucel02	 active
	 RECO_CD_04_exaucel02	 active
	 RECO_CD_05_exaucel02	 active
	 RECO_CD_06_exaucel02	 active
	 RECO_CD_07_exaucel02	 active
	 RECO_CD_08_exaucel02	 active
	 RECO_CD_09_exaucel02	 active
	 RECO_CD_10_exaucel02	 active
	 RECO_CD_11_exaucel02	 active

CellCLI> list physicaldisk where name=28:3 detail
	 name:              	 28:3
	 deviceId:          	 24
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 28
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 foreignState:      	 false
	 luns:              	 0_3
	 makeModel:         	 "SEAGATE ST32000SSSUN2.0T"
	 physicalFirmware:  	 061A
	 physicalInsertTime:	 2010-12-21T01:04:07-06:00
	 physicalInterface: 	 sas
	 physicalSerial:    	 L2HTLY
	 physicalSize:      	 1862.6559999994934G
	 slotNumber:        	 3
	 status:            	 normal

CellCLI>  list griddisk where celldisk=CD_03_exaucel02 attributes name,size,offset
	 DATA_CD_03_exaucel02	 1562G     	 32M
	 RECO_CD_03_exaucel02	 299.65625G	 1562.046875G

Using the above names and sizes, I then dropped and re-created the celldisk and griddisks, and then added the disk back into their respective diskgroups.

CellCLI> drop   celldisk CD_03_exaucel02 force

CellCLI> create celldisk CD_03_exaucel02 lun=0_3

CellCLI> create griddisk DATA_CD_03_exaucel02 celldisk=CD_03_exaucel02,size=1562G

CellCLI> create griddisk RECO_CD_03_exaucel02 celldisk=CD_03_exaucel02,size=299.65625G

CellCLI> list griddisk where celldisk=CD_03_exaucel02 attributes name,size,offset

SQL> alter diskgroup DATA add disk 'o/192.168.10.6/DATA_CD_03_exaucel02' ;
SQL> alter diskgroup RECO add disk 'o/192.168.10.6/RECO_CD_03_exaucel02' ;

The reblance operation was now running and could be seen in the gv$asm_operation view in the ASM instances. When completed, the disks were back in ASM.

From V$ASM_DISK
SQL:ASM> @asm_info

GROUP_NUMBER FAILGROUP                      PATH                                     MOUNT_STATUS STATE
------------ ------------------------------ ---------------------------------------- ------------ --------
           1 EXAUCEL02                      o/192.168.10.6/DATA_CD_03_exaucel02      CACHED       NORMAL
           2 EXAUCEL02                      o/192.168.10.6/RECO_CD_03_exaucel02      CACHED       NORMAL

The same steps were followed to add the other disk back into ASM.

Root Filesystem Full – No Space Left on Device due to open files

Here’s an interesting scenario that I was asked to look into recently with the root file system on an Oracle database server filling up. Normally cleaning up disk space is straight forward; find the large and/or old files and delete them. However, in this case there was a difference is space usage reported between df and du, and the find utility could not locate any file over 1G in size.

Here’s the status of the root file system which was causing the “No Space Left on Device” error message.

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   30G     0 100% /

After deleting around 2G of old files and logs, the error went away but the output of df -h showed the root file system slowing filling up again. These directory sizes hardly changed at all, only MB differences. From the “/” directory, here are all the directories that are on the “/” file system as seen in df -h $dir .

# du -sh *
7.7M     bin
67M     boot
3.5M     dev
8.5M     etc
1.2G     home
440M     lib
28M     lib64
16K     lost+found
4.0K     media
1.2G     mnt
6.8G     opt
1.1G     root
41M     sbin
4.0K     selinux
4.0K     srv
0     sys
12M     tmp
3.0G     usr
260M     var

Notice here that the sum of these directories only adds up to around 15G, leaving the rest of the used space unaccounted for, and the file system used space was still increasing.

Next was to look at open files. It is worth mentioning here that even if a file is deleted, it’s space may not be reclaimed if the process that created it, or still using it, is still running. Using the lsof ( list open files ) utility will show these files.

# lsof | grep deleted
...
expdp      7271  oracle    1w      REG              253,0 16784060416    2475867 /home/oracle/nohup.out (deleted)
expdp      7271  oracle    2w      REG              253,0 16784060416    2475867 /home/oracle/nohup.out (deleted)
…
#
# ps -ef | grep 7271
oracle    7271     1 99 May31 ?        3-10:43:36 expdp               directory=DP_DIR dumpfile=exp_schema.dmp logfile=exp_schema.log schemas=schema

The above shows an export data pump job ( pid = 7271 ) whose process was still running at the OS level, although it was not running in the database. This job was probably canceled out for some reason, but was not cleaned up although the nohup file was deleted. The background process was still running at the OS level and the nohup.out file is taking up the space filling up the “/” partition. It is worth mentioning here that the use of nohup is NOT desired with data pump. The data pump utilities are server side processes; if you kick off a job and then loose your terminal for whatever reason, the data pump job is still running.

Once the expdp process 7271 was killed at the OS level, the space was reclaimed.

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   13G   16G  45% /