Monday 20 August 2012

ORA-15401 Diskgroup Space exhaust Error message in ASM


Symptoms
An ORA-15041 ( diskgroup space exhausted ) occurs during a rebalance or adding of a disk (which implicitly does a rebalance unless told otherwise)
Cause
One or more disks have V$ASM_DISK.FREE_MB are below the threshold level needed to be able to do a successful rebalance (50-100mb)
Solution
1) Determine which (if any) disks contain no free space (ie are below the threshold)
select group_kfdat "group #",
number_kfdat "disk #",
count(*) "# AU's"
from x$kfdat a
where v_kfdat = 'V'
and not exists (select *
from x$kfdat b
where a.group_kfdat = b.group_kfdat
and a.number_kfdat = b.number_kfdat
and b.v_kfdat = 'F')
group by GROUP_KFDAT, number_kfdat;
If no rows are returned ... the following query can also be used

select disk_number "Disk #", free_mb
from v$asm_disk
where group_number = *** disk group number ***
order by 2;
If rows are returned from the first query ... or FREE_MB is less than 100mb in the second ... then there is probably insufficient disk space to allow a rebalance to occur ... Note the Disk #'s for later


2) Determine which files have allocation units on the disk(s) that are on exhausted disks
select name, file_number
from v$asm_alias
where group_number in (select group_kffxp
from x$kffxp
where group_kffxp=*** disk group number ***
and disk_kffxp in (*** disk list from #1 above ***)
and au_kffxp != 4294967294
and number_kffxp >= 256)
and file_number in (select number_kffxp
from x$kffxp
where group_kffxp=*** disk group number ***
and disk_kffxp in (*** disk list from #1 above ***)
and au_kffxp != 4294967294
and number_kffxp >= 256)
and system_created='Y';
 


3) Free up space so that the rebalance can occur

Using the file list from #2 above ... we will need to either drop or move tablespace(s)/datafile(s) such that all disks that are exhausted have at least 100mb free ...

NOTE ... the AU count above ... should relate to 1mb AU size ... so if a single file ... with at least 100 au's can be dropped or moved ... this
should be sufficient to free up enough space to allow the rebalance to occur

Droppable tablespaces may be things like:
* temporary tablespaces
* index tablespaces (assuming you know how to rebuild the indexes)

If none of the tablespaces are droppable then the tablespace(s)/datafile(s) will need to be
* moved to another diskgroup (at least temporarily) ...
* dropped using RMAN (with the database shutdown) and will be restored later

 
4) Check to see if there is sufficient FREE_MB on the problem disks
select disk_number "Disk #", free_mb
from v$asm_disk
where disk_group = *** disk group number ***
and disk_number in (*** disk list from #1 above ***)
order by 2;

If the disks do not have at least 100mb free ... repeat #3 above
 
5) Rebalance
alter diskgroup rebalance power ;
 

6) Monitor the progress of the rebalance until finished
select sofar "AUs moved So Far", est_work "Aprox AU's to be moved"
from v$asm_operation
where group_number = *** disk group number ***;

Continue to monitor until the rebalance has completed
 

7) Check the balance of the disks
select disk_number, total_mb-free_mb
from v$asm_disk
where group_number = *** disk group number ***;
TOTAL_MB - FREE_MB = amount of space used on the disk

The amount of space used on each disk (without regard to size) should be approximately the same (within a few megabytes)

 

8) Put things back where they wereIf the datafiles were moved ... reverse the process and return them to their original diskgroup and location

if the tablespace(s) were dropped ... recreate them ... if needed

If the datafiles were dropped ... restore them using RMAN


Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7 - Release: 10.2 to 11.1
Information in this document applies to any platform.

Thursday 26 July 2012

Oracle Enterprise Manager Agent Upload Troubleshooting


There are times when Oracle Enterprise Manger agent stop uploading.


There are many reasons for the agent not being able to upload.

Examine the agent

1. First I always check the status of the agent
 
emctl status agent

2. Examine status output, may see that the agent is not uploading by see
 
Last successful upload : (none)
Last attempted upload : (none)
Total Megabytes of XML files uploaded so far : 0.00
Number of XML files pending upload : 828
Size of XML files pending upload(MB) : 56.61
Available disk space on upload filesystem : 8.50%
Last successful heartbeat to OMS : 2012-07-21 08:45:39


3. Next I always attempt to do a manual upload for the agent to check/verify upload problem
 
emctl upload agent

4. Check the following log file for errors
 
$AGENT_HOME/sysman/log/emagent.trc
Other logfiles are in the $AGENT_HOME/sysman/log/ directory.

In Most cases it is due to a bad .xml file or unable to contact the management service. If it is due to contacting the management service you will need to fix the OMS Service or connectivity to the service, however if it is due to a bad xml file you can then try to just remove that file from the upload location. I usually save the file to another location so that I can send the file to Oracle Support where hopefully they can tell me why the file would not upload.Once you have removed the file from the upload location you can attempt a manual upload again.emctl upload agent Check the logfile again for errors. you can repeat for subsequent bad xml files, but if all else fails we can always clear the agent completely. Keep in mind you will lose data from the pending xmls therefore the data will not make it to the Grid Control Repository.
Clear the agent up

1. Stop the agent on the target node
 
emctl stop agent
2. Delete any pending upload files from the agent home
 
rm -r $ORACLE_HOME/sysman/emd/state/*
rm -r $ORACLE_HOME/sysman/emd/collection/*
rm -r $ORACLE_HOME/sysman/emd/upload/*
rm $ORACLE_HOME/sysman/emd/lastupld.xml
rm $ORACLE_HOME/sysman/emd/agntstmp.txt
rm $ORACLE_HOME/sysman/emd/blackouts.xml

3. agent clearstate
 
emctl clearstate agent

4. Start the agent again
 
emctl start agent

5. Force an upload to the Oracle Management Server/Service (OMS)
 
emctl upload agent


There could be another reason when you delete an agent and try to reconfigure it in OMS. Issue arise when previous delete is not completed successfully.

You can have following error messages.


EMD upload error: Failed to upload file A0000001.xml: Fatal Error.
Response received: 500|ORA-20618: The specified agent is in the process of being deleted from the repository, wait for deletion to complete before restarting the agent.(agent name = dbp1405.xx.xx.xxxxx.xxx:1830)(agent guid = 35F8188A75ABF6AFAAA6871B64D1C0B1)
ORA-06512: at "SYSMAN.TARGETS_INSERT_TRIGGER", line 30
ORA-04088: error during execution of trigger 'SYSMAN.TARGETS_INSERT_TRIGGER'


This is due to duplicate entry in OMS database.

I tried to delete that entry manually in OMS database by connecting as SYSMAN user


SQL> exec mgmt_diag.PurgeOrphanTarget(HEXTORAW('35F8188A75ABF6AFAAA6871B64D1C0B1'));
BEGIN mgmt_diag.PurgeOrphanTarget(HEXTORAW('35F8188A75ABF6AFAAA6871B64D1C0B1')); END;
*
ERROR at line 1:
ORA-20000: Target is in pending delete state
ORA-06512: at "SYSMAN.MGMT_DIAG", line 1437
ORA-06512: at line 1

 This means that it can not be deleted manually due to previous attempt still running. In order to find when it started then I ran this query

select delete_request_time, delete_complete_time, last_updated_time from mgmt_targets_delete where target_name='dbp1405.xx.xx.xxxx.xxx:1830' AND target_type='oracle_emd';

DELETE_REQUEST_TIME           DELETE_COMPLETE_TIME          LAST_UPDATED_TIME
----------------------------- ----------------------------- -----------------------------
01-jun-2012 11:48:36                                        01-jun-2012 11:48:36
  

This means that previous deletion hung for a long time.


There are couple  methods of progressing it further from this stage is to use repvfy utility to remove duplicate host from repository. If repvfy won't work then you can remove it manually by using the following steps.

Stop the agent

emctl stop agent

Then login to repository database using SYSMAN user the find the details about GUID for that object from MGMT_TARGETS_DELETE  table or use this statement.

SQL> select TARGET_NAME,TARGET_TYPE,EMD_URL from MGMT_TARGETS_DELETE WHERE TARGET_GUID='35F8188A75ABF6AFAAA6871B64D1C0B'

Check for all below tables:

Main tables are:

MGMT_TARGETS_DELETE
MGMT_EMD_PING
MGMT_EMD_PING_CHECK
MGMT_AVAILABILITY
MGMT_AVAILABILITY_MARKER
MGMT_CURRENT_AVAILABILITY

Also some reference tables,request you to check the below tables also:

MGMT_AGENT_SEC_INFO
MGMT_BLACKOUT_PROXY_TARGETS
MGMT_CURRENT_METRICS
MGMT_LAST_VIOLATION
MGMT_METRICS_1DAY
MGMT_POLICY_ASSOC_EVAL_DETAILS
MGMT_POLICY_ASSOC_EVAL_SUMM
MGMT_PURGE_POLICY_TARGET_STATE
MGMT_RT_BOOTSTRAP_TIMES
MGMT_STRING_METRIC_HISTORY
MGMT_TARGET_PROPERTIES
MGMT_VIOLATIONS


If in any table you find the entry for agent dbp1405.nl.eu.abnamro.com so please delete the target

example:

SQL> delete from MGMT_TARGETS_DELETE where TARGET_GUID='35F8188A75ABF6AFAAA6871B64D1C0B'

This should delete the duplicate or old entry from OMS and allow you to continue securing and uploading agent.



  

Wednesday 25 January 2012

Oracle RAC node killed by CRS

I had faced this error message


[    CSSD]2012-01-12 15:24:19.352 [1199618400] >TRACE:   clssnmWaitThread: thrd(2), timeout(1000), wakeonpost(0)
[    CSSD]2012-01-12 15:24:19.353 [1220598112] >ERROR:   ###################################
[    CSSD]2012-01-12 15:24:19.353 [1220598112] >ERROR:   clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
[    CSSD]2012-01-12 15:24:19.353 [1220598112] >ERROR:   ###################################

First of all location of log file is $CRS_HOME/log//cssd/
and file name is ocssd.log

There could be many reason for this error but in a nutshell CSSD has killed the local host connection to rest of RAC cluster. In this case, you will notice a hint on a line above ERROR which says that timeout is happening. Further investigating the log files I noticed that heartbeat between nodes is not fast enough. 
When I checked interface used by interconnect then noticed that it is running on slow speed.

By Changing speed of network interface resolve this problem.
 

Tuesday 3 January 2012

BOOK REVIEW: Oracle 11g R1/R2 RAC Essentials

ORACLE 11g R1/R2 RAC ESSENTIALS
ISBN 978-1-849682-66-4



This is always a need of comprehensive book on this topic. I think a very good effort has been done in writing the above mentioned book. I have bought this book and as I am going through chapters, I'll update this page with my observation.


In First chapter , under the topic of High Availibity: Oracle 11g R1 RAC authors has mentioned that RAC is not a true disaster recovery solution because it does not protect against site failure or database failure.
I think this point need further clarification. In my opinion it depends on your setup, I have recently created a RAC instance using ASM disks with Normal redundancy on two physically separated location. So in this case not only nodes are on different data center but storage is also in two different data center. So ASM has diskgroup has two failgroup of one on each site using stretched SAN. As instance is combination of background processes where is database is combination of data on storage. In the above mentioned setup both database and instance will be available in case of site failure.

Further points still need to come .....................