Tuesday, May 28, 2013

Netra T2000 PCIEX Fault

After installing new Solaris (u11 01/13) I got hardware fault for PCIEX as follow:
  ID Time              FRU               Fault
   6 JAN 01 02:08:21   IOBD              Host detected fault, MSGID: PCIEX-8000-DJ  UUID: 598ad722-6e4c-4335-f86a-ec627b5c2b3c
   7 JAN 01 02:08:21   MB                Host detected fault, MSGID: PCIEX-8000-DJ  UUID: 598ad722-6e4c-4335-f86a-ec627b5c2b3c

Finally found in server product notes a trick for /etc/system to config system to fix it

/etc/system
set pcie:pcie_aer_ce_mask=0x1
set segkmem_lpsize=0x400000

Also found some other config that may relate to improve performance for other things

set ip:ip_soft_rings_cnt=0
set pcie:pcie_aer_ce_mask=0x1
forceload: sys/shmsys
forceload: sys/semsys
set semsys:seminfo_semmni=20
set semsys:seminfo_semmsl=100
set semsys:seminfo_semmns=2000
set semsys:seminfo_semmnu=2000
set shmsys:shminfo_shmmax=0x140000000

Solaris Fault Management

SC (ILOM/ALOM):
showfaults
showfaults -v

Clear faults

clearfault

OS
List faults with details
fmadm faulty

fmdump

TIME                 UUID                                 SUNW-MSG-ID
May 28 14:27:27.3993 247a4225-6ed7-ca23-c85f-c98c461ac7ec ZFS-8000-D3

Fault location and more details

#fmdump -vu

Clear faults

fmadm repair 
cd /var/fm/fmd
rm e* f* c*/eft/* r*/*
fmadm reset cpumem-diagnosis
fmadm reset cpumem-retire
fmadm reset eft
fmadm reset io-retire

-----------------------------------------------------------
iLom
show /SP/faultmgmt

start /SP/faultmgmt/shell

fmadm faulty

fmadm repaired 1feeeff2-5371-c970-c8c7-d82c50b42c6b