Frequent HDR failures: -25572 (oserr = 0 or 4) generally needing restart (11.50FC7 on SLES)

Art Kagel

2013-04-22 13:42:27 UTC

Humph:

Here's what I have for you:

- Errno 4 is indeed interrupted system call, so possibly an OS error
started the trouble.
- Errno 0 is bad Ju-Ju, yes probably a bug
- Errno 104 indicates that the client at the other end of an outgoing
pipe disconnected, so probably the secondary was encountering the same
problem as the primary. That would point to some kind of network glitch.
In the network itself or in the OS? Hard to tell.
- The dr_prping thread going defunct is worrying but you still don't
know if this is cause or symptom.
- You CAN get paid support for the Innovator-C installations you use for
DEV and it is very inexpensive (less than $2000 IB, but I haven't looked at
the pricing lately). So, that is a viable option.

Art
Art S. Kagel
Advanced DataTools (www.advancedatatools.com)
Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions
and do not reflect on my employer, Advanced DataTools, the IIUG, nor any
other organization with which I am associated either explicitly,
implicitly, or by inference. Neither do those opinions reflect those of
other individuals affiliated with any entity with which I am affiliated nor
those of the entities themselves.

Post by Humphrey D
Hi All,
I'm more a developer than a DBA, but after our DBA left, I seem to have
inherited the job of supporting a number of servers running development
databases on a free version of Informix - Innovator-C 11.50.FC7IE
The development boxes are running as HDR pairs, and we've never used any
of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some
blades, HDR generally recovers after a minute or two, whilst on others it
generally fails requiring intervention!
There's a lot of log extracts below, but I guess my basic question is
whether the -25582 "Network connection is broken" errors are genuinely
related to the network (I suspect not, when the HDR nodes are two blades in
an enclosure) or indicative of an error in the database engine itself (does
the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)
Could anyone suggest further diagnosis I could do? Sometimes we see a
single pair fail twice in a few hours, sometimes they'll be fine for weeks.
I'm pretty sure there's one or two pairs that have never failed in the past
year...
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery
07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational
01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server
02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server
In all the above, "oserr = 4" means an interrupted system call, yes? Does
this mean that it's an OS issue that tripped Informix, rather than a bug in
the database engine?
We also get similar failures, but with "oserr = 0" - does this mean
nothing bad at OS level, so it's likely a bug in Informix?
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server
Or, am I reading too much into the "oserr =" part of the logged errors?
I guess we're pretty limited in what we may discover, as we don't have an
IBM support contract, but what I *have* managed to find is that whenever we
see HDR fail and stay failed, than an "onstat g cpu" on the primary always
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294
terminated
Trawling ibm.com gave me an APAR with broadly similar symptoms, but for
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED
CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF
And we do occasionally get that error message if we restart the secondary
12:48:35 DR: ping timeout
12:51:09 Checkpoint Completed: duration was 0 seconds.
0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 9059, Llog used 35583
13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 3772, Llog used 10694
13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]
13:10:34 Checkpoint Completed: duration was 0 seconds.
0x4bdc4b5 Interval: 87469
Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr =
: Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No
connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No
connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.
13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600
13:09:37 Event alarms enabled. ALARMPROG =
'/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial
Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network
driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.
Do the above pair of errors on DB23/DB22 mean that the port for HDR was
still held open after I took Infomrix offline? I cycled Informix on primary
and secondary again, and there was nothing odd in the logs, and this pair
has been fine since...
Sorry for such a rambling first post - but if anyone can tell me what the
likely problem is, or suggest additional Informix or Unix diagnostits I
should do next time, I'd really be grateful!
Thanks very much.
Humph
_______________________________________________
Informix-list mailing list
http://www.iiug.org/mailman/listinfo/informix-list