Discussion:
Frequent HDR failures: -25572 (oserr = 0 or 4) generally needing restart (11.50FC7 on SLES)
(too old to reply)
Humphrey D
2013-04-22 07:50:19 UTC
Permalink
Hi All,

I'm more a developer than a DBA, but after our DBA left, I seem to have inherited the job of supporting a number of servers running development databases on a free version of Informix - Innovator-C 11.50.FC7IE

The development boxes are running as HDR pairs, and we've never used any of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some blades, HDR generally recovers after a minute or two, whilst on others it generally fails requiring intervention!


There's a lot of log extracts below, but I guess my basic question is whether the -25582 "Network connection is broken" errors are genuinely related to the network (I suspect not, when the HDR nodes are two blades in an enclosure) or indicative of an error in the database engine itself (does the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)

Could anyone suggest further diagnosis I could do? Sometimes we see a single pair fail twice in a few hours, sometimes they'll be fine for weeks. I'm pretty sure there's one or two pairs that have never failed in the past year...

Example of HDR failing but recovering itself on a secondary (DB02):
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
07:19:59 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery

07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational




Example of failure on a secondary (DB32):

01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
01:23:12 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server




Slightly different report on another secondary (DB62):

02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
02:34:09 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server




In all the above, "oserr = 4" means an interrupted system call, yes? Does this mean that it's an OS issue that tripped Informix, rather than a bug in the database engine?


We also get similar failures, but with "oserr = 0" - does this mean nothing bad at OS level, so it's likely a bug in Informix?

This from DB62:
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
06:27:43 ASF Echo-Thread Server: asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server

Or, am I reading too much into the "oserr =" part of the logged errors?



I guess we're pretty limited in what we may discover, as we don't have an IBM support contract, but what I *have* managed to find is that whenever we see HDR fail and stay failed, than an "onstat –g cpu" on the primary always shows a terminated HDR ping thread:
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294 terminated


Trawling ibm.com gave me an APAR with broadly similar symptoms, but for leter version of Informix:
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF


And we do occasionally get that error message if we restart the secondary node, then we have to cycle Informix:
Primary (DB23):
12:48:35 DR: ping timeout

12:51:09 Checkpoint Completed: duration was 0 seconds.
12:51:09 Fri Apr 12 - loguniq 31282, logpos 0x91b5204, timestamp: 0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 9059, Llog used 35583

13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 3772, Llog used 10694

13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]

13:10:34 Checkpoint Completed: duration was 0 seconds.
13:10:34 Fri Apr 12 - loguniq 31282, logpos 0xbb7d018, timestamp: 0x4bdc4b5 Interval: 87469



Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr = : Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.

13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600

13:09:37 Event alarms enabled. ALARMPROG = '/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.

Do the above pair of errors on DB23/DB22 mean that the port for HDR was still held open after I took Infomrix offline? I cycled Informix on primary and secondary again, and there was nothing odd in the logs, and this pair has been fine since...

Sorry for such a rambling first post - but if anyone can tell me what the likely problem is, or suggest additional Informix or Unix diagnostits I should do next time, I'd really be grateful!

Thanks very much.
Humph
Art Kagel
2013-04-22 13:42:27 UTC
Permalink
Humph:

Here's what I have for you:

- Errno 4 is indeed interrupted system call, so possibly an OS error
started the trouble.
- Errno 0 is bad Ju-Ju, yes probably a bug
- Errno 104 indicates that the client at the other end of an outgoing
pipe disconnected, so probably the secondary was encountering the same
problem as the primary. That would point to some kind of network glitch.
In the network itself or in the OS? Hard to tell.
- The dr_prping thread going defunct is worrying but you still don't
know if this is cause or symptom.
- You CAN get paid support for the Innovator-C installations you use for
DEV and it is very inexpensive (less than $2000 IB, but I haven't looked at
the pricing lately). So, that is a viable option.

Art
Art S. Kagel
Advanced DataTools (www.advancedatatools.com)
Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions
and do not reflect on my employer, Advanced DataTools, the IIUG, nor any
other organization with which I am associated either explicitly,
implicitly, or by inference. Neither do those opinions reflect those of
other individuals affiliated with any entity with which I am affiliated nor
those of the entities themselves.
Post by Humphrey D
Hi All,
I'm more a developer than a DBA, but after our DBA left, I seem to have
inherited the job of supporting a number of servers running development
databases on a free version of Informix - Innovator-C 11.50.FC7IE
The development boxes are running as HDR pairs, and we've never used any
of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some
blades, HDR generally recovers after a minute or two, whilst on others it
generally fails requiring intervention!
There's a lot of log extracts below, but I guess my basic question is
whether the -25582 "Network connection is broken" errors are genuinely
related to the network (I suspect not, when the HDR nodes are two blades in
an enclosure) or indicative of an error in the database engine itself (does
the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)
Could anyone suggest further diagnosis I could do? Sometimes we see a
single pair fail twice in a few hours, sometimes they'll be fine for weeks.
I'm pretty sure there's one or two pairs that have never failed in the past
year...
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery
07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational
01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server
02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server
In all the above, "oserr = 4" means an interrupted system call, yes? Does
this mean that it's an OS issue that tripped Informix, rather than a bug in
the database engine?
We also get similar failures, but with "oserr = 0" - does this mean
nothing bad at OS level, so it's likely a bug in Informix?
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server
Or, am I reading too much into the "oserr =" part of the logged errors?
I guess we're pretty limited in what we may discover, as we don't have an
IBM support contract, but what I *have* managed to find is that whenever we
see HDR fail and stay failed, than an "onstat –g cpu" on the primary always
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294
terminated
Trawling ibm.com gave me an APAR with broadly similar symptoms, but for
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED
CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF
And we do occasionally get that error message if we restart the secondary
12:48:35 DR: ping timeout
12:51:09 Checkpoint Completed: duration was 0 seconds.
0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 9059, Llog used 35583
13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 3772, Llog used 10694
13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]
13:10:34 Checkpoint Completed: duration was 0 seconds.
0x4bdc4b5 Interval: 87469
Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr =
: Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No
connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No
connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.
13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600
13:09:37 Event alarms enabled. ALARMPROG =
'/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial
Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network
driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.
Do the above pair of errors on DB23/DB22 mean that the port for HDR was
still held open after I took Infomrix offline? I cycled Informix on primary
and secondary again, and there was nothing odd in the logs, and this pair
has been fine since...
Sorry for such a rambling first post - but if anyone can tell me what the
likely problem is, or suggest additional Informix or Unix diagnostits I
should do next time, I'd really be grateful!
Thanks very much.
Humph
_______________________________________________
Informix-list mailing list
http://www.iiug.org/mailman/listinfo/informix-list
Loading...