Humphrey D
2013-04-22 07:50:19 UTC
Hi All,
I'm more a developer than a DBA, but after our DBA left, I seem to have inherited the job of supporting a number of servers running development databases on a free version of Informix - Innovator-C 11.50.FC7IE
The development boxes are running as HDR pairs, and we've never used any of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some blades, HDR generally recovers after a minute or two, whilst on others it generally fails requiring intervention!
There's a lot of log extracts below, but I guess my basic question is whether the -25582 "Network connection is broken" errors are genuinely related to the network (I suspect not, when the HDR nodes are two blades in an enclosure) or indicative of an error in the database engine itself (does the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)
Could anyone suggest further diagnosis I could do? Sometimes we see a single pair fail twice in a few hours, sometimes they'll be fine for weeks. I'm pretty sure there's one or two pairs that have never failed in the past year...
Example of HDR failing but recovering itself on a secondary (DB02):
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
07:19:59 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery
07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational
Example of failure on a secondary (DB32):
01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
01:23:12 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server
Slightly different report on another secondary (DB62):
02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
02:34:09 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server
In all the above, "oserr = 4" means an interrupted system call, yes? Does this mean that it's an OS issue that tripped Informix, rather than a bug in the database engine?
We also get similar failures, but with "oserr = 0" - does this mean nothing bad at OS level, so it's likely a bug in Informix?
This from DB62:
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
06:27:43 ASF Echo-Thread Server: asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server
Or, am I reading too much into the "oserr =" part of the logged errors?
I guess we're pretty limited in what we may discover, as we don't have an IBM support contract, but what I *have* managed to find is that whenever we see HDR fail and stay failed, than an "onstat –g cpu" on the primary always shows a terminated HDR ping thread:
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294 terminated
Trawling ibm.com gave me an APAR with broadly similar symptoms, but for leter version of Informix:
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF
And we do occasionally get that error message if we restart the secondary node, then we have to cycle Informix:
Primary (DB23):
12:48:35 DR: ping timeout
12:51:09 Checkpoint Completed: duration was 0 seconds.
12:51:09 Fri Apr 12 - loguniq 31282, logpos 0x91b5204, timestamp: 0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 9059, Llog used 35583
13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 3772, Llog used 10694
13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]
13:10:34 Checkpoint Completed: duration was 0 seconds.
13:10:34 Fri Apr 12 - loguniq 31282, logpos 0xbb7d018, timestamp: 0x4bdc4b5 Interval: 87469
Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr = : Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.
13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600
13:09:37 Event alarms enabled. ALARMPROG = '/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.
Do the above pair of errors on DB23/DB22 mean that the port for HDR was still held open after I took Infomrix offline? I cycled Informix on primary and secondary again, and there was nothing odd in the logs, and this pair has been fine since...
Sorry for such a rambling first post - but if anyone can tell me what the likely problem is, or suggest additional Informix or Unix diagnostits I should do next time, I'd really be grateful!
Thanks very much.
Humph
I'm more a developer than a DBA, but after our DBA left, I seem to have inherited the job of supporting a number of servers running development databases on a free version of Informix - Innovator-C 11.50.FC7IE
The development boxes are running as HDR pairs, and we've never used any of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some blades, HDR generally recovers after a minute or two, whilst on others it generally fails requiring intervention!
There's a lot of log extracts below, but I guess my basic question is whether the -25582 "Network connection is broken" errors are genuinely related to the network (I suspect not, when the HDR nodes are two blades in an enclosure) or indicative of an error in the database engine itself (does the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)
Could anyone suggest further diagnosis I could do? Sometimes we see a single pair fail twice in a few hours, sometimes they'll be fine for weeks. I'm pretty sure there's one or two pairs that have never failed in the past year...
Example of HDR failing but recovering itself on a secondary (DB02):
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
07:19:59 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery
07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational
Example of failure on a secondary (DB32):
01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
01:23:12 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server
Slightly different report on another secondary (DB62):
02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
02:34:09 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server
In all the above, "oserr = 4" means an interrupted system call, yes? Does this mean that it's an OS issue that tripped Informix, rather than a bug in the database engine?
We also get similar failures, but with "oserr = 0" - does this mean nothing bad at OS level, so it's likely a bug in Informix?
This from DB62:
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
06:27:43 ASF Echo-Thread Server: asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server
Or, am I reading too much into the "oserr =" part of the logged errors?
I guess we're pretty limited in what we may discover, as we don't have an IBM support contract, but what I *have* managed to find is that whenever we see HDR fail and stay failed, than an "onstat –g cpu" on the primary always shows a terminated HDR ping thread:
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294 terminated
Trawling ibm.com gave me an APAR with broadly similar symptoms, but for leter version of Informix:
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF
And we do occasionally get that error message if we restart the secondary node, then we have to cycle Informix:
Primary (DB23):
12:48:35 DR: ping timeout
12:51:09 Checkpoint Completed: duration was 0 seconds.
12:51:09 Fri Apr 12 - loguniq 31282, logpos 0x91b5204, timestamp: 0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 9059, Llog used 35583
13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 3772, Llog used 10694
13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]
13:10:34 Checkpoint Completed: duration was 0 seconds.
13:10:34 Fri Apr 12 - loguniq 31282, logpos 0xbb7d018, timestamp: 0x4bdc4b5 Interval: 87469
Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr = : Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.
13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600
13:09:37 Event alarms enabled. ALARMPROG = '/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.
Do the above pair of errors on DB23/DB22 mean that the port for HDR was still held open after I took Infomrix offline? I cycled Informix on primary and secondary again, and there was nothing odd in the logs, and this pair has been fine since...
Sorry for such a rambling first post - but if anyone can tell me what the likely problem is, or suggest additional Informix or Unix diagnostits I should do next time, I'd really be grateful!
Thanks very much.
Humph