Topic: HF2 problem version 2
I decided to start a new thread for this topic. It is similar to the CM1 failure reported on other threads (which I have experienced as well) but not caused by the same issue or covered on other topics. Sorry for this long post....
The cases discussed before are either a hardware failure or a CM1 crashing due to a network storm caused by incompatibilities between RSTP/MSTP and CobraNet. In the case of my latest problem, the CM1 will crash - and restart completely as verified by the SysUptime variable - under normal operation without any kind of network changes or spanning-tree renegotiations. Here is a brief description of the system in question:
- 53 CobraNet devices: 43 1 channel transmit/receive devices, 2 CAB16d (16ch tx/rx), 7 CAB8i (8ch tx) and 2 CAB16o (2x8ch rx)
- 6 total NIONs
- 18 of the 1ch devices are also receiving serial data using serial bridging. 485 data input using the CAB16d.
We have designed and installed about 20-30 similar systems but this one has the most devices receiving serial data.
On a Cobranet level, the system is arranged as follows:
Nion 1 talks to 16 1ch tx/rx devices without serial data +
Nion 2 talks to 9 1ch tx/rx devices without serial data + 6 1ch tx/rx devices with serial data
Nion 3 talks to 6 1ch tx/rx devices with serial data + 2 CAB8i
Nion 4 talks to 6 1ch tx/rx devices with serial data + 2 CAB8i
Nion 5 talks to 4 CAB8i + 2 16o
Nion 6 talks to 1 CAB8i + 2 16d
If the system is run with all devices connected and serial bridging enabled, the CM1 on Nion 4 or Nion 3 will crash in less than 3 minutes after the roles start. The flash code on the CM1 is 3.4.3 which stands for:
"Byte code: 77; Flash code: 3,4,3; Type: FATAL; Name: ILLEGAL_INST; Description: Illegal instruction encountered.; Expected conditions: - ; Unexpected conditions: Hardware problem with main memory or address/data busses."
Quoting Kevin Gross, this code: "The illegal instruction error usually would occur as the result of a software programming error resulting in corrupted memory. The fact that we only see it under certain stressful situations is not entirely surprising. The error could be invoked by particular timing relationship between host access, network traffic and serial bridging activity. The error could be invoked by an overflow condition caused by multiple concurrent activities, receipt of a malformed Ethernet packet or host request."
Interestingly enough, here are some tests that I have performed and their results:
- Only N3 and N4 seems to crash (at least over 3-4 days), not N2 which also handles serial data and a significant number of 1ch devices
- The NION that has the problem is dependent on the role it has and not the hardware, meaning that I have swapped roles and the problem follows the role not the NION box.
- The problem only happens if serial bridging is enabled
- The problem becomes worse (happens more often) with more CobraNet traffic (as desk unit CobraNet bundles are removed the problem becomes much less frequent, getting spaced to happen every day or so vs. every 3 minutes or so)
- The problem happens in a separate system installed 3 years ago with a smaller network, a different project file and a of course different infrastructure and equipment
- The problem becomes less frequent (but does eventually happen) if the network ring is broken as compared to having a ring with RSTP or STP (both behave the same), but it still happens. It also seems to be less frequent if the ring is made smaller (remove 5 switches from the 12 that are typically a part of the network).
At this point, I am considering this problem either an inherent CobraNet problem or a problem with the way NIONs communicate with the CM1. There is no technical reason (i.e. bandwidth, CM1 capabilities, etc) that should cause this problem. Any ideas or suggestions will of course be greatly appreciated but also be aware that the CM1 crashing problems are not only related to RSTP or bad hardware, and that there are other scenarios under which this problem might occur.
Thoughts, ideas, etc? Thanks!
Rodrigo