
Incident summary prepared by Laura McKenzie – Control Systems Reliability Team
Incident Overview
At 02:17 local time, a scheduled power restoration was completed following planned electrical maintenance.
One controller failed to return to service.
The affected unit was a Honeywell 10012/1/2 CPU module.
Observed Condition
-
CPU powered on normally
-
Status LEDs remained static
-
No transition to RUN state
-
No communication established with peer nodes
The system did not crash.
It simply never finished starting.
Initial Assumptions
The usual assumptions were made:
-
Power instability during restoration
-
Firmware mismatch
-
Incomplete startup sequence
None of these were confirmed.
Voltage levels were correct.
No recent firmware changes had been applied.
Boot Process Behavior
During startup, the 10012/1/2 performs a flash integrity check:
-
Firmware image validation
-
Configuration block verification
-
Bootloader checksum comparison
Failure at this stage prevents execution transfer to the runtime kernel.
In this case, the process stopped silently.
Why No Explicit Error Was Reported
The bootloader operates before:
-
Full diagnostics
-
Communication services
-
Event logging
If flash validation fails early, there is no channel available to report the reason.
From the outside, the CPU appears “alive but frozen.”
Root Cause Determination
Post-removal analysis showed:
-
One flash memory sector intermittently unreadable
-
Checksum results inconsistent between power cycles
-
No physical damage visible
The flash device had degraded just enough to pass occasionally — and fail unpredictably.
Why Power Cycling Made It Worse
Repeated power cycles increased stress:
-
Marginal sectors failed more frequently
-
Validation timing varied
-
Boot success probability dropped to zero
Once the failure became consistent, recovery was no longer possible.
Recovery Actions
-
CPU module replaced
-
Firmware and application restored from validated backup
-
Startup verified under controlled power conditions
The system returned to normal operation without further anomalies.
Preventive Measures Implemented
-
Controlled power restoration procedures
-
Flash health considered during lifecycle reviews
-
Cold-start testing added to maintenance routines
Key Findings
-
Flash degradation can block startup without alarms
-
Boot-stage failures are often invisible to operators
-
Power events accelerate marginal flash failures
-
Backup alone does not prevent startup failure
Closing Statement
The Honeywell 10012/1/2 CPU module did not fail under load.
It failed before it could even begin.
In control systems, the most dangerous failures are the ones that happen before the system can explain itself.
— Laura McKenzie
Excellent PLC
