How I Spent a Day Trying to Recover a Crashed OpenStack Environment — And What I Learned
A real-world incident report for engineers dealing with filesystem corruption on production Linux servers The Problem It started with a simple complaint: our company's OpenStack Horizon portal was ...

Source: DEV Community
A real-world incident report for engineers dealing with filesystem corruption on production Linux servers The Problem It started with a simple complaint: our company's OpenStack Horizon portal was unreachable. The browser returned ERR_CONNECTION_TIMED_OUT. No warning, no gradual degradation — just gone. We had two physical HPE ProLiant DL380 Gen10 servers running the environment, accessible only via HP iLO 5 remote console. No physical access. No one near the data centre. Just me, a browser, and an iLO HTML5 console. This is the story of what happened, what we tried, what failed, and what every engineer should know before they find themselves in the same situation. The Environment Controller Node: HPE ProLiant DL380 Gen10 (12-core) Compute Node: HPE ProLiant DL380 Gen10 (10-core) OS: Ubuntu 22.04 LTS Storage: LVM on top of hardware RAID (HPE Smart Array P408i-a) Access: HP iLO 5 remote console (HTML5) VPN: FortiClient VPN required to reach internal network Step 1 — Diagnosing the Probl