Totally agree, I’ve worked on small ops teams where we drowned in graphs that never told us anything useful.
What actually helped was narrowing down to a few high-signal metrics:
-
CPU load average (not just % usage): Helps identify real strain, especially load > core count.
-
Memory used vs available + swap activity: Memory pressure is fine until it starts swapping.
-
Disk I/O latency (iostat): Much more telling than just disk usage.
-
Network errors/drops: Often overlooked, but packet loss or retransmits can kill app performance.
-
Process/thread count: Surges can indicate runaway services.
We use tools like Netdata and Grafana with Prometheus to visualize, but only alert on what’s actionable.