Hey everyone,
I've been keeping an eye on our "moon" server lately, and the CPU usage metrics have been consistently high, suggesting it might be time to invest in a new, more powerful machine. Before making that decision, I wanted to dig into the data to see exactly what was going on.
For some time now, I've been running a custom Python script, server_metrics.py
, at frequent intervals to collect data on system performance and store it in a SQLite database. This has given me a fantastic historical dataset to work with.
Visualizing the Problem
The first step was to visualize the trend. A picture is worth a thousand words, and plotting the data from the last two weeks confirmed my suspicions immediately.
As you can see, the CPU usage is frequently spiking and sustaining high levels, which isn't ideal for a server running multiple applications. The question now is: what's causing it?
Digging into the Data
To find the culprits, I wrote a SQL query to go through the collected metrics. The goal was to find which process names appeared most often as the top CPU consumer, what their average CPU usage was in those moments, and their maximum recorded spike. The results were immediate and unambiguous:
-- Count how many samples each process was the top-CPU process
SELECT
top_cpu_name,
COUNT(*) AS samples_as_top,
AVG(top_cpu_percent) AS avg_top_pct,
MAX(top_cpu_percent) AS max_top_pct
FROM metrics
-- restrict to last two weeks
WHERE timestamp >= datetime('now', '-14 days')
GROUP BY top_cpu_name
ORDER BY samples_as_top DESC
LIMIT 10;
top_cpu_name | samples_as_top | avg_top_pct | max_top_pct |
---|---|---|---|
python3 | 14852 | 65.1 | 705.4 |
systemd | 2905 | 0.1 | 246.2 |
mariadbd | 661 | 2.86 | 150.0 |
php-fpm7.4 | 96 | 6.59 | 633.1 |
fail2ban-server | 93 | 0.0 | 0.0 |
caddy | 91 | 0.0 | 0.0 |
kworker/0:0-events | 33 | 0.0 | 0.0 |
kworker/0:2-events | 24 | 0.0 | 0.0 |
kworker/0:1-events | 23 | 0.0 | 0.0 |
multipathd | 22 | 0.0 | 0.0 |
As the data clearly shows, python3
processes are the runaway top consumer of CPU resources on this server. It was the top process in over 14,800 samples, with an average CPU usage of 65% during those times. Most strikingly, it had a maximum spike of over 700%, indicating that at certain moments, Python scripts were consuming the equivalent of 7 full CPU cores.
This analysis narrows down the problem significantly. It's not a system-level issue with something like Caddy or the database (mariadbd
); the load is coming directly from the Python applications I'm running.
The next logical step in this investigation is to dig deeper and differentiate between the various python3
processes to see which specific scripts are the heaviest hitters. But for now, we have a very clear answer to "What's using the CPU?". The answer is: Python.
Next Steps
I have added more verbose data gathering to server_metrics.py
to track the command line argments of each process, so we know which one is which. I'll continue to monitor the data and report back to you as I find new insights.
As always,
Michael Garcia a.k.a. TheCrazyGM