uWSGI + Django Stability CPU Jitter Troubleshooting Record

Problem Phenomenon

In a stability test, a Django application published using uWSGI showed periodic CPU jitter. From the monitoring chart, it can be seen that a significant CPU usage peak occurs approximately every 4 hours, accompanied by a drop in memory usage:

3819eef4ec5b5d97542b477262ded417.jpg

Preliminary Analysis: Attribution to GC Misconception

First Instinct: Garbage Collection (GC)

  • Observation: Memory drops when CPU peaks, looking like GC triggering.
  • Time Pattern: Every ~4 hours, showing strong periodicity.
  • Troubleshooting Direction: Check for scheduled GC (none found).

Suspicion Direction: All Workers GC Simultaneously

Through research, it was found that uWSGI’s pre-fork mode might lead to:

  1. All workers share the same GC configuration.
  2. Because request processing rhythms are similar, GC is triggered synchronously.
  3. Massive simultaneous GC → CPU spike.

Implementation and Verification

Attempt 1: Randomize GC Threshold

import random
import gc
from uwsgidecorators import postfork

@postfork
def randomize_gc_threshold():
    """Set different GC thresholds for each worker to avoid simultaneous GC"""
    gc.set_threshold(
        random.randint(700, 900),  # generation 0
        random.randint(8, 12),     # generation 1
        random.randint(8, 12)      # generation 2
    )

Result: CPU jitter did not improve.

Attempt 2: Dedicated GC Thread

The original plan was to create an independent thread to execute periodic GC, but further research revealed a key clue…

Key Breakthrough: Re-examining uWSGI Configuration

Multiple sources recommended the following configuration:

max-requests = 5000
max-requests-delta = 300

However, the actual project configuration was:

max-requests = 50000  # Restart worker after processing 50000 requests

Calculation Verification

Based on the test environment TPS (about 3.5 req/s):

Restart Interval = 50000 / 3 ≈ 16666.7 seconds ≈ 4.6 hours

Perfectly matches the ~4 hour CPU peak in the monitoring chart!

The CPU jitter was not GC at all, but the worker restart cycle.

The Truth: Periodic Worker Restart Causes CPU Peaks

Mechanism

  1. Worker triggers graceful restart after processing max-requests count.
  2. During restart:
    • Release Python interpreter memory.
    • Reload Django application.
    • Rebuild connection pool.
    • Re-import modules.

Reason for Memory Drop

Not GC, but:

  1. Old worker exits.
  2. OS completely reclaims the process memory.
  3. New worker starts and reallocates memory.

Optimization Suggestions

Having found the problem point, solving it is straightforward. According to suggestions from AI, if we disperse the restart points, this CPU spike should improve. So I adjusted and tested.

Adjust Restart Strategy (Currently Adopted)

max-requests = 10000
max-requests-delta = 300

Advantages:

  • Avoid excessive restart costs caused by accumulating a lot of state in a single worker.
  • Random offset avoids simultaneous restarts.
  • CPU and memory curves are smoother.

Memory-based Restart Trigger

reload-on-rss = 512  # Restart if RSS > 512MB
reload-on-as = 768   # Restart if Virtual Space > 768MB

Reoccurring Issue

After testing, no obvious “random effect” was seen. I began to suspect whether max-requests-delta was supported.

Verification execution:

uwsgi --help | grep max-requests-delta

Parameter not found.

uwsgi --help | grep delta

Output:

--max-worker-lifetime-delta  add (worker_id * delta) seconds to the max_worker_lifetime value of each worker

This indicates that the current version only supports:

--max-worker-lifetime-delta

And this mechanism is linear offset by worker_id, not random offset.

Summary of Lessons Learned

Troubleshooting Misconception Reflection

  1. Phenomenon Attribution Bias
    • Memory drop ≠ GC
    • Periodicity ≠ Scheduled Task
    • Must consider all relevant mechanisms
  2. Configuration Neglect
    • Over-focus on code
    • Ignore runtime configuration
    • Middleware lifecycle management has far-reaching effects
  3. AI Answers Need Verification
    • Cross-check multiple models
    • Check official documentation
    • Version differences are critical

Configuration Audit Checklist

Key uWSGI configurations:

  • max-requests
  • max-worker-lifetime
  • reload-on-rss
  • harakiri
  • enable-metrics

Follow-up

The restart mechanism of uWSGI is the root cause of CPU jitter, but the place where CPU usage is truly high is the Django application loading process. Future work can further reduce the number of workers or optimize Django startup speed.

Core Takeaways

  1. Infrastructure configuration is as important as code quality.
  2. Understanding middleware lifecycle is key to troubleshooting.
  3. Establish a systematic troubleshooting mental model.
  4. Monitoring combined with logs provides a complete perspective.

This troubleshooting reminds us: in complex systems, seemingly obvious reasons are often illusions. True tuning comes from understanding the operating mechanism of every layer of components, not just staying at the surface.