Troubleshooting Common Process Governor Issues
A process governor helps control resource usage, limit runaway processes, and keep systems stable. When it malfunctions or behaves unexpectedly, applications can misbehave, users experience slowdowns, or processes are incorrectly terminated. This article walks through common issues, how to diagnose them, and practical fixes.
1. Process governor terminates processes too aggressively
Symptoms
- Frequently killed processes that normally complete.
- Spikes in user complaints after heavy but legitimate workloads.
Causes
- Thresholds (CPU, memory, runtime) set too low.
- Misread metrics (e.g., cumulative vs. instantaneous CPU).
- Incorrect process classification (governor treats critical processes as background).
Troubleshooting steps
- Review thresholds: Compare limits to observed normal peaks. Temporarily raise limits to confirm.
- Check metric types: Ensure governor uses appropriate metrics (instantaneous CPU for short spikes, averaged CPU for sustained load).
- Inspect process tags: Verify process identification rules (names, UIDs, cgroups). Add explicit whitelists for critical services.
- Look at logs: Examine governor logs for kill reason and resource snapshot at termination time.
Fixes
- Increase limits or use adaptive thresholds (percentile-based).
- Use longer sampling windows or smoothing for CPU/memory metrics.
- Add whitelists or priority rules for essential services.
- Implement graceful termination (SIGTERM, delay) so processes can checkpoint.
2. Process governor fails to enforce limits
Symptoms
- Processes exceed configured CPU/memory limits without being throttled or killed.
- System-level resources exhausted despite governance enabled.
Causes
- Governor not attached to target processes (wrong PID/cgroup).
- Insufficient privileges to enforce controls.
- Kernel features (cgroups, OOM killer) misconfigured or unavailable.
- Governor service crashed or in degraded mode.
Troubleshooting steps
- Verify attachment: Confirm governor shows the target PIDs or cgroups under management.
- Check permissions: Ensure governor runs with required privileges (root or CAP_SYS_RESOURCE).
- Inspect system features: Confirm cgroups v1/v2 is enabled and configured; check kernel logs for related errors.
- Service health: Confirm governor process is running and has not disabled enforcement.
Fixes
- Correct process selection rules or reattach governor to cgroups.
- Run governor with proper capabilities or via system service manager with elevated rights.
- Reconfigure/enable cgroups or use alternate enforcement (nice, cpulimit) where cgroups unavailable.
- Restart governor, enable automatic recovery or monitoring.
3. False positives from short-lived spikes
Symptoms
- Short CPU or memory bursts trigger enforcement even though workload is transient.
- Batch jobs or build tasks frequently interrupted.
Causes
- Thresholds lack tolerance for brief bursts.
- Sampling window too narrow.
- No burst allowance configured.
Troubleshooting steps
- Examine time series: Look at resource graphs around enforcement events to see burst duration.
- Review sampling config: Check window size and smoothing parameters.
- Identify workload patterns: Determine if bursts are expected (e.g., compile/link steps) and predictable.
Fixes
- Increase sampling window or use exponential moving average to smooth spikes.
- Configure burst allowances or token-bucket style policies that allow short bursts.
- Create job-specific rules that exempt scheduled batch work or elevate their limits during windows.
4. High governor CPU/memory overhead
Symptoms
- Governor itself consumes significant CPU or memory, reducing available resources.
- Resource monitoring shows governor as a frequent top consumer.
Causes
- Excessive polling frequency or overly detailed metrics collection.
- Memory leaks or inefficient data structures.
- Large numbers of managed processes causing heavy bookkeeping.
Troubleshooting steps
- Profile governor: Use perf, top, or pprof (for Go) to find hotspots.
- Check polling intervals: Review metric collection frequency.
- Inspect data structures: Look for unbounded caches or retained historical data.
- Scale test: Observe governor behavior as number of managed processes increases.
Fixes
- Reduce polling frequency or switch to event-driven metrics where possible (kernel events, inotify).
- Fix memory leaks and optimize algorithms/data structures.
- Shard management across multiple governor instances or use hierarchical cgroups to reduce per-process overhead.
- Aggregate metrics to lower cardinality.
5. Inaccurate metrics feeding enforcement decisions
Symptoms
- Enforcement decisions inconsistent with observed system state.
- Mismatches between monitoring dashboards and governor logs.
Causes
- Time skew between components.
- Incomplete or lossy metrics pipeline.
- Wrong units or sampling semantics (bytes vs MiB, percent vs absolute).
Troubleshooting steps
- Compare timestamps: Ensure synchronized clocks (NTP/chrony) across hosts and services.
- Validate metric pipeline: Check for dropped packets, buffer overflows, or serialization issues.
- Verify units/labels: Ensure consistency across data sources and governance rules.
Fixes
- Enable and verify time synchronization.
- Harden metrics transport (retries, batching, backpressure).
- Normalize units and add validation checks in metric ingestion.
6. Conflicts with other system components (OOM killer, schedulers)
Symptoms
- Governor and system OOM killer both acting, causing unpredictable terminations.
- Interaction issues with container orchestrators (Kubernetes) or batch schedulers.
Causes
- Multiple controllers managing the same resources without coordination.
- Kubernetes resource limits/requests mismatched with governor settings.
- Governor unaware of container or orchestrator semantics.
Troubleshooting steps
- Check system logs: Look for OOM events and compare with governor actions.
- Review orchestrator settings: Inspect Kubernetes QoS, requests/limits, and eviction thresholds.
- Map control boundaries: Determine which system component has primary authority for resource control.
Fixes
- Coordinate policies: let one layer be authoritative or implement hierarchical policies.
- Align Kubernetes limits/requests with governor thresholds; leverage Vertical Pod Autoscaler or LimitRange.
- Make governor orchestrator-aware (respect cgroup v2 unified hierarchy and Kubernetes QoS classes).
7. Policy complexity causes unexpected behavior
Symptoms
- Complex, overlapping rules lead to surprising outcomes.
- Difficulty predicting which rule applies.
Causes
- Rule precedence not well-defined.
- Too many special-case exceptions or overlapping selectors.
Troubleshooting steps
- Audit policies: Export and read active policies; look for overlaps and contradictions.
- Simulate rules: Run a dry-run or simulation mode to see which rule would apply.
- Prioritize rules: Identify and document precedence.
Fixes
- Simplify policies and prefer explicit, minimal rules.
- Add clear precedence and fallbacks.
- Use test suites and dry-run capability before deploying policy changes.
8. Logs and observability gaps
Symptoms
- Lack of information to determine why actions were taken.
- Long time to diagnose incidents.
Causes
- Insufficient logging level or missing contextual data.
- Metrics not correlated with governance events.
Troubleshooting steps
- Increase log verbosity: Temporarily enable debug logs when reproducing issues.
- Add context: Ensure logs include PID, cgroup, resource snapshot, rule ID, and timestamps.
- Correlate events: Link governance actions to metric streams and system events.
Fixes
- Improve structured logging and include telemetry for audits.
- Emit events to centralized observability (logs, traces, metrics) with consistent identifiers.
- Provide a UI or CLI tools to query recent enforcement actions.
Quick checklist for incident response
- Check governor service health and recent logs.
- Confirm target processes/cgroups are attached.
- Verify thresholds, sampling windows, and burst allowances.
- Ensure system features (cgroups, permissions) are functional.
- Correlate enforcement events with metric graphs and system logs.
- If uncertain, enable a dry-run mode or temporarily relax rules.
Conclusion A reliable process governor depends on correct thresholds, accurate metrics, proper attachment to processes/cgroups, and good observability. Use conservative defaults, provide burst tolerance, keep policies simple, and instrument thoroughly to reduce both false positives and negatives.
Leave a Reply