LabPP_Solaris Troubleshooting: Common Issues and Fixes
1. Boot failures or kernel panics
- Symptom: System stalls during boot, shows panic messages, or drops to single-user mode.
- Likely causes: Corrupted kernel or initramfs, incompatible kernel modules, disk corruption, recent configuration changes.
- Fixes:
- Boot from rescue media and check filesystem integrity (fsck).
- Restore a known-good kernel or initramfs from backup; remove recently added/third-party kernel modules.
- Review /var/adm/messages and dmesg for exact error strings to identify failing drivers.
- If hardware-related, run vendor diagnostics on memory and disks.
2. Network interface not coming up
- Symptom: No network on interface after reboot; ifconfig/ip shows interface down or missing.
- Likely causes: Misconfigured network scripts, wrong interface naming, driver/module not loaded, DHCP failure.
- Fixes:
- Check interface config files (e.g., /etc/hostname.or NetworkManager settings) and ensure correct persistent name.
- Manually bring interface up:
ip link set dev eth0 upand obtain IP:dhclient eth0orip addr add. - Confirm driver loaded:
lsmod/modinfo; load withmodprobe. - Inspect logs:
tail -n 200 /var/log/syslogor journalctl for DHCP/NetworkManager errors.
3. Package installation or dependency failures
- Symptom: Package manager errors, unmet dependencies, failed installs.
- Likely causes: Repository misconfiguration, corrupted package cache, incompatible package versions.
- Fixes:
- Update repository metadata and clean cache:
pkg update/pkg refresh –fullor equivalent. - Rebuild package database if available.
- Pin or explicitly install required dependency versions; use
pkg install –reinstall. - Check repository URLs and GPG keys; re-add or refresh keys if signature errors occur.
- Update repository metadata and clean cache:
4. High CPU or memory usage by services
- Symptom: System slow, high load averages, swapping.
- Likely causes: Memory leaks, runaway processes, misconfigured service limits.
- Fixes:
- Identify culprits:
top,htop,ps aux –sort=-%mem. - Restart or gracefully reload misbehaving services; check their logs for errors.
- Tune service limits (ulimits, systemd service resource limits) or add swap if appropriate.
- Apply patches or update software if memory leaks are known bugs.
- Identify culprits:
5. Storage full or unexpected disk usage
- Symptom: “No space left” errors; important services fail to write.
- Likely causes: Log growth, orphaned files, snapshots, temporary files.
- Fixes:
- Find large files:
du -sh /*andfind / -xdev -type f -size +100M. - Rotate or compress logs; clear tmp directories.
- Check for snapshots (ZFS/Btrfs) consuming space and prune old ones.
- Expand filesystem or add storage if consumption is legitimate.
- Find large files:
6. Service fails to start (systemd or init)
- Symptom: System reports service start failure, exit codes, or repeated restarts.
- Likely causes: Misconfiguration, missing dependencies, permission issues, port conflicts.
- Fixes:
- Inspect service status and logs:
systemctl statusandjournalctl -u. - Validate config files with built-in checkers (e.g., nginx -t).
- Check file permissions, SELinux/AppArmor denials, and socket/port availability.
- Run the service manually to surface runtime errors.
- Inspect service status and logs:
7. Authentication and access problems
- Symptom: Users cannot authenticate via SSH, LDAP, or local accounts.
- Likely causes: Incorrect PAM/SSSD configuration, expired keys, clock skew, network reachability to auth servers.
- Fixes:
- Verify PAM and SSSD configuration files and restart related services.
- Check SSH logs (
/var/log/auth.logor journalctl) for authentication errors. - Confirm system clock sync (NTP) and LDAP/AD server reachability.
- Test locally with
passwdandsuto isolate remote vs local issues.
8. Time sync drift
- Symptom: System clock drifting, causing cert or authentication failures.
- Likely causes: NTP/chrony service stopped, wrong timezone, hardware clock issues.
- Fixes:
- Ensure chrony/ntpd is running and sync status is healthy:
chronyc sourcesorntpq -p. - Set timezone correctly and sync hardware clock:
timedatectl set-timezoneandhwclock –systohc. - Check for virtualization host time issues.
- Ensure chrony/ntpd is running and sync status is healthy:
9. Security alerts or unusual activity
- Symptom: Unexpected outbound connections, unknown user accounts, modified binaries.
- Likely causes: Compromise, misconfigured services, exposed management interfaces.
- Fixes:
- Isolate affected systems from network and preserve logs for forensics.
- Inspect running processes, network connections (
ss -tunap), and recent auth logs. - Run integrity checks (tripwire/aide) and compare binaries to known-good versions.
- Rotate credentials, update packages, and apply security patches; consider full rebuild if compromised.
10. I/O latency or disk errors
- Symptom: Slow disk I/O, I/O errors in logs, SMART warnings.
- Likely causes: Failing disk, misconfigured RAID, heavy I/O workload.
- Fixes:
- Check SMART data:
smartctl -a /dev/sdX. - Review kernel logs for I/O errors and identify failing device.
- Rebalance or replace failing disks; rebuild RAID arrays as needed.
- Tune filesystem mount options and I/O scheduler for workload.
- Check SMART data:
Troubleshooting workflow (quick checklist)
- Reproduce and capture exact error messages.
- Check logs: system, service-specific, and kernel messages.
- Isolate changes: recent updates, config edits, hardware swaps.
- Test fixes in staging if possible, apply to production during maintenance windows.
- Document root cause and remediation; add monitoring/alerts to detect recurrence.
If you want, I can convert this into a printable checklist or a step-by-step runbook for a specific LabPP_Solaris version — tell me the version and I’ll generate it.
Leave a Reply