SDN Controller Outage Due to Database Connection Limit
Incident Details
Summary
On August 1, 2025, AS214503 experienced a Software-Defined Networking (SDN) controller outage caused by a MariaDB max_connections setting that was inadvertently reset during a database upgrade. The incident began at 08:00 UTC and was fully resolved at 09:00 UTC, lasting 1 hour.
The connection limit reverted from the previously configured 16000 to the default 1000, preventing the SDN controller from executing critical network operations. This resulted in extended DHCP lease times and degraded network automation capabilities. No packet forwarding issues were observed as the data plane remained operational throughout the incident.
Impact
- Primary Cause: MariaDB
max_connectionslimit exceeded due to configuration reset - Affected Systems: SDN controller, network automation, DHCP services
- Configuration Issue:
max_connectionsreset from 16000 to default 1000 during database upgrade - DHCP Impact: Extended lease renewal times, some renewals delayed
- Network Operations: Automated configuration changes blocked
- Data Plane: No packet forwarding disruption, traffic continued normally
- Customer Impact: Minimal impact, existing connections maintained, new DHCP leases delayed
Timeline (All times UTC)
08:00 - SDN controller alerts triggered for failed database connections
08:05 - Network operations team investigates controller unresponsiveness
08:10 - DHCP lease renewal delays reported by monitoring systems
08:15 - Database connection pool exhaustion identified in MariaDB logs
08:20 - MariaDB max_connections setting found reset to default value of 1000 (previously 16000)
08:25 - Database upgrade logs reviewed, configuration reset identified as cause
08:30 - max_connections restored to 16000, SDN controller service restarted
08:40 - DHCP lease processing returns to normal operation
08:50 - Configuration management updated to prevent future resets during upgrades
09:00 - Full SDN controller functionality restored, incident resolved
Root Cause
Primary Cause: MariaDB database upgrade that reset the max_connections setting from the previously configured 16000 back to the default value of 1000. The SDN controller maintains persistent database connections for network state management, configuration changes, and DHCP lease tracking, requiring approximately 200-300 concurrent database connections during normal operation.
Contributing Factors:
- Database upgrade procedure did not preserve custom configuration settings
- Lack of post-upgrade configuration validation checks
- Missing configuration management for database settings
- No automated monitoring to detect configuration drift after upgrades
Technical Details: During the MariaDB upgrade, the database configuration file was replaced with default settings, overwriting the production-tuned max_connections value. This caused connection pool exhaustion when the SDN controller attempted to maintain its normal operational connection count, preventing new network operations from being executed.
Resolution
Immediate Actions:
- Identified MariaDB upgrade as the cause of configuration reset at 08:25 UTC
- Restored MariaDB
max_connectionsfrom 1000 to 16000 at 08:30 UTC - Restarted SDN controller service to clear connection pool
- Monitored DHCP lease processing recovery
- Verified packet forwarding remained unaffected
Permanent Fixes:
- Implemented configuration management for MariaDB settings in version control
- Added post-upgrade validation procedures to verify custom configurations
- Created automated configuration drift detection with alerting
- Updated database upgrade procedures to preserve custom settings
Configuration Changes:
| |
Prevention Measures
- Configuration Management: MariaDB settings now tracked in version control with automated validation
- Database Monitoring: Implemented comprehensive MariaDB connection tracking with alerting at 70% utilization
- Post-Upgrade Validation: Added automated configuration verification to upgrade procedures
- Connection Pool Optimization: Reviewed and optimized SDN controller connection pooling parameters
- Capacity Planning: Established database connection capacity planning process
- Documentation Update: Updated deployment procedures to include database tuning requirements
- Testing Enhancement: Added load testing for database connection limits in staging environment
Lessons Learned
- Default database configurations are rarely suitable for production workloads
- Database upgrade procedures must preserve custom configuration settings
- Post-upgrade validation is critical to detect configuration drift
- Automated monitoring for configuration changes prevents extended outages
- SDN controller database requirements must be properly sized during initial deployment
- Data plane and control plane separation provided resilience during the outage
Technical Notes
Data Plane Resilience: The separation between control plane (SDN controller) and data plane (packet forwarding) ensured that existing network traffic continued without interruption. Only new network configurations and DHCP lease renewals were affected during the incident.
DHCP Lease Behavior: Existing DHCP leases continued to function normally throughout the incident. Only lease renewals and new assignments experienced delays due to the SDN controller’s inability to update lease databases.
Incident resolved. SDN controller operating normally with optimized database configuration and enhanced monitoring.