[ad_1]
In October, an incident occurred that caused performance degradation across all GitHub services.
October 11th 05:59 UTC (lasts 19 hours and 12 minutes)
On October 11, 2024, beginning at 05:59 UTC, the DNS infrastructure at one of our sites began to fail to resolve lookups following a database migration. Attempts to restore the database resulted in cascading errors that impacted this site’s DNS systems. While the team worked to restore infrastructure, the first customer impacts began around 17:31 UTC.
The impact of the incident was far-reaching. 4% of Copilot users noticed a degradation in IDE code completions, while 25% of Actions Workflow users experienced delays of more than 5 minutes. 100% of code search requests failed for a period of approximately 4 hours.
As of 18:05 UTC, we attempted to resolve the issue by redirecting the compromised DNS site to another site, but to no avail. While this mitigation was effective in restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the compromised site, so we began planning another remediation effort.
At 20:52 UTC, the team completed a remediation plan and began the next phase of remediation by providing temporary DNS resolution capabilities to the compromised site. At 21:46 UTC, the affected site’s DNS resolution began to recover and was fully healthy again at 22:16 UTC. Persistent code search issues were resolved on October 12 at 01:11 UTC.
The team also continued restoring original functionality within the website after restoring public service functionality. GitHub is working to strengthen our resiliency and automation processes around this infrastructure so that we can diagnose and resolve such issues more quickly in the future.
Please follow our Status page for real-time updates on status changes and post-incident summaries. To find out more about what we’re working on, check out GitHub Engineering Blog.
Written by
[ad_2]
Source link