GitHub Availability Report: July 2024

[ad_1]

In July, we experienced four incidents that resulted in degraded performance across all GitHub services.

July 5 00:53 16:31 UTC (duration 97 minutes)

On July 5, between 16:31 and 18:08 UTC, the Webhooks service experienced performance degradation, resulting in delayed webhook deliveries with an average delay of 24 minutes and a maximum of 71 minutes. This issue was triggered by a configuration change that removed authentication of background job requests from Webhooks, causing these requests to be rejected. Because Webhooks relies on this job infrastructure, external webhook delivery failed. Webhook delivery resumed after the configuration was fixed.

After the initial fix, a secondary issue from 18:21 to 21:14 UTC caused further delays in GitHub Actions execution on pull requests due to failed health tests in the background job processing service, which caused a crash loop in the background jobs API layer and reduced capacity. This capacity reduction resulted in an average delay of 45 seconds with a maximum delay of 1 minute and 54 seconds in job submission. This was fixed by deploying a service.

To improve incident detection, we have updated our dashboards, improved our health checks, and introduced new alerts for similar issues. We are also focused on minimizing the impact of such incidents in the future through better workload isolation.

July 13 00:01 UTC (duration 19 hours and 26 minutes)

On July 13, we experienced performance degradation in the GitHub Copilot service between 00:01 and 19:27 UTC. During this time, the error rate for Copilot code completions reached 1.16% and GitHub Copilot Chat peaked at 63%. We rerouted Copilot Chat traffic between 01:00 and 02:00 UTC, reducing the Copilot Chat error rate to below 6%. The error rate for Copilot Chat completions generally remained below 1%. Customers may have experienced delays, errors, or timeouts in Copilot completions and Copilot Chat during this time. Auto-correction of GitHub code scans discarded suggested fixes between 00:01 UTC and 12:38 UTC, and delayed but eventually completed suggested fixes between 12:38 UTC and 21:38 UTC.

We determined that the issue was due to a resource cleanup job run by a partner service on July 13 that mistakenly targeted a resource group containing important resources, resulting in their removal. The job was stopped in time to preserve some resources, allowing GitHub to mitigate the impact while the resources were restored.

We are working with partner services to develop safeguards against future incidents and improving our traffic diversion processes to expedite future mitigation efforts.

July 16 00:53 UTC (duration 149 minutes)

On July 16, between 00:30 UTC and 03:07 UTC, Copilot Chat became impacted and refused all requests. The failure rate was close to 100% during this period and customers received error messages when attempting to use Copilot Chat.

This was triggered during routine maintenance by a service provider when GitHub services were disconnected and the dependent service was overloaded when reconnecting.

To mitigate the issue in the future, we are working to improve our reconnection and disconnection logic for dependent services to enable seamless recovery from events of this nature without overloading the other service.

July 18, 22:47 UTC (duration 231 minutes)

Starting on July 18, 2024 at 22:38 UTC, network issues at an upstream provider impacted the performance of the Actions, Copilot, and GitHub Pages services. During this time, up to 50% of Actions workflow jobs were stuck in the queued state, including Pages deployments. Users were also unable to activate Actions or register self-hosted runners. This was caused by an unreachable backend resource in the Central US region. This resource is configured for geo-replication, but the replication configuration prevented resiliency when a region was unavailable. Updating the replication configuration mitigated the impact by allowing successful requests while a region was unavailable. By July 19 at 00:12 UTC, users saw some improvements in Actions jobs and a full recovery of Pages. Standard hosted runners and self-hosted Actions workflows were healthy at 2:10 UTC, and large hosted runners were fully recovered at 2:38 UTC.

Copilot requests were also affected: up to 2% of copilot chat requests and 0.5% of copilot completion requests resulted in errors. Copilot chat requests were redirected to other regions after 20 minutes, while redirecting copilot completion requests took 45 minutes.

To mitigate these issues in the future, we are improving our replication and failover workflows to better handle such situations, reduce the time required for recovery, and minimize the impact to customers.

Please note our Status page for real-time updates on status changes and incident summaries. For more information on our current activities, please visit the GitHub Engineering Blog.

Written by

[ad_2]

Source link

Welcome! Please hold on...

Kashif Sohail

Residence:

City:

Age:

Magento 1x 2x

Wordpress

Laravel

Angular

React

English

Urdu

GitHub Availability Report: July 2024

Written by