How we improved availability through iterative simplification

[ad_1]

When scaling a system the size of GitHub, solving problems and staying ahead of them is a delicate process. The stack is complex and even small changes can have a big impact. Here’s a look at some of the tools in GitHub’s toolbox and how we’ve used them to solve problems. We’ll also share some of our successes and lessons learned along the way.

To keep up with our growing system, we use several tools. While we can’t list them all, here are a few that have been critical to our growth.

When we process requests, we get a constant stream of related numbers that are of interest to us. For example, we might want to know how frequently events occur or how traffic volume compares to expected usage. We can record metrics for each event in Datadog to identify patterns over time and drill them down into different dimensions to identify areas we need to focus on.
Events also contain context that can help identify details about issues we’re troubleshooting. We send all of that context to Splunk for further analysis.
Much of our application data is stored in MySQL and query performance can degrade over time due to factors such as database size and query frequency. We have written custom monitors that detect and report slow and timed queries for further investigation and remediation.
When we introduce changes, we often need to know how these changes affect performance. We use scientist to test proposed changes. We use this tool to measure and report results before making the changes permanent.
When we are ready to release a change, we roll it out in stages to ensure that it works as expected in all use cases. We also need to be able to roll back in case of unexpected behavior. We use Pinball limit the rollout to early access users and then to an increasing percentage of users as we build trust.

Faster database queries

We recently observed a SQL query that was causing a high number of timeouts. Our investigation in Splunk traced the cause to GitHub’s command palette feature, which was loading a list of repositories. The code to generate this list looked something like this:

org_repo_ids = Repository.where(owner: org).pluck(:id)
suggested_repo_ids = Contribution.where(user: viewer, repository_id: org_repo_ids).pluck(:repository_id)

If an organization has many active repositories, the second line could be a SQL query with a large IN (...) Clause with increased risk of timeout. Although we had seen this type of problem before, this particular use case was unique. We could potentially improve performance by querying the user first, since a given user contributes to a relatively small number of repositories.

contributor_repo_ids = Contribution.where(user: viewer).pluck(:repository_id)
suggested_repo_ids = Repository.where(owner: org, id: contributor_repo_ids)

We created a scientist experiment with a new candidate code block to evaluate performance. The Datadog dashboard for the experiment confirmed two things: the candidate code block produced the same results and improved performance by 80-90%.

We also took a closer look at the queries this feature generated and found some potential further improvements.

The first was to eliminate a SQL query and sort the results in the application instead of letting the SQL server do the sorting. We repeated the same process on a new experiment and found that the candidate code block performed 40-80% worse than the control group. We removed the candidate code block and stopped the experiment.

The second was a query that filtered the results based on the viewer’s access level and did this by iterating through the list of results. The access checking we needed can be done in batches, so we ran another experiment to do the filtering with a single batch query and confirmed that the candidate code block improved performance by another 20-80%.

While completing these experiments, we looked for similar patterns in the associated code and found a similar filter that we could run in batches. We confirmed a 30–40% performance increase with one final experiment and left the feature in a better place that made our developers, DBAs and users happier.

Removing unused code

While our tools reveal problem areas to focus on, it’s better to get ahead of performance issues and fix problematic areas before they impact the user experience. We recently analyzed the most frequently used request endpoints for one of our teams and found room for improvement on one of them before it escalated to an urgent issue.

The data for each request to the GitHub Rails application is logged in Splunk and tagged with the associated controller and action. We started by querying Splunk for the top 10 controller/action pairs in the team’s endpoints. We used this list to create a Datadog dashboard with a series of charts for each controller/action showing the total request volume, average and P99 request latency, and peak request latency. We found that the busiest endpoint on the dashboard was an action responsible for a simple redirect, and that performance regularly dropped to the timeout threshold.

We needed to know what was slowing down these requests, so we dug into Datadog’s APM feature to see requests for the problematic controller/endpoint. We sorted these requests by elapsed request time to see the slowest requests first. We identified a pattern where slow requests were spending a lot of time performing an access check that wasn’t necessary to send the redirect response.

Most requests to the GitHub Rails application generate HTML responses, where we need to be careful that all data in the response is accessible to the viewer. We can simplify the code by using shared Rails controller filters to check that the viewer is allowed to see the resources they requested, which are executed before the server renders a response. These checks are not required for redirection, so we wanted to confirm that we could serve these requests with a different set of filters and that this approach would improve performance.

Because the Rails controller’s filters are configured when the application boots and not when each individual request is processed, we couldn’t use a Scientist experiment to test a potential block of code. However, filters can be configured to run conditionally, so we were able to change the behavior using a Flipper feature flag. We identified the filters that weren’t required for redirection and configured the controller to skip those filters when the feature flag is enabled. Using the feature flag controls, we were able to increase this behavior while monitoring both performance and request status via Datadog and watching for unexpected issues via Splunk.

After seeing that performance improved for P75/P99 request latency—and, more importantly, reduced maximum latency to provide more consistent performance and reduce the likelihood of timeouts—we continued to evolve the feature and generalize the behavior so that other similar controllers could use it.

What did we learn?

We’ve learned a lot throughout this process, and here are some of the key points we’re keeping in mind.

Investing in observability is absolutely worth it! The metrics and log information we tracked allowed us to quickly identify and resolve issues.
Even if you’re solving a problem that’s traditionally been difficult to solve, the use case may be slightly different and provide a new solution.
When you’re working on a bug, look at the adjacent code; there may be related issues you can address there.
Performance issues are a moving target, and keeping an eye out for the next problem will help you fix it when it’s slowed down, not when it’s causing timeouts and glitches.
Make small changes in a way that allows you to control them through a gradual introduction and measure the results.

Written by

[ad_2]

Source link

Welcome! Please hold on...

Kashif Sohail

Residence:

City:

Age:

Magento 1x 2x

Wordpress

Laravel

Angular

React

English

Urdu

How we improved availability through iterative simplification

Faster database queries

Removing unused code

What did we learn?

Written by