GitHub's Commitment to Reliability: Navigating Exponential Growth and Improving Availability

Introduction

We want to provide a transparent update on GitHub's availability following two recent incidents that fell short of our standards. We sincerely apologize for the disruption these caused to your work. In this article, we'll share details about what happened, the steps we've already taken, and our ongoing efforts to strengthen reliability for the long term.

GitHub's Commitment to Reliability: Navigating Exponential Growth and Improving Availability — Source: github.blog

The Challenge of Unprecedented Growth

The Acceleration of Agentic Development

In October 2025, we initiated a plan to increase GitHub's capacity by tenfold, aiming to dramatically improve reliability and failover processes. However, by February 2026, it became evident that we needed to prepare for a future requiring 30 times our current scale. The primary catalyst? A rapid shift in how software is built. Since December 2025, agentic development workflows—automated, AI-driven coding processes—have surged sharply. Nearly every metric confirms the trend: repository creation, pull request activity, API calls, automation, and large-repository workloads are all climbing at an exponential rate.

The Domino Effect on Systems

This exponential growth doesn't strain one system in isolation. A single pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses translate into database strain, indexes lag behind, retries amplify traffic, and one slow dependency can cascade across multiple product experiences. This interconnected nature demands a comprehensive approach.

Our Strategic Priorities

To address these challenges, we've set clear priorities: availability first, then capacity, then new features. Every decision is filtered through this lens. We are reducing unnecessary work, improving caching strategies, isolating critical services, removing single points of failure, and migrating performance-sensitive paths into systems engineered for modern workloads. This is the essence of distributed systems work: reducing hidden coupling, limiting blast radius, and ensuring GitHub degrades gracefully when any subsystem faces pressure. We're making rapid progress, but these recent incidents highlight where work remains.

Short-Term Improvements and Bottleneck Resolution

In the immediate term, we resolved several bottlenecks that appeared faster than anticipated. Key actions included:

Webhooks backend migration: Moving webhooks from MySQL to a more scalable system.
Session cache redesign: Overhauling the user session cache to reduce database churn.
Authentication and authorization flows: Re-architecting these to substantially cut database load.
Azure compute scaling: Leveraging our migration to Azure to rapidly provision additional compute resources and absorb growth.

Long-Term Infrastructure Overhaul

Isolating Critical Services

Next, we focused on isolating critical services like Git and GitHub Actions from other workloads. By minimizing single points of failure, we can contain the blast radius of any incident. This work began with careful dependency analysis and traffic tier mapping to understand what must be separated and how to shield legitimate traffic from attacks. We then addressed risks in order of severity.

Modernizing Code and Architecture

Simultaneously, we accelerated the migration of performance-sensitive and scale-sensitive code from our Ruby monolith into Go. This shift improves efficiency, reduces latency, and simplifies scaling. These changes are already paying dividends in terms of resource usage and response times.

Looking Ahead to Multi-Cloud

While we were already in the process of moving out of our smaller custom data centers to a public cloud, we have now started working toward a true multi-cloud strategy. This will provide additional redundancy, resilience, and flexibility—ensuring that a single provider outage doesn't impact your work. The transition is complex, but it's a critical part of building a more robust GitHub for the future.

Conclusion

We know that reliability is fundamental to your trust in GitHub. Every incident, while unacceptable, teaches us valuable lessons. We are committed to transparent communication and continuous improvement. Our teams are working around the clock to implement these changes, and we will continue to share updates on our progress. Thank you for your patience and understanding as we build a more resilient platform for the global developer community.

Tags: