Stories From the War-Room
So we are now emerging from the first 24 hours of our second Alpha test. As was to be expected, many lessons were learned, and pride swallowed, but we also got some pleasant surprises.
One of the big objectives of this Alpha is to test our whole tech stack under pressure and at increasing scale with real players because no amount of automated testing and simulations can expose the various critical events that can occur in different sections of our stack and what kind of domino effect they can have on the whole service.
From a Liveops point of view, the whole Alpha test is like firing a huge rocket brimming with fuel out in space with the hope of it reaching orbit and preferably returning to Earth more or less intact, ready for another flight. The whole thing is rigged up with all the telemetry that we can so that we can get a glimpse of what is happening in real-time and be able to address issues, preferably before they lead to catastrophic failures.
Our stack is quite complex, consisting of a variety of servers running a multitude of services across multiple geographical regions. For this test, we are running around 150 virtual servers distributed between the EU and the US. About two-thirds of these are Unreal servers running specific zones in specific worlds and are automatically started as players enter those zones. The remainder are backend servers acting as API endpoints for various game services. These include inventory, avatar, groups, building, and authentication. Behind all of these are a few large databases that handle all persistence, both for customers, their characters, and associated persistent state.
All of these servers and systems, as well as selected samples of clients, generate telemetry that is gathered in an observability platform called Datadog as well as in our analytics stack, which allows us to visualize in real time the state of all components and also generate alerts automatically when certain critical conditions are observed.
During launch events such as now, we assemble a LiveOps crew that is monitoring Datadog as well as monitoring social media channels working on shifts all around the clock. This crew consists of our most seasoned site reliability engineers, software engineers, and community managers. The composition of this team is such that it should be able to troubleshoot critical conditions in a live environment, find an acceptable solution, test and deploy the said solution, and then monitor that it has indeed resolved the issue, all of this while maintaining community updated as much as possible and making sure that anything that is done affects the least amount of people. This crew all convene in the so-called War Room, which is a video conferencing channel where issues are discussed and resolved collectively. Think of this as the Apollo Flight Control (or at least that is what they like to compare themselves to).
Here are a few stories from the War Room.
On this fine Tuesday morning, we lit our engines at 11 UTC and saw an immediate influx of players across all worlds and zones. All systems were behaving normally. During such ramp-ups, the systems to keep an eye on are the authentication servers and, subsequently, the various zones that are launched on demand as players connect to different parts of the worlds. Even though this ramp-up can be quite rapid, it is still relatively organic, as players are joining in a random fashion.
As we were reaching a population of about 5K players around 12:30 UTC we started to get our first warnings. A Redis server was beginning to come dangerously close to 100% CPU. Redis servers are used for caching and fast persistence of volatile state from Unreal servers across multiple worlds in the same AWS region. They are super efficient and usually not something that easily buckles under load but you still need to have some educated initial guess about what kind of CPU capacity it will need. We usually are able to form educated guesses by simulating traffic on various endpoints but, in the case of the Redis server, it did not fall under those types of tests but instead relied on limited stress tests we did using internal players as well as special test bots. These tests never came close to simulating the current load that was happening and it was clear that it needed to be switched over to a more performant node. Usually, this is something that you can do live and transparently, but, in this case, the service had not been configured to run with a readily available replica (called Multi-AZ mode).This meant that it needed to be taken down and then restarted. This is where we were approaching unknown territory because it wasn’t clear if all systems would handle that disruption gracefully and reconnect transparently to a new Redis instance without an orderly shutdown and restart. As we didn’t have much of a choice as the server would be failing eventually, we pulled the trigger on updating the Redis instances to more performant nodes. It quickly became evident that proper reconnection did not happen, and we started getting errors and abnormal behaviour from different parts of the stack that were relying on the Redis server.
Parallel to this, we had been keeping an eye on the memory consumption of a few Unreal zone servers that were trending above the norm, resulting in poor performance and lag for players connected to those nodes. Some investigations by actually connecting to these suspicious nodes revealed the culprit. Rabbits. Rabbits everywhere. It’s always rabbits, isn’t it? A resource distribution rule that was being used to control the rabbit population was behaving badly in certain biomes resulting in a proliferation of rabbits. Now even though a rabbit seems to be an innocent creature, it actually counts as a full creature in our NPC count. When a server is at its full player capacity, NPCs need to be kept in control as they compete for the same CPU resources of the server for things like collision detection and pathfinding. Updating the resource distribution to kill off these rabbits was easy enough, but in this case, it would require a reboot of the zones affected. As we didn’t have a clear view of which node might be affected by rabbit infestations and considering that we were already fighting another disturbance due to the Redis restart, as well as needing to apply the same changes to different regions of the world, it was decided to put the whole game into maintenance mode so that we could reboot everything properly.
Maintenance mode essentially cuts access to all players, so it immediately brought the service down, allowing us to reboot the world and finalize some of the changes we wanted to apply. We want to avoid this as much as possible because it is 100% disruptive to our player experience. Still, there is also another challenge associated with it: going out of maintenance mode.
The problem here is that going into maintenance mode with many thousands of players already connected means that when you want to reopen the service shortly after, there is a crowd of thousands of players reconnecting nearly at the same instant. This is very different from the organic growth pattern of connections across a typical day but more like a tsunami of simultaneous connections happening within a few minutes. Most of our stack has built-in protection for that in terms of auto-scaling and load balancing, but there are still some bottlenecks that can quickly become problematic in these scenarios. These bottlenecks can then have a domino effect on other systems that are waiting for connections and might eventually time out, creating a perfect storm. Usually, this all irons itself in the end though the player experience during that phase might be less than optimal. Our goal in the future is to improve our maintenance mode to include some form of throttling of incoming connections, allowing us to have an orderly ramp-up of connections, but for now, it is instructive for us to see which services become bottlenecks and helps us identify where we might want to have different configurations to handle these surges better.
After about 20 minutes of downtime, we were back on the saddle with players ramping steadily as US customers started to join the game in addition to the EU. At around 17:30 UTC we had reached what we anticipated to be our peak connected players for the day.
At this point, we started observing some turbulence in the force and in this case, it appeared to be related to database congestion. It seemed to affect mostly one region but because some of our DB queries straddle multiple DBs in multiple regions, the effect could be felt everywhere. Our DBs seemed to be properly provisioned and shouldn’t be going anywhere close to 100% CPU even under this high load, but something was causing a DB to crawl to a halt. We finally narrowed it down to one rogue DB connection that was hanging and keeping a critical table locked causing any queries using that table to block. This again had a domino effect as the requests to the backend that were initiating those queries eventually timed out, leading to either retries or failures of specific player activities. This was a good example of a single loose screw having radiating effects across the whole stack and regions. Once we released the lock on the table, most services started recovering gracefully, and we quickly came back to prior player levels.
During the rest of the night, we had a few alerts that needed to be dealt with, but fortunately, these were all issues we could fix live without disrupting the whole service and without needing to wake up more people.
The lessons learned are that usually, once you manage to find the root cause of a problem, the cure is relatively obvious and you might also be slightly embarrassed of not having caught that specific problem before going live. The truth, however, is that the type of issues that come up and the combination of multiple different systems exhibiting different behaviour under real player traffic is difficult to predict in totality, and often it can be challenging to discern what the real cause is because of the interdependencies. At the same time, there is a certain thrill to the hunt and a lot of gratification once a resolution is reached.
And this is why we are running a test like Alpha - and are so grateful to the community participating in an event like this, helping discover these hard-to-find issues and create more robust systems to handle them in the future. Once again, we want to extend a sincere thank you to our whole community for being so supportive and passionate about Pax Dei.