Appearance
Appearance
This guide contains instructions for troubleshooting high server CPU usage and memory leaks from the game server using various diagnostics tools.
We'll go over how to interact with Metaplay-managed cloud environments, such as connecting to the Kubernetes cluster and retrieving files from the cloud. This page also covers .NET Performance Counters, which are a great starting point to find possible problems. We'll also cover capturing and analyzing CPU captures for the pods and troubleshooting memory looks using tools like dotnet-cgdump
and dotnet-dump
.
To troubleshoot the pods running in the cloud, you'll need to interact with the Kubernetes cluster using the kubectl
tool.
You can get a kubeconfig
with the following commands:
# Login to Metaplay cloud
npx @metaplay/metaplay-auth@latest login
# Get a kubeconfig against the specified project and environment
npx @metaplay/metaplay-auth@latest get-kubeconfig <organization>-<project>-<environment> --output my-kubeconfig
# Use the generated kubeconfig with kubectl
export KUBECONFIG=$(pwd)/my-kubeconfig
# On windows powershell
$env:KUBECONFIG="$(pwd)\my-kubeconfig"
The generated my-kubeconfig
uses the metaplay-auth
login to authenticate to the cluster, so the authentication will be valid for as long as the metaplay-auth
session lasts.
This is how to list the server pods in an environment:
kubectl get pods -l app=metaplay-server
WARNING
If the kubeconfig
was generated with metaplay-auth
older than v1.3.0, the kubectl
requires explict namespace argument. The argument is in the form of kubectl -n <namespace> ..
, for example kubectl -n idler-develop get pods
. On later metaplay-auth
versions the namespace can be omitted as the generated kubeconfig
defaults to the target environment namespace.
You can run the following command to start a Kubernetes ephemeral diagnostics container against one of the server pods:
kubectl debug <pod-name> -it --profile=general --image metaplay/diagnostics:latest --target shard-server
# For example:
kubectl debug all-0 -it --profile=general --image metaplay/diagnostics:latest --target shard-server
This gives you a shell which can access the running server process. The metaplay/diagnostics
image contains various diagnostics tools to help debug the server's CPU or memory consumption, including the .NET diagnostics tools, Linux perf, curl, and others.
The container automatically detects which user is running the target server process: app
for chiseled base images and root
for classic base images. A shell is opened for that user so that the .NET diagnostics tools work without further tricks. The shell starts in /tmp
directory as all users can write files there.
Note
Starting from Release 28, Metaplay uses the chiseled .NET base images, which are distroless and contain no shell, so an ephemeral container needs to be used. The distroless images are considered much safer as their attack surface is substantially smaller than that of a full OS image.
Compatibility Note
Depending on infrastructure version, the target container name may be metaplay-server
instead of shard-server
. If kubectl debug
gives an error message of a missing target container, try again with --target metaplay-server
.
Run the following command to copy a file from a pod to your local machine:
kubectl cp <pod-name>:<path-to-file> ./<filename>
# For example:
kubectl cp all-0:/tmp/some-diagnostic-file ./some-diagnostic-file
# To copy a file from the `kubectl debug` debug container:
kubectl cp <pod-name>:<path-to-file> ./<filename> --container <debugger-container-id>
# For example:
kubectl cp all-0:/tmp/some-diagnostic-file ./some-diagnostic-file --container debugger-qqw2k
Note
The file system on the containers is ephemeral and gets wiped out if the container is restarted. If you perform diagnostics operations that generate any files you'd like to keep, you should retrieve them immediately to your local machine to avoid accidentally losing them.
Unreliable Error Messages
If the source file does not exist, kubectl cp
does not always print an error message and instead silently completes. Due to ephemerality of the source filesystem, always copy files to a new file name or delete the destination file first to avoid using data from an earlier kubectl cp
call.
Taking heap dumps of the game server binary can take a long time, during which the game server is completely unresponsible. This causes the Kubernetes liveness probes to fail which leads to the container getting killed by Kubernetes after enough probe failures (around 30sec by default).
The metaplay-gameserver
Helm chart v0.6.1 introduces sdk.useHealthProbeProxy
which, when enabled, causes the health probes to get sent via the entrypoint
binary running in the container.
If the sdk.useHealthProbeProxy
is enabled, you can override the health probe proxy to always return true for the liveness probe with the following which prevents Kubernetes from killing the pod due to the liveness probes failing during a heap dump operation:
curl localhost:8585/setOverride/healthz?mode=Success
The override is applied for 30 minutes after which is returns to normal behavior, i.e., forwarding the health probes to the game server process. You can explicitly remove the proxy override with the following:
curl localhost:8585/setOverride/healthz?mode=Passthrough
Debugging crashed servers can be tricky in Kubernetes as by default the container restarts also cause the contents of the file system to be lost. This means that crash dumps written to the disk are also lost as the container restarts after a crash.
Metaplay configures a volume mount to be mapped onto the game server containers at /diagnostics
and core dumps are written there in case of server crashes. These volumes share the lifecycle of the pod, as opposed to the container. Thus, the core dumps are retained over server restarts and can be retrieved for debugging purposes.
You can retrieve the core dumps with kubectl
:
kubectl cp <pod-name>:/diagnostics/<file> ./core-dump
# For example:
kubectl cp all-0:/diagnostics/<file> ./core-dump
To analyze the heap dump, you can use the interactive analyzer:
dotnet-dump analyze <dump-path>
See .NET Guide to dotnet-dump on how to use dotnet-dump
to analyze the core dump.
You can use the /diagnostics
directory for your own purposes as well, if you need storage that is slightly more persistent than the regular container file system.
Note that re-scheduling the pod, either due to deploying a new version of the server, or for example due to a failed EC2 node, the volumes are lost and any crash dumps along with them.
The .NET performance counters are a good troubleshooting starting point and give a good overview of the health of the running pod.
View the performance counters with the following:
# Find the PID of the running server
dotnet-counters ps
# Monitor PID 100
dotnet-counters monitor -p 100
Here are some good overall indicators to check and see if the following counters are within reasonable limits:
You can also take a look at the Dotnet Diagnostic Tools CLI Design page from the .NET Diagnostics repository for an overview of the .NET troubleshooting tools available.
You can use the dotnet-trace
command to collect a CPU profile from the game server. Here we're collecting it from a 30-second interval.
# Find the PID of the running server
dotnet-trace ps
# Collect from PID 100
dotnet-trace collect -p 100 --duration 00:00:00:30
This will output a file called trace.nettrace
. To retrieve the file to your local machine, use the following:
# Note: File is written to the debug container's filesystem
kubectl -n <namespace> cp <pod-name>:<path>/trace.nettrace ./trace.nettrace --container <debugger-container-id>
# For example:
kubectl -n idler-develop cp all-0:/tmp/trace.nettrace ./trace.nettrace --container debugger-qqw2k
If you're using Visual Studio (recommended for Windows users), just drag the file into your IDE.
You can also use Speedscope. Run dotnet-trace convert <tracefile> --format Speedscope
to convert the image to Speedscope format, which generates a JSON file that you can open in https://www.speedscope.app/.
Note that by default, Speedscope only shows a single thread. You can switch threads from the top-center widget. There are a few different views available. Left Heavy view is good for getting overall CPU usage, and the Time Order view is good for analyzing short-term spikes.
Pro tip!
You can check out the Dotnet Docs if you want to dive a little deeper into the Dotnet Trace tools.
The recommended way for tracing memory leaks is to start with a Load Testing run on your local machine:
dotnet-gcdump
or dotnet-dump
. The next section goes into detail about which tool to use and how to use them.Danger!
Collecting a memory dump with either dotnet-gcdump
or dotnet-dump
can take a long time, and the process is completely frozen during the operation. This will pause the game for all the players.
If the heap is large enough (generally multiple gigabytes), the operation can take long enough for the Kubernetes health checks to consider the container unhealthy and restart it. Please see Health Probe Overrides on how to override the health probes temporarily to prevent this from happening.
For a more in-depth guide, see the .NET Guide to Debugging Memory Leaks.
dotnet-gcdump
vs dotnet-dump
In general, if you have access to a Windows machine, you should start with dotnet-gcdump
:
gcdump
supports macOS).See .NET Diagnostics Tools: dump vs. gcdump for a more detailed comparison between the two and detailed instructions on using each.
dotnet-gcdump
In the cloud, the diagnostic docker image comes with the tool pre-installed. To install the tool locally, you can run the following command on your machine:
dotnet tool install -g dotnet-gcdump
First, collect a heap dump from the server process.
# Find the PID of the running server
dotnet-gcdump ps
# Use tool on PID 100
dotnet-gcdump collect -p 100
This will create a file named something like 20240325_095122_28876.gcdump
. To retrieve the file from the Kubernetes pod, run the following:
# Note: File is written to the debug container's filesystem
kubectl cp <pod-name>:<path>/<filename> ./<filename> --container <debugger-container-id>
# For example:
kubectl cp all-0:/tmp/20240325_095122_28876.gcdump ./20240325_095122_28876.gcdump --container debugger-qqw2k
To analyze the heap dump, you can drag the file into Visual Studio to open it or open the file in PerfView.
dotnet-dump
The dotnet-dump
tool can be used to collect and analyze full memory dumps of a running process. Analyzing the dump must happen on the same OS where the dump originated, so a Linux machine is required to analyze dumps from docker images running in Kubernetes clusters.
In the cloud, the docker images have the tool pre-installed. Alternatively, you can run the following command to install the tool in locally in your machine:
dotnet tool install -g dotnet-dump
First, collect a heap dump from the server process.
# Find the PID of the running server
dotnet-dump ps
# Use tool on PID 100
dotnet-dump collect -p 100
This will create a file named something like dump_20240325_095609.dmp
. To retrieve the file from the Kubernetes pod, run the following:
kubectl cp <pod-name>:<path>/xxxxx.dmp ./xxxxx.dmp
# For example:
kubectl cp all-0:/tmp/xxxxx.dmp ./xxxxx.dmp
To analyze the heap dump, you can use the interactive analyzer.
dotnet-dump analyze <dump-path>
Here are some useful commands for a good starting point:
dumpheap -stat
shows an overview of what consumes memory.dumpheap -mt <MT>
shows all entries of a given type (get using dumpheap -stat
).dumpheap -type System.Byte[] -min 1024
shows all byte arrays that are 1kB or larger.dumpobj <address>
shows information about the given object at <address>
.gcroot <address>
finds the chain of references to a given object. It is useful to find where a leaked reference is coming from.These are some resources you can use to learn more about these tools and how to use them effectively:
The LLDB debugger can be used to dig deeper into the memory heap dumps. You can use it to dump aggregate amounts of memory used by types and trace the object graph to understand which objects are referenced by whom.
Take a look at the following articles for more information:
This slide presentation by Pavel Klimiankou also has some useful insights about using Perfview, LTTng and LLDB: https://www.slideshare.net/pashaklimenkov/troubleshooting-net-core-on-linux