Appearance
Appearance
This guide contains instructions for troubleshooting high server CPU usage and memory leaks from the game server using various diagnostics tools.
We'll go over how to interact with Metaplay-managed cloud environments, such as connecting to the Kubernetes cluster and retrieving files from the cloud. This page also covers .NET Performance Counters, which are a great starting point to find possible problems. We'll also cover capturing and analyzing CPU captures for the pods and troubleshooting memory looks using tools like dotnet-cgdump
and dotnet-dump
.
To troubleshoot the pods running in the cloud, you'll need to interact with the Kubernetes cluster using the kubectl
tool.
For a Metaplay-managed cloud, you can get a kubeconfig
with the following commands:
# Login to Metaplay cloud
npx @metaplay/metaplay-auth login
# Get a kubeconfig against the specified project and environment
npx @metaplay/metaplay-auth get-kubeconfig <organization>-<project>-<environment> > my-kubeconfig
# Use the generated kubeconfig with kubectl
export KUBECONFIG=$(pwd)/my-kubeconfig
The generated my-kubeconfig
uses the metaplay-auth
login to authenticate to the cluster, so the authentication will be valid for as long as the metaplay-auth
session lasts.
With self-hosted servers, follow the steps in Helm deployment section of the Manually Deploying a Game Server to the Metaplay-Managed Cloud page for instructions on how to get a kubeconfig
.
This is how to list the server pods in an environment:
kubectl -n <namespace> get pods -l app=metaplay-server
# For example:
kubectl -n idler-develop get pods -l app=metaplay-server
If the kubeconfig
was generated with metaplay-auth
v1.3.0 or later, the namespace can be omitted as the generated kubeconfig
defaults to the target environment namespace.
You can run the following command to start a Kubernetes ephemeral diagnostics container against one of the server pods:
kubectl -n <namespace> debug <pod-name> -it --image metaplay/diagnostics --target shard-server
# For example:
kubectl -n idler-develop debug all-0 -it --image metaplay/diagnostics --target shard-server
This gives you a shell which can access the running server process. The metaplay/diagnostics
image contains various diagnostics tools to help debug the server's CPU or memory consumption, including the .NET diagnostics tools, Linux perf, curl, and others.
The container automatically detects which user is running the target server process: app
for chiseled base images and root
for classic base images. A shell is opened for that user so that the .NET diagnostics tools work without further tricks. The shell starts in /tmp
directory as all users can write files there.
Note
Starting from Release 28, Metaplay uses the chiseled .NET base images, which are distroless and contain no shell, so an ephemeral container needs to be used. The distroless images are considered much safer as their attack surface is substantially smaller than that of a full OS image.
Run the following command to copy a file from a pod to your local machine:
kubectl -n <namespace> cp <pod-name>:<path-to-file> ./<filename>
# For example:
kubectl -n idler-develop cp all-0:/tmp/some-diagnostic-file ./some-diagnostic-file
Note
The file system on the containers is ephemeral and gets wiped out if the container is restarted. If you perform diagnostics operations that generate any files you'd like to keep, you should retrieve them immediately to your local machine to avoid accidentally losing them.
The .NET performance counters are a good troubleshooting starting point and give a good overview of the health of the running pod.
View the performance counters with the following:
dotnet-counters monitor -p 1
Here are some good overall indicators to check and see if the following counters are within reasonable limits:
% Time in GC (since last GC)
shouldn't exceed 3%Allocation Rate (Bytes / sec)
shouldn't exceed ~1kB / CCUExceptions / sec
should be 0Monitor Lock Contention Count / sec
should be mostly < 10You can also take a look at the Dotnet Diagnostic Tools CLI Design page from the .NET Diagnostics repository for an overview of the .NET troubleshooting tools available.
You can use the dotnet-trace
command to collect a CPU profile from the game server. Here we're collecting it from a 30-second interval.
dotnet-trace collect -p 1 --duration 00:00:00:30
This will output a file called trace.nettrace
. To retrieve the file to your local machine, use the following:
kubectl -n <namespace> cp <pod-name>:<path>/trace.nettrace ./trace.nettrace
# For example:
kubectl -n idler-develop cp all-0:/tmp/trace.nettrace ./trace.nettrace
If you're using Visual Studio (recommended for Windows users), just drag the file into your IDE.
You can also use Speedscope. Run dotnet-trace convert <tracefile> --format Speedscope
to convert the image to Speedscope format, which generates a JSON file that you can open in https://www.speedscope.app/.
Note that by default, Speedscope only shows a single thread. You can switch threads from the top-center widget. There are a few different views available. Left Heavy view is good for getting overall CPU usage, and the Time Order view is good for analyzing short-term spikes.
Pro tip!
You can check out the Dotnet Docs if you want to dive a little deeper into the Dotnet Trace tools.
The recommended way for tracing memory leaks is to start with a Load Testing run on your local machine:
dotnet-gcdump
or dotnet-dump
. The next section goes into detail about which tool to use and how to use them.Danger!
Collecting a memory dump with either dotnet-gcdump
or dotnet-dump
can take a long time, and the process is completely frozen during the operation. This will pause the game for all the players and, if the heap is large enough (generally multiple gigabytes), the operation can take long enough for the Kubernetes health checks to consider the container unhealthy and restart it.
For a more in-depth guide, see the .NET Guide to Debugging Memory Leaks.
dotnet-gcdump
vs dotnet-dump
In general, if you have access to a Windows machine, you should start with dotnet-gcdump
:
gcdump
supports macOS).See .NET Diagnostics Tools: dump vs. gcdump for a more detailed comparison between the two and detailed instructions on using each.
dotnet-gcdump
In the cloud, the diagnostic docker image comes with the tool pre-installed. To install the tool locally, you can run the following command on your machine:
dotnet tool install -g dotnet-gcdump
First, collect a heap dump from the server process.
dotnet-gcdump -p 1
This will create a file named something like 20240325_095122_28876.gcdump
. To retrieve the file from the Kubernetes pod, run the following:
kubectl -n <namespace> cp <pod-name>:<path>/<filename> ./<filename>
# For example:
kubectl -n idler-develop cp all-0:/tmp/20240325_095122_28876.gcdump ./20240325_095122_28876.gcdump
To analyze the heap dump, you can drag the file into Visual Studio to open it or open the file in PerfView.
dotnet-dump
The dotnet-dump
tool can be used to collect and analyze full memory dumps of a running process. Analyzing the dump must happen on the same OS where the dump originated, so a Linux machine is required to analyze dumps from docker images running in Kubernetes clusters.
In the cloud, the docker images have the tool pre-installed. Alternatively, you can run the following command to install the tool in locally in your machine:
dotnet tool install -g dotnet-dump
First, collect a heap dump from the server process.
dotnet-dump collect -p 1
This will create a file named something like dump_20240325_095609.dmp
. To retrieve the file from the Kubernetes pod, run the following:
kubectl -n <namespace> cp <pod-name>:<path>/xxxxx.dmp ./xxxxx.dmp
# For example:
kubectl -n idler-develop cp all-0:/tmp/xxxxx.dmp ./xxxxx.dmp
To analyze the heap dump, you can use the interactive analyzer.
dotnet-dump analyze <dump-path>
Here are some useful commands for a good starting point:
dumpheap -stat
shows an overview of what consumes memory.dumpheap -mt <MT>
shows all entries of a given type (get using dumpheap -stat
).dumpheap -type System.Byte[] -min 1024
shows all byte arrays that are 1kB or larger.dumpobj <address>
shows information about the given object at <address>
.gcroot <address>
finds the chain of references to a given object. It is useful to find where a leaked reference is coming from.These are some resources you can use to learn more about these tools and how to use them effectively:
The LLDB debugger can be used to dig deeper into the memory heap dumps. You can use it to dump aggregate amounts of memory used by types and trace the object graph to understand which objects are referenced by whom.
Take a look at the following articles for more information:
This slide presentation by Pavel Klimiankou also has some useful insights about using Perfview, LTTng and LLDB: https://www.slideshare.net/pashaklimenkov/troubleshooting-net-core-on-linux