Logging

This section provides an overview of the tools and their configurations for interacting with infrastructure and game server logs.

Target Audience

This page is primarily intended for users on Private Cloud plans. If you are using Metaplay Cloud with a Pre-launch/Production plan, information from this page may not be directly relevant to your needs.

Tools

The main tools we use are:

Promtail: used for monitoring and sending logs from Kubernetes nodes as well as containers running on those nodes.
Loki: an in-cluster log aggregation tool for collecting and allowing querying of logs. It can be queried either directly using the APIs and LogQL or via a separate UI such as Grafana.
Grafana: used as a common frontend for querying and interacting with logs (as well as other metrics and dashboards).
CloudWatch Logs: An Amazon Web Services (AWS) service for collecting logs. Most other AWS services offer native integration to send logs to CloudWatch.
S3: An AWS service for object storage, often used for persisting logs (e.g., via Loki or directly from Promtail).

Overview of Logging

Sources of Logs and Gathering Log Data

Logging is a broad subject, and in the context of running Metaplay-based game servers and supporting infrastructure, we can roughly categorize sources of log data into the following categories:

Docker containers
- Game servers and supporting tooling orchestrated via the metaplay-gameserver Helm chart
- Kubernetes cluster-wide tooling via the metaplay-services Helm chart
Virtual machines
- EC2 instances that make up the nodes of the Kubernetes cluster
AWS services
- Elastic Kubernetes Service (EKS): Kubernetes control plane-related logging
- Lambda: Various utility functions

Each of these sources behaves slightly differently, and utilizing and capturing logs is handled by default differently. For example, many AWS services themselves can be configured to publish logs to AWS CloudWatch Logs, where the logs can be persisted or consumed from. As a practical example, AWS EKS control plane logs are published to CloudWatch Logs in this way. In practice, for many AWS services, we not only do not need to, but often cannot actually influence that much in how the logs are gathered.

On the other hand, most components of the Metaplay stack run as Docker containers on top of EKS. This means that container logs are generated and exist mainly in the context of the underlying Kubernetes node, which runs those containers. In our case, these hosts are often AWS EC2 instances. We leverage Promtail, which runs on each host as a Kubernetes DaemonSet, to monitor the host's /var/log directory and capture all logs generated from containers on the given host. Promtail can be used to additionally decorate each gathered log row with metadata, such as which Kubernetes pod and which container in the pod generated the log row. Collecting this type of data allows easier consumption of logs further down the line.

In practice, we prefer to collect Kubernetes logs in this fashion as it means we do not need to customize the log collection per service, but instead can easily collect logs of all containers, be they long-running services or ephemeral jobs, in this fashion.

Aggregating Logs

Once logs are gathered, we by default aggregate them to two different locations:

CloudWatch Logs
Loki

Functionally, log aggregation allows us to define a central location where the log data is collected, which solves issues like how the log data is persisted and how it can be interacted with or queried.

As discussed above, CloudWatch Logs offers a logical location for us to aggregate AWS-native logging data. We attempt to segregate services into different CloudWatch Logs log groups based on logical components (e.g., EKS logs go to a different log group than Lambda function logs), and within those log groups, we attempt to split logs into log streams based on further logical groupings within the component (e.g., within the EKS log group, different log streams exist for the Kubernetes API server, authentication service, controller manager, etc.).

For logs originating from Kubernetes nodes via Promtail, we by default use Grafana Labs' Loki as the aggregator. Our current design runs Loki in-cluster and uses AWS S3 for the persistence of log files. Unlike CloudWatch Logs, which is an AWS PaaS, Loki requires more operational caretaking, especially in large deployments, and the later parts of this document will outline various strategies available to help with scaling Loki up if log volumes increase.

Consuming Logs

As an end user of logs, regardless of whether you are a game server developer or an operations engineer looking after the infrastructure, our base design places Grafana as the central source of not only metrics but also log data. You can think of Grafana as a lens to these different troves of data and offers a unified location for you to query logs from or even create dashboards based on log analysis.

Each log aggregator has a slightly different format for querying data. Listed below are links to resources for getting acquainted with querying Loki using LogQL or AWS CloudWatch Logs using the CloudWatch Logs Insights queries:

Configuring Your Infrastructure for Logging

Basic Configuration to Get the Flow Going

Assuming that you manage your infrastructure through the Metaplay-provided Terraform modules, configuring a basic logging setup is relatively straightforward. The environments/aws-region module README is a good place to get started with to understand which parameters are available to you. That said, to get started, we can configure our environment as follows:

bash

module "infra" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # Snip, your other configurations are here...

  promtail_enabled = true

  loki_enabled                   = true
  loki_aws_storage_allow_destroy = true

  s3_name_random_suffix_enabled  = true
}

The above example explicitly enables both Promtail and Loki (they are enabled by default, so technically they would not need to be called out). Additionally, we allow the creation of an S3 bucket that Loki can persist log data on. This allows for easy storage of larger amounts of log data than if the data was only stored ephemerally inside the cluster's own storage. We additionally flag loki_aws_storage_allow_destroy to allow for Terraform to clear the contents of the Loki S3 bucket when terraform destroy is run. This is convenient for development environments but should be set to false in production environments where you do not want to destroy historical log data, even accidentally. Finally, we explicitly enable s3_name_random_suffix_enabled to avoid the risk of duplication in S3 names, which should be globally unique across all AWS accounts.

With the above setup deployed and in play, you can already access container logs via Loki using Grafana's Explore section and selecting Loki as the data source.

Scaling Up Loki

If the volume of your game grows, it is possible to hit the limits of the above basic setup with Loki quite easily. Typically, this will manifest itself as Loki potentially becoming unresponsive when being queried or the general load of the cluster being significantly increased.

In these cases, we can apply a couple of different tactics to allow for Loki to scale up:

Increase the number of Loki replicas to allow for more bandwidth to ingest incoming log data.
Increase the number of Loki reader replicas to allow for more capacity when querying Loki.
Ensure that the Loki query frontend is enabled with sufficient replicas to allow long queries to be carved into smaller queries and distributed across multiple Loki reader replicas.
Move Loki pods to dedicated Kubernetes nodes to reduce the risk of situations where high Loki resource usage might adversely affect other payloads running in the cluster.
Move Loki pods to be run on EKS Fargate to allow for similar separation as tactic 4.

The above tactics 1-4 can be implemented via the Terraform module using the following configurations:

bash

module "infra" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # Snip, your other configurations are here...

  promtail_enabled = true

  loki_enabled                   = true 
  loki_aws_storage_allow_destroy = true 

  # tactic 1
  loki_replicas = 2

  # tactic 2
  loki_reader_replicas = 3

  # tactic 3
  loki_queryfrontend_enabled     = true
  loki_queryfrontend_replicas    = 1

  # tactic 4
  loki_shard_pool               = "logging"
  loki_reader_shard_pool        = "logging"
  loki_queryfrontend_shard_pool = "logging"

  shard_node_groups = {
    # Snip, your pre-existing shard node group configurations, make sure that all pool spec keys exist...
    logging = {
      instance_type = ["c5.large"]
      desired_size  = 3 # initial node group size of 6 (= 3*length(logging.azs))
      min_size      = 1
      max_size      = 10
      autoscaler    = true
      azs           = ["eu-west-1a","eu-west-1b"]
    }
  }
}

Tactic 4 of separating the Loki pods to a separate Kubernetes node group is especially wise in production environments to limit the blast radius and safeguard the operations of the rest of the cluster even if Loki gets overwhelmed. In the example above, we have defined a new shard node group named logging, which is guaranteed to not run any other pods than the ones that explicitly request to be run in the group. Using the shard_pool parameters, we tell which Loki components we want to schedule there. We also enable the autoscaler to manage the node group for us, which makes it easier in the future to just adjust the replica numbers as needed since the node group will dynamically adjust accordingly.

On a very technical level, Loki is being deployed by the Terraform module using the metaplay-services Helm chart. The chart is designed to prefer Loki pods to be scheduled away from each other, if possible. This will mean that as long as there is a sufficient amount of underlying nodes available, Kubernetes will attempt to distribute the pods correspondingly away from each other.

Running Loki Pods on EKS Fargate

DANGER

This approach is still slightly experimental, so beware of dragons!

Tactic 5 is an alternative to tactic 4, achieving fundamentally the same thing but using the AWS EKS Fargate profile approach to allow the Kubernetes scheduler to schedule Loki pods to run on Fargate instead of on the EC2 nodes. The benefit of this is that there is no need to mess around with underlying node infrastructure. The downside is that Fargate-based pods are charged separately based on Fargate pricing and tend to be typically more expensive than running pods on EC2 nodes.

That said, we can adjust the configuration of Loki as presented below to target deployments into Fargate:

bash

module "infra" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # Snip, your other configurations are here...

  promtail_enabled = true

  loki_enabled                   = true 
  loki_aws_storage_allow_destroy = true 

  # tactic 1
  loki_replicas = 2

  # tactic 2
  loki_reader_replicas = 3

  # tactic 3
  loki_queryfrontend_enabled     = true
  loki_queryfrontend_replicas    = 1

  # tactic 4
  loki_shard_pool               = "logging"
  loki_reader_shard_pool        = "logging"
  loki_queryfrontend_shard_pool = "logging"

  # tactic 5
  fargate_profiles = {
    loki = {
      namespace = "metaplay-system"
      labels = {
        app     = "loki"
        release = "metaplay-services"
      }
    }
    loki-reader = {
      namespace = "metaplay-system"
      labels = {
        app     = "loki-reader"
        release = "metaplay-services"
      }
    }
  }
}

What we specifically do here is tell the AWS EKS scheduler that any pods which are in the configured namespaces and have the corresponding labels should be scheduled to corresponding Fargate profiles. In our case above, we know that our Loki pods tend to have the app label of either loki or loki-reader, and that we have a release label with the value of metaplay-services. We additionally know that the pods will exist in the metaplay-system namespace. Putting all these together allows us to target specifically these pods for Fargate execution.

Note that if you apply this configuration change to an already existing cluster, you may need to kill the individual Loki pods to get them to be rescheduled to Fargate if they are already currently running on EC2 nodes.

Getting Started

Wordle Tutorial

SDK Updates

Release Notes

Release 33

Release 32

Release 31

Release 30

Release 29

Release 27

Release 25

Open Source Software Licenses

Logging

Tools

Overview of Logging

Sources of Logs and Gathering Log Data

Aggregating Logs

Consuming Logs

Configuring Your Infrastructure for Logging

Basic Configuration to Get the Flow Going

Scaling Up Loki

Running Loki Pods on EKS Fargate

Release Notes

Release 33

Release 32

Release 31

Release 30

Release 29

Release 27

Release 25

Logging

Tools ​

Overview of Logging ​

Sources of Logs and Gathering Log Data ​

Aggregating Logs ​

Consuming Logs ​

Configuring Your Infrastructure for Logging ​

Basic Configuration to Get the Flow Going ​

Scaling Up Loki ​

Running Loki Pods on EKS Fargate ​

Tools

Overview of Logging

Sources of Logs and Gathering Log Data

Aggregating Logs

Consuming Logs

Configuring Your Infrastructure for Logging

Basic Configuration to Get the Flow Going

Scaling Up Loki

Running Loki Pods on EKS Fargate