Monitoring & Alerting

This page provides an overview of the tools for monitoring the state of game servers and infrastructure, as well as how to configure alerts through paging services.

WIP

This article is a work in progress.

Target Audience

This page is primarily intended for users on Private Cloud plans. If you are using Metaplay Cloud with a Pre-launch/Production plan, information from this page may not be directly relevant to your needs.

Tools

The main tools we use are:

Prometheus: used for scraping metrics from game servers and other systems.
Prometheus Alertmanager: A component of Prometheus that allows for the configuration of alerts based on changes in Prometheus metrics. Alerts can trigger incidents in paging services (e.g., PagerDuty or Opsgenie) or be directed to messaging platforms (e.g., Slack).
Grafana: used for visualizing Prometheus metrics.
AWS Lambda: We run a custom AWS Lambda function that knows which game server deployments have been set up and performs an actual, out-of-cluster check that the game server endpoints are listening and responding. These tests mimic what a real player would experience and are the most important tests for assessing whether a game server is running as it should.

In addition to this, you may opt to integrate your own paging setup, depending on which tooling you use (e.g., PagerDuty or Atlassian Opsgenie). These integrations would be configured in relation to Prometheus Alertmanager.

Prometheus

Accessing Prometheus

Prometheus provides a fairly rudimentary web UI, and most of the time it is less painful to access Prometheus data through another tool, e.g., Grafana. Sometimes, though, it may be worthwhile to access Prometheus directly.

You can access Prometheus on your Metaplay infrastructure stack via the URL https://admin-prometheus.[infrastructure domain]/ (for example, if we were running a development stack named dev.metaplay.io, Prometheus would be accessible via https://admin-prometheus.dev.metaplay.io/).

You can find Alertmanager alerts under the Alerts section of the UI.

Configuring Prometheus Alerting Rules

A set of Prometheus Alertmanager alerting rules is, by default, set in place when you deploy a Metaplay infrastructure stack and is controlled as part of the Terraform deployment. Specifically, the rules are deployed by Terraform through the services component module, which in turn calls the metaplay-services Helm chart.

For additional Prometheus alerting and recording rules, you can use the prometheus_extra_alert_files and prometheus_extra_rule_files input parameters in the environments/aws-region and components/services Terraform modules. Please refer to the module README for details on formats and links to specific Prometheus documentation.

Testing Alerting Rules

To test out Prometheus alerts manually, you can trigger them through the Prometheus API. As the Prometheus API is, by default, not exposed to the public internet, the easiest way is to use kubectl to port forward the Prometheus API endpoint to your local machine, e.g.:

bash

$ kubectl get svc -n metaplay-system
NAME                                      TYPE      CLUSTER-IP    EXTERNAL-IP PORT(S) AGE
...
metaplay-services-prometheus-alertmanager ClusterIP 172.20.131.56 <none>      80/TCP  13d
...
$ kubectl port-forward -n metaplay-system svc/metaplay-services-prometheus-alertmanager 8080:80
Forwarding from 127.0.0.1:8080 -> 9093
Forwarding from [::1]:8080 -> 9093

We can then separately confirm that the forwarding is in place and test the API:

bash

$ curl http://localhost:8080/api/v1/alerts
{"status":"success","data":{"alerts":[]}}

To trigger an alert manually, we can do the following:

bash

$ curl -X POST http://localhost:8080/api/v1/alerts \
    -d '[{
        "status": "firing",
        "labels": {
          "alertname": "test",
          "service": "some-service",
          "severity": "warning",
          "instance": "foo.bar.example.org"
        },
        "annotations": {
          "summary": "Manually triggered test alert"
        },
        "generatorURL": "http://foo.bar.example.org"
}]'
{"status":"success"}
$ curl http://localhost:8080/api/v1/alerts | jq .
{
  "status": "success",
  "data": [
    {
      "labels": {
        "alertname": "test",
        "instance": "foo.bar.example.org",
        "service": "some-service",
        "severity": "warning"
      },
      "annotations": {
        "summary": "Manually triggered test alert"
      },
      "startsAt": "2020-05-20T08:59:45.042344144Z",
      "endsAt": "2020-05-20T09:04:45.042344144Z",
      "generatorURL": "http://foo.bar.example.org",
      "status": {
        "state": "active",
        "silencedBy": [],
        "inhibitedBy": []
      },
      "receivers": [
        "default"
      ],
      "fingerprint": "4cbfe8d764da2f94"
    }
  ]
}

Configuring Targets for Prometheus Alertmanager Alerts

TIP

If you are unsure of how this works, please refer to the Deploying Infrastructure chapter for an introduction to how to deploy and manage your infrastructure stack with Terraform. After completing the chapter, you can plug in the additional configurations from this section to your own infrastructure deployment.

You can define Slack, PagerDuty, and OpsGenie targets for any of the alerts originating from Prometheus Alertmanager. This means that any of the pre-packaged alerts or any custom alerts you create can be used to send messages via Slack or trigger incidents in paging services.

To leverage this functionality, please refer to the appropriate documentation from the services on how to set up the target side:

Once you have set up the targets, you can configure Alertmanager through Terraform to send alerts to any of these destinations. This can be done, e.g., as follows:

bash

module "services" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # ... rest of the infra-modules/environments/aws-region configurations

  slack_targets = {
    "slack-channel-target" = {
      "webhook_url" = "webhook url"
      "channel" = "channel" # channel name without #-prefix
      "alert_level" = ["warning", "severe", "critical"]
      "matchers" = [
        "severity=critical"
      ]
      "group_by" = ["alertgroup", "alertname"]
      "continue" = "false"
      "group_wait" = "10s"
      "group_interval" = "5m"
      "repeat_interval" = "3h"
      "mute_time_intervals" = []
      "active_time_intervals" = []
      "routes" = []
    }
  }

  pager_targets = {
    "pagerduty-target" = {
      "type" = "pagerduty"
      "api_version" = "v2"
      "alert_level" = ["critical"]
      "pagerduty_service_key" = "service key"
      "pagerduty_endpoint" = "pagerduty endpoint"
      "matchers" = [
        "severity=critical"
      ]
      "group_by" = ["alertgroup", "alertname"]
      "continue" = "false"
      "group_wait" = "10s"
      "group_interval" = "5m"
      "repeat_interval" = "3h"
      "mute_time_intervals" = []
      "active_time_intervals" = []
      "routes" = []
    }
    "opsgenie-target": {
      "type": "opsgenie"
      "alert_level": ["critical"]
      "opsgenie_prometheus_api_key": "opsgenie api key"
      "opsgenie_cloudwatch_endpoint": "opsgenie endpoint"
    }
  }
}

You can tweak the alert levels by providing a list of desired severity levels using the matchers in each target's configuration or use the alert_level parameter, which acts as a global default for each target if no target-specific matchers are given.

Sending Prometheus Metrics to a Central Location

If you are operating multiple infrastructure stacks simultaneously, it may become inconvenient to have multiple locations to track metrics. For these cases, we support sending metrics to a central location via the environments/aws-region module. Do note that you should then provide a functioning receiving side setup (e.g., your own, out-of-band Prometheus stack, Victoria Metrics, or some other compliant system).

You can configure the metrics sending via the environments/aws-region module's central_monitoring_* parameters, as below. It is also possible to opt-out of sending labels that may contain sensitive information like AWS account ID to the central platform via the central_monitoring_send_sensitive_data parameter.

Under the hood, the mechanism used leverages the Prometheus remote write mechanism. For more information, please refer to the Prometheus documentation.

bash

module "infra" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # ... rest of the infra-modules/environments/aws-region configurations

  central_monitoring_enabled             = true
  central_monitoring_url                 = "https://central.monitoring.endpoint"
  central_monitoring_username            = "your_username"
  central_monitoring_password            = "your_password"
  central_monitoring_send_sensitive_data = false
}

By default, the environments/aws-region sets up three SNS topics for the three commonly used severity levels: info, severe, and critical. These topics are named [Organization]-[Environment]-alerts-[severity level]. If you provide any paging destinations through the pager_targets parameter, these targets will be configured to subscribe to the corresponding SNS topics based on the configured alert level that the paging target has.

These SNS topics are not very frequently used as we prefer to utilize Prometheus alerting more, thanks to the flexibility and customizability of it, which allows alerting based on game server internal metrics as well. However, these SNS topics are important to subscribe to because the primary use for them is to alert in incidents where it is possible that the Kubernetes cluster and, as a cascading result, Prometheus might be in an inoperable state. In these cases, an AWS SNS-based alert on the critical level will almost always indicate very big issues with the current infrastructure stack and/or game servers.

You can connect to these SNS topics programmatically via the outputs of the environments/aws-region module, for example:

bash

module "infra" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # ... rest of the infra-modules/environments/aws-region configurations
}

resource "aws_sns_topic_subscription" "critical-email-alerts" {
  topic_arn = module.infra.alert_critical_sns_arn
  protocol  = "email"
  endpoint  = "critical@metaplay.io"
}

AWS Lambda `endpoint_ping` Tests

When using the environments/aws-region module to set up an infrastructure stack, by default, an AWS Lambda function called endpoint_ping is set up. This function is seeded with access to the AWS Secrets Manager secrets for the game server deployments that the stack supports and uses these secrets to periodically establish connections to the game server using the public endpoint and test that the game server is running appropriately.

The endpoint_ping function publishes the observed state of the game servers into AWS CloudWatch Metrics. These metrics can be ingested into Prometheus for observability, but we additionally set up per-deployment CloudWatch Alarms to trigger if a game server is detected to be down. These CloudWatch Alarms feed into an AWS SNS topic for critical alerts, and if you have configured any paging destinations through the pager_targets parameter, these destinations will be automatically paged via the AWS SNS topic if a game server becomes unresponsive.

Custom Receivers for CloudWatch Alarms

If you have other sinks where you wish to receive the CloudWatch Alarms, you can provide them using the endpoint_ping_alarm_actions and endpoint_ping_ok_actions parameters for a given deployment. This allows you to, for example, direct the alarms to an existing AWS SNS that you may already have available and in use for other alerting:

bash

resource "aws_sns_topic" "it-alarms" {
  name = "it-alarms"
}

# have alarms be sent to an email address
resource "aws_sns_topic_subscription" "it-alarms-email" {
  topic_arn = aws_sns_topic.it-alarms.arn
  protocol  = "email"
  endpoint  = "it-alarms@metaplay.io"
}

# ... and have alarms also be sent to an HTTPS endpoint
resource "aws_sns_topic_subscription" "it-alarms-https-endpoint" {
  topic_arn = aws_sns_topic.it-alarms.arn
  protocol  = "https"
  endpoint  = "https://alarms.metaplay.io/"
}

# ... and also trigger a custom Lambda function
resource "aws_sns_topic_subscription" "it-alarms-lambda" {
  topic_arn = aws_sns_topic.it-alarms.arn
  protocol  = "lambda"
  endpoint  = "arn:aws:lambda:eu-west-1:000011112222:function:my-alarm-function"
}

module "services" {
  source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"

  # ... rest of the infra-modules/environments/aws-region configurations

  deployments = {
    lovely-wombats-build-quickly = {
      endpoint_ping_alarm_enabled = true # you can opt out of alarms, if needed
      endpoint_ping_alarm_actions = [aws_sns_topic.it-alarms.arn]
      endpoint_ping_ok_actions    = [aws_sns_topic.it-alarms.arn]
    }
  }
}

Getting Started

Wordle Tutorial

SDK Updates

Release Notes

Release 33

Release 32

Release 31

Release 30

Release 29

Release 27

Release 25

Open Source Software Licenses

Monitoring & Alerting

Tools

Prometheus

Accessing Prometheus

Configuring Prometheus Alerting Rules

Testing Alerting Rules

Configuring Targets for Prometheus Alertmanager Alerts

Sending Prometheus Metrics to a Central Location

AWS Lambda `endpoint_ping` Tests

Custom Receivers for CloudWatch Alarms

Release Notes

Release 33

Release 32

Release 31

Release 30

Release 29

Release 27

Release 25

Monitoring & Alerting

Tools ​

Prometheus ​

Accessing Prometheus ​

Configuring Prometheus Alerting Rules ​

Testing Alerting Rules ​

Configuring Targets for Prometheus Alertmanager Alerts ​

Sending Prometheus Metrics to a Central Location ​

AWS SNS Topics ​

AWS Lambda endpoint_ping Tests ​

Custom Receivers for CloudWatch Alarms ​

Tools

Prometheus

Accessing Prometheus

Configuring Prometheus Alerting Rules

Testing Alerting Rules

Configuring Targets for Prometheus Alertmanager Alerts

Sending Prometheus Metrics to a Central Location

AWS SNS Topics

AWS Lambda `endpoint_ping` Tests

Custom Receivers for CloudWatch Alarms