Appearance
Appearance
WIP
This article is a work in progress.
Target Audience
This page is primarily intended for users on Private Cloud plans. If you are using Metaplay Cloud with a Pre-launch/Production plan, information from this page may not be directly relevant to your needs.
The main tools we use are:
In addition to this, you may opt to integrate your own paging setup, depending on which tooling you use (e.g., PagerDuty or Atlassian Opsgenie). These integrations would be configured in relation to Prometheus Alertmanager.
Prometheus provides a fairly rudimentary web UI, and most of the time it is less painful to access Prometheus data through another tool, e.g., Grafana. Sometimes, though, it may be worthwhile to access Prometheus directly.
You can access Prometheus on your Metaplay infrastructure stack via the URL https://admin-prometheus.[infrastructure domain]/
(for example, if we were running a development stack named dev.metaplay.io
, Prometheus would be accessible via https://admin-prometheus.dev.metaplay.io/
).
You can find Alertmanager alerts under the Alerts section of the UI.
A set of Prometheus Alertmanager alerting rules is, by default, set in place when you deploy a Metaplay infrastructure stack and is controlled as part of the Terraform deployment. Specifically, the rules are deployed by Terraform through the services
component module, which in turn calls the metaplay-services
Helm chart.
For additional Prometheus alerting and recording rules, you can use the prometheus_extra_alert_files
and prometheus_extra_rule_files
input parameters in the environments/aws-region
and components/services
Terraform modules. Please refer to the module README for details on formats and links to specific Prometheus documentation.
To test out Prometheus alerts manually, you can trigger them through the Prometheus API. As the Prometheus API is, by default, not exposed to the public internet, the easiest way is to use kubectl
to port forward the Prometheus API endpoint to your local machine, e.g.:
$ kubectl get svc -n metaplay-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
metaplay-services-prometheus-alertmanager ClusterIP 172.20.131.56 <none> 80/TCP 13d
...
$ kubectl port-forward -n metaplay-system svc/metaplay-services-prometheus-alertmanager 8080:80
Forwarding from 127.0.0.1:8080 -> 9093
Forwarding from [::1]:8080 -> 9093
We can then separately confirm that the forwarding is in place and test the API:
$ curl http://localhost:8080/api/v1/alerts
{"status":"success","data":{"alerts":[]}}
To trigger an alert manually, we can do the following:
$ curl -X POST http://localhost:8080/api/v1/alerts \
-d '[{
"status": "firing",
"labels": {
"alertname": "test",
"service": "some-service",
"severity": "warning",
"instance": "foo.bar.example.org"
},
"annotations": {
"summary": "Manually triggered test alert"
},
"generatorURL": "http://foo.bar.example.org"
}]'
{"status":"success"}
$ curl http://localhost:8080/api/v1/alerts | jq .
{
"status": "success",
"data": [
{
"labels": {
"alertname": "test",
"instance": "foo.bar.example.org",
"service": "some-service",
"severity": "warning"
},
"annotations": {
"summary": "Manually triggered test alert"
},
"startsAt": "2020-05-20T08:59:45.042344144Z",
"endsAt": "2020-05-20T09:04:45.042344144Z",
"generatorURL": "http://foo.bar.example.org",
"status": {
"state": "active",
"silencedBy": [],
"inhibitedBy": []
},
"receivers": [
"default"
],
"fingerprint": "4cbfe8d764da2f94"
}
]
}
TIP
If you are unsure of how this works, please refer to the Deploying Infrastructure chapter for an introduction to how to deploy and manage your infrastructure stack with Terraform. After completing the chapter, you can plug in the additional configurations from this section to your own infrastructure deployment.
You can define Slack, PagerDuty, and OpsGenie targets for any of the alerts originating from Prometheus Alertmanager. This means that any of the pre-packaged alerts or any custom alerts you create can be used to send messages via Slack or trigger incidents in paging services.
To leverage this functionality, please refer to the appropriate documentation from the services on how to set up the target side:
Once you have set up the targets, you can configure Alertmanager through Terraform to send alerts to any of these destinations. This can be done, e.g., as follows:
module "services" {
source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"
# ... rest of the infra-modules/environments/aws-region configurations
slack_targets = {
"slack-channel-target" = {
"webhook_url" = "webhook url"
"channel" = "channel" # channel name without #-prefix
"alert_level" = ["warning", "severe", "critical"]
"matchers" = [
"severity=critical"
]
"group_by" = ["alertgroup", "alertname"]
"continue" = "false"
"group_wait" = "10s"
"group_interval" = "5m"
"repeat_interval" = "3h"
"mute_time_intervals" = []
"active_time_intervals" = []
"routes" = []
}
}
pager_targets = {
"pagerduty-target" = {
"type" = "pagerduty"
"api_version" = "v2"
"alert_level" = ["critical"]
"pagerduty_service_key" = "service key"
"pagerduty_endpoint" = "pagerduty endpoint"
"matchers" = [
"severity=critical"
]
"group_by" = ["alertgroup", "alertname"]
"continue" = "false"
"group_wait" = "10s"
"group_interval" = "5m"
"repeat_interval" = "3h"
"mute_time_intervals" = []
"active_time_intervals" = []
"routes" = []
}
"opsgenie-target": {
"type": "opsgenie"
"alert_level": ["critical"]
"opsgenie_prometheus_api_key": "opsgenie api key"
"opsgenie_cloudwatch_endpoint": "opsgenie endpoint"
}
}
}
You can tweak the alert levels by providing a list of desired severity levels using the matchers
in each target's configuration or use the alert_level
parameter, which acts as a global default for each target if no target-specific matchers
are given.
If you are operating multiple infrastructure stacks simultaneously, it may become inconvenient to have multiple locations to track metrics. For these cases, we support sending metrics to a central location via the environments/aws-region
module. Do note that you should then provide a functioning receiving side setup (e.g., your own, out-of-band Prometheus stack, Victoria Metrics, or some other compliant system).
You can configure the metrics sending via the environments/aws-region
module's central_monitoring_*
parameters, as below. It is also possible to opt-out of sending labels that may contain sensitive information like AWS account ID to the central platform via the central_monitoring_send_sensitive_data
parameter.
Under the hood, the mechanism used leverages the Prometheus remote write mechanism. For more information, please refer to the Prometheus documentation.
module "infra" {
source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"
# ... rest of the infra-modules/environments/aws-region configurations
central_monitoring_enabled = true
central_monitoring_url = "https://central.monitoring.endpoint"
central_monitoring_username = "your_username"
central_monitoring_password = "your_password"
central_monitoring_send_sensitive_data = false
}
By default, the environments/aws-region
sets up three SNS topics for the three commonly used severity levels: info, severe, and critical. These topics are named [Organization]-[Environment]-alerts-[severity level]
. If you provide any paging destinations through the pager_targets
parameter, these targets will be configured to subscribe to the corresponding SNS topics based on the configured alert level that the paging target has.
These SNS topics are not very frequently used as we prefer to utilize Prometheus alerting more, thanks to the flexibility and customizability of it, which allows alerting based on game server internal metrics as well. However, these SNS topics are important to subscribe to because the primary use for them is to alert in incidents where it is possible that the Kubernetes cluster and, as a cascading result, Prometheus might be in an inoperable state. In these cases, an AWS SNS-based alert on the critical level will almost always indicate very big issues with the current infrastructure stack and/or game servers.
You can connect to these SNS topics programmatically via the outputs of the environments/aws-region
module, for example:
module "infra" {
source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"
# ... rest of the infra-modules/environments/aws-region configurations
}
resource "aws_sns_topic_subscription" "critical-email-alerts" {
topic_arn = module.infra.alert_critical_sns_arn
protocol = "email"
endpoint = "critical@metaplay.io"
}
endpoint_ping
Tests When using the environments/aws-region
module to set up an infrastructure stack, by default, an AWS Lambda function called endpoint_ping
is set up. This function is seeded with access to the AWS Secrets Manager secrets for the game server deployments that the stack supports and uses these secrets to periodically establish connections to the game server using the public endpoint and test that the game server is running appropriately.
The endpoint_ping
function publishes the observed state of the game servers into AWS CloudWatch Metrics. These metrics can be ingested into Prometheus for observability, but we additionally set up per-deployment CloudWatch Alarms to trigger if a game server is detected to be down. These CloudWatch Alarms feed into an AWS SNS topic for critical alerts, and if you have configured any paging destinations through the pager_targets
parameter, these destinations will be automatically paged via the AWS SNS topic if a game server becomes unresponsive.
If you have other sinks where you wish to receive the CloudWatch Alarms, you can provide them using the endpoint_ping_alarm_actions
and endpoint_ping_ok_actions
parameters for a given deployment. This allows you to, for example, direct the alarms to an existing AWS SNS that you may already have available and in use for other alerting:
resource "aws_sns_topic" "it-alarms" {
name = "it-alarms"
}
# have alarms be sent to an email address
resource "aws_sns_topic_subscription" "it-alarms-email" {
topic_arn = aws_sns_topic.it-alarms.arn
protocol = "email"
endpoint = "it-alarms@metaplay.io"
}
# ... and have alarms also be sent to an HTTPS endpoint
resource "aws_sns_topic_subscription" "it-alarms-https-endpoint" {
topic_arn = aws_sns_topic.it-alarms.arn
protocol = "https"
endpoint = "https://alarms.metaplay.io/"
}
# ... and also trigger a custom Lambda function
resource "aws_sns_topic_subscription" "it-alarms-lambda" {
topic_arn = aws_sns_topic.it-alarms.arn
protocol = "lambda"
endpoint = "arn:aws:lambda:eu-west-1:000011112222:function:my-alarm-function"
}
module "services" {
source = "git@github.com:metaplay-shared/infra-modules.git//environments/aws-region?ref=main"
# ... rest of the infra-modules/environments/aws-region configurations
deployments = {
idler-develop = {
endpoint_ping_alarm_enabled = true # you can opt out of alarms, if needed
endpoint_ping_alarm_actions = [aws_sns_topic.it-alarms.arn]
endpoint_ping_ok_actions = [aws_sns_topic.it-alarms.arn]
}
}
}