Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 4: Alerting with PagerDuty and CloudWatch

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series

Part 1: Architecture and TGW Setup
Part 2: IAM Roles and Users
Part 3: Installing PMM Server and Registering Services
Part 4: Alerting with PagerDuty and CloudWatch (You Are Here)
Part 5: Backup, Recovery, and Node Reconfiguration

With PMM collecting metrics from every RDS instance and EC2 node across all accounts, the next step is turning those metrics into actionable alerts. PMM's alerting stack is built on Grafana Alerting and Alertmanager, which means you can use MetricsQL (backward-compatible with PromQL) for alert expressions and integrate with any webhook-based notification target. This part covers the full alerting configuration: custom alert templates, a PagerDuty contact point with a custom notification template, notification routing policies, adding CloudWatch data sources via cross-account role assumption, and managing silences with amtool.

Alert Templates

PMM alert templates are YAML files that define the expression and parameters for an alert rule. They serve as the source framework — you create one template and derive multiple rules from it (for example, separate rules for warning and critical thresholds, or for production versus staging).

Template Format

Every custom template requires the following fields:

Field	Required	Description
`name`	Yes	Unique identifier. No spaces or special characters.
`version`	Yes	Template format version (use `1`).
`summary`	Yes	Human-readable description.
`expr`	Yes	MetricsQL query string with parameter placeholders.
`params`	No	Parameter definitions (name, type, range, default value).
`for`	Yes	How long the expression must be true before firing.
`severity`	Yes	Default alert severity: `critical`, `warning`, or `notice`.
`labels`	No	Additional labels attached to fired alerts (e.g., runbook URL).
`annotations`	No	Additional annotations (e.g., summary, description text).

Template Example: High CPU Load


templates:
    - name: NodeCpuUtilization
      version: 1
      summary: NodeCpuUtilization
      expr: >
        (1 - sum(rate(node_cpu_seconds_total{mode='idle'}[1m])) by (node_name)
        / count(node_cpu_seconds_total{mode='idle'}) by (node_name)) * 100
        > [[ .threshold ]]
      params:
        - name: threshold
          summary: A percentage from configured maximum
          unit: '%'
          type: float
          range: [0, 100]
          value: 90
      for: 5m
      severity: critical
      labels:
        runbook: https://localhost/runbooks/#NodeCpuUtilization
        service: generic
      annotations:
        description: Check if a process is not using too much CPU
        summary: "CRITICAL - CPU Usage - {{ $labels.node_name }}"

The [[ .threshold ]] placeholder is replaced with the configured parameter value at rule creation time. This lets you reuse the same template for a 90% critical threshold and an 80% warning threshold without duplicating the expression.

Create an Alert Rule from a Template

Go to Alerting → Alert Rules → New alert rule
Select the Percona templated alert option
In Template details, choose your template — the Name, Duration, and Severity fields populate automatically
Add Filters to scope the rule (e.g., service_name=prod-orders-db). Multiple filters use AND logic.
Select a Folder for the rule
Click Save and Exit

Filter label matching: Label names in filters must be exact matches. Use the Explore menu in PMM to browse available labels for your registered services before writing filter expressions.

PagerDuty Integration

Not using PagerDuty? Grafana Alerting supports many other contact point types out of the box — Grafana IRM, OpsGenie, Slack, Microsoft Teams, webhooks, and more. The alert template format and notification policy concepts covered in this section apply to all of them. Only the contact point configuration (Steps 1–3 below) differs per platform.

Step 1: Get the PagerDuty Integration Key

Log into PagerDuty and open the target service
Go to the Integrations tab → Add Integration
Select Events API v2
Name it (e.g., "PMM Alerts") and click Add Integration
Copy the Integration Key — you'll paste it into PMM in the next step

Step 2: Create the PagerDuty Notification Template

PMM uses Grafana's notification templating. Create a template named pagerduty to control the alert title sent to PagerDuty:

Go to Alerting → Contact Points → Notification templates
Set the template name to pagerduty
Paste the following in the template body:


{{ define "pagerduty" }}
  {{ index .CommonLabels "alertname"}} - {{ index .CommonAnnotations "summary"}}
{{ end }}

This template surfaces the alert name and summary annotation as the PagerDuty incident title, making incidents immediately readable in the PagerDuty timeline.

Step 3: Create the PagerDuty Contact Point

Go to Alerting → Contact Points → New contact point
Set:
- Name: pagerduty
- Type: PagerDuty
- Integration Key: paste the key from Step 1
- Severity: leave at default or set as appropriate
In the Summary field, enter: {{ template "pagerduty" . }}
Click Save contact point

Notification Policy

Notification policies define which alerts route to which contact points. The root policy applies a default contact point to all alerts. Specific policies let you override routing based on label matchers.

Go to Alerting → Notification policies → New specific policy
In Matching labels, add label matchers to scope the policy. Leave it empty to match all alerts.
Select pagerduty as the contact point
Optional settings:
- Continue matching subsequent sibling nodes — useful for sending to a catch-all contact point in addition to the matched one
- Override grouping — disables root policy grouping for this branch
- Override general timings — sets a custom group wait interval for new alert groups
- Mute timing — suppresses notifications on a recurring schedule (e.g., weekends for low-severity alerts)
Save the policy

Add CloudWatch Data Sources via IAM Role Assumption

CloudWatch data sources power the infrastructure-level dashboards (CPU, disk I/O, storage) and can be used as the data source for alert rules targeting RDS metrics. Adding one data source per target account is required; the script below automates this.

Prerequisites: AWS Configuration

The PMM EC2 instance must have its AWS profiles configured to assume the correct roles. Verify the configuration:

cat ~/.aws/config

Expected structure:

[profile default]
sso_start_url = https://your-org.awsapps.com/start
sso_account_id = <pmm-account-id>
sso_region = us-east-1
sso_role_name = <your-sso-role>

[profile account-a]
role_arn = arn:aws:iam::<account-id>:role/<admin-role>
source_profile = default

Connect to PMM via SSM Port Forwarding

The PMM API is accessible only within the VPC. Use SSM to tunnel the API port to your local machine:

aws ssm start-session \
    --profile <your-profile> \
    --target <pmm-instance-id> \
    --region us-east-1 \
    --document-name AWS-StartPortForwardingSession \
    --parameters '{"portNumber":["443"],"localPortNumber":["8443"]}'

Bulk-Add CloudWatch Data Sources

The script below adds one CloudWatch data source per account, using cross-account role assumption. It checks for existing data sources to avoid duplicates on re-runs:

#!/bin/bash

PMM_URL="https://localhost:8443"
API_KEY="<your_pmm_api_key>"
ROLE_NAME="AmazonRDSforPMMrole"
REGION="us-east-1"

account_names=(
  "account-a"
  "account-b"
  "account-c"
  "account-d"
)

account_ids=(
  111111111111
  222222222222
  333333333333
  444444444444
)

existing_data=$(curl -k -s -H "Authorization: Bearer $API_KEY" "$PMM_URL/graph/api/datasources")

for i in "${!account_ids[@]}"; do
  name="${account_names[$i]}"
  account_id="${account_ids[$i]}"
  arn="arn:aws:iam::${account_id}:role/${ROLE_NAME}"

  exists=$(echo "$existing_data" | jq -r ".[] | select(.type==\"cloudwatch\") | .jsonData.assumeRoleArn" | grep -F "$arn")

  if [[ -n "$exists" ]]; then
    echo "Skipping $name — already exists with ARN: $arn"
    continue
  fi

  response=$(curl -k -s -X POST "$PMM_URL/graph/api/datasources" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d @- <<EOF
{
  "name": "$name",
  "type": "cloudwatch",
  "access": "proxy",
  "isDefault": false,
  "jsonData": {
    "authType": "default",
    "defaultRegion": "$REGION",
    "assumeRoleArn": "$arn"
  }
}
EOF
  )

  echo "$response"
  echo "Created data source for $name ($account_id)"
done

Run the script from the machine where the SSM tunnel is active.

Managing Silences with amtool

amtool is the Alertmanager CLI tool. It allows you to query active alerts and manage silences from the command line — useful during maintenance windows or when acknowledging known flapping alerts without opening the PMM UI.

Install amtool

Requires Go 1.22 or later on the PMM node:

go install github.com/prometheus/alertmanager/cmd/amtool@latest

Add the Go bin directory to PATH in .bashrc:

export PATH=$PATH:/root/go/bin

Configure amtool

Create an API key in PMM (Configuration → API Keys → Add API Key) with:

Name: amtool
Role: Editor
Time to live: leave empty (no expiry)

Create the configuration directory and files:

mkdir -p ~/.config/amtool/

~/.config/amtool/config.yml:

alertmanager.url: "https://localhost/graph/api/alertmanager/grafana/"
version-check: false
http.config.file: "/root/.config/amtool/http.conf"

~/.config/amtool/http.conf:

authorization:
  type: Bearer
  credentials: <your_amtool_api_key>
tls_config:
  insecure_skip_verify: True

Remote access: If configuring amtool on a laptop rather than the PMM node, replace localhost in alertmanager.url with the PMM server's IP address or DNS name.

Common amtool Commands

View active alerts:

amtool alert

Example output:

Alertname                           Starts At                Summary                                                     State
MySQLPrimaryReadOnly Alerting Rule  2025-05-01 11:49:00 UTC  CRITICAL - MySQL Primary ReadOnly - prod-orders-replica-db  active

Silence an alert for a specific service:

amtool silence add service_name=<service-name>-mysql alertname=<alertname> --duration=16h --comment "maintenance window"

View active silences:

amtool silence query

Expire a specific silence by ID:

amtool silence expire <ID>

Expire all active silences at once:

amtool silence expire $(amtool silence query -q)

Continue the Series

← Part 3: Installing PMM Server and Registering Services
Part 4: Alerting with PagerDuty and CloudWatch (You Are Here)
Part 5: Backup, Recovery, and Node Reconfiguration →

Mario — ReliaDB

ReliaDB is a specialist DBA team for PostgreSQL and MySQL performance, high availability, and cloud database optimization. More about ReliaDB →