Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series
- Part 1: Architecture and TGW Setup
- Part 2: IAM Roles and Users
- Part 3: Installing PMM Server and Registering Services
- Part 4: Alerting with PagerDuty and CloudWatch (You Are Here)
- Part 5: Backup, Recovery, and Node Reconfiguration
With PMM collecting metrics from every RDS instance and EC2 node across all accounts, the next step is turning those metrics into actionable alerts. PMM's alerting stack is built on Grafana Alerting and Alertmanager, which means you can use MetricsQL (backward-compatible with PromQL) for alert expressions and integrate with any webhook-based notification target. This part covers the full alerting configuration: custom alert templates, a PagerDuty contact point with a custom notification template, notification routing policies, adding CloudWatch data sources via cross-account role assumption, and managing silences with amtool.
Alert Templates
PMM alert templates are YAML files that define the expression and parameters for an alert rule. They serve as the source framework — you create one template and derive multiple rules from it (for example, separate rules for warning and critical thresholds, or for production versus staging).
Template Format
Every custom template requires the following fields:
| Field | Required | Description |
|---|---|---|
name | Yes | Unique identifier. No spaces or special characters. |
version | Yes | Template format version (use 1). |
summary | Yes | Human-readable description. |
expr | Yes | MetricsQL query string with parameter placeholders. |
params | No | Parameter definitions (name, type, range, default value). |
for | Yes | How long the expression must be true before firing. |
severity | Yes | Default alert severity: critical, warning, or notice. |
labels | No | Additional labels attached to fired alerts (e.g., runbook URL). |
annotations | No | Additional annotations (e.g., summary, description text). |
Template Example: High CPU Load
templates:
- name: NodeCpuUtilization
version: 1
summary: NodeCpuUtilization
expr: >
(1 - sum(rate(node_cpu_seconds_total{mode='idle'}[1m])) by (node_name)
/ count(node_cpu_seconds_total{mode='idle'}) by (node_name)) * 100
> [[ .threshold ]]
params:
- name: threshold
summary: A percentage from configured maximum
unit: '%'
type: float
range: [0, 100]
value: 90
for: 5m
severity: critical
labels:
runbook: https://localhost/runbooks/#NodeCpuUtilization
service: generic
annotations:
description: Check if a process is not using too much CPU
summary: "CRITICAL - CPU Usage - {{ $labels.node_name }}"
The [[ .threshold ]] placeholder is replaced with the configured parameter value at rule creation time. This lets you reuse the same template for a 90% critical threshold and an 80% warning threshold without duplicating the expression.
Create an Alert Rule from a Template
- Go to Alerting → Alert Rules → New alert rule
- Select the Percona templated alert option
- In Template details, choose your template — the Name, Duration, and Severity fields populate automatically
- Add Filters to scope the rule (e.g.,
service_name=prod-orders-db). Multiple filters use AND logic. - Select a Folder for the rule
- Click Save and Exit
Filter label matching: Label names in filters must be exact matches. Use the Explore menu in PMM to browse available labels for your registered services before writing filter expressions.
PagerDuty Integration
Not using PagerDuty? Grafana Alerting supports many other contact point types out of the box — Grafana IRM, OpsGenie, Slack, Microsoft Teams, webhooks, and more. The alert template format and notification policy concepts covered in this section apply to all of them. Only the contact point configuration (Steps 1–3 below) differs per platform.
Step 1: Get the PagerDuty Integration Key
- Log into PagerDuty and open the target service
- Go to the Integrations tab → Add Integration
- Select Events API v2
- Name it (e.g., "PMM Alerts") and click Add Integration
- Copy the Integration Key — you'll paste it into PMM in the next step
Step 2: Create the PagerDuty Notification Template
PMM uses Grafana's notification templating. Create a template named pagerduty to control the alert title sent to PagerDuty:
- Go to Alerting → Contact Points → Notification templates
- Set the template name to
pagerduty - Paste the following in the template body:
{{ define "pagerduty" }}
{{ index .CommonLabels "alertname"}} - {{ index .CommonAnnotations "summary"}}
{{ end }}
This template surfaces the alert name and summary annotation as the PagerDuty incident title, making incidents immediately readable in the PagerDuty timeline.
Step 3: Create the PagerDuty Contact Point
- Go to Alerting → Contact Points → New contact point
- Set:
- Name:
pagerduty - Type: PagerDuty
- Integration Key: paste the key from Step 1
- Severity: leave at default or set as appropriate
- Name:
- In the Summary field, enter:
{{ template "pagerduty" . }} - Click Save contact point
Notification Policy
Notification policies define which alerts route to which contact points. The root policy applies a default contact point to all alerts. Specific policies let you override routing based on label matchers.
- Go to Alerting → Notification policies → New specific policy
- In Matching labels, add label matchers to scope the policy. Leave it empty to match all alerts.
- Select
pagerdutyas the contact point - Optional settings:
- Continue matching subsequent sibling nodes — useful for sending to a catch-all contact point in addition to the matched one
- Override grouping — disables root policy grouping for this branch
- Override general timings — sets a custom group wait interval for new alert groups
- Mute timing — suppresses notifications on a recurring schedule (e.g., weekends for low-severity alerts)
- Save the policy
Add CloudWatch Data Sources via IAM Role Assumption
CloudWatch data sources power the infrastructure-level dashboards (CPU, disk I/O, storage) and can be used as the data source for alert rules targeting RDS metrics. Adding one data source per target account is required; the script below automates this.
Prerequisites: AWS Configuration
The PMM EC2 instance must have its AWS profiles configured to assume the correct roles. Verify the configuration:
cat ~/.aws/config
Expected structure:
[profile default]
sso_start_url = https://your-org.awsapps.com/start
sso_account_id = <pmm-account-id>
sso_region = us-east-1
sso_role_name = <your-sso-role>
[profile account-a]
role_arn = arn:aws:iam::<account-id>:role/<admin-role>
source_profile = default
Connect to PMM via SSM Port Forwarding
The PMM API is accessible only within the VPC. Use SSM to tunnel the API port to your local machine:
aws ssm start-session \
--profile <your-profile> \
--target <pmm-instance-id> \
--region us-east-1 \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["443"],"localPortNumber":["8443"]}'
Bulk-Add CloudWatch Data Sources
The script below adds one CloudWatch data source per account, using cross-account role assumption. It checks for existing data sources to avoid duplicates on re-runs:
#!/bin/bash
PMM_URL="https://localhost:8443"
API_KEY="<your_pmm_api_key>"
ROLE_NAME="AmazonRDSforPMMrole"
REGION="us-east-1"
account_names=(
"account-a"
"account-b"
"account-c"
"account-d"
)
account_ids=(
111111111111
222222222222
333333333333
444444444444
)
existing_data=$(curl -k -s -H "Authorization: Bearer $API_KEY" "$PMM_URL/graph/api/datasources")
for i in "${!account_ids[@]}"; do
name="${account_names[$i]}"
account_id="${account_ids[$i]}"
arn="arn:aws:iam::${account_id}:role/${ROLE_NAME}"
exists=$(echo "$existing_data" | jq -r ".[] | select(.type==\"cloudwatch\") | .jsonData.assumeRoleArn" | grep -F "$arn")
if [[ -n "$exists" ]]; then
echo "Skipping $name — already exists with ARN: $arn"
continue
fi
response=$(curl -k -s -X POST "$PMM_URL/graph/api/datasources" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d @- <<EOF
{
"name": "$name",
"type": "cloudwatch",
"access": "proxy",
"isDefault": false,
"jsonData": {
"authType": "default",
"defaultRegion": "$REGION",
"assumeRoleArn": "$arn"
}
}
EOF
)
echo "$response"
echo "Created data source for $name ($account_id)"
done
Run the script from the machine where the SSM tunnel is active.
Managing Silences with amtool
amtool is the Alertmanager CLI tool. It allows you to query active alerts and manage silences from the command line — useful during maintenance windows or when acknowledging known flapping alerts without opening the PMM UI.
Install amtool
Requires Go 1.22 or later on the PMM node:
go install github.com/prometheus/alertmanager/cmd/amtool@latest
Add the Go bin directory to PATH in .bashrc:
export PATH=$PATH:/root/go/bin
Configure amtool
Create an API key in PMM (Configuration → API Keys → Add API Key) with:
- Name:
amtool - Role: Editor
- Time to live: leave empty (no expiry)
Create the configuration directory and files:
mkdir -p ~/.config/amtool/
~/.config/amtool/config.yml:
alertmanager.url: "https://localhost/graph/api/alertmanager/grafana/"
version-check: false
http.config.file: "/root/.config/amtool/http.conf"
~/.config/amtool/http.conf:
authorization:
type: Bearer
credentials: <your_amtool_api_key>
tls_config:
insecure_skip_verify: True
Remote access: If configuring amtool on a laptop rather than the PMM node, replace localhost in alertmanager.url with the PMM server's IP address or DNS name.
Common amtool Commands
View active alerts:
amtool alert
Example output:
Alertname Starts At Summary State
MySQLPrimaryReadOnly Alerting Rule 2025-05-01 11:49:00 UTC CRITICAL - MySQL Primary ReadOnly - prod-orders-replica-db active
Silence an alert for a specific service:
amtool silence add service_name=<service-name>-mysql alertname=<alertname> --duration=16h --comment "maintenance window"
View active silences:
amtool silence query
Expire a specific silence by ID:
amtool silence expire <ID>
Expire all active silences at once:
amtool silence expire $(amtool silence query -q)
Continue the Series
- ← Part 3: Installing PMM Server and Registering Services
- Part 4: Alerting with PagerDuty and CloudWatch (You Are Here)
- Part 5: Backup, Recovery, and Node Reconfiguration →