Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series
- Part 1: Architecture and TGW Setup
- Part 2: IAM Roles and Users
- Part 3: Installing PMM Server and Registering Services
- Part 4: Alerting with PagerDuty and CloudWatch
- Part 5: Backup, Recovery, and Node Reconfiguration (You Are Here)
The PMM EC2 instance is a single point of observability for your entire database fleet. If it fails without a recovery plan, you lose monitoring and alerting across every account until you rebuild from scratch — which means re-registering all services, re-adding CloudWatch data sources, and reconfiguring every PMM client agent. A weekly AWS Backup policy eliminates most of that work and reduces recovery time to the duration of a snapshot restore plus a few reconfiguration steps.
This final part covers the AWS Backup configuration, the step-by-step recovery procedure, and how to reconfigure PMM client agents on both x86_64 and aarch64 nodes after a PMM server restore where the IP address changes.
Backup Architecture
The backup configuration manages four AWS resources:
| Resource | Name | Purpose |
|---|---|---|
| Backup Vault | aws-pmm01-backup-vault | Stores EBS snapshots taken by the backup plan |
| Backup Plan | aws-pmm01-weekly-backup | Defines schedule, windows, and retention policy |
| Backup Selection | aws-pmm01-backup-selection | Targets the specific EC2 instance to back up |
| IAM Role | AwsBackupServiceRole | Grants AWS Backup permission to create and manage snapshots |
Backup Vault
The backup vault aws-pmm01-backup-vault stores all recovery points. Tag it consistently with your environment tagging standard:
| Tag Key | Value |
|---|---|
| environment | prod |
| owner | platform |
| service | backup |
| component | ec2 |
Backup Plan
The plan aws-pmm01-weekly-backup runs on a weekly schedule with the following configuration:
| Setting | Value | Rationale |
|---|---|---|
| Schedule | Sundays at 5:00 AM UTC | Off-peak for most EU/US timezones; minimizes I/O impact on running containers |
| Start window | 60 minutes | Time AWS Backup has to begin the job before it's marked failed |
| Completion window | 120 minutes | Maximum allowed duration for a backup job |
| Retention period | 30 days | Keeps four weekly recovery points available at any time |
| Continuous backup | Disabled | Point-in-time recovery isn't required for the PMM EC2 |
IAM Configuration for AWS Backup
Service Role: AwsBackupServiceRole
AWS Backup requires a service role to create and manage EBS snapshots and perform EC2 restores. The role uses the following configuration:
- Path:
/ops/ - Permissions boundary:
arn:aws:iam::<account_id>:policy/<your-permissions-boundary>
Policies Attached
-
AWS managed policy:
AWSBackupServiceRolePolicyForBackup— grants AWS Backup the baseline permissions to create backup jobs and access backup storage -
Custom policy (
backup-service-policy) covering:- EC2 operations: create/delete snapshots, create/delete volumes, create/modify tags, start/stop/reboot instances, volume operations, snapshot operations
- Backup operations: full access to AWS Backup operations and backup storage
Terraform Reference
The backup vault, plan, IAM role, and selection described in this guide are fully managed as Terraform in production. The snippets below cover the key resources — adapt variable names and values to your environment.
Vault and Backup Plan
resource "aws_backup_vault" "pmm" {
name = "pmm-backup-vault"
tags = {
environment = "prod"
service = "backup"
component = "ec2"
}
}
resource "aws_backup_plan" "pmm_weekly" {
name = "pmm-weekly-backup"
rule {
rule_name = "weekly-sunday-5am-utc"
target_vault_name = aws_backup_vault.pmm.name
schedule = "cron(0 5 ? * SUN *)"
start_window = 60 # minutes before job is marked failed
completion_window = 120 # maximum allowed job duration in minutes
lifecycle {
delete_after = 30 # retain 4 weekly recovery points
}
}
}
IAM Role for AWS Backup
resource "aws_iam_role" "backup_service" {
name = "AwsBackupServiceRole"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "backup.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "backup_service_managed" {
role = aws_iam_role.backup_service.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup"
}
resource "aws_iam_role_policy" "backup_service_custom" {
name = "backup-service-policy"
role = aws_iam_role.backup_service.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ec2:CreateSnapshot", "ec2:DeleteSnapshot",
"ec2:CreateVolume", "ec2:DeleteVolume",
"ec2:CreateTags", "ec2:ModifySnapshotAttribute",
"ec2:DescribeVolumes", "ec2:DescribeSnapshots",
"ec2:StartInstances", "ec2:StopInstances", "ec2:RebootInstances"
]
Resource = "*"
}]
})
}
Backup Selection
resource "aws_backup_selection" "pmm" {
name = "pmm-backup-selection"
plan_id = aws_backup_plan.pmm_weekly.id
iam_role_arn = aws_iam_role.backup_service.arn
resources = [var.pmm_instance_arn]
}
If you'd like the complete module including the full variable definitions, get in touch and we're happy to share it.
Recovery Procedure
The following steps restore the PMM EC2 instance from a backup vault recovery point. Follow them in order — skipping steps, particularly the subnet selection and IAM role attachment, will leave the restored instance unable to communicate via TGW or access CloudWatch.
Step 1: Disable the PMM Service in PagerDuty
Before restoring, mute PagerDuty to prevent a flood of stale firing alerts from the old instance state from paging your on-call team during the recovery window.
Step 2: Open the AWS Backup Console
Navigate to AWS Backup → Backup vaults → aws-pmm01-backup-vault. Select the most recent recovery point and click Restore.
Step 3: Configure Restore Options
Subnet selection is critical: Select the same subnet as the original PMM EC2 instance. The Transit Gateway attachments and route tables created in Part 1 are scoped to specific subnets. Restoring to a different subnet breaks TGW connectivity and requires updating the TGW attachment and all route tables.
Additional restore settings:
- IAM role for restore: select
AwsBackupServiceRole - IAM role for the restored instance: do not select one at this point — the backup role does not have permission to restore an instance role. Attach the correct IAM role after the restore completes.
- All other options can remain at their default values.
Step 4: Monitor Restoration Progress
Monitor the job in AWS Backup → Jobs → Restore jobs until the job status shows Completed.
Step 5: Attach the IAM Role to the Restored Instance
Once the instance is running:
- Go to EC2 → Instances → Actions → Security → Modify IAM role
- Select the IAM role:
pmm-ec2-instance-role - Apply the change
- Reboot the instance to ensure the instance metadata reflects the updated role
Step 6: Remove Stale PagerDuty Configuration from PMM UI
Log into the PMM UI on the restored instance and remove the existing PagerDuty contact point and notification policy. The restored instance retains the previous configuration, but PagerDuty state should be re-validated before re-enabling alerting.
Step 7: Update the PMM Instance ID in Terraform
The restored instance has a new EC2 instance ID. If your Terraform configuration references the PMM instance ID directly — for example, in the TGW attachment module or in any resource that targets the instance by ID — update it to the new value, run a plan, and merge the change. This is required to keep your infrastructure-as-code in sync and to ensure future Terraform applies don't attempt to recreate resources against a stale instance ID.
Step 8: Reconfigure PMM Client Agents on EC2 Nodes
If the restored instance received a new private IP address, every PMM client agent in the fleet needs to be reconfigured. See the full procedure in the next section.
Step 9: Re-enable the PMM Service in PagerDuty
After verifying that the PMM inventory shows all services as up (check Inventory → Services → Status), re-enable the PagerDuty service and restore the contact point and notification policy in the PMM UI.
Reconfiguring PMM Client Agents After a PMM IP Change
When the PMM server IP changes after a restore, each monitored EC2 node needs its PMM client agent reconfigured. The procedure differs between x86_64 and aarch64 architectures.
The first step is the same for all nodes: remove the stale node entry from the PMM inventory to prevent duplicate node registration errors.
Step 1: Delete the node from PMM Inventory
Go to PMM → Inventory → Nodes, locate the node, click Delete, enable Force mode, and confirm. Force mode removes the node and all associated services in one operation.
Reconfigure x86_64 Nodes (e.g., masterdb02)
pmm-admin config \
--server-insecure-tls \
--server-url=https://admin:<your_pmm_password>@<new-pmm-server-ip>:443 \
<node-ip> generic <node-name> \
--force
pmm-admin add mysql \
--username=pmm \
--password=<your_password> \
--size-slow-logs=1GiB
Reconfigure aarch64 Nodes (e.g., ec2-go-db-0)
pmm-agent setup \
--config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
--server-address=<new-pmm-server-ip> \
--server-insecure-tls \
--server-username=admin \
--server-password=<your_pmm_password> \
<node-ip> generic <node-name> \
--force
systemctl start pmm-agent@percona
systemctl status pmm-agent@percona
pmm-admin list
pmm-admin add mysql \
--username=pmm \
--password=<your_password> \
--size-slow-logs=1GiB
Reconfigure aarch64 ProxySQL Nodes
pmm-agent setup \
--config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
--server-address=<new-pmm-server-ip> \
--server-insecure-tls \
--server-username=admin \
--server-password=<your_pmm_password> \
<node-ip> generic <node-name>
pmm-admin add proxysql \
--username=admin \
--password=<your_proxysql_admin_password> \
--service-name=<node-name> \
--host=127.0.0.1 \
--port=6032
Order matters: Always delete the node from the PMM inventory before running pmm-admin config or pmm-agent setup. Attempting to re-register a node that still exists in the inventory produces a duplicate node error even with --force.
Recovery Checklist
- PagerDuty PMM service disabled before restore
- Recovery point selected from
aws-pmm01-backup-vault - Restore subnet matches original PMM instance subnet
- IAM role
pmm-ec2-instance-roleattached post-restore - Instance rebooted after IAM role attachment
- PMM UI accessible and healthy
- Stale PagerDuty configuration removed from PMM UI
- Terraform updated with new EC2 instance ID and merged
- All EC2 PMM client agents reconfigured with new PMM server IP
- PMM Inventory shows all services as Up
- PagerDuty contact point and notification policy restored in PMM
- PagerDuty PMM service re-enabled
Series Complete
- Part 1: Architecture and TGW Setup
- Part 2: IAM Roles and Users
- Part 3: Installing PMM Server and Registering Services
- Part 4: Alerting with PagerDuty and CloudWatch
- Part 5: Backup, Recovery, and Node Reconfiguration (You Are Here)