Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 5: Backup, Recovery, and Node Reconfiguration

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series

Part 1: Architecture and TGW Setup
Part 2: IAM Roles and Users
Part 3: Installing PMM Server and Registering Services
Part 4: Alerting with PagerDuty and CloudWatch
Part 5: Backup, Recovery, and Node Reconfiguration (You Are Here)

The PMM EC2 instance is a single point of observability for your entire database fleet. If it fails without a recovery plan, you lose monitoring and alerting across every account until you rebuild from scratch — which means re-registering all services, re-adding CloudWatch data sources, and reconfiguring every PMM client agent. A weekly AWS Backup policy eliminates most of that work and reduces recovery time to the duration of a snapshot restore plus a few reconfiguration steps.

This final part covers the AWS Backup configuration, the step-by-step recovery procedure, and how to reconfigure PMM client agents on both x86_64 and aarch64 nodes after a PMM server restore where the IP address changes.

Backup Architecture

The backup configuration manages four AWS resources:

Resource	Name	Purpose
Backup Vault	`aws-pmm01-backup-vault`	Stores EBS snapshots taken by the backup plan
Backup Plan	`aws-pmm01-weekly-backup`	Defines schedule, windows, and retention policy
Backup Selection	`aws-pmm01-backup-selection`	Targets the specific EC2 instance to back up
IAM Role	`AwsBackupServiceRole`	Grants AWS Backup permission to create and manage snapshots

Backup Vault

The backup vault aws-pmm01-backup-vault stores all recovery points. Tag it consistently with your environment tagging standard:

Tag Key	Value
environment	prod
owner	platform
service	backup
component	ec2

Backup Plan

The plan aws-pmm01-weekly-backup runs on a weekly schedule with the following configuration:

Setting	Value	Rationale
Schedule	Sundays at 5:00 AM UTC	Off-peak for most EU/US timezones; minimizes I/O impact on running containers
Start window	60 minutes	Time AWS Backup has to begin the job before it's marked failed
Completion window	120 minutes	Maximum allowed duration for a backup job
Retention period	30 days	Keeps four weekly recovery points available at any time
Continuous backup	Disabled	Point-in-time recovery isn't required for the PMM EC2

IAM Configuration for AWS Backup

Service Role: `AwsBackupServiceRole`

AWS Backup requires a service role to create and manage EBS snapshots and perform EC2 restores. The role uses the following configuration:

Path: /ops/
Permissions boundary: arn:aws:iam::<account_id>:policy/<your-permissions-boundary>

Policies Attached

AWS managed policy: AWSBackupServiceRolePolicyForBackup — grants AWS Backup the baseline permissions to create backup jobs and access backup storage
Custom policy (backup-service-policy) covering:
- EC2 operations: create/delete snapshots, create/delete volumes, create/modify tags, start/stop/reboot instances, volume operations, snapshot operations
- Backup operations: full access to AWS Backup operations and backup storage

Terraform Reference

The backup vault, plan, IAM role, and selection described in this guide are fully managed as Terraform in production. The snippets below cover the key resources — adapt variable names and values to your environment.

Vault and Backup Plan

resource "aws_backup_vault" "pmm" {
  name = "pmm-backup-vault"

  tags = {
    environment = "prod"
    service     = "backup"
    component   = "ec2"
  }
}

resource "aws_backup_plan" "pmm_weekly" {
  name = "pmm-weekly-backup"

  rule {
    rule_name         = "weekly-sunday-5am-utc"
    target_vault_name = aws_backup_vault.pmm.name
    schedule          = "cron(0 5 ? * SUN *)"
    start_window      = 60   # minutes before job is marked failed
    completion_window = 120  # maximum allowed job duration in minutes

    lifecycle {
      delete_after = 30 # retain 4 weekly recovery points
    }
  }
}

IAM Role for AWS Backup

resource "aws_iam_role" "backup_service" {
  name = "AwsBackupServiceRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "backup.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "backup_service_managed" {
  role       = aws_iam_role.backup_service.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup"
}

resource "aws_iam_role_policy" "backup_service_custom" {
  name = "backup-service-policy"
  role = aws_iam_role.backup_service.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ec2:CreateSnapshot", "ec2:DeleteSnapshot",
        "ec2:CreateVolume", "ec2:DeleteVolume",
        "ec2:CreateTags", "ec2:ModifySnapshotAttribute",
        "ec2:DescribeVolumes", "ec2:DescribeSnapshots",
        "ec2:StartInstances", "ec2:StopInstances", "ec2:RebootInstances"
      ]
      Resource = "*"
    }]
  })
}

Backup Selection

resource "aws_backup_selection" "pmm" {
  name         = "pmm-backup-selection"
  plan_id      = aws_backup_plan.pmm_weekly.id
  iam_role_arn = aws_iam_role.backup_service.arn

  resources = [var.pmm_instance_arn]
}

If you'd like the complete module including the full variable definitions, get in touch and we're happy to share it.

Recovery Procedure

The following steps restore the PMM EC2 instance from a backup vault recovery point. Follow them in order — skipping steps, particularly the subnet selection and IAM role attachment, will leave the restored instance unable to communicate via TGW or access CloudWatch.

Step 1: Disable the PMM Service in PagerDuty

Before restoring, mute PagerDuty to prevent a flood of stale firing alerts from the old instance state from paging your on-call team during the recovery window.

Step 2: Open the AWS Backup Console

Navigate to AWS Backup → Backup vaults → aws-pmm01-backup-vault. Select the most recent recovery point and click Restore.

Step 3: Configure Restore Options

Subnet selection is critical: Select the same subnet as the original PMM EC2 instance. The Transit Gateway attachments and route tables created in Part 1 are scoped to specific subnets. Restoring to a different subnet breaks TGW connectivity and requires updating the TGW attachment and all route tables.

Additional restore settings:

IAM role for restore: select AwsBackupServiceRole
IAM role for the restored instance: do not select one at this point — the backup role does not have permission to restore an instance role. Attach the correct IAM role after the restore completes.
All other options can remain at their default values.

Step 4: Monitor Restoration Progress

Monitor the job in AWS Backup → Jobs → Restore jobs until the job status shows Completed.

Step 5: Attach the IAM Role to the Restored Instance

Once the instance is running:

Go to EC2 → Instances → Actions → Security → Modify IAM role
Select the IAM role: pmm-ec2-instance-role
Apply the change
Reboot the instance to ensure the instance metadata reflects the updated role

Step 6: Remove Stale PagerDuty Configuration from PMM UI

Log into the PMM UI on the restored instance and remove the existing PagerDuty contact point and notification policy. The restored instance retains the previous configuration, but PagerDuty state should be re-validated before re-enabling alerting.

Step 7: Update the PMM Instance ID in Terraform

The restored instance has a new EC2 instance ID. If your Terraform configuration references the PMM instance ID directly — for example, in the TGW attachment module or in any resource that targets the instance by ID — update it to the new value, run a plan, and merge the change. This is required to keep your infrastructure-as-code in sync and to ensure future Terraform applies don't attempt to recreate resources against a stale instance ID.

Step 8: Reconfigure PMM Client Agents on EC2 Nodes

If the restored instance received a new private IP address, every PMM client agent in the fleet needs to be reconfigured. See the full procedure in the next section.

Step 9: Re-enable the PMM Service in PagerDuty

After verifying that the PMM inventory shows all services as up (check Inventory → Services → Status), re-enable the PagerDuty service and restore the contact point and notification policy in the PMM UI.

Reconfiguring PMM Client Agents After a PMM IP Change

When the PMM server IP changes after a restore, each monitored EC2 node needs its PMM client agent reconfigured. The procedure differs between x86_64 and aarch64 architectures.

The first step is the same for all nodes: remove the stale node entry from the PMM inventory to prevent duplicate node registration errors.

Step 1: Delete the node from PMM Inventory

Go to PMM → Inventory → Nodes, locate the node, click Delete, enable Force mode, and confirm. Force mode removes the node and all associated services in one operation.

Reconfigure x86_64 Nodes (e.g., masterdb02)

pmm-admin config \
  --server-insecure-tls \
  --server-url=https://admin:<your_pmm_password>@<new-pmm-server-ip>:443 \
  <node-ip> generic <node-name> \
  --force

pmm-admin add mysql \
  --username=pmm \
  --password=<your_password> \
  --size-slow-logs=1GiB

Reconfigure aarch64 Nodes (e.g., ec2-go-db-0)

pmm-agent setup \
  --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
  --server-address=<new-pmm-server-ip> \
  --server-insecure-tls \
  --server-username=admin \
  --server-password=<your_pmm_password> \
  <node-ip> generic <node-name> \
  --force

systemctl start pmm-agent@percona
systemctl status pmm-agent@percona
pmm-admin list

pmm-admin add mysql \
  --username=pmm \
  --password=<your_password> \
  --size-slow-logs=1GiB

Reconfigure aarch64 ProxySQL Nodes

pmm-agent setup \
  --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
  --server-address=<new-pmm-server-ip> \
  --server-insecure-tls \
  --server-username=admin \
  --server-password=<your_pmm_password> \
  <node-ip> generic <node-name>

pmm-admin add proxysql \
  --username=admin \
  --password=<your_proxysql_admin_password> \
  --service-name=<node-name> \
  --host=127.0.0.1 \
  --port=6032

Order matters: Always delete the node from the PMM inventory before running pmm-admin config or pmm-agent setup. Attempting to re-register a node that still exists in the inventory produces a duplicate node error even with --force.

Recovery Checklist

PagerDuty PMM service disabled before restore
Recovery point selected from aws-pmm01-backup-vault
Restore subnet matches original PMM instance subnet
IAM role pmm-ec2-instance-role attached post-restore
Instance rebooted after IAM role attachment
PMM UI accessible and healthy
Stale PagerDuty configuration removed from PMM UI
Terraform updated with new EC2 instance ID and merged
All EC2 PMM client agents reconfigured with new PMM server IP
PMM Inventory shows all services as Up
PagerDuty contact point and notification policy restored in PMM
PagerDuty PMM service re-enabled

Series Complete

Part 1: Architecture and TGW Setup
Part 2: IAM Roles and Users
Part 3: Installing PMM Server and Registering Services
Part 4: Alerting with PagerDuty and CloudWatch
Part 5: Backup, Recovery, and Node Reconfiguration (You Are Here)

Mario — ReliaDB

ReliaDB is a specialist DBA team for PostgreSQL and MySQL performance, high availability, and cloud database optimization. More about ReliaDB →

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 5: Backup, Recovery, and Node Reconfiguration

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series

Backup Architecture

Backup Vault

Backup Plan

IAM Configuration for AWS Backup

Service Role: AwsBackupServiceRole

Policies Attached

Terraform Reference

Vault and Backup Plan

IAM Role for AWS Backup

Backup Selection

Recovery Procedure

Reconfiguring PMM Client Agents After a PMM IP Change

Reconfigure x86_64 Nodes (e.g., masterdb02)

Reconfigure aarch64 Nodes (e.g., ec2-go-db-0)

Reconfigure aarch64 ProxySQL Nodes

Recovery Checklist

Series Complete

Mario — ReliaDB

Service Role: `AwsBackupServiceRole`