Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 5: Backup, Recovery, and Node Reconfiguration

Configure weekly AWS Backup for the PMM EC2 instance, walk through the full recovery procedure after a failure, and reconfigure PMM client agents on x86_64 and aarch64 nodes after a PMM server IP change.

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 5: Backup, Recovery, and Node Reconfiguration
Cross-Account Database Monitoring with PMM and AWS Transit Gateway — Part 5: Backup, Recovery, and Node Reconfiguration
ReliaDB ReliaDB

Cross-Account Database Monitoring with PMM and AWS Transit Gateway — 5-Part Series

  1. Part 1: Architecture and TGW Setup
  2. Part 2: IAM Roles and Users
  3. Part 3: Installing PMM Server and Registering Services
  4. Part 4: Alerting with PagerDuty and CloudWatch
  5. Part 5: Backup, Recovery, and Node Reconfiguration (You Are Here)

The PMM EC2 instance is a single point of observability for your entire database fleet. If it fails without a recovery plan, you lose monitoring and alerting across every account until you rebuild from scratch — which means re-registering all services, re-adding CloudWatch data sources, and reconfiguring every PMM client agent. A weekly AWS Backup policy eliminates most of that work and reduces recovery time to the duration of a snapshot restore plus a few reconfiguration steps.

This final part covers the AWS Backup configuration, the step-by-step recovery procedure, and how to reconfigure PMM client agents on both x86_64 and aarch64 nodes after a PMM server restore where the IP address changes.

Backup Architecture

The backup configuration manages four AWS resources:

ResourceNamePurpose
Backup Vaultaws-pmm01-backup-vaultStores EBS snapshots taken by the backup plan
Backup Planaws-pmm01-weekly-backupDefines schedule, windows, and retention policy
Backup Selectionaws-pmm01-backup-selectionTargets the specific EC2 instance to back up
IAM RoleAwsBackupServiceRoleGrants AWS Backup permission to create and manage snapshots

Backup Vault

The backup vault aws-pmm01-backup-vault stores all recovery points. Tag it consistently with your environment tagging standard:

Tag KeyValue
environmentprod
ownerplatform
servicebackup
componentec2

Backup Plan

The plan aws-pmm01-weekly-backup runs on a weekly schedule with the following configuration:

SettingValueRationale
ScheduleSundays at 5:00 AM UTCOff-peak for most EU/US timezones; minimizes I/O impact on running containers
Start window60 minutesTime AWS Backup has to begin the job before it's marked failed
Completion window120 minutesMaximum allowed duration for a backup job
Retention period30 daysKeeps four weekly recovery points available at any time
Continuous backupDisabledPoint-in-time recovery isn't required for the PMM EC2

IAM Configuration for AWS Backup

Service Role: AwsBackupServiceRole

AWS Backup requires a service role to create and manage EBS snapshots and perform EC2 restores. The role uses the following configuration:

Policies Attached

  1. AWS managed policy: AWSBackupServiceRolePolicyForBackup — grants AWS Backup the baseline permissions to create backup jobs and access backup storage

  2. Custom policy (backup-service-policy) covering:

    • EC2 operations: create/delete snapshots, create/delete volumes, create/modify tags, start/stop/reboot instances, volume operations, snapshot operations
    • Backup operations: full access to AWS Backup operations and backup storage

Terraform Reference

The backup vault, plan, IAM role, and selection described in this guide are fully managed as Terraform in production. The snippets below cover the key resources — adapt variable names and values to your environment.

Vault and Backup Plan

resource "aws_backup_vault" "pmm" {
  name = "pmm-backup-vault"

  tags = {
    environment = "prod"
    service     = "backup"
    component   = "ec2"
  }
}

resource "aws_backup_plan" "pmm_weekly" {
  name = "pmm-weekly-backup"

  rule {
    rule_name         = "weekly-sunday-5am-utc"
    target_vault_name = aws_backup_vault.pmm.name
    schedule          = "cron(0 5 ? * SUN *)"
    start_window      = 60   # minutes before job is marked failed
    completion_window = 120  # maximum allowed job duration in minutes

    lifecycle {
      delete_after = 30 # retain 4 weekly recovery points
    }
  }
}

IAM Role for AWS Backup

resource "aws_iam_role" "backup_service" {
  name = "AwsBackupServiceRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "backup.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "backup_service_managed" {
  role       = aws_iam_role.backup_service.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup"
}

resource "aws_iam_role_policy" "backup_service_custom" {
  name = "backup-service-policy"
  role = aws_iam_role.backup_service.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ec2:CreateSnapshot", "ec2:DeleteSnapshot",
        "ec2:CreateVolume", "ec2:DeleteVolume",
        "ec2:CreateTags", "ec2:ModifySnapshotAttribute",
        "ec2:DescribeVolumes", "ec2:DescribeSnapshots",
        "ec2:StartInstances", "ec2:StopInstances", "ec2:RebootInstances"
      ]
      Resource = "*"
    }]
  })
}

Backup Selection

resource "aws_backup_selection" "pmm" {
  name         = "pmm-backup-selection"
  plan_id      = aws_backup_plan.pmm_weekly.id
  iam_role_arn = aws_iam_role.backup_service.arn

  resources = [var.pmm_instance_arn]
}

If you'd like the complete module including the full variable definitions, get in touch and we're happy to share it.

Recovery Procedure

The following steps restore the PMM EC2 instance from a backup vault recovery point. Follow them in order — skipping steps, particularly the subnet selection and IAM role attachment, will leave the restored instance unable to communicate via TGW or access CloudWatch.

Step 1: Disable the PMM Service in PagerDuty

Before restoring, mute PagerDuty to prevent a flood of stale firing alerts from the old instance state from paging your on-call team during the recovery window.

Step 2: Open the AWS Backup Console

Navigate to AWS Backup → Backup vaults → aws-pmm01-backup-vault. Select the most recent recovery point and click Restore.

Step 3: Configure Restore Options

Subnet selection is critical: Select the same subnet as the original PMM EC2 instance. The Transit Gateway attachments and route tables created in Part 1 are scoped to specific subnets. Restoring to a different subnet breaks TGW connectivity and requires updating the TGW attachment and all route tables.

Additional restore settings:

Step 4: Monitor Restoration Progress

Monitor the job in AWS Backup → Jobs → Restore jobs until the job status shows Completed.

Step 5: Attach the IAM Role to the Restored Instance

Once the instance is running:

  1. Go to EC2 → Instances → Actions → Security → Modify IAM role
  2. Select the IAM role: pmm-ec2-instance-role
  3. Apply the change
  4. Reboot the instance to ensure the instance metadata reflects the updated role

Step 6: Remove Stale PagerDuty Configuration from PMM UI

Log into the PMM UI on the restored instance and remove the existing PagerDuty contact point and notification policy. The restored instance retains the previous configuration, but PagerDuty state should be re-validated before re-enabling alerting.

Step 7: Update the PMM Instance ID in Terraform

The restored instance has a new EC2 instance ID. If your Terraform configuration references the PMM instance ID directly — for example, in the TGW attachment module or in any resource that targets the instance by ID — update it to the new value, run a plan, and merge the change. This is required to keep your infrastructure-as-code in sync and to ensure future Terraform applies don't attempt to recreate resources against a stale instance ID.

Step 8: Reconfigure PMM Client Agents on EC2 Nodes

If the restored instance received a new private IP address, every PMM client agent in the fleet needs to be reconfigured. See the full procedure in the next section.

Step 9: Re-enable the PMM Service in PagerDuty

After verifying that the PMM inventory shows all services as up (check Inventory → Services → Status), re-enable the PagerDuty service and restore the contact point and notification policy in the PMM UI.

Reconfiguring PMM Client Agents After a PMM IP Change

When the PMM server IP changes after a restore, each monitored EC2 node needs its PMM client agent reconfigured. The procedure differs between x86_64 and aarch64 architectures.

The first step is the same for all nodes: remove the stale node entry from the PMM inventory to prevent duplicate node registration errors.

Step 1: Delete the node from PMM Inventory

Go to PMM → Inventory → Nodes, locate the node, click Delete, enable Force mode, and confirm. Force mode removes the node and all associated services in one operation.

Reconfigure x86_64 Nodes (e.g., masterdb02)

pmm-admin config \
  --server-insecure-tls \
  --server-url=https://admin:<your_pmm_password>@<new-pmm-server-ip>:443 \
  <node-ip> generic <node-name> \
  --force

pmm-admin add mysql \
  --username=pmm \
  --password=<your_password> \
  --size-slow-logs=1GiB

Reconfigure aarch64 Nodes (e.g., ec2-go-db-0)

pmm-agent setup \
  --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
  --server-address=<new-pmm-server-ip> \
  --server-insecure-tls \
  --server-username=admin \
  --server-password=<your_pmm_password> \
  <node-ip> generic <node-name> \
  --force

systemctl start pmm-agent@percona
systemctl status pmm-agent@percona
pmm-admin list

pmm-admin add mysql \
  --username=pmm \
  --password=<your_password> \
  --size-slow-logs=1GiB

Reconfigure aarch64 ProxySQL Nodes

pmm-agent setup \
  --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml \
  --server-address=<new-pmm-server-ip> \
  --server-insecure-tls \
  --server-username=admin \
  --server-password=<your_pmm_password> \
  <node-ip> generic <node-name>

pmm-admin add proxysql \
  --username=admin \
  --password=<your_proxysql_admin_password> \
  --service-name=<node-name> \
  --host=127.0.0.1 \
  --port=6032

Order matters: Always delete the node from the PMM inventory before running pmm-admin config or pmm-agent setup. Attempting to re-register a node that still exists in the inventory produces a duplicate node error even with --force.

Recovery Checklist

M

Mario — ReliaDB

ReliaDB is a specialist DBA team for PostgreSQL and MySQL performance, high availability, and cloud database optimization. More about ReliaDB →