Automating Disaster Recovery with Persistent Volume Claim (PVC) Sync in Azure Kubernetes Service

Blog

Automating DR with PVC in AKS And Azure Fileshare

Automating Disaster Recovery with Persistent Volume Claim (PVC) Sync in Azure Kubernetes Service

Umang Patel

December 12, 2024

Introduction

Disaster Recovery (DR) is an essential strategy in modern Kubernetes-based architectures to ensure business continuity and safeguard data integrity. This implementation demonstrates how to synchronize 100 Persistent Volume Claims (PVCs), distributed across namespaces in an Azure Kubernetes Service (AKS) cluster, from the primary region (UK South) to the secondary region (UK West). Leveraging Azure DevOps (ADO) pipelines, Azure Storage, and automation scripts, we achieve a scalable, secure, and efficient DR solution.

This guide explores the architecture, implementation, challenges encountered, and the strategies used to overcome them using standard DevOps and cloud-native practices.

Challenges and Solutions

1. Scaling DR for Multiple Namespaces

Challenge: Managing 100 clients, each deployed in separate namespaces, with unique PVCs and data requirements.

Solution:

Implemented a mapping file (pvc_mappings.yaml) to create a one-to-one relationship between source and target PVCs.
Ensured flexibility to add new namespaces and PVCs dynamically by updating the mappings.

2. Automating Regular Data Syncs

Challenge: Ensuring consistent replication of PVC data to the secondary region with minimal latency and overhead.

Solution:

Designed a cron-based Azure DevOps pipeline that triggers the sync process every 15 minutes, minimizing Recovery Point Objective (RPO).
Used Azure Blob Storage with Read Access Geo-Redundant Zone-Redundant Storage (RAGZRS) for backend durability and performance.

3. Ensuring Secure Data Transfers

Challenge: Protecting sensitive data during replication across regions.

Solution:

Retrieved Shared Access Signature (SAS) tokens securely from Azure Key Vault within the pipeline.
Scoped SAS tokens with least privilege, ensuring secure access for specific operations.

4. Reliable and Scalable Sync Mechanism

Challenge: Avoiding data discrepancies, incomplete transfers, or failures during sync.

Solution:

Utilized azcopy, a high-performance data transfer tool, to synchronize blob data efficiently.
Incorporated robust error handling and retries in the sync script.

Architecture and Workflow

1. High-Level Architecture

The DR setup involves:

Primary Region (UK South): Hosts the production AKS cluster with PVCs backed by Azure Storage.
Secondary Region (UK West): Hosts the DR AKS cluster with PVCs mapped to the secondary storage account.
Azure DevOps Pipeline: Automates data sync between primary and secondary PVCs.

2. Workflow

Pipeline Triggers: The pipeline runs on:
- Changes pushed to the main branch.
- A cron schedule (every 15 minutes).
SAS Token Retrieval: The pipeline fetches an SAS token from Azure Key Vault.
PVC Mapping: Reads the PVC mapping file (pvc_mappings.yaml) for source and target PVCs.
Data Sync: Executes azcopy sync commands to replicate data between storage accounts.
Monitoring: Logs sync status and retries failed operations if necessary.

Implementation Details

1. Pipeline Configuration

The Azure DevOps pipeline (azure-pipelines.yml) automates the sync process with the following key sections:

Triggers and Scheduling: Defines branch-based triggers and a 15-minute sync interval using cron:
Secure SAS Token Retrieval: Uses the Azure Key Vault task to fetch the SAS token securely:
Sync Execution:
Executes the sync-pvcs.sh script, passing the SAS token as an argument:

2. PVC Mapping File

The pvc_mappings.yaml serves as a configuration layer to map source PVCs (primary region) to target PVCs (secondary region).

Example YAML Structure:

Scalability: New mappings can be added without modifying the script logic.

3. Sync Script

The sync-pvcs.sh script orchestrates the data sync process:

Script Logic:

Validation and Monitoring

1. Testing the DR Setup

Deploy test PVCs with sample data in the primary cluster.
Manually trigger the pipeline and validate data replication in the secondary region.

2. Monitoring Tools

Pipeline Logs: Track sync job status and resolve failures.
Azure Monitor: Set up alerts for sync anomalies or storage health issues.

3. Data Consistency Checks

Use checksum or hash comparisons to validate data integrity between source and target PVCs.

Technical Highlights

Infrastructure as Code (IaC)

All configurations (pipeline, PVC mappings) are defined declaratively.

Cloud-Native Tools

Azure Storage RAGZRS: Robust geo-redundancy.
AzCopy: High-performance blob data transfers.

Security Best Practices

SAS token scoped with minimal permissions.
Tokens fetched securely using Azure Key Vault.

Conclusion

This solution demonstrates a scalable and robust DR mechanism for Kubernetes workloads using AKS and Azure DevOps. With automated PVC syncing, secure data handling, and regular monitoring, it achieves minimal downtime and data loss in the event of a regional outage. This architecture can be further extended for larger-scale Kubernetes deployments or integrated with other cloud services for enhanced DR capabilities.

Contact Us

Thank you for reading our comprehensive guide on "Building Private Azure Infrastructure with Terraform" We hope you found it insightful and valuable. If you have any questions, need further assistance, or are looking for expert support in setting up and managing your Azure infrastructure, our team is here to help!

Reach out to us for Your Azure Infrastructure Needs:

🌐 Website: https://www.prometheanz.com

📧 Email: [email protected]

Happy Terraforming!

Blog

Automating DR with PVC in AKS And Azure Fileshare