_private/qwestly-docs/Policies/Business Continuity Disaster Recovery.md

Business Continuity and Disaster Recovery (BC/DR)

Purpose

  • To prepare the company in the event of service outages caused by factors beyond company control (e.g., natural disasters, man-made events).
  • To restore services to the widest extent possible in a minimum time frame.

Scope

All Company information systems that are business-critical. This policy applies to all employees of the company and all relevant external parties, including but not limited to company contractors.

Policy

In the event of a major disruption to production services and a disaster affecting the availability and/or security of the company office or hosting facilities for more than 24 hours, senior managers and executive staff shall determine mitigation actions based on this plan.

A disaster recovery test, including a test of backup restoration processes, shall be performed on an annual basis. In the case of an information security event or incident, refer to the Incident Response Plan.

Cloud Infrastructure and Dependencies

Primary Dependencies

  • AWS: Database hosting (RDS), application infrastructure, file storage (S3)
  • Vercel: Frontend application hosting and CDN
  • Google Workspace: Identity management, email, productivity tools
  • GitHub: Source code repositories, CI/CD pipelines
  • Domain/DNS: Primary domain registration and DNS management

High Availability Design

  • Multi-AZ database deployment in AWS for automatic failover
  • CDN distribution for static assets via Vercel
  • Automated backups to multiple AWS regions
  • Load balancing across availability zones
  • Infrastructure as Code for rapid redeployment

Single Points of Failure

  • Domain registrar/DNS provider
  • Primary AWS account access
  • GitHub organization access
  • Google Workspace administrative access
  • Key personnel availability (CTO/CEO)

Communications and Escalation

Primary Channels (in order of preference):

  1. Slack #emergency channel (if available)
  2. Personal phone numbers (maintained in 1Password)
  3. Personal email addresses (backup contact list)
  4. SMS/text messaging

Communication Tree:

  • CTO → CEO → Engineer → Contractors
  • Parallel notification via automated monitoring alerts
  • Customer communication via status page and email

Emergency Contact Information:

  • Maintain current personal contact info in shared secure location
  • Test communication channels quarterly
  • Designate backup communicators for each channel

Incident Classification and Response

Disaster Declaration Criteria

Level 1 - Minor Incident (No BC/DR activation):

  • Single service degradation < 2 hours
  • Partial feature unavailability with workarounds available
  • No customer data impact
  • Localized performance issues

Level 2 - Major Incident (Partial BC/DR activation):

  • Multiple service outage 2-8 hours
  • Customer-facing functionality significantly impacted
  • Potential data loss < 1 hour
  • Regional vendor outages affecting operations

Level 3 - Disaster (Full BC/DR activation):

  • Complete service outage > 8 hours
  • Data center/region failure
  • Potential data loss > 1 hour
  • Security breach affecting customer data
  • Key personnel unavailability during critical operations

Activation Authority

  • Level 1: Any engineer can declare and manage response
  • Level 2: CTO or CEO must approve activation
  • Level 3: CEO must approve (CTO can declare if CEO unavailable)

Roles and Responsibilities

Chief Technology Officer (CTO):

  • Lead all BC/DR efforts and technical recovery operations
  • Coordinate with hosting providers and technical vendors
  • Make technical decisions about system recovery priorities
  • Communicate technical status to CEO and team

Chief Executive Officer (CEO):

  • Overall incident command and business decision authority
  • External communications with customers, investors, and partners
  • Resource allocation and vendor escalation decisions
  • Media and regulatory communication if required

Engineering Team:

  • Execute technical recovery procedures under CTO direction
  • Provide technical expertise for system restoration
  • Assist with damage assessment and recovery planning
  • Maintain communication with department heads during recovery

Contractors:

  • Support recovery efforts within their area of expertise
  • Follow direction from CTO and engineering team
  • Maintain availability during declared disaster periods
  • Assist with non-critical system recovery as needed

Continuity of Critical Services

Customer-Facing Services

  • Strategy: Rely on cloud provider SLAs and multi-AZ deployment
  • Backup Plan: Rapid deployment to alternative cloud regions
  • Communication: Automated status page updates and email notifications

Identity and Authentication

  • Strategy: Leverage Google Workspace distributed infrastructure
  • Backup Plan: Emergency admin access procedures documented
  • Recovery: Service account access for critical system recovery

Business Operations

  • Finance/Legal/HR: All vendor-hosted SaaS applications with built-in redundancy
  • Sales/Marketing: CRM and marketing tools hosted by vendors
  • Development: Distributed team with remote access capabilities

Plan Activation

Automatic Activation Triggers

  • Complete loss of primary AWS region for > 2 hours
  • Database failure with inability to restore within RTO
  • Multiple critical vendor failures simultaneously
  • Security incident requiring immediate isolation of systems

Manual Activation Criteria

  • Natural disaster affecting primary infrastructure region
  • Cyber attack requiring system isolation and recovery
  • Key personnel unavailability during critical business periods
  • Vendor bankruptcy or sudden service termination

Appendix A – Business Continuity Procedures and Scenario

Disaster Recovery Procedures

Disaster recovery procedures are broken up into stages.

1.) Notification and Activation Phase

In this phase, it is determined that this plan should be activated and the initial steps below are taken to notify internal and external stakeholders.

Immediate Response (0-30 minutes)

  1. Incident Detection: Automated monitoring or manual detection
  2. Initial Assessment: CTO evaluates severity and impact
  3. Plan Activation: Decision to activate BC/DR plan
  4. Team Notification: Emergency communication to all team members
  5. Customer Communication: Initial status page update if customer-facing

Notification Sequence

  1. Monitoring system alerts CTO via multiple channels
  2. CTO assesses situation and determines response level
  3. CTO notifies CEO with initial impact assessment
  4. CEO approves plan activation for Level 2/3 incidents
  5. Team notification via emergency communication channels
  6. Customer notification via status page and email if applicable

Damage Assessment (30-60 minutes)

  1. System inventory of affected and functioning systems
  2. Data integrity verification for critical databases
  3. Vendor status confirmation for all critical dependencies
  4. Recovery time estimation based on available resources
  5. Resource requirement assessment for recovery operations

2.) Recovery Phase

This phase of the plan outlines the steps to recover the company's systems to acceptable levels. The goal is to get company systems and applications back to a full, operational production state.

Immediate Recovery Actions (0-4 hours)

  1. Implement workarounds for critical functionality if possible
  2. Activate backup systems in alternative regions if needed
  3. Restore from backups if data loss has occurred
  4. Redirect traffic to functioning systems
  5. Scale resources to handle recovery load

System Recovery Sequence

  1. Database recovery - highest priority for data integrity
  2. Authentication services - enable user access
  3. Core API services - restore basic functionality
  4. Customer-facing applications - restore user interfaces
  5. Administrative tools - enable business operations
  6. Development environments - restore development capabilities

Recovery Validation

  1. Functionality testing of all restored systems
  2. Security control verification for all recovered systems
  3. Data integrity verification through automated and manual checks
  4. Performance testing to ensure acceptable response times
  5. User acceptance testing with limited internal users

3.) Re-Establishment Phase

In this phase, company systems are moved back to the original hosting provider. If this is deemed impossible given the nature of the disaster, the alternative site is to be converted to a primary site on an ongoing basis.

Service Stabilization

  1. Monitor system performance for stability indicators
  2. Gradually increase load to validate capacity
  3. Enable all features systematically with testing
  4. Resume normal operations when all systems validated
  5. Document lessons learned throughout the process

Original Infrastructure Recovery (if applicable)

  1. Assess original infrastructure for restoration feasibility
  2. Plan migration back if cost-effective and strategic
  3. Execute migration during low-traffic periods
  4. Validate functionality after migration completion
  5. Update DNS records to point to restored infrastructure

Plan Deactivation

  1. Confirm all systems operating at full capacity
  2. Complete incident documentation with timeline and actions
  3. Conduct post-incident review with all stakeholders
  4. Update BC/DR plan based on lessons learned
  5. Return to normal operations with enhanced monitoring

Business Continuity Scenarios

Scenario 1: AWS Region Failure

Impact: Complete loss of primary infrastructure

Response:

  1. Activate backup region within 2 hours
  2. Restore database from latest backup
  3. Update DNS to point to backup region
  4. Communicate estimated recovery time to customers
  5. Monitor new region performance and capacity

Scenario 2: Google Workspace Outage

Impact: Loss of email, authentication, and productivity tools

Response:

  1. Use emergency admin accounts for critical system access
  2. Implement alternative communication channels (personal phones/email)
  3. Use service accounts for automated system operations
  4. Communicate via status page and alternative channels
  5. Document manual workarounds for business operations

Scenario 3: GitHub Unavailability

Impact: Cannot deploy code changes or access repositories

Response:

  1. Use local repository copies for emergency fixes
  2. Deploy critical fixes directly to production if necessary
  3. Document all manual changes for later synchronization
  4. Use GitLab backup repositories if extended outage
  5. Restore full CI/CD pipeline when GitHub recovers

Scenario 4: Key Personnel Unavailability

Impact: Loss of critical technical or business knowledge

Response:

  1. Activate cross-training procedures and documentation
  2. Use emergency access procedures for critical systems
  3. Engage backup personnel (contractors, advisors)
  4. Prioritize essential operations only
  5. Implement temporary leadership structure

Scenario 5: Cyber Security Incident

Impact: Potential data breach requiring system isolation

Response:

  1. Immediately isolate affected systems
  2. Activate incident response plan in parallel
  3. Preserve evidence for forensic analysis
  4. Restore from clean backups to separate environment
  5. Implement additional security controls before reconnection

Appendix B – RTOs/RPOs

Critical System Recovery Objectives

System Business Impact RTO RPO Owner
Customer Portal Customer cannot apply 4 hours 1 hour CTO
Company Portal HMs cannot post jobs 8 hours 4 hours CTO
Database (Customer Data) Complete service loss 2 hours 15 min CTO
API Services All functionality down 4 hours 1 hour CTO
Authentication (Google) No user access 2 hours N/A CTO
Public Website Marketing impact only 24 hours 4 hours CTO
Development Environment Development stops 24 hours 8 hours CTO

Document History

Version Date Description Written by Approved by
1.0.0 6/13/25 Dominick Pham Adam Boender