Table of Contents

Business Continuity and Disaster Recovery (BC/DR)

Purpose

To prepare the company in the event of service outages caused by factors beyond company control (e.g., natural disasters, man-made events).
To restore services to the widest extent possible in a minimum time frame.

Scope

All Company information systems that are business-critical. This policy applies to all employees of the company and all relevant external parties, including but not limited to company contractors.

Policy

In the event of a major disruption to production services and a disaster affecting the availability and/or security of the company office or hosting facilities for more than 24 hours, senior managers and executive staff shall determine mitigation actions based on this plan.

A disaster recovery test, including a test of backup restoration processes, shall be performed on an annual basis. In the case of an information security event or incident, refer to the Incident Response Plan.

Cloud Infrastructure and Dependencies

Primary Dependencies

AWS: Database hosting (RDS), application infrastructure, file storage (S3)
Vercel: Frontend application hosting and CDN
Google Workspace: Identity management, email, productivity tools
GitHub: Source code repositories, CI/CD pipelines
Domain/DNS: Primary domain registration and DNS management

High Availability Design

Multi-AZ database deployment in AWS for automatic failover
CDN distribution for static assets via Vercel
Automated backups to multiple AWS regions
Load balancing across availability zones
Infrastructure as Code for rapid redeployment

Single Points of Failure

Domain registrar/DNS provider
Primary AWS account access
GitHub organization access
Google Workspace administrative access
Key personnel availability (CTO/CEO)

Communications and Escalation

Primary Channels (in order of preference):

Slack #emergency channel (if available)
Personal phone numbers (maintained in 1Password)
Personal email addresses (backup contact list)
SMS/text messaging

Communication Tree:

CTO → CEO → Engineer → Contractors
Parallel notification via automated monitoring alerts
Customer communication via status page and email

Emergency Contact Information:

Maintain current personal contact info in shared secure location
Test communication channels quarterly
Designate backup communicators for each channel

Incident Classification and Response

Disaster Declaration Criteria

Level 1 - Minor Incident (No BC/DR activation):

Single service degradation < 2 hours
Partial feature unavailability with workarounds available
No customer data impact
Localized performance issues

Level 2 - Major Incident (Partial BC/DR activation):

Multiple service outage 2-8 hours
Customer-facing functionality significantly impacted
Potential data loss < 1 hour
Regional vendor outages affecting operations

Level 3 - Disaster (Full BC/DR activation):

Complete service outage > 8 hours
Data center/region failure
Potential data loss > 1 hour
Security breach affecting customer data
Key personnel unavailability during critical operations

Activation Authority

Level 1: Any engineer can declare and manage response
Level 2: CTO or CEO must approve activation
Level 3: CEO must approve (CTO can declare if CEO unavailable)

Roles and Responsibilities

Chief Technology Officer (CTO):

Lead all BC/DR efforts and technical recovery operations
Coordinate with hosting providers and technical vendors
Make technical decisions about system recovery priorities
Communicate technical status to CEO and team

Chief Executive Officer (CEO):

Overall incident command and business decision authority
External communications with customers, investors, and partners
Resource allocation and vendor escalation decisions
Media and regulatory communication if required

Engineering Team:

Execute technical recovery procedures under CTO direction
Provide technical expertise for system restoration
Assist with damage assessment and recovery planning
Maintain communication with department heads during recovery

Contractors:

Support recovery efforts within their area of expertise
Follow direction from CTO and engineering team
Maintain availability during declared disaster periods
Assist with non-critical system recovery as needed

Continuity of Critical Services

Customer-Facing Services

Strategy: Rely on cloud provider SLAs and multi-AZ deployment
Backup Plan: Rapid deployment to alternative cloud regions
Communication: Automated status page updates and email notifications

Identity and Authentication

Strategy: Leverage Google Workspace distributed infrastructure
Backup Plan: Emergency admin access procedures documented
Recovery: Service account access for critical system recovery

Business Operations

Finance/Legal/HR: All vendor-hosted SaaS applications with built-in redundancy
Sales/Marketing: CRM and marketing tools hosted by vendors
Development: Distributed team with remote access capabilities

Plan Activation

Automatic Activation Triggers

Complete loss of primary AWS region for > 2 hours
Database failure with inability to restore within RTO
Multiple critical vendor failures simultaneously
Security incident requiring immediate isolation of systems

Manual Activation Criteria

Natural disaster affecting primary infrastructure region
Cyber attack requiring system isolation and recovery
Key personnel unavailability during critical business periods
Vendor bankruptcy or sudden service termination

Appendix A – Business Continuity Procedures and Scenario

Disaster Recovery Procedures

Disaster recovery procedures are broken up into stages.

1.) Notification and Activation Phase

In this phase, it is determined that this plan should be activated and the initial steps below are taken to notify internal and external stakeholders.

Immediate Response (0-30 minutes)

Incident Detection: Automated monitoring or manual detection
Initial Assessment: CTO evaluates severity and impact
Plan Activation: Decision to activate BC/DR plan
Team Notification: Emergency communication to all team members
Customer Communication: Initial status page update if customer-facing

Notification Sequence

Monitoring system alerts CTO via multiple channels
CTO assesses situation and determines response level
CTO notifies CEO with initial impact assessment
CEO approves plan activation for Level 2/3 incidents
Team notification via emergency communication channels
Customer notification via status page and email if applicable

Damage Assessment (30-60 minutes)

System inventory of affected and functioning systems
Data integrity verification for critical databases
Vendor status confirmation for all critical dependencies
Recovery time estimation based on available resources
Resource requirement assessment for recovery operations

2.) Recovery Phase

This phase of the plan outlines the steps to recover the company's systems to acceptable levels. The goal is to get company systems and applications back to a full, operational production state.

Immediate Recovery Actions (0-4 hours)

Implement workarounds for critical functionality if possible
Activate backup systems in alternative regions if needed
Restore from backups if data loss has occurred
Redirect traffic to functioning systems
Scale resources to handle recovery load

System Recovery Sequence

Database recovery - highest priority for data integrity
Authentication services - enable user access
Core API services - restore basic functionality
Customer-facing applications - restore user interfaces
Administrative tools - enable business operations
Development environments - restore development capabilities

Recovery Validation

Functionality testing of all restored systems
Security control verification for all recovered systems
Data integrity verification through automated and manual checks
Performance testing to ensure acceptable response times
User acceptance testing with limited internal users

3.) Re-Establishment Phase

In this phase, company systems are moved back to the original hosting provider. If this is deemed impossible given the nature of the disaster, the alternative site is to be converted to a primary site on an ongoing basis.

Service Stabilization

Monitor system performance for stability indicators
Gradually increase load to validate capacity
Enable all features systematically with testing
Resume normal operations when all systems validated
Document lessons learned throughout the process

Original Infrastructure Recovery (if applicable)

Assess original infrastructure for restoration feasibility
Plan migration back if cost-effective and strategic
Execute migration during low-traffic periods
Validate functionality after migration completion
Update DNS records to point to restored infrastructure

Plan Deactivation

Confirm all systems operating at full capacity
Complete incident documentation with timeline and actions
Conduct post-incident review with all stakeholders
Update BC/DR plan based on lessons learned
Return to normal operations with enhanced monitoring

Business Continuity Scenarios

Scenario 1: AWS Region Failure

Impact: Complete loss of primary infrastructure

Response:

Activate backup region within 2 hours
Restore database from latest backup
Update DNS to point to backup region
Communicate estimated recovery time to customers
Monitor new region performance and capacity

Scenario 2: Google Workspace Outage

Impact: Loss of email, authentication, and productivity tools

Response:

Use emergency admin accounts for critical system access
Implement alternative communication channels (personal phones/email)
Use service accounts for automated system operations
Communicate via status page and alternative channels
Document manual workarounds for business operations

Scenario 3: GitHub Unavailability

Impact: Cannot deploy code changes or access repositories

Response:

Use local repository copies for emergency fixes
Deploy critical fixes directly to production if necessary
Document all manual changes for later synchronization
Use GitLab backup repositories if extended outage
Restore full CI/CD pipeline when GitHub recovers

Scenario 4: Key Personnel Unavailability

Impact: Loss of critical technical or business knowledge

Response:

Activate cross-training procedures and documentation
Use emergency access procedures for critical systems
Engage backup personnel (contractors, advisors)
Prioritize essential operations only
Implement temporary leadership structure

Scenario 5: Cyber Security Incident

Impact: Potential data breach requiring system isolation

Response:

Immediately isolate affected systems
Activate incident response plan in parallel
Preserve evidence for forensic analysis
Restore from clean backups to separate environment
Implement additional security controls before reconnection

Appendix B – RTOs/RPOs

Critical System Recovery Objectives

System	Business Impact	RTO	RPO	Owner
Customer Portal	Customer cannot apply	4 hours	1 hour	CTO
Company Portal	HMs cannot post jobs	8 hours	4 hours	CTO
Database (Customer Data)	Complete service loss	2 hours	15 min	CTO
API Services	All functionality down	4 hours	1 hour	CTO
Authentication (Google)	No user access	2 hours	N/A	CTO
Public Website	Marketing impact only	24 hours	4 hours	CTO
Development Environment	Development stops	24 hours	8 hours	CTO

Document History

Version	Date	Description	Written by	Approved by
1.0.0	6/13/25		Dominick Pham	Adam Boender

_private/qwestly-docs/Policies/Business Continuity Disaster Recovery.md

Business Continuity and Disaster Recovery (BC/DR)

Purpose

Scope

Policy

Cloud Infrastructure and Dependencies

Primary Dependencies

High Availability Design

Single Points of Failure

Communications and Escalation

Incident Classification and Response

Disaster Declaration Criteria

Activation Authority

Roles and Responsibilities

Continuity of Critical Services

Plan Activation

Appendix A – Business Continuity Procedures and Scenario

Disaster Recovery Procedures

Disaster recovery procedures are broken up into stages.

1.) Notification and Activation Phase

Immediate Response (0-30 minutes)

Notification Sequence

Damage Assessment (30-60 minutes)

2.) Recovery Phase

Immediate Recovery Actions (0-4 hours)

System Recovery Sequence

Recovery Validation

3.) Re-Establishment Phase

Service Stabilization

Original Infrastructure Recovery (if applicable)

Plan Deactivation

Business Continuity Scenarios

Scenario 1: AWS Region Failure

Scenario 2: Google Workspace Outage

Scenario 3: GitHub Unavailability

Scenario 4: Key Personnel Unavailability

Scenario 5: Cyber Security Incident

Appendix B – RTOs/RPOs

Document History