_private/qwestly-docs/Policies/Business Continuity Disaster Recovery.md
Table of Contents
Business Continuity and Disaster Recovery (BC/DR)
Purpose
- To prepare the company in the event of service outages caused by factors beyond company control (e.g., natural disasters, man-made events).
- To restore services to the widest extent possible in a minimum time frame.
Scope
All Company information systems that are business-critical. This policy applies to all employees of the company and all relevant external parties, including but not limited to company contractors.
Policy
In the event of a major disruption to production services and a disaster affecting the availability and/or security of the company office or hosting facilities for more than 24 hours, senior managers and executive staff shall determine mitigation actions based on this plan.
A disaster recovery test, including a test of backup restoration processes, shall be performed on an annual basis. In the case of an information security event or incident, refer to the Incident Response Plan.
Cloud Infrastructure and Dependencies
Primary Dependencies
- AWS: Database hosting (RDS), application infrastructure, file storage (S3)
- Vercel: Frontend application hosting and CDN
- Google Workspace: Identity management, email, productivity tools
- GitHub: Source code repositories, CI/CD pipelines
- Domain/DNS: Primary domain registration and DNS management
High Availability Design
- Multi-AZ database deployment in AWS for automatic failover
- CDN distribution for static assets via Vercel
- Automated backups to multiple AWS regions
- Load balancing across availability zones
- Infrastructure as Code for rapid redeployment
Single Points of Failure
- Domain registrar/DNS provider
- Primary AWS account access
- GitHub organization access
- Google Workspace administrative access
- Key personnel availability (CTO/CEO)
Communications and Escalation
Primary Channels (in order of preference):
- Slack #emergency channel (if available)
- Personal phone numbers (maintained in 1Password)
- Personal email addresses (backup contact list)
- SMS/text messaging
Communication Tree:
- CTO → CEO → Engineer → Contractors
- Parallel notification via automated monitoring alerts
- Customer communication via status page and email
Emergency Contact Information:
- Maintain current personal contact info in shared secure location
- Test communication channels quarterly
- Designate backup communicators for each channel
Incident Classification and Response
Disaster Declaration Criteria
Level 1 - Minor Incident (No BC/DR activation):
- Single service degradation < 2 hours
- Partial feature unavailability with workarounds available
- No customer data impact
- Localized performance issues
Level 2 - Major Incident (Partial BC/DR activation):
- Multiple service outage 2-8 hours
- Customer-facing functionality significantly impacted
- Potential data loss < 1 hour
- Regional vendor outages affecting operations
Level 3 - Disaster (Full BC/DR activation):
- Complete service outage > 8 hours
- Data center/region failure
- Potential data loss > 1 hour
- Security breach affecting customer data
- Key personnel unavailability during critical operations
Activation Authority
- Level 1: Any engineer can declare and manage response
- Level 2: CTO or CEO must approve activation
- Level 3: CEO must approve (CTO can declare if CEO unavailable)
Roles and Responsibilities
Chief Technology Officer (CTO):
- Lead all BC/DR efforts and technical recovery operations
- Coordinate with hosting providers and technical vendors
- Make technical decisions about system recovery priorities
- Communicate technical status to CEO and team
Chief Executive Officer (CEO):
- Overall incident command and business decision authority
- External communications with customers, investors, and partners
- Resource allocation and vendor escalation decisions
- Media and regulatory communication if required
Engineering Team:
- Execute technical recovery procedures under CTO direction
- Provide technical expertise for system restoration
- Assist with damage assessment and recovery planning
- Maintain communication with department heads during recovery
Contractors:
- Support recovery efforts within their area of expertise
- Follow direction from CTO and engineering team
- Maintain availability during declared disaster periods
- Assist with non-critical system recovery as needed
Continuity of Critical Services
Customer-Facing Services
- Strategy: Rely on cloud provider SLAs and multi-AZ deployment
- Backup Plan: Rapid deployment to alternative cloud regions
- Communication: Automated status page updates and email notifications
Identity and Authentication
- Strategy: Leverage Google Workspace distributed infrastructure
- Backup Plan: Emergency admin access procedures documented
- Recovery: Service account access for critical system recovery
Business Operations
- Finance/Legal/HR: All vendor-hosted SaaS applications with built-in redundancy
- Sales/Marketing: CRM and marketing tools hosted by vendors
- Development: Distributed team with remote access capabilities
Plan Activation
Automatic Activation Triggers
- Complete loss of primary AWS region for > 2 hours
- Database failure with inability to restore within RTO
- Multiple critical vendor failures simultaneously
- Security incident requiring immediate isolation of systems
Manual Activation Criteria
- Natural disaster affecting primary infrastructure region
- Cyber attack requiring system isolation and recovery
- Key personnel unavailability during critical business periods
- Vendor bankruptcy or sudden service termination
Appendix A – Business Continuity Procedures and Scenario
Disaster Recovery Procedures
Disaster recovery procedures are broken up into stages.
1.) Notification and Activation Phase
In this phase, it is determined that this plan should be activated and the initial steps below are taken to notify internal and external stakeholders.
Immediate Response (0-30 minutes)
- Incident Detection: Automated monitoring or manual detection
- Initial Assessment: CTO evaluates severity and impact
- Plan Activation: Decision to activate BC/DR plan
- Team Notification: Emergency communication to all team members
- Customer Communication: Initial status page update if customer-facing
Notification Sequence
- Monitoring system alerts CTO via multiple channels
- CTO assesses situation and determines response level
- CTO notifies CEO with initial impact assessment
- CEO approves plan activation for Level 2/3 incidents
- Team notification via emergency communication channels
- Customer notification via status page and email if applicable
Damage Assessment (30-60 minutes)
- System inventory of affected and functioning systems
- Data integrity verification for critical databases
- Vendor status confirmation for all critical dependencies
- Recovery time estimation based on available resources
- Resource requirement assessment for recovery operations
2.) Recovery Phase
This phase of the plan outlines the steps to recover the company's systems to acceptable levels. The goal is to get company systems and applications back to a full, operational production state.
Immediate Recovery Actions (0-4 hours)
- Implement workarounds for critical functionality if possible
- Activate backup systems in alternative regions if needed
- Restore from backups if data loss has occurred
- Redirect traffic to functioning systems
- Scale resources to handle recovery load
System Recovery Sequence
- Database recovery - highest priority for data integrity
- Authentication services - enable user access
- Core API services - restore basic functionality
- Customer-facing applications - restore user interfaces
- Administrative tools - enable business operations
- Development environments - restore development capabilities
Recovery Validation
- Functionality testing of all restored systems
- Security control verification for all recovered systems
- Data integrity verification through automated and manual checks
- Performance testing to ensure acceptable response times
- User acceptance testing with limited internal users
3.) Re-Establishment Phase
In this phase, company systems are moved back to the original hosting provider. If this is deemed impossible given the nature of the disaster, the alternative site is to be converted to a primary site on an ongoing basis.
Service Stabilization
- Monitor system performance for stability indicators
- Gradually increase load to validate capacity
- Enable all features systematically with testing
- Resume normal operations when all systems validated
- Document lessons learned throughout the process
Original Infrastructure Recovery (if applicable)
- Assess original infrastructure for restoration feasibility
- Plan migration back if cost-effective and strategic
- Execute migration during low-traffic periods
- Validate functionality after migration completion
- Update DNS records to point to restored infrastructure
Plan Deactivation
- Confirm all systems operating at full capacity
- Complete incident documentation with timeline and actions
- Conduct post-incident review with all stakeholders
- Update BC/DR plan based on lessons learned
- Return to normal operations with enhanced monitoring
Business Continuity Scenarios
Scenario 1: AWS Region Failure
Impact: Complete loss of primary infrastructure
Response:
- Activate backup region within 2 hours
- Restore database from latest backup
- Update DNS to point to backup region
- Communicate estimated recovery time to customers
- Monitor new region performance and capacity
Scenario 2: Google Workspace Outage
Impact: Loss of email, authentication, and productivity tools
Response:
- Use emergency admin accounts for critical system access
- Implement alternative communication channels (personal phones/email)
- Use service accounts for automated system operations
- Communicate via status page and alternative channels
- Document manual workarounds for business operations
Scenario 3: GitHub Unavailability
Impact: Cannot deploy code changes or access repositories
Response:
- Use local repository copies for emergency fixes
- Deploy critical fixes directly to production if necessary
- Document all manual changes for later synchronization
- Use GitLab backup repositories if extended outage
- Restore full CI/CD pipeline when GitHub recovers
Scenario 4: Key Personnel Unavailability
Impact: Loss of critical technical or business knowledge
Response:
- Activate cross-training procedures and documentation
- Use emergency access procedures for critical systems
- Engage backup personnel (contractors, advisors)
- Prioritize essential operations only
- Implement temporary leadership structure
Scenario 5: Cyber Security Incident
Impact: Potential data breach requiring system isolation
Response:
- Immediately isolate affected systems
- Activate incident response plan in parallel
- Preserve evidence for forensic analysis
- Restore from clean backups to separate environment
- Implement additional security controls before reconnection
Appendix B – RTOs/RPOs
Critical System Recovery Objectives
| System | Business Impact | RTO | RPO | Owner |
|---|---|---|---|---|
| Customer Portal | Customer cannot apply | 4 hours | 1 hour | CTO |
| Company Portal | HMs cannot post jobs | 8 hours | 4 hours | CTO |
| Database (Customer Data) | Complete service loss | 2 hours | 15 min | CTO |
| API Services | All functionality down | 4 hours | 1 hour | CTO |
| Authentication (Google) | No user access | 2 hours | N/A | CTO |
| Public Website | Marketing impact only | 24 hours | 4 hours | CTO |
| Development Environment | Development stops | 24 hours | 8 hours | CTO |
Document History
| Version | Date | Description | Written by | Approved by |
|---|---|---|---|---|
| 1.0.0 | 6/13/25 | Dominick Pham | Adam Boender |