Disaster Recovery and Business Continuity Specialist IT Services Cape Town South Africa.
Business Continuity and Disaster Recovery services.Business Continuity and Disaster Recovery services Cape Town South Africa
BUSINESS CONTINUITY – DISASTER PLANNING – BUSINESS IMPACT ANALYSIS RISK PROFILING – FULL FAILOVER SOLUTION DESIGN - COMPLETE DATA CENTRE/SITE DESIGN OFFICE MOVES - SYSTEMS INTEGRATION –ON-SITE TECHNICAL SUPPORT CONSULTANCY
Fourspiral Technologies • Tel: +27 (0)21 - 4242957 • Fax: +27 (0)21 - 4242956 • E-mail:
EXPERTISE & EXPERIENCE
SERVICES OFFERED
CONTACT US
PARTNERS
secure your website NOW!
To Purchase Thawte certificates click above
BUSINESS CONTINUITY & DISASTER RECOVERY
Philosophy and Overview:
Fourspiral’s philosophy is to identify and define the scope of key processes involving staff, systems, interfaces to other companies and communications – these are then modeled and documented. How data flows, what its path is and how both human and automated processes are carriers for this data to be transposed to meaningful information all highlight task and system bottlenecks risk areas. Architecture of robust, heavy duty failover & disaster plans to suit the business Recovery Time & Point Objectives (RTO & RPO) is predicated on fully understanding each composite element of key business processes required for the business to operate effectively.
RTO & RPO’s ( terms are explained) determine what timeframes critical processes (which have already been modeled) are acceptable from a risk appetite perspective.Aggressive scenario testing will be an integral part of refining a DR plan so that business can deal with the emergency at its primary site and then instigate recovery to its backup site.Many years of experience in high stress, production critical environments have fomented knowledge that forms the basis for the provision of practical, intuitive DR and BC solutions that make sense to BOTH the business and technical staff. Polarized perspectives do not function well in critical situations – everyone in the company needs to understand what their roles will be in a contingent situation. Tested and proven DR plans provide focus to executives, managers & technical staff alike – their roles and recovery objectives, customer communication and expectation management are almost as important to an organization's reputation as achieving speedy technical recovery.
Financial versus Reputational Risk
It is important to remember that the term ‘risk’ is a composite term which generically encompasses many factors from all aspects of life pertinent to a business and its operation. Reputational risk where customers and shareholders perceive (that an incident or an event led to) unnecessary financial loss due to the lack of due diligence in a risk area (or worse) due to negligence in applying the correct controls & procedures that could easily have prevented this loss. Informed consensus on risk is that financial loss due to lack of control of systems, procedures and internal controls can hurt a company’s reputation is a serious way; it is considered potentially more serious that an standard outage that temporarily caused trading impact with a corresponding financial loss but that was handled well, positively within the bounds of the company response policy.
Negligence or lack of control and insufficient due care, in a regulated environment can bring large indirect financial losses where investors or shareholders lose confidence & trust in crucial control and continuity elements within a given business unit. This can really affect the investment capital on a company’s balance sheet. An indirect sleight on one’s reputation can be harder hitting than purely a monetary loss that can be written off on a balance sheet.
These statements are especially true of companies operating in the financial services sector.

Categories of Disaster
Categories of Disaster will be defined. These can range from a localized power outage on a small site to a fire at Head office or a database deletion or compromise after an unathorised network penetration.
Worse situations, where regional disasters have been declared, may require major businesses to instantiate services to contingent regional offices thousands of kilometers away.
Each of these continuity scenarios can be invoked for key systems, operating centres, alone or inclusive of the user base environments such as call centres, or order processing departments and the like.
Special consideration needs to be given to services like traditional analogue Voice circuit termination how and when Primary rate ISDN lines will be logically moved to a contingent site.
What happens frequently in a DR scenario is that dependencies on B2B interfaces that form part of the business critical workflow for key processes are not factored into recovery plans: Fourspiral considers and tests for these from the outset.
Structuring the right contingent environment to correspond the level of risk appetite
Thresholds (or appetites) that key business stakeholders set down are fundamental to mitigating financial or reputational loss. These risk thresholds would be determined in a Business (or Enterprise) Impact Analysis which is required to logically present outcomes and impacts to myriad threat conditions or scenarios. The Criticality (per service or business unit) & likelihood of these events occurring generate acceptable risk thresholds. Risk Tolerance thresholds allow for informed continuity spending and planning. Knowing budgets and staff resources BC & DR invocation actions and responsibilities can be documented and given to nominated senior technical and business staff to own, follow through & test.
Service or operational owners (departmental or operational managers) will own their component continuity plans. The composite plans roll-up into one master ECP Enterprise Continuity Plan would typically be assigned to a senior operations manager (with a secondary owner for resilience) to own, understand, maintain and bear responsibility for becoming the businesses’ expert for running continuity operations at a high level.
Typically the role of a specialist DR company can be categorised into three potential areas:
  1. Provide consultancy and advice to key staff members within an organisation.
  2. Work together with IT staff to develop, implement and test DR & BC procedures & then hand ownership to the organization to maintain updated environments and documentation and BC plans.
  3. Take responsibility for an outsourced and managed BC solution – so that our clients are effectively mitigating against a known risk through ongoing preparation monitoring and alerts with integrated device/soft failover where require

Potential problems associated with maintaining a strong, robust failover environment is that senior executive management must be persuaded of its merits and ultimately own responsibility for the business wellbeing in a critical business continuity situation:

Points to interest to consider:
a) It is not unusual in the corporate world not to have a serious enough stakeholder involvement and mandate – buy in must be given from the highest business & board level. The reason for this traditionally has been that capital expenditure and monthly costs for contingent lines contributes negatively to the quarterly bottom line.
Executive management must fully understand the direct relationship between capex outlay for failover architecture, procedures and preparation versus the potential revenue reputational loss when a disaster hits. Good Business Continuity & preparation can be likened to insurance payments that are made every month: you hate paying them but when your house burns down you thank the universe that you offset the risk of losing all by contributing a relatively small percentage of one’s income.
b) If confidence in recovery and continuity procedures is not strong because aggressive enough testing has not been conducted – simulated testing must be real and taken seriously. If there is doubt at senior executive levels of an organization as to its capacity to respond to serious incidents affecting its business operations then this doubt will be magnified at the technical levels required to bear the burden in an invocation scenario. Very often companies sit back in self satisfaction under the illusion that they are prepared; engaging an external specialist company to ratify given approaches can be of great benefit. Quote to bear in mind: ‘If it ain’t seriously tested it won’t work’
c) The level of automated monitoring is not in place to alert and failover to the contingent instance.
d) Change management does not include promotions of changes and deployments into the DR environment – this is especially true of very subtle optimization upgrades or parameter tuning (for example on databases) that is not mirrored onto the contingent environment.
e) Connectivity to 3rd party vendors or suppliers is not integrated or tested in DR exercises or invocations.
f) Hardware dependencies :– these can very often be eliminated through the use of Virtual Machine clustered environments layered on top redundant Blade enclosures for instance . This in itself can offer significant cost and ease of management advantages. In availing of this approach. Many companies are now switching to these environments for their day production operations.
Fourspiral highly recommends the use of the following products:
Vitual Machine Ware : VMWARE ESX product suite – : click here : hyperlink
Case Study available on VMWARE site click here: hyperlink
Coupled with Enclosure & Blade array from Dell Corporation:
Dell Blade Servers
Dell Blade Servers_more

NB: ‘Your IT guys are going to love you if you suggest or even mention VMWare’
RTO & RPO Definitions (courtesy of www.drj.com)
RECOVERY POINT OBJECTIVE (RPO): The point in time to which systems and data must be recovered after an outage as determined by the business unit [or department].
RECOVERY TIME OBJECTIVE (RTO): The period of time within which systems, applications, or functions must be recovered after an outage (e.g. one business day). RTOs are often used as the basis for the development of recovery strategies, and as a determinant as to whether or not to implement the recovery strategies during a disaster situation. SIMILAR TERMS: Maximum Allowable Downtime.
Additional Information:
Fourspiral Staff have been responsible for the design and maintenance recovery of infrastructure serving business processes generating turnover in excess of Euro 2Billion per annum dollars in Europe.
Within South Africa Responsibility for BC and DR continuity for clients with Turnover in excess of R 200 Million per annum
Additional services
Timed and audited DR tests to ensure recovery objectives and timeframes required by the business and its auditors have been met.
Technical Audits such as SAS 70: ( IT & Corporate Audits)
Implementing more in depth technical recommendations which stipulated after audit assertions or recommendations are made (to a client) by KPMG. Typically I have often found a lack of clarity in the nexus between procedural recommendations and what should actually be deployed technically to not only remain compliant but ,more importantly, to have a failsafe, robust failover environment for services, systems, servers and staff.
Fourspiral can assist in sourcing staff of co-ordinating local resources for the delivery of specialised continuity projects for companies with locations in areas of Africa that pose challenges in terms of infrastructure and communications.
Continued upkeep of DR process documentation given changes implemented through a company’s Change Management (CM) processes. FS can be responsible for maintaining, the currency of electronic documentation, plans and procedures against every changing technical environments.
Assistance in the following areas can also be offered:
Incident Management
Technical system recovery
User-base bring-up and relocation
Interface services between business crisis management team and the user-base.
Interface with all third parties to ensure communications, systems, business objectives are defined.
Special, rapid site microwave communications deployment available for ultra quick WAN/internet/inter-site connectivity.
This level of assistance is offered with the assistance of key partners: FS works with its partners to ensure the best service in any given area, networking, communications, hardware support, expertise technical knowledge.

CASE STUDIES
Case Studies #5 : Real Time Replication to hot site for Leading investment Bank
Name: Live Real time Continuity & DR site established 30kms for Investment Trading Bank : Chase Manhattan Bank, Dublin Ireland (now J P Morgan Chase)
Brief: Provide and secure solution to replicate all systems responsible for facilitating trades & share, stock investments. Aggressively simulate different ‘disaster’ scenarios once every 6 months to pass internal audit requirements.
Objective: To ensure that in the event of the loss of the primary site that all functionality for trade settlement, investment, staff approval, sign-off, settlement and MIS.
Assumptions:
That the entire primary facility was not available, no systems were recoverable, no paper records were available, no staff injuries – all staff available, all PC’s had to be re-imaged with current secure corporate desktop environment. The corporate enterprise WAN was still available. Primary connectivity replication over tunneled Ethernet LANE over SDH ring was full functional at time of ‘disaster’.
Regional and national connectivity unaffected – international internet and private WAN circuits not affected.
Mainframes available over corporate WAN (situated in UK)
DR facility a real-time recovery facility for server and communications, warm standby for Desktop environment recovery.
Relocation of user-base of 220 catered for; RPO was set at eight hours. Recover to 80% of normal trading capacity.
Failover services that were included included:
  • Access to multiple Mainframe environment and applications: Contingent WAN link bring up, firewall rule-base updated for access
  • Real time access to Lotus Notes Databases for historical data and custom Lotus applications. E-mail and e-Fax repositories realtime replication to recovery site – no recovery necessary.
  • Novell Directory services, file and print services – already in Situ, as was Layer 2 and 3 switching and routing.
  • Databases of lessor priorities available in state when DR invoked
Result: Key departments and services recovery and available in order of priority Objectives and RPO achieved with tolerances.

Case Studies # 6:
Name: Server Farm DR Failover from West to east coast in US
Delivery: Ensure that in the event of a systems failure in the Live (Linux) production farm failure that all systems would fail over to East Coat server contingent server farm.
Objective: Ensure that all procedures, infrastructure monitoring were in place to ensure cutover to
Failover methods: Multiple modes of failure defined with know RPO and RTO for each. Different system modes of failure warrant different DR recovery procedures and notifications (on-going situation updates) to internal & external client bases
Scenario 1: Main database production server fails and is unrecoverable – failover to contingent instance within the server farm.
Senario 2A More than 2 servers fail in the farm:
Full systems Contingency to East Coast site:
Scenario 2B Business to Business backend dependencies on holding company systems (separate from above server farm)
Active monitoring
MySQL4 real-time Slave Database mirror ( For Interesting article see: data integrity )
With an entirely mirrored Server farm at a geographically DNS re-point
Assumptions:
Result: Audit test by KPMG every 6 months proved conclusively that hot standby site worked very to RPO objectives

Case Studies # 7: Live Incident & Invocation of Regional Business Continuity Plan
Motivation: Thawte USA office about to be hit by Hurricane
Objective: Relocation Of Thawte USA order processing redirected from East to West Coast due to Hurricane Isabelle on September 17th 2003; this resulted in a Regional disaster warning for the East Coast states bordering the ocean.
The Thawte office was located in Raleigh, North Carolina. The Raleigh office focused primarily on processing customer certificates for issuance.
With the office closed down as it was in the projected hurricane path, processing functions for customers in the US were split between the Thawte office, in Cape Town, South Africa and the holding company’s office located in Mountain View California.
Customer queries via the web site and contact number redirection were facilitated so that orders could be handled seamlessly at other locations. Some fancy redirection of distribution groups across the Exchange backbone – redirected e-mail queries to agents in the office another continent away.
That this could happen on 3 hours notice pays tribute to procedural and system architecture that allowed this to take place.
Result:
No customer impact, Thawte Raleigh office survived the hurricane, and reopened three days later with full business funcionality being returned to them as if it never happened.
This is a true example of Business Continuity across the world in action.

Of Interest in this area:
Confidence in Data Integrity latent within the slave DB replica is crucial. It is imperative that the data be frozen and referentially intact and solid at the time that disaster occurs:
An article by IBM on one of its new product offerings: GDPS/PPRC is interesting – check it out at: hyperlink
According to IBM: Their GDPS offering:
‘ is a multi-site or single-site end to end application availability solution that provides the capability to manage remote copy configuration and storage subsystems (including IBM TotalStorage Enterprise Storage Server), to automate Parallel Sysplex operation tasks and perform failure recovery from a single point of control.
… This prevents the logical contamination of the secondary copy of data that would occur if any storage subsystem mirroring were to continue after a failure that prevents some, but not all secondary volumes from being updated.’

Ensuring data integrity on replicated or slave database is technically demanding and requires a combination of in-depth DBA knowledge/intervention, advanced transaction sensing technology and scripts or procedures to allow for non-repudiation of data integrity.
When you get this element of disaster recovery right you have really surmounted one of the major challenges associated with real time data/database slave replica recovery.
Note:
Other databases such as Oracle have specialist tool sets associated with standby instances that make failover fully transparent.
This is a highly effective but comes at a cost which exceeds open Source solutions by many, many orders of magnitude.
The questions returns full circle to how much Risk a company is willing to bear. As a company’s risk appetite reduces towards zero then protection expenditure to mitigate against outage impact rises markedly.


Motto:
'Plan, Plan, Plan – Train hard, expect the worst and you’ll be surprised at how you grow and what one's team can achieve.’
B. Mc Mahon

Official Thawte Referrer