February 27, 2013

How much is Enough?

Filed under: Disaster Recovery — houtkin @ 11:56 am

You can only do so much to ensure that your soluion will work but - it must work to support our business. We must keep in mind that 1) the scenario can never be accurately planned for (we are not fortune tellers - and we cannot control disasters and how they manifest); 2) businesses and their priorities change - impacting the technical side of our work.

What I have found, however, is that there are some fundamentals, that if considered in designing, implementing and validating the solution, can reach a consistent level of integrity that helps in answering your question - in a positive way.

1. A clearly written and agreed-to definition of what a successful disaster recovery event means to the business. This should be reviewed twice a year — even management experiences re-organizations.

2. A clear understanding of what the business defines as mission critical and the technology that supports these services and applications/technology.

3. Enterprise architecture. The ability for the dr solution to integrate into the existing architecture or adhere to architectural precepts will help to cut down on some of the risk that the solution will not work.

4. The architecture of each application and its components - problems integrating into the overall architectural environment and those extra steps that are required to ensure that they can failover together in order to meet the RTO/RPO.

5. An extremely detailed failover plan - minute by minute and technology by technology — including all processes and procedures to fail over and fall-back - that becomes the fundamental training source and guideline for a disaster. This must be reviewed quarterly and tested as walkthroughs and through real testing multiple times during the year. As well, it should be audited once by an outside organization and upgraded for all new technology/upgraded technology integrated into the environment.

6. The skillset of staff required to support the dr solution and how often they are trained in the process - including a thought to outsourcing for the event if staff are not available.

7. A business continuity plan for the IT department - to ensure that disaster recovery teams with primary/secondary responsibilities are identified and practiced. Remember that IT representatives are people too and need to be considered in planning - in the same way that the business is.

8. All third-party software/carrier/infrastructure contracts are up-to-date and define roles/responsibiliites for systems/technology/applications during a dr event — and how they plan to handle their own dr event as well as notification plans so you are aware of their issues before it becomes a problem for your business.

Most importantly, one thing that I learned in 9/11 was that until you have your approved dr solution in place, you have to identify temporary solutions - that are agreed-to by the business. You cannot build out technology overnight. However, you can have agreements for temporary solutions should an event occur while your overall solution is being built.


February 20, 2013


Filed under: Disaster Recovery — houtkin @ 11:17 am

A successful deployment of SRM is dependent on a thorough understanding of the business RTO/RPO requirements, data classification standing of applications, a technical preparatory analysis of the environment, storage and backup policy and administrative procedures; storage/replication design and whether there is a need to perform physical to virtual migrations.

Work with VMware to identify the full scope of the SRM deployment, the architectural design that maps to the enterprise, licensing and your company’s purchasing agreement with VMware. Also, consider SRM future growth as architecture changes after a threshhold is met.

1. SRM is an engineering solution and should be fostered/owned by both server and storage engineering.

2. Identify an SRM owner and ensure that they are trained before and during the deployment.

3. Choose an integrator to support the project manager and the SRM owner. They will help design SRM for use in testing and disaster recovery.

4. If you do not have a replication solution in place or if you cannot fall back from the DR site, you may only be able to failover from prod-to-dr using SRM.

5. A data Classification policy specified by the business will facilitate the deployment of SRM. This should: 1) classify the application and data by criticality based on direction from the business; 2) The storage design would be dictated to some level by the data classification requirements as identified by the business - and this would find its way into the replication solution and schedule.

6. Identify what applications and application data will be configured into SRM. Here you can use data classification policy and application qualifications, a business’ set of applications or a particular business process and those applications/data used to manifest.

7. If there is goal is to configure all applications with physical server dependencies into SRM a review of all of the applications resident on the physical servers is required to identify:
a. re-configuration needs of the application;
b. if the application can be migrated to virtual;
c. whether any applications require additional licenses and / or upgrades to be able to migrate frfrom p-to-v. P-to-V migrations are a sub-project to the SRM deployment and need to include all application-owners who are responsible for the application through the complete migration and validation process.

8. If you plan on performing P-to-v migrations to accommodate your new SRM deployment, understand any additional ESX servers that you may require and the number of licenses to cover your complete solution.

9. The storage design should be analyzed to ensure:

a. The related data for these applications are on the same frame and not spread over various frames.
b. The related data is not spread over various vendor products.
c. The replication solution and schedule works and is in synch with the storage design and data classification requirements; e.g. does it meet the RTO and RPO?
d. There is enough storage to handle data requirements as a result of data classification requirements and its configuration in SRM.
e. Backup policies / administrative processes exist and can be tweaked as SRM is configured and tweaked.
f. Storage administration policies and administrative processes exist and can be tweaked as SRM is configured and tweaked.

10. A very strong testing and validation program with proven scripts owned by each respective technology layer.

11. Before you schedule any configuration of SRM or P-to-v migrations to be able to configure applications into SRM understand the business schedule to avoid impact to the technical plan as a result of month/quarter/year-end activities on the applications. So, change management is a very important aspect of the project methodology.

More tomorrow.

February 18, 2013

High-Level Framework: System/Technology/Application Recovery

Filed under: Disaster Recovery — houtkin @ 8:47 am

In the perfect dr world, all technology/systems/applications should go through 4 levels of testing before they go into production - and have an architectural / design document, as-built design, operations model and failover process - if you are lucky enough to have the staff and bandwidth to do this work. Reality dictates that this is not always available but we cannot get away with thinking we can recover an application/system/technology without understanding the basics: the business requirement / use of this application/technology/system and its criticality to the business; enterprise architecture, the architecture of the technology and how it integrates into the enterprise architectural precepts and then how to successfully recovery the system/application/technology.

So, the basics for a framework is an understanding of:
1. The business process that is manifested through the technology/system/application;
2. The RTO of the business process and the system/technology/application;
3. The applicaiton/system/technology architecture and how it integrates into the overall architecture of the technical environment;
4. The operational model - and how the system/technology/application is maintained.

Recovery does not necessarily mean a failover unless the time to recover surpasses the RTO agreed-to with the business. Items required for all system recovery requires:
1. architectural design document;
2. as-built document;
3. operations model and related processes/procedures;
4. recovery processes/procedures
5. testing script for both infrastructure (server, os, database) and application-levels

Other considerations:
-What is recovered: application/technology/system AND data? If so, what is the RPO of the data and can your recovery methodology meet that expectation?
-What up/down-stream technical dependencies are impacted by the outage and then recovery of the technology/system/application.
-What core infrastructure comprises the application/system/technology and what application-level procedures require failover or not. In other words, based on what “goes down”, what is the path to technical least resistance to meet the RTO;
-What skillset is required to recover the application and the various levels (infrastructure/database/application).
-Recovery methodology: do you recover in isolation and then integrate into production, etc.
-Security requirements during recovery and integration back into production; e.g. access control; vulnerability, etc.
-What is the recovery sla with the vendor, if a third-party or managed system/technology/application.
-What is the agreed-to scope of work with the vendor, if a third-arty or managed system/technology/application.
-What policies are in place (or not) to handle recovery.
-Governance - who determines that the applicaiton/system/technology has been fully recovered?

Off the cuff - this is a baseline idea from the technical side.

February 15, 2013

Acknowledgement - Great Leadership

Filed under: Technical Project Management, Disaster Recovery — houtkin @ 12:13 pm

I had the great pleasure of attending a webinar presented by Judith Umlas called the “Power of Acknowledgement” (http://www.iil.com/poa/about.asp).  In addition to being gracious and attentive to each member of her audience, Judith’s message is extremely important for not only general management but project and disaster recovery managers.

What does the impact of acknowledgement have on participants in technology programs and projects?  By the responses in the webinar and my own experience, it can make or break the project.  We are all human and we all have bad days.  But making the attempt to really understand each individual in the project and disaster team helps us to better understand how people will react when an incident occurs or during the critical aspects of a project.  It is savings in the bank during those bad days or critical moments.

There are many ways to acknowledge what each member of the team brings to the table: 1) Writing thank you notes to their management and their management (for annual staff reviews); 2) Understanding their schedule and responsibilities to their primary technology team before setting deliverables - especially those in the critical path; 3) Working with their manager to take personal lives into consideration; 4) Giving them the opportunity to present to the team; 5) acknowledging team members’ approach to their deliverable or to the overall deliverable; 6) identifying when team members go beyond the call of duty or think of solutions “outside-of-the-box” and 7) not passing on Management pressures to members of the team.

Everyone has something they love about their work. Acknowledging this to them through direct communication or to the team through these various means, makes the difference.

Understand that you may not be appreciated by your management for recognizing the team or individuals in the team.  In some corporate cultures, it is frowned upon - for whatever reason.  However, a simple query before hiring or upon starting the engagement will tell you whether management approves of team/staff acknowledgement or how they prefer to handle this. But, either way, I will never understand how acknowledging members of the team could be taken as a negative action.  So be forewarned, and ask whether providing acknowledgement to the team/participants is permitted and how acknowledgment is made - and if the answer is “That is not done here,” you may want to re-think your engagement.

Another important aspect of providing acknowledgement to individuals is that it is clear that you live by, “Giving credit where credit is due.”  Someone once told me that they are acknowledged by their management and peers - by how talented they are as a result of the team and individuals working for them.  Part of that is not taking credit for the good work that is performed by members of staff, the team or project.  It is clear, anyway, what your capabilities are and it will not take too many thought cycles for people to realize what you can or cannot provide by the way of technical, engineering solutions.  So, be honest right away - and acknowledge the source of the great idea.

I learned about team and individual acknowledgement during 9/11 - while managing the rebuild of back office operations of an impacted financial institution.  Acknowledgement is particularly important during and after disaster incidents - as this may be the only life-line you have to maintain staff loyalty and presence.  After an incident, no matter how catastrophic, if people do not feel accepted through acknowledgement, they will leave - so fed-up with the stress of the incident, that sometimes they cannot deal with going back to a company so that so ill treated them.  Staff retention is one of the biggest risks after an incident.  Staff acknowledgement can mitigate this risk.

Remember that acknowledgement is free and takes little effort.  Good people show you who they are by their existence.  Change your world and take the risk of opening yourself up to something that feels good for both you and your team — and that yields results in ways that you cannot imagine.


Powered by WordPress