hadrblog

March 5, 2013

Lessons Learned - Sandy

Filed under: Disaster Recovery — houtkin @ 6:15 am

I had the pleasure of attending the latest conference of the Contingency Planning Exchange last week. The agenda was focussed on lessons learned for representatives of various sectors of the business, municipal and government entities. Net/net, the basic lessons learned, including new concepts for me included:

1. Use of the transit strike map
The idea here was that since it was a city-wide event, the team would utilize the transit strike maps which go into effect in time of an incident. No-one anticipated that that even these maps would not be of use because areas of the city were flooded.

LESSON: Meet with the city transit organization after an event to identify their updated transit recommendations/maps/tools for further planning.

2. Solar Phone Chargers

LESSON: No electricity results in a loss of the ability to communicate. Check this out. Its a great idea.

3. Communications: push, SMS, email and voice
An emergency communications company identified that their successful modes of communication in order of greatest to best effort were: push, SMS, email and then voice.

LESSON: Meet with your notification provider and ask them for statistics captured re: their service during SANDY, consider the results for use with your team, Company, Business and then reconfigure before the next event. Inform your User-base so that they know what to expect and Test, test, test.

4. Staff anxiety
Some entities identified a growing anxiety amongst “non-critical” staff who may not have been asked to come into the work-place or engage in work-related contingency process.

LESSON: Re-brand the concepts of “critical” and “non-critical” according to staff and identify, if possible, use of staff who may be closer and able to come to work. Educate.

5. Staff support and volunteering
Some companies organized teams of their staff, living closer to those who may be impacted, and provided, money, food, places to stay, and general hands-on support to ensure: accountability and availability of staff after an event. This was almost altruistic but very important for maintaining the company vision of the importance of staff. Staff retention is a very real concern after an incident. This support decreases the %.

LESSON: Look at your company’s vision and precepts and consider the opportunity of creating teams of volunteers in various areas as well as processes and procedures to help impacted staff. This also helps the staff anxiety concept identified in #4, above.

6. Use of VDI in support of your company’s resiliency objective.
One company used VDI to facilitate the time to build alternative workgroups in place of buildings that were impacted.

LESSON. Look at the VDI solution for the desktop as a resiliency solution. In this case, the impact could potentially be felt in the data center, ability to procure workstations as part of the solution; required workstation/laptop/notebook requirements for use with VDI, which build to utilize, if your company uses several and sites with wan connections already in place. Clearly, there are many considerations and this may not be appropriate for your company. But it is a great resiliency solution.

8. Manual procedures and non digital tools.
Although this discussion solicited some laughs from the audience, the old ways are still the best - in consideration of loss of electricity, technology, etc. After a regional loss of power, the phone systems supporting the land-line networks have 4-5 hours before their generators lose power. What differentiates land-line from cellular is simply, power. If you have a phone that uses the electricity powered by the land-line RJ14 connection, you can send/receive calls while the phone company powers this network. So, some old-fashioned phones could be of use.

LESSON: Look at the logistics of your plan, the mission critical business process and ways that they can continue without technology; e.g. forms, non-feature phones, trade books, etc. To this day, I always keep old business forms at the workgroup site in case nothing else is available. I also work through passwords, etc. with the other side so that they know who they are talking to. It’s a bit of work, a bit of thinking like a movie script-writer, but sneaker-net has been known to keep businesses in business.

That’s all for today.

February 27, 2013

How much is Enough?

Filed under: Disaster Recovery — houtkin @ 11:56 am

You can only do so much to ensure that your soluion will work but - it must work to support our business. We must keep in mind that 1) the scenario can never be accurately planned for (we are not fortune tellers - and we cannot control disasters and how they manifest); 2) businesses and their priorities change - impacting the technical side of our work.

What I have found, however, is that there are some fundamentals, that if considered in designing, implementing and validating the solution, can reach a consistent level of integrity that helps in answering your question - in a positive way.

1. A clearly written and agreed-to definition of what a successful disaster recovery event means to the business. This should be reviewed twice a year — even management experiences re-organizations.

2. A clear understanding of what the business defines as mission critical and the technology that supports these services and applications/technology.

3. Enterprise architecture. The ability for the dr solution to integrate into the existing architecture or adhere to architectural precepts will help to cut down on some of the risk that the solution will not work.

4. The architecture of each application and its components - problems integrating into the overall architectural environment and those extra steps that are required to ensure that they can failover together in order to meet the RTO/RPO.

5. An extremely detailed failover plan - minute by minute and technology by technology — including all processes and procedures to fail over and fall-back - that becomes the fundamental training source and guideline for a disaster. This must be reviewed quarterly and tested as walkthroughs and through real testing multiple times during the year. As well, it should be audited once by an outside organization and upgraded for all new technology/upgraded technology integrated into the environment.

6. The skillset of staff required to support the dr solution and how often they are trained in the process - including a thought to outsourcing for the event if staff are not available.

7. A business continuity plan for the IT department - to ensure that disaster recovery teams with primary/secondary responsibilities are identified and practiced. Remember that IT representatives are people too and need to be considered in planning - in the same way that the business is.

8. All third-party software/carrier/infrastructure contracts are up-to-date and define roles/responsibiliites for systems/technology/applications during a dr event — and how they plan to handle their own dr event as well as notification plans so you are aware of their issues before it becomes a problem for your business.

Most importantly, one thing that I learned in 9/11 was that until you have your approved dr solution in place, you have to identify temporary solutions - that are agreed-to by the business. You cannot build out technology overnight. However, you can have agreements for temporary solutions should an event occur while your overall solution is being built.

A

February 20, 2013

SRM

Filed under: Disaster Recovery — houtkin @ 11:17 am

A successful deployment of SRM is dependent on a thorough understanding of the business RTO/RPO requirements, data classification standing of applications, a technical preparatory analysis of the environment, storage and backup policy and administrative procedures; storage/replication design and whether there is a need to perform physical to virtual migrations.

Work with VMware to identify the full scope of the SRM deployment, the architectural design that maps to the enterprise, licensing and your company’s purchasing agreement with VMware. Also, consider SRM future growth as architecture changes after a threshhold is met.

1. SRM is an engineering solution and should be fostered/owned by both server and storage engineering.

2. Identify an SRM owner and ensure that they are trained before and during the deployment.

3. Choose an integrator to support the project manager and the SRM owner. They will help design SRM for use in testing and disaster recovery.

4. If you do not have a replication solution in place or if you cannot fall back from the DR site, you may only be able to failover from prod-to-dr using SRM.

5. A data Classification policy specified by the business will facilitate the deployment of SRM. This should: 1) classify the application and data by criticality based on direction from the business; 2) The storage design would be dictated to some level by the data classification requirements as identified by the business - and this would find its way into the replication solution and schedule.

6. Identify what applications and application data will be configured into SRM. Here you can use data classification policy and application qualifications, a business’ set of applications or a particular business process and those applications/data used to manifest.

7. If there is goal is to configure all applications with physical server dependencies into SRM a review of all of the applications resident on the physical servers is required to identify:
a. re-configuration needs of the application;
b. if the application can be migrated to virtual;
c. whether any applications require additional licenses and / or upgrades to be able to migrate frfrom p-to-v. P-to-V migrations are a sub-project to the SRM deployment and need to include all application-owners who are responsible for the application through the complete migration and validation process.

8. If you plan on performing P-to-v migrations to accommodate your new SRM deployment, understand any additional ESX servers that you may require and the number of licenses to cover your complete solution.

9. The storage design should be analyzed to ensure:

a. The related data for these applications are on the same frame and not spread over various frames.
b. The related data is not spread over various vendor products.
c. The replication solution and schedule works and is in synch with the storage design and data classification requirements; e.g. does it meet the RTO and RPO?
d. There is enough storage to handle data requirements as a result of data classification requirements and its configuration in SRM.
e. Backup policies / administrative processes exist and can be tweaked as SRM is configured and tweaked.
f. Storage administration policies and administrative processes exist and can be tweaked as SRM is configured and tweaked.

10. A very strong testing and validation program with proven scripts owned by each respective technology layer.

11. Before you schedule any configuration of SRM or P-to-v migrations to be able to configure applications into SRM understand the business schedule to avoid impact to the technical plan as a result of month/quarter/year-end activities on the applications. So, change management is a very important aspect of the project methodology.

More tomorrow.

February 18, 2013

High-Level Framework: System/Technology/Application Recovery

Filed under: Disaster Recovery — houtkin @ 8:47 am

In the perfect dr world, all technology/systems/applications should go through 4 levels of testing before they go into production - and have an architectural / design document, as-built design, operations model and failover process - if you are lucky enough to have the staff and bandwidth to do this work. Reality dictates that this is not always available but we cannot get away with thinking we can recover an application/system/technology without understanding the basics: the business requirement / use of this application/technology/system and its criticality to the business; enterprise architecture, the architecture of the technology and how it integrates into the enterprise architectural precepts and then how to successfully recovery the system/application/technology.

So, the basics for a framework is an understanding of:
1. The business process that is manifested through the technology/system/application;
2. The RTO of the business process and the system/technology/application;
3. The applicaiton/system/technology architecture and how it integrates into the overall architecture of the technical environment;
4. The operational model - and how the system/technology/application is maintained.

Recovery does not necessarily mean a failover unless the time to recover surpasses the RTO agreed-to with the business. Items required for all system recovery requires:
1. architectural design document;
2. as-built document;
3. operations model and related processes/procedures;
4. recovery processes/procedures
5. testing script for both infrastructure (server, os, database) and application-levels

Other considerations:
-What is recovered: application/technology/system AND data? If so, what is the RPO of the data and can your recovery methodology meet that expectation?
-What up/down-stream technical dependencies are impacted by the outage and then recovery of the technology/system/application.
-What core infrastructure comprises the application/system/technology and what application-level procedures require failover or not. In other words, based on what “goes down”, what is the path to technical least resistance to meet the RTO;
-What skillset is required to recover the application and the various levels (infrastructure/database/application).
-Recovery methodology: do you recover in isolation and then integrate into production, etc.
-Security requirements during recovery and integration back into production; e.g. access control; vulnerability, etc.
-What is the recovery sla with the vendor, if a third-party or managed system/technology/application.
-What is the agreed-to scope of work with the vendor, if a third-arty or managed system/technology/application.
-What policies are in place (or not) to handle recovery.
-Governance - who determines that the applicaiton/system/technology has been fully recovered?

Off the cuff - this is a baseline idea from the technical side.

February 15, 2013

Acknowledgement - Great Leadership

Filed under: Technical Project Management, Disaster Recovery — houtkin @ 12:13 pm

I had the great pleasure of attending a webinar presented by Judith Umlas called the “Power of Acknowledgement” (http://www.iil.com/poa/about.asp).  In addition to being gracious and attentive to each member of her audience, Judith’s message is extremely important for not only general management but project and disaster recovery managers.

What does the impact of acknowledgement have on participants in technology programs and projects?  By the responses in the webinar and my own experience, it can make or break the project.  We are all human and we all have bad days.  But making the attempt to really understand each individual in the project and disaster team helps us to better understand how people will react when an incident occurs or during the critical aspects of a project.  It is savings in the bank during those bad days or critical moments.

There are many ways to acknowledge what each member of the team brings to the table: 1) Writing thank you notes to their management and their management (for annual staff reviews); 2) Understanding their schedule and responsibilities to their primary technology team before setting deliverables - especially those in the critical path; 3) Working with their manager to take personal lives into consideration; 4) Giving them the opportunity to present to the team; 5) acknowledging team members’ approach to their deliverable or to the overall deliverable; 6) identifying when team members go beyond the call of duty or think of solutions “outside-of-the-box” and 7) not passing on Management pressures to members of the team.

Everyone has something they love about their work. Acknowledging this to them through direct communication or to the team through these various means, makes the difference.

Understand that you may not be appreciated by your management for recognizing the team or individuals in the team.  In some corporate cultures, it is frowned upon - for whatever reason.  However, a simple query before hiring or upon starting the engagement will tell you whether management approves of team/staff acknowledgement or how they prefer to handle this. But, either way, I will never understand how acknowledging members of the team could be taken as a negative action.  So be forewarned, and ask whether providing acknowledgement to the team/participants is permitted and how acknowledgment is made - and if the answer is “That is not done here,” you may want to re-think your engagement.

Another important aspect of providing acknowledgement to individuals is that it is clear that you live by, “Giving credit where credit is due.”  Someone once told me that they are acknowledged by their management and peers - by how talented they are as a result of the team and individuals working for them.  Part of that is not taking credit for the good work that is performed by members of staff, the team or project.  It is clear, anyway, what your capabilities are and it will not take too many thought cycles for people to realize what you can or cannot provide by the way of technical, engineering solutions.  So, be honest right away - and acknowledge the source of the great idea.

I learned about team and individual acknowledgement during 9/11 - while managing the rebuild of back office operations of an impacted financial institution.  Acknowledgement is particularly important during and after disaster incidents - as this may be the only life-line you have to maintain staff loyalty and presence.  After an incident, no matter how catastrophic, if people do not feel accepted through acknowledgement, they will leave - so fed-up with the stress of the incident, that sometimes they cannot deal with going back to a company so that so ill treated them.  Staff retention is one of the biggest risks after an incident.  Staff acknowledgement can mitigate this risk.

Remember that acknowledgement is free and takes little effort.  Good people show you who they are by their existence.  Change your world and take the risk of opening yourself up to something that feels good for both you and your team — and that yields results in ways that you cannot imagine.

Andrea

Powered by WordPress