Quantcast
Channel: Ask the Core Team
Viewing all 76 articles
Browse latest View live

Microsoft Professional Advisory Services

$
0
0

I am sure many of you are aware that Microsoft provides several options for our customers in terms of support services. The Support website provides information about our support offerings. We have Consumer support, Professional support and various levels of Premier Support. There are even several Self Support options available. These solutions are primarily focused on break-fix scenarios. What if you do not have something that is broken that needs fixing but instead would like some help implementing one of Microsoft’s technologies? We can help with that as well. This kind of help can be provided via Advisory type services.

If you are a small company, of even just an individual, and usually obtain support on a pay-per-incident basis, it is difficult to obtain advisory services. This is where Pro Advisory services can assist. Microsoft now offers Professional Advisory Services that is paid for on an hourly basis without having to have a Premier contract or having to work through Microsoft Consulting Services. The service is still in pilot, and only covers specific scenarios, but more are being added all the time. Each group has their own supported scenarios, and there are too many to list here. Here is a list of what the CORE Team has to offer at this point:

2276908 Windows Server 2008 R2 - RDWeb Access and RemoteApp Configuration (http://support.microsoft.com/kb/2276908)

2276905 Windows Server 2008 R2 - Microsoft VDI Configuration (http://support.microsoft.com/kb/2276905)

2276880 Windows 2008 Session Broker Load Balancing (http://support.microsoft.com/kb/227688)

2276874 Windows Server 2008 R2 RD Web Single Sign On (http://support.microsoft.com/kb/2276874)

2275811 TS Web Access And RemoteApp Configuration (http://support.microsoft.com/kb/2275811)

2275629 Windows Server 2003 Server Print Queue Migration (http://support.microsoft.com/kb/2275629)

2253278 Windows Server 2008 R2 RD Connection Broker (http://support.microsoft.com/kb/2253278)

2253250 Windows Server 2008 R2 Hyper-V Installation (http://support.microsoft.com/kb/2253250)

982909 Windows Server 2003 Server Cluster Disaster Recovery Planning (http://support.microsoft.com/kb/982909)

982908 Windows Server 2008 or Windows Server 2008 R2 Failover Cluster Disaster Recovery Planning (http://support.microsoft.com/kb/982908)

982872 Windows Server 2008 R2 RD Web Single Sign On (http://support.microsoft.com/kb/982872)

980643 Windows 2008 R2 Cluster Installation with Hyper-V (http://support.microsoft.com/kb/980643)

980459 Windows 2008 R2 Cluster Installation (http://support.microsoft.com/kb/980459)

979130 Windows 7 Deployment Activation Guidance (http://support.microsoft.com/kb/979130)

979129 Demonstration of Microsoft Deployment Toolkit With Q&A (http://support.microsoft.com/kb/979129)

978867 Windows 7 Deployment Question and Answer (http://support.microsoft.com/kb/978867)

974386 Platform Application Compatibility (http://support.microsoft.com/kb/974386)

What can you expect from Microsoft Professional Advisory services? The process is pretty straightforward:

1. Expect to be contacted by a Support Engineer who specializes in the technology area you are interested in.

2. The Support Engineer will review the Professional Advisory Services offering with you as it applies to the scenario you selected to ensure you both understand the scope of the work involved before an official support incident is created and work can begin.

3. The Support Engineer will carefully track the time involved in providing the solution so you will not be overcharged.

4. Once the work has been completed, and both you and the Support Engineer agree the solution has been provided, a summary will be provided and the case will be closed.

If you are interested in seeing other thecnology offerings that are available, navigate to http://support.microsoft.com and search on the keyword ‘kbProAdvisory’ and you will be able to browse the current offerings.

clip_image002

Hope this helps.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support


Troubleshooting ‘Redirected Access’ on a Cluster Shared Volume (CSV)

$
0
0

 Cluster shared Volumes (CSV) is a new feature implemented in Windows Server 2008 R2 to assist with new scale-up\out scenarios.  CSV provides a scalable fault tolerant solution for clustered applications that require NTFS file system access from anywhere in the cluster.  In Windows Server 2008 R2, CSV is only supported for use by the Hyper-V role. 

The purpose of this blog is to provide some basic troubleshooting steps that can be executed to address CSV volumes that show a Redirected Access status in Failover Cluster Manager.  It is not my intention to cover the Cluster Shared Volumes feature.  For more information on Cluster Shared Volumes consult TechNet.

Before diving into some troubleshooting techniques that can be used to resolve Redirected Access issues on Cluster Shared Volumes, let’s list some of the basic requirements for CSV as this may help resolve other issues not specifically related to Redirected Access.

  • Disks that will be used in the CSV namespace must be MBR or GPT with an NTFS partition. 
  • The drive letter for the system disk must be the same on all nodes in the cluster.
  • The NTLM protocol must be enabled on all nodes in the cluster.
  • Only the in-box cluster “Physical Disk” resource type can be added to the CSV namespace.  No third party storage resource types are supported.
  • Pass-through disk configurations cannot be used in the CSV namespace.
  • All networks enabled for cluster communications must have Client for Microsoft Networks and File and Printer Sharing for Microsoft Networks protocols enabled.
  • All nodes in the cluster must share the same IP subnets between them as CSV network traffic cannot be routed.  For multi-site clusters, this means stretched VLANs must be used.

Let’s start off by looking at the CSV namespace in a Failover Cluster when all things appear to be ‘normal.’  In Figure 1,  all CSV volumes show Online in the Failover Cluster Management interface.

clip_image002

Figure 1

Looking at a CSV volume from the perspective of a highly available Virtual Machine group (Figure 2), the Virtual Machine is Online on one node of the cluster (R2-NODE1), while the CSV volume hosting the Virtual Machine files is Online on another node (R2-NODE2) thus demonstrating how CSV completely disassociates the Virtual Machine resources (Virtual Machine; Virtual Machine Configuration) from the storage hosting them.

clip_image004

Figure 2

When all things are working normally (no backups in progress, etc…) in a Failover Cluster with respect to CSV, the vast majority of all storage I/O is Direct I/O meaning each node hosting a virtual machine(s) is writing directly (via Fibre Channel, iSCSI, or SAS connectivity) to the CSV volume supporting the files associated with the virtual machine(s).  A CSV volume showing a Redirected Access status indicates that all I/O to that volume, from the perspective of a particular node in the cluster, is being redirected over the CSV network to another node in the cluster which still has direct access to the storage supporting the CSV volume.  This is, for all intents and purposes, a ‘recovery’ mode.  This functionality prevents the loss of all connectivity to storage.  Instead, all storage related I/O is redirected over the CSV network.  This is very powerful technology as it prevents a total loss of connectivity thereby allowing virtual machine workloads to continue functioning.  This provides the cluster administrator an opportunity to evaluate the situation and live migrate workloads to other nodes in the cluster not experiencing connectivity issues. All this happens behind the scenes without users knowing what is going on.  The end result may be slower performance (depending on the speed of the network interconnect, for example, 10 GB vs. I GB) since we are no longer using direct, local, block level access to storage.  We are, instead, using remote file system access via the network using SMB.

There are basically four reasons a CSV volume may be in a Redirected Access mode. 

  • The user intentionally places the CSV Volume in Redirected Access mode.
  • There is a storage connectivity failure for a node in which case all I\O is redirected over a cluster network designated for CSV traffic to another node.
  • A backup of a CSV volume is in progress or failed.
  • An incompatible filter driver is installed on the node.

Lets’ take a look at a CSV volume in Redirected Access mode (Figure 3).

clip_image006

Figure 3

When a CSV volume is placed in Redirected Access mode, a Warning message (Event ID 5136) is registered in the System Event log. (Figure 4).

 clip_image008

Figure 4

For additional information on event messages that pertain specifically to Cluster Shared Volumes please consult TechNet.


Let’s look at each one of the four reasons I mentioned and propose some troubleshooting steps that can help resolve the issue.

1.  User intentionally places a CSV volume in Redirected Access mode:  Users are able to manually place a CSV volume in Redirected Access mode by simply selecting a CSV volume, Right-Click on the resource, select More Actions and then select Turn on redirected access for this Cluster shared volume (Figure 5).

clip_image010

Figure 5

Therefore, the first troubleshooting step should be to try turning off Redirected Access mode in the Failover Cluster Management interface.

2.  There is a storage connectivity issue:  When a node loses connectivity to attached storage that is supporting a CSV volume, the cluster implements a recovery mode by redirecting storage I\O to another node in the cluster over a network that CSV can use.  The status of the cluster Physical Disk resource associated with the CSV volume is Redirected Access and all storage I\O for the associated virtual machine(s) being hosted on that volume is redirected over the network to another node in the cluster that has direct access to the CSV volume.  This is by far the number one reason CSV volumes are placed in Redirected Access mode. Troubleshoot this as you would any other loss of storage connectivity on a server.  Involve the storage vendor as needed.  Since this is a cluster, the cluster validation process can also be used as part of the troubleshooting process to test storage connectivity.

Look for the following event ID in the system event log.

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Date: 10/8/2010 6:16:39 PM

Event ID: 5121

Task Category: Cluster Shared Volume

Level: Error

Keywords:

User: SYSTEM

Computer: Node1.cluster.com

Description:Cluster Shared Volume 'DATA-LUN1' ('DATA-LUN1') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

3.  A backup of a CSV volume fails:  When a backup is initiated on a CSV volume, the volume is placed in Redirected Access mode.  The type of backup being executed determines how long a CSV volume stays in redirected mode. If a software backup is being executed, the CSV volume remains in redirected mode until the backup completes.  If hardware snapshots are being used as part of the backup process, the amount of time a CSV volume stays in redirected mode will be very short.  For a backup scenario, the CSV volume status is slightly modified.  The status actually shows as Backup in progress, Redirected Access  (Figure 6) to allow you to better understand why the volume was placed in Redirected Access mode. When the backup application completes the backup of the volume, the cluster must be properly notified so the volume can be brought out of redirected mode.

clip_image012

Figure 6

A couple of things can happen here.  Before proceeding down this road, ensure a backup is really not in progress. The first thing that needs to be considered is that the backup completes but the application did not properly notify the cluster that it completed so the volume can be brought out of redirected mode.  The proper call that needs to be made by the backup application is ClusterClearBackupStateForSharedVolume which is documented on MSDN.  If that is the case, you should be able to clear the Backup in progress, Redirected Access status by simulating a failure on the CSV volume using the cluster PowerShell cmdlet Test-ClusterResourceFailure.  Using the CSV volume shown in Figure 6, an example would be –

Test-ClusterResourceFailure “35 GB Disk”

If this clears the redirected status, then the backup application vendor needs to be notified so they can fix their application.

The second consideration concerns a backup that fails, but the application did not properly notify the cluster of the failure so the cluster still thinks the backup is in progress. If a backup fails, and the failure occurs before a snapshot of the volume being backed up is created, then the status of the CSV volume should be reset by itself after a 30 minute time delay.  If, however, during the backup, a software snapshot was actually created (assuming the application creates software snapshots as part of the backup process), then we need to use a slightly different approach.

To determine if any volume shadow copies exist on a CSV volume, use the vssadmin command line utility and run vssadmin list shadows (Figure 7).

clip_image014

Figure 7

Figure 7 shows there is a shadow copy that exists on the CSV volume that is in Redirected Access mode. Use the vssadmin utility to delete the shadow copy (Figure 8).  Once that completes, the CSV volume should come Online normally.  If not, change the Coordinator node by moving the volume to another node in the cluster and verify the volume comes Online.

clip_image016

Figure 8

4.  An incompatible filter driver is installed in the cluster:  The last item in the list has to do with filter drivers introduced by third party application(s) that may be running on a cluster node and are incompatible with CSV.  When these filter drivers are detected by the cluster, the CSV volume is placed in redirected mode to help prevent potential data corruption on a CSV volume.  When this occurs an Event ID 5125[EC4] Warning message is registered in the System Event Log.  Here is a sample message –

17416 06/23/2010 04:18:12 AM   Warning       <node_name>  5125    Microsoft-Windows-FailoverClusterin Cluster Shared Vol NT AUTHORITY\SYSTEM               Cluster Shared Volume 'Volume2' ('Cluster Disk 6') has identified one or more active filter drivers on this device stack that could interfere with CSV operations. I/O access will be redirected to the storage device over the network through another Cluster node. This may result in degraded performance. Please contact the filter driver vendor to verify interoperability with Cluster Shared Volumes.  Active filter drivers found: <filter_driver_1>,<filter_driver_2>,<filter_driver_3>

The cluster log will record warning messages similar to these –

7c8:088.06/10[06:26:07.394](000000) WARN  [DCM] filter <filter_name> found at unsafe altitude <altitude_numeric>
7c8:088.06/10[06:26:07.394](000000) WARN  [DCM] filter <filter_name>  found at unsafe altitude <altitude_numeric>
7c8:088.06/10[06:26:07.394](000000) WARN  [DCM] filter <filter_name>   found at unsafe altitude <altitude_numeric>

Event ID 5125 is specific to a file system filter driver.  If, instead, an incompatible volume filter driver were detected, an Event ID 5126 would be registered.  For more information on the difference between file and volume filter drivers, consult MSDN.

Note:  Specific filter driver names and altitudes have been intentionally left out.  The information can be decoded by downloading the ‘File System Minifilter Allocated Altitudes’ spreadsheet posted on the Windows Hardware Developer Central public website.

Additionally, the fltmc.exe command line utility can be run to enumerate filter drivers.  An example is shown in Figure 9.

clip_image018

Figure 9

Once the Third Party filter driver has been identified, the application should be removed and\or the vendor contacted to report the problem.  Problems involving Third Party filter drivers are rarely seen but still need to be considered.

UPDATE 4/9: A Hotfix has been released to address an issue where filter drivers can cause the 'redirected access' issue:

FIXED: Cluster Shared Volumes (CSV) in redirected access mode after installing McAfee VSE 8.7 Patch 5 or 8.8 Patch 1

Hopefully, I have provided information here that will get you started down the right path to resolving issues that involve CSV volumes running in a Redirected Access mode.

Thanks!

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support


CNO Blog Series: Increasing Awareness around the Cluster Name Object (CNO)

$
0
0

I am starting a 'CNO Blog Series', which will consist of blogs written by the CORE team cluster engineers and will focus primarily on the Cluster Name Object (CNO). The CNO is the computer object in Active Directory associated with the Cluster Name; it is used as a common identity in the cluster. If you have been working with Failover Clusters since Windows Server 2008, you should be very familiar with the CNO and the role it plays with respect to the cluster security model. Looking over the CORE Team blog site, there have already been some blogs written that focus primarily on the CNO:

With the release of Windows Server 2012, there have been several enhancements added to the Failover Clustering feature that provide for better integration with Active Directory. The Product Team blog (http://blogs.msdn.com/b/clustering/), has a post that discusses creating Windows Server 2012 Failover Clusters in more restrictive Active Directory environments. That blog discusses some of the changes that have been made in the product that directly involve the CNO.

On to today's blog - increasing awareness around the Cluster Name Object (CNO)….

Beginning with Windows Server 2008, when a cluster is created, the computer objected associated with the CNO, unless pre-staged in some other container, is placed, by default, in the Computers container. Windows Server 2012 Failover Clusters give cluster administrators more control over the computer object representing the CNO. The Product Group's blog mentioned earlier, details new functionality in Windows Server 2012, which includes:

  • Using Distinguished Names when creating the cluster to manually control CNO placement
  • New default behavior where a CNO is placed in the same container as the computer objects for the nodes in the cluster
  • The Virtual Computer Objects (VCOs) created by a CNO are placed in the same container as the CNO

Having more control over cluster computer object(s) placement, while desirable, requires a bit more 'awareness' on the part of a cluster administrator. This 'awareness' involves knowing that, by default, the CNO when placed in the non-default location may not have the rights it needs for other cluster operations such as creating other cluster computer objects (VCOs). The first indication of a problem may be when a Role is made highly available in the cluster and that Role requires a Client Access Point (CAP). After the Role creation process completes, and the Network Name associated with the CAP attempts to come Online, it fails with an Event ID 1194.

Log Name:

System

Source:

Microsoft-Windows-Failover-Clustering

Even ID:

1194

Level:

Error

This event reports a computer object associated with a cluster Network Name resource could not be created. The error message itself provides good troubleshooting guidance to help resolve the issue -

clip_image002

In this case, it is a simply a matter of modifying the security on the AD container so the CNO is allowed to Create Computer Objects. Once this setting is in place, the Network Name comes online without issue. Additionally, the CNO is also given another critical right, the right to change the password for any VCO it creates.

If Active Directory is properly configured (more on that in a bit), the VCO, along with the CNO, can be also protected from accidental deletion.

image

Protecting Cluster Computer Objects

A call often handled by our support engineers involves the accidental, or semi-intentional, deletion of the computer objects associated with Failover Clusters. There are a variety of reasons this happens, but we will not go into those here. Suffice it to say, things function more smoothly if the computer objects associated with a cluster are protected.

I mentioned new functionality in Windows Server 2012 Failover Clusters where cluster objects will be strategically placed in targeted Active directory containers (OU) automatically. Using this methodology also makes it easier to discern which objects are associated with a Failover Cluster. As you can see in this screenshot of a custom OU (Clusters) that I created in my domain, the objects associated with the cluster carry the description of Failover cluster virtual network name account. The cluster nodes, which are located in the same OU, are traditional computer objects, which do not carry this description.

clip_image006

Examining the properties of one of these accounts using the Attribute Editor, one can see it is clearly an attribute (Description field) of the computer object.

clip_image008

Properly protecting cluster computer objects (from accidental deletion) requires Domain Administrator intervention. This can be either a 'proactive' or a 'reactive' intervention. A proactive intervention requires a Domain Administrator set a Deny ACE (Access Control Entry) for Delete all child objects for the Everyone group on the container where the cluster computer objects will be located.

clip_image010

A reactive intervention occurs after a CNO is placed in the designated container. At this point, the Domain Administrator has a choice. He can either:

1. Set the Deny ACE for Delete all child objects on the container, or

2. Check the Protect object from accidental deletion checkbox on the CNO computer object (which would then set the correct Deny ACE on the container)

image

Let us step through a scenario from a recent case I worked for one of our customers deploying a new Windows Server 2012 Failover Cluster.

Customer Case Study

In this case, a customer was deploying a 2-Node Windows Server 2012 Hyper-V Failover Cluster dedicated to supporting virtualized workloads. The cluster creation process was completed without issue and the Cluster Core Resources group could move freely between the nodes without any resource failures. The customer had already created four highly available virtual machines, some of which were already in production. The customer wanted to test live migration for the virtual machines. When he attempted to execute a live migration for a virtual machine, it failed immediately on the source cluster node. He attempted a quick migration and that succeeded.

Reviewing the cluster logs obtained from the customer, the live migration error appeared in the cluster log of the source cluster node. The live migration failure was registered with an error code of 1326.

00001274.00001c24::2012/09/18-17:50:16.301 ERR [RES] Virtual Machine <Virtual Machine MRS1SAPPBW31>: Live migration of 'Virtual Machine MRS1SAPPBW31' failed.

00001274.00001c24::2012/09/18-17:50:16.301 ERR [RHS] Resource Virtual Machine MRS1SAPPBW31 has cancelled offline with error code 1326.

00000aa8.00001cf4::2012/09/18-17:50:16.301 INFO [RCM] HandleMonitorReply: OFFLINERESOURCE for 'Virtual Machine MRS1SAPPBW31', gen(0) result 0/1326.

The error code resolved to - 'The user name or password is incorrect'.

Examining the rest of the cluster log indicated the CNO could not log on to the domain controller to obtain necessary tokens. This failure was also causing a failure registering with DNS (customer is using Microsoft dynamic DNS).

00001228.00001a20::2012/09/18-17:43:00.466 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user HPVCLU03$: 1326 (useSecondaryPassword: 0)

00001228.00001a20::2012/09/18-17:43:00.550 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user HPVCLU03$: 1326 (useSecondaryPassword: 1)

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name: [NNLIB] Logon failed for user HPVCLU03$ (Error 1326), DC \\<FQDN_of_DC_here>

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: Obtaining Windows Token for Name: HPVCLU03, SamName: HPVCLU03$, Type: Singleton, Result: 1326, LastDC: \\<FQDN_of _DC_here>

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: Slow Operation, FinishWithReply: 1326

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: InternalReplyHandler with event: 1326

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: End of Slow Operation, state: Error/Idle, prevWorkState: Idle

00001228.00001a8c::2012/09/18-17:43:00.550 WARN [RES] Network Name <Cluster Name>: Identity: Get Token Request, currently doesn't have a token!

00001228.00001a8c::2012/09/18-17:43:00.550 INFO [RES] Network Name: [NN] got sync reply: 0

00001228.00001e0c::2012/09/18-17:43:00.550 ERR [RES] Network Name <Cluster Name>: Dns: Obtaining token threw exception, error 6

00001228.00001e0c::2012/09/18-17:43:00.550 ERR [RES] Network Name <Cluster Name>: Dns: Failed DNS registration with error 6 for Name: HPVCLU03 (Type: Singleton)

Examination of the DNS zone verified there was no A-Record for the cluster name.

At this point, we logged into the domain controller the cluster was communicating with and tried to locate the CNO using the Active Directory Users and Computers (ADUC) snap-in. When the computer object was not found in the Computers container, a full search of active directory revealed it was located in a nested OU structure four levels deep. Coincidentally, it was located with the cluster node computer accounts, which is the expected new behavior beginning with Windows Server 2012 Failover Clusters as previously described. It was clear to me; however, the cluster administrator was not aware of this new behavior.

At this point, it appeared to be a case of the CNO account password being out of synch in the domain. I had the customer execute the following process:

  1. Temporarily move the CNO account into the Computers container
  2. Log into one of the cluster nodes with a domain account that had the Reset Password right in the domain
  3. Simulate failures for the cluster Network Name resource until it was in a permanent failed state
  4. Once the resource was in a Failed state, right-click on the resource, choose More Actions and then click Repair
  5. The previous action caused the password for the CNO to be reset in the domain

After executing the procedure, the cluster name came back online, and the customer noticed an automatic registration in DNS. He then executed a live migration for a virtual machine and it worked flawlessly. He also checked and verified the dNSHostName attribute on the computer object was now correctly populated. Issue resolved. Case closed.

Moral of the story - Not only do cluster administrators need to become familiar with the new functionality in Windows Server 2012 Failover Clusters (and there are many), but they should also realize that the CNO can have impact in areas that are not necessarily obvious.

Thanks, and come back again soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Logon Failures Involving Virtual Machines in Windows Server 2012

$
0
0

Welcome back to the CORE Team blog. The General Availability date for Windows 8 and Windows Server 2012 has come and gone, and we here on the CORE Team expect more of you will be diving in and taking part in all of the excitement around these new products. To make sure you have a great experience, we endeavor, whenever possible, to make you aware of situations that may temporarily 'inconvenience' you. That is the purpose of this blog.

We have recently encountered several instances where a specific group policy configuration can affect the proper functioning of a Windows Server 2012 Hyper-V virtualization solution. This is due primarily to changes made in Windows Server 2012 Hyper-V functionality that will also be explained here.

The two scenarios we have seen thus far are:

  • Virtual machines failing to start
  • Virtual machines failing to live migrate

In both of these scenarios, the problem is the result of a logon failure. Here is an example of a pop-up error message you may see when a virtual machine fails to start.

clip_image002

The critical piece of the error message is, "Logon failure: the user has not been granted the requested logon type at this computer (0x80070569.)"

In the second scenario where a virtual machine fails to live migrate, an Event ID 21502 error message is registered in the Hyper-V-High-Availability log. The critical piece of the error message is, "Failed to create Planned Virtual Machine at migration destination. Logon failure: the user has not been granted the requested logon type at this computer (0x80070569.)"

Investigation of these events revealed that a custom Group Policy was modifying the user accounts that are allowed to Logon on as a Service on each Hyper-V server.

In Windows Server 2012, a special security group, NT VIRTUAL MACHINE\Virtual Machines is created when the Hyper-V Role is installed. Members of this group require the right to Create Symbolic Links (SeCreateSymbolicLinkPrivilege) and to Log on as a Service (SeServiceLogonRight). The SID associated with the group is S-1-5-83-0. The security group is maintained by the Hyper-V Management Service (VMMS). To ensure members of the NT VIRTUAL MACHINE\Virtual Machines security group maintain the rights they need, VMMS registers with Group Policy in order to update the local security policy whenever Group Policy is refreshed.

The NT VIRTUAL MACHINE\Virtual Machines group did not exist in previous versions of Hyper-V. As each virtual machine is started on a Hyper-V server, its account (Virtual Machine ID (VM_ID)) is added to the NT VIRTUAL MACHINE\Virtual Machines group and VMMS creates a Virtual Machine Worker Process (vmwp.exe). Examples of these processes are visible in Task Manager:

 

clip_image004

 

The VM_ID is the virtual machine account that is used to gain access to its own resources and prevent other virtual machines from gaining access to those same resources. As an example, if I run the following PowerShell command, it is easy to see the rights given to the virtual machine account to one of its resources (a virtual hard disk in this case):

 

Get-Acl -Path E:\Virtual Machines\Contoso-FS1\Virtual Hard Disks\contoso-fs1.vhdx | FL AccessToString

AccessToString : NT VIRTUAL MACHINE\E57917F3-31C3-456E-B1BA-5E45B4CC7E0C Allow Write, Read, Synchronize

BUILTIN\Administrators Allow FullControl

NT AUTHORITY\SYSTEM Allow FullControl

NT AUTHORITY\Authenticated Users Allow Modify, Synchronize

BUILTIN\Users Allow ReadAndExecute, Synchronize

 

Since the VMWP is an extension of VMMS, VMMS performs a service logon to create an access token that is used to run the VMWP. In order for this to work, the NT VIRTUAL MACHINE\Virtual Machines security group must be granted the Log on as a Service right. In previous versions of Hyper-V, the VMWP ran in the context of a different account, NETWORK SERVICE, which is an account defined by SYSTEM.

Windows Server 2008 R2 SP1 Hyper-V Server

clip_image006

To find out more information about the NETWORK SERVICE account, review this MSDN resource (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684272(v=vs.85).aspx).

The error message, previously mentioned, refers to a 'user' not being granted a 'logon type'. That user, again as seen in Task Manager, is the Virtual Machine ID (VM_ID), and the logon type is 'Log on as a Service.'

clip_image008

 

Now that we understand the new changes, what needs to be done? A detailed Knowledge Base (KB) article was written in cooperation with the Directory Services team that provides additional details.

KB2779204
Starting or Live Migrating Hyper-V virtual machines may fail with error 0x80070569 on Windows Server 2012-based computers
http://support.microsoft.com/kb/2779204

Briefly, one of two things must happen:

  1. Hyper-V Administrators need to get with their Domain Administrators to review Group Policies to see if any involve specific user accounts being granted the Log on as a Service right, and, if so, have the policy modified appropriately
  2. Create an OU in Active Directory and place all hyper-V servers in that OU and block policy inheritance

Note: Option (2) is recommended by the Hyper-V Product Team

Tip: Administrators can temporarily, but quickly, recover from this error by opening an elevated command prompt and running gpupdate /force which forces a group policy refresh

Before we wrap-up, I would like to re-state one of Microsoft's long standing 'best practices' with respect to Hyper-V servers, and that is, the only Roles or Services that should ever be installed on a Hyper-V server is the Hyper-V Role and only those additional Roles or Features that directly support virtualization. The classic example is Hyper-V Failover Clusters where the Hyper-V Role and the Failover Clustering Feature complement each other by providing highly available virtualized workloads, which are the foundation of Microsoft's Cloud Strategy. If this 'best practice' is followed, no user rights modifications that could impact virtualization services should be needed.

I hope this has been helpful.

Thanks, and come back again soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Just when you thought…..(Part 1)

$
0
0

Just when you thought you had things figured out - in the words of the legendary Bob Dylan, "the times they are a-changin." With the release of Windows Server 2012, Microsoft introduces a load of new features, which, in some cases, translates into doing some of the same things in different ways. Up to now, highly available virtualized workloads meant multi-node Hyper-V Failover Clusters configured with Cluster Shared Volumes (CSV) hosting virtual machines. In Windows Server 2012 Hyper-V, the rules have changed. Now, virtual machine files can be stored on SMB 3.0 file shares hosted in standalone Windows Server 2012 File Servers, or in Windows Server 2012 Scale-Out File Servers.

This multi-part blog will walk through a new scenario, one that we may start seeing more and more as IT Professionals realize they can capitalize on their high-speed networking infrastructure investment while at the same time saving themselves a little money. The scenario involves both Windows Server 2012 Hyper-V Failover Clusters and Windows Server 2012 Scale-Out File Servers.

In this multi-part blog, I will cover the following:

  • Setting up a Windows Server 2012 Hyper-V Failover Cluster with no shared storage
  • Setting up a Windows Server 2012 Failover Cluster with the Scale-Out File Services Role
  • Configuring an SMB Share that supports Application Data with Continuous Availability in the Scale-Out File Server
  • Deploying virtual machines in the Hyper-V Failover Cluster while using the Scale-Out File Server SMB 3.0 shares to host the virtual machine files

To demonstrate the scenario, I created a 3-Node Windows Server 2012 Hyper-V Failover Cluster with no shared storage and a 2-Node Windows Server 2012 Failover Cluster connected to iSCSI storage to provide the shared storage for the Scale-Out File Server Role.

Create a 3-Node Windows Server 2012 Hyper-V Failover Cluster

First, create the 3-Node Hyper-V Failover Cluster. Since the cluster will not be connected to storage, and it is always a 'best practice' from a Quorum calculation perspective, to keep the number of votes in the cluster equal to an odd number, I chose a 3-Node cluster. I could have just as easily configured a 2-Node cluster and manually modified the Quorum Model to Node and File Share Witness. To support this Quorum Model, the Scale-Out File Server could be configured with a General Purpose file share to support the File Share Witness resource.

Recommendation: Since the cluster is not connected to storage, you do not have to run the storage tests in the cluster validation process.

In the interest of highlighting some of the other new features in Windows Server 2012 Failover Clustering, I created the cluster using a Distinguished Name format which provides greater control over the placement of cluster computer objects in a custom Organization Unit (OU) I created in Active Directory. It is recommended that you configure the OU to protect the Failover Cluster computer objects from 'accidental' deletion prior to creating the cluster. To accomplish this, implement a custom Access Control Entry (ACE) on the OU to deny Everyone the right to Delete all child objects.

clip_image002

This configuration on the container automatically checks the Protect object from accidental deletion on cluster computer objects when they are created.

clip_image004

Specify a Distinguished Name for the Cluster Name when creating the cluster (Create Cluster Wizard).

clip_image006

The Create Cluster report reflects the Active Directory path (container) where the CNO computer object is located.

clip_image008

Create a 2-Node Windows Server 2012 Scale-Out File Server

Configure a 2-Node Windows Server 2012 Failover Cluster to provide Scale-Out File Services to the virtual machines hosted by the 3-Node Hyper-V Failover Cluster.

Note: To read about Scale-Out File Services access the TechNet content here - http://technet.microsoft.com/en-us/library/hh831349.aspx

The Scale-Out File Services cluster requires storage to support the Cluster Shared Volumes (CSV) that will host the virtual machine files. To ensure the entire configuration is supported, run a complete cluster validation process, including the storage tests, before creating the cluster. Be sure to create the cluster with sufficient storage to support a Node and Disk Majority Quorum Model (Witness disk required) and the CSV volumes to host the virtual machine files.

 

Note: While a single CSV volume supports multiple virtual machines, a 'best practice' is to place virtual machines across several CSV volumes to distribute the I/O to the backend storage. Additionally, consider enabling CSV caching (scenario dependent). To find out more about CSV Caching, review the Product Team blog on the topic - http://blogs.msdn.com/b/clustering/archive/2012/03/22/10286676.aspx

clip_image010

With the cluster up and running, configure the Scale-Out File Server Role by following these steps:

  1. In Failover Cluster Manager, in the left-hand pane, right-click on Roles and choose Configure Role to start the High Availability Wizard
  2. Review the Before You Begin screen and click Next
  3. In the Select Role screen, choose File Server and click Next
  4. For the File Server Type, choose Scale-Out File server for application data and click Next
  5. Provide a properly formatted NetBIOS name for the Client Access Point and click Next
  6. Review the Confirmation screen information and click Next
  7. Verify the wizard completes and the Role comes Online properly in Failover Cluster Manager

A properly configured Scale-Out File Server Role should look something like this -

clip_image012

What happens if the Scale-Out File Server Role fails to start? Check the Cluster Events and you may find an Event ID: 1194 indicating a Network Name Resource failure occurred.

clip_image014

The Event Details section provides information for proper corrective action. In this case, since we are placing the cluster computer objects in a custom OU, we need to give the Scale-Out File Server CNO the right to Create Computer Objects. Once this is accomplished, and Active Directory replication has occurred, the Scale-Out File Server Role should start properly. Verify the Role comes online on all nodes in the cluster.

To review what we have accomplished:

  • Active Directory is configured properly to protect the accidental deletion of cluster computer objects
  • A 3-Node Hyper-V Failover Cluster has been created and validated
  • A 2-Node Scale-Out File Server Failover Cluster has been created and validated
  • The Scale-Out File Server CNO permissions have been properly configured on a custom OU

Well CORE Blog fans, that wraps it up for Part 1. Stayed tuned for Part 2 where we will:

  • Configure SMB 3.0 shares on the Scale-Out File Server
  • Configure highly available virtual machines in the Hyper-V Failover Cluster using the SMB shares on the Scale-Out File Server Cluster
  • Demonstrate Live Migration of virtual machines in the Hyper-V Failover Cluster

Thanks, and come back soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Just when you thought… (Part 2)

$
0
0

In Part 1, I covered configuring the Hyper-V Failover Cluster and the Scale-Out File Server solution. In Part two, I will cover:

  • Creating the file shares in the Scale-Out File Server
  • Creating a virtual machine to use the SMB3.0 shares in the Scale-Out File Server
  • Verifying we can Live Migrate the virtual machines in the Hyper-V Failover Cluster

Creating the File Share

Execute the following steps to create a file share in the Scale-Out File Server

  1. In Failover Cluster Manager, right-click on the Scale-Out File Server role in the center pane and choose Add File Share. This starts the New Share Wizard
  2. In the Select Profile screen, choose SMB Share - Applications and click Next
  3. For the Share Location, choose one of the CSV Volumes and click Next
  4. Provide a Share Name, verify the path information and click Next
  5. In the Other Settings screen, Enable Continuous Availability is checked by default. Click Next
    Note: Some selections are greyed-out. This is because they are not supported for this share profile in a Failover Cluster
  6. In the Permissions screen, click Customize Permissions. In the Advanced Security Settings screen, note the default NTFS and Share permissions and then proceed to add the Hyper-V Failover Cluster Nodes Computer Accounts to the NTFS permissions for the share and ensure they have Full Control. If the permissions listing does not include the cluster administrator(s), add it and give the account (or Security Group) Full Control. Click Apply when finished

Complete configuring the file shares.

clip_image002

As a test, connect to each of the shares from the Hyper-V Failover Cluster and verify you can write to each location before proceeding to the next step.

Creating a Virtual Machine to use an SMB 3.0 Share

Execute the following steps to create a new virtual machine

  1. On one of the nodes in the Hyper-V Cluster, open Failover Cluster Manager
  2. In the left-hand pane, click on Roles and then in the right-hand Actions pane click on Virtual Machines and choose New Virtual Machine
  3. Choose one of the cluster nodes to be the target for the virtual machine and click OK
  4. This starts the New Virtual Machine Wizard. Review the Before You Begin screen and click Next
  5. In the Specify Name and Location screen, provide a name for the virtual machine and enter an UNC path to a share on the Scale-Out File Server and then click Next

    clip_image004
  6. Configure memory settings and click Next
  7. Configure network settings and click Next
  8. In the Connect Virtual Hard Disk screen, make a selection and click Next
  9. Review the Summary screen and click Finish
  10. Verify the process completes successfully and click Finish

Testing Live Migration

Once all the virtual machines are created, you may want to test Live Migration. Depending on how many simultaneous live migrations you want to support, you may have to modify the Live Migration settings on each of the Hyper-V Failover Cluster nodes. The default is to allow two simultaneous live migrations. Here is a little PowerShell script you can run to take care of the settings for all the nodes in the cluster -

$Cred = Get-Credential

Invoke-Command -Computername Fabrikam-N21,Fabrikam-N22,Fabrikam-N23 -Credential $Cred -scriptblock {Set-VMHost -MaximumVirtualMachineMigrations 6}

In my cluster, I have all the virtual machines running on the same node -

clip_image006

I will use a new feature in Windows Server 2012 Failover Clusters, multi-select, and select all of the virtual machines and live migrate them to another node in the cluster -

clip_image008

Since there are only four virtual machines and the maximum number of live migrations is equal to six, all will migrate.

clip_image010

If I were to rerun my script and make a change back to two, then two migrations will be queued until at least one of the in progress migrations completes.

clip_image012

You can use the Get-SmbSession PowerShell cmdlet on any node in the Scale-Out File Server to determine the number of sessions. For illustration purposes, I have all virtual machines running on the same Hyper-V Failover Cluster node (Fabrikam-N21) and the CSV volumes are running on the same node in the Scale-Out File Server (Fabrikam-N1) -

clip_image014

Distributing the virtual machines across the multi-node Hyper-V Failover Cluster (Fabrilam-N21, Fabrikam-N22, and Fabrikam-N23) is reflected on the Scale-Out File Server -
image

Finally, I re-distribute the CSV volumes across the Scale-Out File Server nodes as shown here -

clip_image018

This is reflected in the Get-SmbSession PowerShell cmdlet output -

clip_image020

Thanks, and come back again soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Working with multiple network adapters in a virtual machine

$
0
0

Thanks for coming back to the CORE Team blog site. This blog will address working with multiple network adapters in a virtual machine. Many of you out there may not be interested in this because you work with virtual machines that only use a single network adapter. However, for those of us that frequently work with virtualized Failover Clusters, virtualized iSCSI Target Servers or even virtualized RRAS servers, we find ourselves in a position where virtual machines require more than one network adapter. I hope that the information here will provide some needed relief for you.

We all know that we cannot 'hot-add' network adapters to running virtual machines. The choices we have are to either configure all the network adapters before starting the virtual machine, or configure them one at a time, which requires the virtual machine to be shut down first. As an example, here is what I typically have to deal with when configuring nodes in a Failover Cluster. I require three networks; one for Public access, one for Cluster only communications, and one for connectivity to the shared storage provided by an iSCSI target.

clip_image002

I think we can all agree that the information displayed is insufficient to assist with the configuration of each of the networks. Let us address each scenario individually.

 

Windows Server 2012

The great thing about Windows Server 2012 is that there is lots of PowerShell help available to assist. We will use PowerShell to work through configuring the networks in the virtual machine. As shown above, here is the starting point -

clip_image004

One thing that I like to do is to make sure the Hyper-V virtual switch configuration makes sense for what I am doing. My virtual switch names are Public, Cluster and iSCSI because they make sense and meet my needs. Using that information, I use the Get-VMNetworkAdapter cmdlet to get the information I will need.

Get-VMNetworkAdapter -VMName 2012-Test | ft -Autosize Name,SwitchName,MacAddress,IPAddresses

clip_image006

 

 

 

In the virtual machine, I use the Get-NetAdapter cmdlet to get additional information I will need.

Get-NetAdapter | ft -Autosize Name,InterfaceDescription,ifIndex, MacAddress

clip_image008

 

 

 

 

Using the MacAddress information, I can sort out the 'players.'

clip_image010

Using the Get-NetAdapter and Rename-NetAdapter cmdlets, change the name of the connections in the virtual machine.

Get-NetAdapter -Name 'Ethernet' | Rename-NetAdapter -NewName Cluster

Get-NetAdapter -Name 'Ethernet 2' | Rename-NetAdapter -NewName ISCSI

Get-NetAdapter -Name 'Ethernet 3' | Rename-NetAdapter -NewName Public

clip_image012

 

 

 

 

Once the names of the adapters are changed, it is time to configure the IP addressing. To accomplish this, use the New-NetIPAddress cmdlet. 

New-NetIPaddress -InterfaceIndex 13 -IPAddress 1.0.0.3 -PrefixLength 8 -DefaultGateway 1.0.0.10

New-NetIPaddress -InterfaceIndex 14 -IPAddress 192.168.0.3 -PrefixLength 24

New-NetIPaddress -InterfaceIndex 15 -IPAddress 172.16.0.3 -PrefixLength 16

If name resolution is required, configure a DNS server address on the Public interface

Set-DnsClientServerAddress -InterfaceIndex 13 -ServerAddresses ("1.0.0.110","1.0.0.100")

 

To verify the new IP addresses, in the Hyper-V Host, re-run the Get-VMNetworkAdapter cmdlet or in the virtual machine run ipconfig /all.

This completes the configuration of the network adapters and it was accomplished without having to reboot the virtual machine.

 

Windows Server 2008 R2

Windows Server 2008 R2 includes PowerShell as well, but it does not come close to being as useful, or as powerful, as the PowerShell functionality found in Windows Server 2012. To complete the very same process as was executed in Windows Server 2012 requires a little different strategy. As shown above, here is the starting point.

clip_image014

Use the Get-VMNetworkAdapter cmdlet to get the information I will need.

Get-VMNetworkAdapter -VMName Contoso-FS2 | ft -Autosize Name,SwitchName,MacAddress,IPAddresses

clip_image016

 

 

 

Windows Server 2008 R2 comes with PowerShell Version 2.0 installed. Use PowerShell to obtain the network adapter information we need. 

Get-WmiObject -query "select * from Win32_NetworkAdapter where name like 'Microsoft Hyper-V Network Adapter%'" | FL Name,MACAddress

clip_image018

 

 

 

 

 

Using the MacAddress information, I can sort out the 'players.'

clip_image020

Next, use the netsh command to finish the configuration. First, rename the adapters.

Netsh interface set interface name="Local Area Connection" NewName="Public"

Netsh interface set interface name="Local Area Connection 2" NewName="Cluster"

Netsh interface set interface name="Local Area Connection 3" NewName="ISCSI"

 

Use the netsh interface show interface command sequence to show the new names for the interfaces.

Set the IP Address configuration (set the Default Gateway on Public) on each interface and verify using ipconfig /all

Netsh interface ip set address name="Public" static 1.0.0.4 255.0.0.0 1.0.0.10 1

Netsh interface ip set address name="Cluster" static 172.16.0.4 255.255.0.0

Netsh interface ip set address name="ISCSI" static 192.168.0.4 255.255.255.0

 

If name resolution is required, configure a DNS server.

Netsh interface ip set dnsservers name= "Public" static 1.0.0.110 primary

 

This completes the configuration for the Windows Server 2008 R2 virtual machine network adapters. Again, the configuration was accomplished without rebooting the virtual machine.

I would also like to acknowledge the help from my teammate - Sean Dwyer. Thanks, and come back again soon.


Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Sean Dwyer
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Adding a Pass-through Disk to a Highly Available Virtual Machine

$
0
0

This blog discusses the proper way to make a configuration change to a highly available virtual machine in a Windows Server 2008 (RTM) Failover Cluster. I will demonstrate how to add a Pass-through disk to a highly available virtual machine by attaching it to a SCSI Controller. You can also use an IDE controller, but I chose to use a SCSI controller because it is not available by default and as we walk through the process, you get the added benefit of seeing how to add it (Note: SCSI controllers require the installation of Integration Components in the Guest).

There are several reason why Pass-through disks are an attractive option. The main reason is that you can bypass the Hyper-V server file system and gain faster, direct access to the disk from inside a virtual machine. To accomplish this requires the disk be Offline from the operating systems perspective. There are tradeoffs, however, some of which include having to locate the virtual machine configuration file somewhere else and you lose the ability to take snapshots and you cannot use dynamic disks or configure differencing disks. For more information on the storage options for Hyper-V, you can review Jose Barreto's blog on the subject.

I start off with a highly available VM running the Windows Server 2008 operating system. The VM is using a VHD for its boot disk attached to an IDE controller.

clip_image002

Note: Hyper-V virtual machines can only boot from storage attached to IDE controllers.

In the Disk Management snap-in, I can see the disk I am using to support the boot disk for the VM and also the LUN I will be adding as the Pass-through disk attached to a SCSI controller. The disk I will use for the Pass-through disk is Offline (must be Offline or cannot be added to the VM).

clip_image004

Note: A new disk must be brought Online and Initialized before it can be used. This process writes a disk signature to the disk so cluster can use it. Once the disk has been initialized, it can be placed Offline again. No partitioning is required as that will be accomplished inside the virtual machine.

In the virtual machine, only the boot disk is currently visible in the Disk Management interface.

clip_image006

Since I will be modifying the configuration of the virtual machine, I first need to shut it down. In the Failover Cluster Management snap-in, right-click on the Virtual Machine resource and choose Shut Down.

clip_image008

Leave the Virtual Machine Configuration resource Online or you will not be able to access the machine settings in the Hyper-V Management snap-in.

clip_image010

In the Hyper-V Management Snap-in, start the Add Hardware wizard and choose to add a SCSI Controller as that is the type of interface the Pass-through disk will be attached to.

clip_image012

As part of the wizard, choose to add a hard drive to the SCSI Controller.

clip_image014

Since this will be a Pass-through disk, select the correct disk from the drop down list under Physical hard disk.

clip_image016

Complete the configuration and Start the Virtual Machine in the Failover Cluster Management interface.

With the virtual machine started, open the Disk Management snap-in. I can now see the new disk that was added.

clip_image018

However, I still do not see the new disk added to the virtual machine configuration in Failover Cluster Manager.

clip_image020

Since I made a change to a virtual machine that is under the control of the cluster, I need to inform the cluster service that a change has been made. I accomplish this by running the Refresh virtual machine configuration action in the right-hand pane.

clip_image022

The virtual machine is Saved as part of the process.

clip_image024

Once the refresh is completed, review the report that is generated to see if it was successful.

clip_image026

Examine the details of the report to see what changes were made.

clip_image028

Once I complete the review of the report and inspect the Failover Cluster Management snap-in, I see the new disk added to the group and it is Online.

clip_image030

Restart the virtual machine from the Failover Cluster Management snap-in and complete the configuration of the new storage in the virtual machine.

clip_image032

Once the partitioning and formatting of the volume is complete, refresh the display in the Failover Cluster Management snap-in and the information is updated for the new storage.

clip_image034

Back in the Disk Management snap-in, the disk now shows a Reserved status meaning it is under the control of the cluster (just like the boot disk).

clip_image036

All that remains is to test failover to other nodes in the cluster to ensure the new configuration comes Online successfully.

So, what is important to remember here?

  • When making changes to any highly available virtual machine, you must always Refresh the virtual machine configuration in the Failover Cluster Management snap-in before attempting a failover to another node in the cluster. Ensure the generated report is free of any errors.
  • Do not take the Virtual Machine Configuration resource Offline or you will not be able to make any changes to the VM as it will be removed from display in the Hyper-V Management snap-in.
  • Do not add the Pass-through disk as a cluster physical disk resource before modifying the virtual machine configuration. Let the Refresh process take care of all of that for you.
  • The disk must be in an Offline status in Disk Manager before it can be added to the virtual machine configuration as a Pass-through disk.
  • Finally, there is one anomaly when executing this process. After you modify a VM using the information in this blog, and if that VM is not the only VM on a LUN, if you were to add another VM to the same LUN and make it HA, when the operation completes, the disk corresponding to the pass-through disk will also be added as a 'dependency' to the new VM simply because it is already in the group. The dependencies will have to be manually modified by editing the property of the Virtual Machine resource. This is a known issue and will not be fixed.

Thanks again for your attention, and I hope this information helps.

Chuck Timon
Senior, Support Escalation Engineer
Microsoft Enterprise Platforms Support


Maximizing Limited Storage Resources in a Windows Server 2008 Failover Cluster

$
0
0

We have encountered scenarios where customers are implementing Windows Server 2008 Failover Clusters and they want to make quite a few services and applications highly available but, they are not able to purchase additional Storage to facilitate this. Or, it may be that another business within their organization has higher priority for existing storage assets and their 'lower' priority cluster will just have to make do with what storage has already been allocated. What is a cluster administrator to do under such circumstances?

In this blog, as an example, I will show you how to configure a highly available File Server Group so that it can also be used to support highly available Print Services. I am using File and Print Services because that is the most common scenario we see in customer environments. To implement this configuration we will be using the concepts explained in KB947050: Advanced resource configuration in Windows Server failover clusters.

Note: Even though these procedures are fully supported, the preference is to use dedicated storage for each highly available service or application. This allows the built-in wizard-based processes to be used thus ensuring the correct configuration. It is recommended that customers review their current storage utilization within each cluster and discuss with their local Storage Team to see if it would be possible recover any excess storage space on current LUNs which could then be used to create new LUNs.

The starting point will be an already configured File Server application (CONTOSO-FS1) providing some shared folders for a couple of business groups within an organization.

clip_image002

We will be using the storage in this file Server group to also host the files needed for a highly available Print Server. The procedures we will use will be executed outside of the normal wizard-based process for configuring a highly available Print Server. When using the normal wizard-based process, a highly available Print Server application looks something like this:

clip_image004

The resources in the group include a Client Access Point (CAP) (CONTOSO-PS1), a Print Spooler resource and a piece of storage for storing print jobs and printer drivers. Additionally, since this was configured using the wizard-based process, we also have the correct group type configured [101] so we will have access to the Print Management interface via the Failover Cluster Management snap-in:

clip_image006

The process we will use will not provide direct access to the Print Management snap-in from inside Failover Cluster Manager and, we will have to take that into account. Let's get started....

Since we are outside of the normal wizard-based process, the cluster will not be able to verify the installation of any prerequisite Roles and\or Features, so the first thing is to ensure we install the Print Server Role on all nodes in the cluster.

clip_image008

Once the Print Server role is installed, the Print Management snap-in is listed under Administrative Tools.

clip_image010

The Print Management snap-in provides management capabilities for print services running either in the context of the local node or a highly available Print Server using a configured Client Access Point (CAP).

clip_image012

With the Print Server role installed, we can move forward with the manual configuration of the resources we will need to place in the highly available File Server group.

The Print Spooler resource requires a dependency on a Network Name and Physical disk resource. Both of these resources already exist and are Online in the group so we could use the existing resources. However, we will choose to create a new Client Access Point (CAP) (CONTOSO-PS2) so the users can connect using another Network Name. In the Actions pane to the right, select Add a resource action and select a Client Access Point.

clip_image014

This starts a wizard where we create a NetBIOS name and IP Address (IP address information would not be requested if using DHCP).

clip_image016

Once the resource is created, bring it Online.

Next, create a Print Spooler resource.

clip_image018

The Print Spooler resource is created and placed in an Offline state. The resource is Offline because additional configuration is required before it can be brought Online. If the resource were brought Online at this point, it would fail and would take the whole group down because a failover would be initiated. This would disrupt any user connections to shared folders (more on this later).

clip_image020

To complete the configuration, Right-click on the resource and select Properties. On the General tab you can change the display name for the resource (optional) but you must enter a path to a folder on the storage where spooled print jobs and printer drivers can be stored. I created a Spooler directory on the storage so I enter the path information.

clip_image022

Next, select the Dependencies tab and add dependencies for the storage and the CAP.

clip_image024

Verify the default setting for the Policies and Advanced Policies tabs and then click OK. Bring the Print Spooler resource Online.

clip_image026

Test failover to other nodes in the cluster to be sure you have high availability.

Earlier, I mentioned we would not be able to manage highly available printers in the Failover Cluster Management snap-in when we configure a Print Spooler resource using the method above. If we were to open the Print Management snap-in, accessible in Administrative Tools, we would only see the local cluster node listed.

clip_image028

We have to manually add the new CAP we created for the Print Spooler resource.

clip_image030

Once we complete this action, both the local and the highly available Print Server will be visible in the management interface.

clip_image032

Looking at the final result where we have a single grouping of resources which consist of both File Server and Print Server resources, we need to consider what happens in case of any failure?

clip_image034

This is a critical question that needs to be answered because if the default behavior is implemented which is "if restart is unsuccessful, fail over all resources in this service or application", then all the resources in the application group will be taken Offline and moved to another node in the cluster and brought back Online.

clip_image036

In this scenario then, if the Print Spooler resource were to fail and could not be restarted on the node that owns it, the entire group, which includes the File Server resource and its associated shared folders, would be taken Offline and moved to another node in the cluster. This will interrupt all client connections to the shared folders until they are brought back Online on another node and the connections re-established. A decision may be required, in this situation, where perhaps it is more important that access to the File Server be maintained while the Print Spooler is in a Failed state. To accomplish this, a cluster administrator would have to modify the setting by Unchecking the box on both the Print Spooler resource and the associated Client Access Point so that failures of either of these will not result in a failover. You obviously would not do this for the disk resource because that is the single common resource being used by both services and would want that to failover.

clip_image038

To ensure the modified settings result in the desired behavior, we can simulate failures on the modified resources and observe the results. Here, I am simulating a failure on the Client Access Point for the Print Server.

clip_image040

After a single successful restart of the resource, execute another simulated failure and the resource will go into a Failed state but will not force a failover of the group.

clip_image042

So there you have it. A method for maximizing existing resources in a cluster. Before we wrap up, I want to emphasize again that the preferred method would be to have dedicated storage for each highly available application or service and not try to multi-purpose current storage.

I hope this information proves to be useful to someone and keep the cards and letters coming.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Recovering a Deleted Cluster Name Object (CNO) in a Windows Server 2008 Failover Cluster

$
0
0

Greetings once again from the support trenches here on the CORE team.  I want to talk a bit about a Windows Server 2008 Failover Cluster issue that appears to be on the rise.  What we are seeing is the Computer Object for the Cluster Name (a.k.a. Cluster Name Object (CNO) being removed from Active Directory resulting in the Cluster Name no longer being able to function properly.  This does not happen automatically.  It requires some sort of human interaction either by consciously going into AD and deleting the object or running some script (process) that deletes it.  However this is being done, it appears to us that the implications are not fully understood and there is no quick recovery from this.  In this blog, I hope to provide information that will help avoid this scenario from happening within your organization.  Along the way, I want to provide some 'value-add' information by discussing how the cluster computer objects relate to each other.

The first step to preventing this from happening in your organization is to be sure there is a clear understanding of the cluster security model in Windows Server 2008.  Rather than spend a whole lot of time and space here rehashing what is already publicly available, I refer you to the following:

KB 947049: Description of the Failover Cluster Security Model in Windows Server 2008.

Failover Cluster Step-by-Step Guide:  Configuring Accounts in Active directory

After reviewing the materials, you should have an understanding of how security works in Windows Server 2008 Failover Clusters and an appreciation for the importance of not removing (or disabling) the Computer Objects created in Active Directory by the cluster.  By default, the Computer Objects created by the cluster are all placed in the Computers container.  These can be relocated to another OU, or even pre-staged in an OU before the cluster is created.  If pre-staging, be sure to review the requirements in the Step-by-step Guide already mentioned. As an example (Figure 1),  I created a Cluster OU and moved the cluster nodes and their associated objects into the OU. 

clip_image002

Figure 1

You may want to consider implementing a similar practice in your organization as it groups the cluster objects together thereby reinforcing the idea that this grouping of objects is 'special' in some way. 

Before moving forward and discussing the actual recovery process, I want to spend a little time reviewing the cluster 'family tree' to help you gain an understanding of how cluster objects are related.  To illustrate, I will use a cluster named W2K8-CLUS (Figure 2)in the CONTOSO domain.

clip_image004

Figure 2


 

This cluster is located in the Cluster OU shown in Figure 1.  Using Regedit.exe, I open the cluster registry hive and inspect the properties for the cluster.  I can see the name of the cluster and the resource GUID for the Cluster Name.

clip_image006

Figure 3

Expanding the Resource GUID corresponding to the Cluster Name, I inspect additional properties for the resource.  Selecting the Parameters entry displays the ObjectGUID for the cluster Computer Object in Active directory (Figure 4).

clip_image008

Figure 4


 

In Figure 5, we see the attribute in Active directory (must enable Advanced Features before the Attribute Editor tab is visible).  You can also use ADSIEdit to view the same information.

clip_image010

Figure 5

The Cluster Name Object (CNO) functions as the primary security context for the cluster.  The CNO is responsible for creating any additional Computer Objects (Virtual Computer Objects (VCO)) associated with the cluster.  These Computer Objects represent Network Name resources in a cluster.  A Network Name resource is created as part of a Client Access Point (CAP).  Each Computer Object created by a cluster CNO contains an Access Control Entry (ACE) for the CNO on the Access Control List (ACL) for the object.  The CNO is also responsible for synchronizing the password for each VCO in the domain.  The VCOs associated with a particular CNO can be determined either by manually inspecting the ACL for each VCO in AD, or the information can be obtained in the cluster registry. 


 

Opening the cluster registry hive and inspecting the properties of the Cluster Name resource, we can see an entry called ObjectGUIDS.  This is a listing for each Computer Object created by the CNO in Active directory.  In Figure 6, I have four Computer Objects in Active Directory associated with this cluster.  

clip_image012

Figure 6

One of them is a Computer Object (VCO) associated with the CAP representing a highly available Print Server (CONTOSO-PS1) in this cluster (Figure 7).

clip_image014

Figure 7

Well, there you have it…the cluster family tree.

So, what happens if the Cluster Name Object is deleted from Active Directory?  A few important things –

·         The Cluster Name, if Online, will stay Online but will fail to come Online again if the resource is cycled (it will be placed in a Failed state).  This will prevent being able to connect to the cluster remotely when trying to administer the cluster.

·         The security context for the cluster is lost.  This prevents the passwords for all associated VCOs from being synchronized within the domain.  Also, any user, service or other process needing permission to access cluster objects will fail to be authenticated.

·         No more CAPs can be created in the cluster.

Besides the items listed above, there are other indications of problems.  The Cluster Name resource in the Cluster Core Resources group will be in a Failed state.  Attempts to bring the resource Online will generate a pop-up error (Figure 8)

clip_image016

Figure 8

A FailoverClustering   error (Event ID 1207) will be registered in the System Log (Figure 9).

clip_image018

Figure 9

The cluster log will report a failure to locate the CNO Computer Object in Active Directory (Figure 10)

clip_image020

Figure 10

It is, therefore, very important the CNOs Computer Object in the domain not be deleted. 

How does one recover from this?  The supported way(s) to recover an Active Directory object that has been accidentally, or intentionally, deleted are described in the following articles and will not be covered in detail here–

KB840001: How to restore deleted user accounts and their group memberships in Active Directory

TechNet   Content -   Recovering Active Directory Domain Services

Additionally, there are 3rd party solutions that can be used to protect Active Directory objects and\or recover them if deleted. Finally, as a last ditch effort, and when there is no other alternative, there is a free utility called ADRestore (32-bit only) that can be used to recover the Computer Object associated with the CNO.  Please review the following information before deciding to use this utility –

Microsoft Supportability Newsletter– Using ADRestore tool to restore deleted objects

 Either of these methods can be used, but they may end up being time consuming, expensive or both.  

Once the Computer Object has been recovered from Active Directory, the Repair Active Directory object action can be used to restore functionality in the cluster (Figure 11).

clip_image022

Figure 11

Note:  The logged on user that will perform the Repair action must have rights to administer the cluster and must have the right to Reset Passwords in the domain.

I personally believe ‘an ounce of prevention is worth a pound of cure.’ To that end, my top recommendation is to implement the steps outlined in the section Preventing unwanted deletions in the TechNet Content already mentioned above.  Beginning with Windows Server 2008, objects in Active Directory, such as the Computer Object shown here (Figure 12), can be protected from accidental deletion by simply checking a box – Protect object from accidental deletion.

clip_image024

Figure 12

With this ‘guard’ in place, when an object is selected for deletion, the first pop-up is presented (Figure 13)

clip_image026

Figure 13

If Yes is selected, the next error is presented to the user (Figure 14) thus preventing deletion.

clip_image028

Figure 14

If this isn’t enough, there is more help coming in Windows Server 2008 R2.  Domain Services in Windows Server 2008 R2 will include an optional feature called Active Directory Recycle Bin.  This feature is not enabled by default and must be added.  Details about the feature can be found on TechNet

TechNet Content – Active Directory Recycle Bin Step-by-Step Guide

That about wraps it up for this installment.  As usual, we hope this information is useful.  Come back and visit.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Top Issues for Microsoft Support for Windows Server 2008 Hyper-V (Q3)

$
0
0

It is time to update everyone on the issues our support engineers have been seeing for Hyper-V for the past quarter.  The issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed.  I think you will notice that the issues for Q3 have not changed much from Q1\Q2.  Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether.  There will probably be one more blog for the Q4 results.  Additionally, I would like to mention that we are highly recommending the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

Resolution:  Information is provided on the Hyper-V TechNet Library which includes links to several Product Team blogs.  Additionally, the Microsoft Virtualization site contains information that can be used to get a Hyper-V based solution up and running quickly.

InstallationIssues

Issue #1

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #2

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
vmname ’ failed to initialize.
Failed to read or update VM configuration.

or

An error occurred while attempting to change the state of virtual machine vmname .
" VMName " failed to initialize
An attempt to read or update the virtual machine configuration failed.
" VMName " failed to read or update the virtual machine configuration: Unspecified error (0x80040005).

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Issue #3

After the Hyper-V role is installed, a customer creates a virtual machine but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS.  In some cases, the server may need to be physically shutdown in order for the new BIOS settings to take effect.

Virtual Devices\Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached.

Resolution: Perform the steps documented in KB969266.

Issue #3

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free up disk space to allow the merge to complete.

Issue #2

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #3

Snapshots were lost

Cause:  Parent VHD was expanded (not supported).  If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

clip_image002

Resolution:  Restore data from backup.

Integration Components

Issue #1

A Windows 2000 (SP4) virtual machine with the Integration Components installed may shut down slowly.

Cause:  Thisproblem is caused by a bug in the Windows Software Trace Pre-Processor (WPP) tracing macro (outside of Hyper-V).

Resolution:  KB959781 documents the workarounds for this issue on Server 2008.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred:  The specified program requires a newer version of Windows.

Cause:  Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution:  Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

·         The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )

·         ‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID

{7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).

·         The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause:  This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution:  Perform the steps documented in KB961804.

Issue #2

Creating or starting a virtual machine fails with the following error:

'General access denied error' (0x80070005).

Cause:  This issue can be caused by the Intel IPMI driver.

Resolution:  Perform the steps documented in KB969556.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is now available which covers how to configure Hyper-V on a Failover Cluster.

Issue #2

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause:  The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution:  When virtual machine settings are changed on a VM that’s on a Failover Cluster, you must select the ‘Refresh virtual machine configuration’ option before the VM is moved to another node.  There is a blog that discusses this.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

·         If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

·         The Microsoft Hyper-V VSS Writer may enter an unstable state if a backup of the Hyper-V virtual machine fails. If you run the vssadmin list writers command, the Microsoft Hyper-V VSS Writer is not listed. To return the Microsoft Hyper-V VSS Writer to a stable state, the Hyper-V Virtual Machine Management service must be restarted.

Resolution:  An update (KB959962) is now available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

Network connectivity issues

Cause: NIC teaming software

Resolution: Remove the NIC teaming software. Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Issue #3

Customers inquiring if Hyper-V supports NIC Teaming.

Resolution: Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution:  The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Import/Export

Issue #1

Importing a virtual machine may fail with the following error:

A Server error occurred while attempting to import the virtual machine. Failed to import the virtual machine from import directory <Directory Path>. Error: One or more arguments are invalid (0x80070057).

Resolution: Perform the steps documented in KB968968.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

·         An Event ID 1054 is logged to the Application Event log:

Event ID: 1054
Source: Userenv
Type: Error
Description:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted). Group Policy processing aborted.

·         A negative ping time is displayed when you use the pingcommand.

·         Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Top Issues for Microsoft Support for Windows Server 2008 Hyper-V (Q4)

$
0
0

It is time for the final installment of a year-long segment on the top issues in Hyper-V.  It is appropriate since Windows Server 2008 R2 has finally released, and we can look forward to tracking\reporting any issues we may find in the new version of Hyper-V.  As always, the issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed.  I think you will notice that the issues for Q4 have not changed much from Q1\Q2\Q3.  Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether (if you have been following this blog series, you will notice some already have).   Additionally, we continue to highly recommend the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

Resolution:  Information is provided on the Hyper-V TechNet Library which includes links to several Product Team blogs.  Additionally, the Microsoft Virtualization site contains information that can be used to get a Hyper-V based solution up and running quickly.

Installation Issues

Issue #1

After the Hyper-V role is installed, a customer creates a virtual machine, but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS. In some cases, the server needs to be physically shutdown in order for the new BIOS settings to take effect.

Issue #2

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #3

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
vmname ’ failed to initialize.
Failed to read or update VM configuration.

or

An error occurred while attempting to change the state of virtual machine vmname .
" VMName " failed to initialize
An attempt to read or update the virtual machine configuration failed.
" VMName " failed to read or update the virtual machine configuration: Unspecified error (0x80040005).

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Virtual Devices or Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Corrupted virtual hard disk (VHD) file.

Cause: The most common cause was a power outage or the server wasn’t shutdown properly.

Resolution: Restore the VHD file from backup.

Issue #3

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached (driver issue - Storvsp.sys).

Resolution: Perform the steps documented in KB969266.

Issue #4

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #2

Snapshots were lost

Cause:  Parent VHD was expanded (not supported).  If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

clip_image002

Resolution:  Restore data from backup.

Issue #3

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free disk space to allow the merge to complete or move the .VHD and .AVHD file(s) to a volume with sufficient disk space and manually merge the snapshots.

Integration Components

Issue #1

On Windows Server 2008, when you attempt to install the Integration Components in a Hyper-V virtual machine running Windows Vista Service Pack 2, the installation may fail with the following error:

An error has occurred: One of the update processes returned error code 1.

Cause: This issue occurs if the management operating system (parent partition) that has the Hyper-V role installed does not have Service Pack 2 installed. If you have a virtual machine that’s running Windows Vista Service Pack 2, you need to use the Vmguest.iso from Service Pack 2 to install the Integration Components.

Resolution: Perform the steps documented in KB974503.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred:  The specified program requires a newer version of Windows.

Cause:  Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution:  Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

  • The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )
  • ‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID {7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).
  • The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause:  This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution: Perform the steps documented in KB961804.

Issue #2

Customer has multiple Hyper-V servers and virtual machines are getting duplicate MAC addresses.

Resolution: Configure the Hyper-V servers to use unique MAC address ranges by modifying the MinimumMacAddress and MaximumMacAddress registry values on each Hyper-V server. This issue is documented on TechNet: http://technet.microsoft.com/en-us/library/dd582198(WS.10).aspx. On Server 2008 R2, the MAC address ranges can be configured in the UI.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause:  The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution:  We have a KB article (KB 2000016) which discusses this issue for Windows 2008. On Windows 2008 R2, the experience has improved. If the virtual machine settings are modified within the Failover Cluster Management console, changes that are made to the VM will be saved to the Cluster (i.e. synchronized across all nodes in the cluster). If you make changes to the VM using the Hyper-V Manager Console, you must select the refresh virtual machine configuration option before the VM is moved to another node. This issue is documented in the Windows Server 2008 R2 help file. There is also a blog that discusses this.

Issue #2

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is available which covers how to configure Hyper-V on a Failover Cluster.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

·         If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

·         The Microsoft Hyper-V VSS Writer may enter an unstable state if a backup of the Hyper-V virtual machine fails. If you run the vssadmin list writers command, the Microsoft Hyper-V VSS Writer is not listed. To return the Microsoft Hyper-V VSS Writer to a stable state, the Hyper-V Virtual Machine Management service must be restarted.

Resolution:  An update (KB959962) is available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

After the customer configured a virtual machine to use a VLAN ID, the virtual machine is unable to access the network.

Cause: The VLAN ID used by the virtual machine didn’t match the VLAD ID configured on the network switch.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Issue #3

How to configure a virtual machine to use a VLAN.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution:  The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

·         An Event ID 1054 is logged to the Application Event log:

Event ID: 1054
Source: Userenv
Type: Error
Description:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted). Group Policy processing aborted.

·         A negative ping time is displayed when you use the pingcommand.

·         Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

BTW – Did I mention we are strongly recommending installing Windows Server 2008 SP2 on all Hyper-V server?  Have a good one!

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Failover Cluster Validation Firewall Error in Windows Server 2008 R2

$
0
0

An issue involving a firewall configuration error in the cluster validation process just surfaced here in Microsoft Support so I thought I would post a quick blog in an effort to not only inform our readership, but to ‘nip this in the bud’ before we start seeing more. 

                After running a Windows Server 2008 R2 Failover Cluster validation report, you may see the following error –

“An error occurred while executing the test.  There was an error verifying the firewall configuration.  An item with the same key has already been added”

The error, as is, does not provide a clear direction to take when trying to troubleshoot.  Thanks to the efforts of Cluster Product Group, the source of the issue was identified and a quick data collection process can be executed to help determine the ‘root’ cause.

The firewall configuration error is reported if any of the network adapters across the cluster nodes being validated have the same Globally Unique Identified (GUID).  This can be determined by running the following WMI query on each node in the cluster and comparing the results.  I chose to run the query inside PowerShell  to display sample data in a formatted list-

GetWMI Win32_NetworkAdapter | fl Name,GUID

clip_image002

The sample output above shows the information associated with the three physical network adapters that exist in one of the nodes in my cluster.  After the data is gathered from each node in the cluster, you just need to compare it and identify the duplicate GUID information.

The next logical question is, “How does one find themselves in this predicament?”  In the cases we have encountered thus far, the cluster nodes were being deployed in an unsupported manner.  In each case an ‘image’ was being used to deploy the nodes.  We discovered that the operating system image was not properly prepared before being deployed by, for example, running sysprep.

Hopefully this information will be useful and will help avoid further occurrences of this issue.  Thanks again and please come back.

Additional References:

Failover Cluster Step-by-Step Guide: Validating hardware for a Failover Cluster

KB 943984:  The Microsoft Policy for Windows Server 2008 Failover Clusters

Deployment Tools Technical Reference

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters

$
0
0

In this blog, I would like to explore some of the inner-workings of the Resource Host Subsystem (RHS) which is responsible for monitoring the health of the various cluster resources being provided as part of highly available services in a Failover cluster.   A Windows Server 2008 Failover Cluster is capable of providing high availability services using a variety of resources some of which are included as part of the Failover Cluster feature and others are as part of ’cluster-aware’ applications like SQL and Exchange. Resources are designed to work together and are typically organized in Resource Groups (Figure 1).  For example, a group of resources supporting a highly available File Server may consist of one or more of the following types of resources –   Client Access Point (IP Address(s) + Network Name resource), Physical Disk (Storage), and a File Server.  A highly available SQL Instance could contain the following resources -   Client Access Point (IP Address + Network Name resource), Physical Disk (Storage), SQL Server and SQL Server Agent.  Cluster resources are supported by special ‘plugins’ or resource Data Link Libraries (DLLs) that include coding to allow them to properly integrate\interoperate with the cluster service.

image

Figure 1

A Windows Server 2008 Failover Cluster is capable of hosting an unlimited number of resources.  The management of these resources is the responsibility of the Resource Control Manager (RCM) and the Resource Host Subsystem (RHS) which provide this functionality as part of the Cluster Service itself (Figure 2). 

image

Figure 2

The Resource Control Manager (RCM) is part of the overall cluster architecture and is responsible for implementing failover mechanisms and policies for the cluster service as well as establishing and maintaining the dependency tree (Figure 3) for each resource (e.g. a File Server resource requires a dependency on a Client Access Point and a Storage resource). 

image

Figure 3

The Resource Control Manager maintains the state for individual resources (Online, Offline, Failed, Online Pending, and Offline Pending) as well as for Resource Groups (Online, Offline, Partial Online, and Failed).  The Resource Control Manager can execute the following actions on a group of resources – Move, Failover and Failback.  Which action is executed depends on several factors including the current ‘health’ of resources in the group, administrative actions taken on the group (e.g. Move Group), or the current policies in effect for the group.  Here is an example (Figure 4) of Failover and Failback Group Policies –

image

Figure 4

Individual resources have policies (Figure 5) that apply to them as well.

imageimage

Figure 5

The Resource Hosting Subsystem (RHS) is responsible for initially hosting all resources that come Online in the cluster in one default process – rhs.exe (Resource Host Monitoring process) (Figure 6).

image

Figure 6

Note:  The rhs.exe *32 process supports  32-bit resource DLLs running in the cluster.

 In previous versions of Microsoft clustering, this was called the resource monitor process (resrcmon.exe) (Figure 7).

image

Figure 7

There is one exception to this rule which has been implemented in the Windows Server 2008 R2 Failover Clustering feature.  In Windows Server 2008 R2, the Cluster Group which consists of the Cluster Network Name resource, one or more associated IP address resources and a ‘witness’ resource and the Available Storage group are considered to be ‘critical’ cluster resource groupings and are hosted in an rhs.exe process separate from all the other cluster resources.

The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly.  This is accomplished by executing   IsAlive and LooksAlive processes which are specific to the type of resource.  Examples of these are documented in the following KB article –

KB 914458 -   Behavior of the LooksAlive and IsAlive functions for the resources that are included in the Windows server Clustering component of Windows Server 2003.

How often health checks are conducted is determined by the specific resource DLL or by a policy set by the cluster administrator.  An example of this policy is shown in Figure 5.  Should a resource fail to respond to a low-level LooksAlive check, a more in-depth IsAlive check is conducted.  If a resource fails an IsAlive check, additional policies are executed until such time it is determined that a resource cannot run on a particular node in the cluster.  When that point has been reached, RHS notifies the Resource Control Manager which will report the resource as Failed to the cluster service and a Failover is executed to move the Resource Group to another node in the cluster provided the default policy (Figure 8) is in effect.

image

Figure 8

There are times when a cluster administrator will choose not to implement the default policy shown in Figure 8 for specific ‘non-critical’ resources.  This reduces instability in the cluster which could adversely impact clients connected to highly available service(s). 

The IsAlive and LooksAlive health monitoring function is but a small part of what can be done with cluster resources.  Figure 9 shows a listing of additional Resource DLL Entry-Point functions. 

image

Figure 9

Note:  Information on the Failover Cluster APIs can be found on MSDN.

Failure of an IsAlive call into a resource is but one way resources can become unavailable in the cluster.  Other ways include:

  • Deadlocks in a resource DLL
  • Crashes in a resource DLL
  • RHS process itself terminates in the cluster
  • Cluster service fails on the node
  • Operating system failures (e.g. resource exhaustion)

Most of us who have been working with clusters for a long period of time understand what happens if a resource fails a critical health check.  I want to spend a little time discussing resource deadlocks. 

What is a resource ‘deadlock’?  Basically, there are two common reasons for instability within a resource DLL.  The resource DLL itself crashes (e.g. access violation in the resource DLL) or the resource fails to respond to a command in a timely fashion.  Every time a call is made into a resource, a timer is started.  If a response is not received within a specific period of time (configurable), the resource is considered to be deadlocked and  the RHS process hosting that resource will be terminated and the resource will be placed in a newly created RHS process thereby isolating it from all the other resources running in the default rhs.exe process.   When a deadlock happens, the Failover Cluster service registers an event in the cluster log.  Here is an example of a deadlock occurring in the ‘Cluster Name’ resource –

000008c8.00002528::2009/06/17-20:07:57.900 WARN  [RCM] ResourceControl(GET_NETWORK_NAME) to Network Name (email) returned 5910.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR   [RHS] RhsCall::DeadlockMonitor: Call LOOKSALIVE timed out for resource 'Cluster Name'.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR   [RHS] Resource Cluster Name handling deadlock. Cleaning current operation and terminating RHS process.

000008c8.00001cc4::2009/06/17-20:07:58.009 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Cluster Name', gen(0) result 4.

000008c8.00001cc4::2009/06/17-20:07:58.009 WARN  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'Cluster Name' has crashed or deadlocked; marking it to run in a separate monitor.

 Figure 10

Entries are also made in the Windows System Event Log.  Here is an example –

06/17/2009 04:07:58 PM  Error         Server1.contoso.com. 1230    Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM                Cluster resource 'Cluster Name' (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.

06/17/2009 04:07:58 PM  Critical      Server1.contoso.com. 1146    Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM                The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor.


Figure 11

Information on these specific Failover Cluster error messages can be found on TechNet.    The information for the two events shown in  Figure 11 is  shown in Figure 12.


image

Figure 12

In Windows Server 2008 R2, RHS events are registered with Windows Error Reporting.  These events can be viewed in the Action Center under Control Panel.  All RHS issues will be listed under the category ‘Failover Cluster Resource Host Subsystem.’

Examining the properties of a cluster resource highlights some of the information we have been discussing.  Figure 13 points out some of the pertinent properties of a resource.

image

Figure 13

MonitorProcessID:  Indicates the Process Identifier (PID) in task manger of the rhs.exe process associated with this resource.  If multiple resources have been placed in their own RHS process, it can be difficult to discern which process is associated with which resource.  Examining the properties of the specific resource can help.

Note:  The Process ID is not displayed by default in Task Manager.  You need to add the Column to the display by selecting  View in the Menu Bar and from the drop down list select  Select Columns.  Check the box for PID (Process Identifier).

SeparateMonitor:  Indicates if the resource has been placed in a separate monitor (0:No, 1:Yes).

IsAlivePoleInterval:  Default is as shown indicating it is using the default setting for this specific resource type.

LooksAlivePollInterval:  Default is as shown indicating it is using the default setting for this specific resource type.

DeadlockTimeout:  Default setting indicating 5 minutes.

Resource deadlock detection was actually introduced in Windows Server 2003 clusters, however it was not turned on by default.  Figure 14 illustrates this.

image

Figure 14

Deadlock detection is turned on by default in Windows Server 2008 (RTM + R2) and cannot be disabled.

So, what is the moral of this story?  It is important to understand that cluster resource deadlocks are a symptom of a larger problem.  The deadlock itself is not the problem….cluster is a victim of a problem that can exist  either internal to the cluster node itself or somewhere external to the cluster.  Applying a logical troubleshooting methodology can help understand where the problem may exist.  But, to do that requires a couple of pieces of knowledge –

  1. Identification of the specific resource that is deadlocked.
  2. What is the entry point that is failing?
  3. What is the entry point trying to do?

Using the example provided in Figures 10 and 11, we can see there was a deadlock in the cluster name resource during a LooksAlive entry point.  Understanding what is being evaluated for a LooksAlive process for a Network Name resource may help identify the problem which could end up being local to the node or could perhaps involve connectivity to a DNS server on the network.  Referring back to KB 914458, the cluster resource DLL (ClusRes.dll) is responsible for Network Name resource health checking (IsAlive\LooksAlive tests).  Some of the tests that are conducted include:

·         Determining if the Network Name (NetBIOS Name) is still registered on the network stack on the node.  Opening a command prompt on a node and running an nbtstat –n command to view the local NetBIOS name table, will show the registrations for cluster Network Name resources.  Here is an example of a Network Name supported a Client Access Point for a File Server –

image

    Inspecting the Parameter data for the resource in the cluster registry hive, confirms the information –

image

  • Determine the result of a DNS registration attempt (dynamic DNS is required for this test).
  • If the Require DNS property is set and registration fails, then the IsAlive\LooksAlive test fails.

If all DNS registrations fail and the NetBIOS name is no longer registered locally on the node, the Network Name is no longer considered reachable and the resource is placed in a Failed state. Recovery processes are initiated by the cluster service on the local node first.  If local recovery fails, the Group containing the Failed Network Name resource could be moved to another node in the cluster.

What are some things that can be done to help avoid, or at least mitigate,  situations where a deadlock may occur?  While not set in stone, here are some of my personal recommendations:

  1. Make sure the operating system (OS) is running with the latest service pack plus any post-service pack updates that pertain to Failover Cluster, networking or storage connectivity.
  2. If running highly available Microsoft applications like SQL or Exchange, ensure they are updated as well.
  3. Consult with the storage vendor and ensure the shared storage is updated and configured correctly to work in a Microsoft Failover Cluster.  Most storage vendors maintain a current support matrix.
  4. Ensure there are reliable and redundant communications paths between all nodes in the cluster.
  5. Ensure there is reliable connectivity between all nodes in the cluster and Active Directory.
  6. Document all Third party products that are running in the cluster and ensure they are fully updated. Third party products that interact with storage or network connectivity are always potential suspects.
  7. Use the cluster validation process to help troubleshoot issues seen in a cluster.
  8. If you are a Cluster Administrator, you must be aware of all changes being implemented in the corporate infrastructure to determine potential impacts on highly available services.

Hopefully, you will find this information useful.  Thanks again and please come back.

Additional References:

http://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Using Pass-Through Disks in Conjunction with Clustered Shared Volumes (CSV) in Windows Server 2008 R2 Failover Clusters

$
0
0

This blog is essentially an update, or follow-up if you will, to the original blog I wrote for Windows Server 2008 Failover Clustering.  With the release of Windows Server 2008 R2 comes the ability to Live Migrate highly available virtual machines between nodes in a Failover Cluster.  As if that were not enough to get customers excited about the product, we also include a feature called Clustered Shared Volumes (CSV) which is designed to work in conjunction with making virtual machines highly available in Windows Server 2008 R2 Failover Clustering using the Live Migration functionality.  There are users out there that are a little confused about CSV and whether  or not they must use CSV with Live Migration or can they still take advantage of pass-through disks in a virtual machine configuration.  I am here to tell you that you can use both simultaneously if desired and that is what we will discuss here.

The ultimate goal is to arrive at the configuration shown here where a highly available virtual machine is using a CSV volume for storing configuration files and the virtual hard disk (VHD) supporting the base OS while still taking advantage of a pass-through disk for data storage.

clip_image002

We start off with the assumption that a virtual machine has already been made highly available using a CSV volume to store the configuration file and the base OS vhd and Live Migration has already been tested and is working properly.

clip_image004

Next, we prepare a new LUN that has been presented to the cluster to be used as a pass-through disk.  In the Disk Management interface we can see the LUN has been presented and is Offline.

clip_image006

Since this is a LUN that the cluster has never seen before, we need to bring the disk Online first.

clip_image008

Once the disk is Online, we need to initialize the disk which will write a signature so the operating system can identify it.

clip_image010

Note:  The cluster uses the disk signature as one attribute for uniquely identifying storage that it controls.  How do we know cluster has control of a disk? 

Bonus material  The disk shows as Reserved when a cluster has control of it.

clip_image012

Once the disk has been initialized, take the disk Offline.  There is no need to partition the drive as that will be done by the OS running in the virtual machine.

clip_image014

In order for the disk to be used by a highly available virtual machine, it must be under control of the cluster.  In the Failover Cluster Management snap-in, add the disk to the cluster.

clip_image016

Select the disk that is Offline.

clip_image018

Once added to the cluster, the new storage is placed in the Available Storage group and is brought Online.

clip_image020

Bonus material– Can a pass-through disk be used in the CSV namespace? 

If we try to add the disk to the CSV namespace –

clip_image022

 

The process will not complete –

clip_image024

Reviewing the error –

clip_image026

So, the answer is pass-through disks cannot be added to a CSV namespace because the drive must be Offline in preparation for being configured in a VM.  If the disk is Offline, the partition information cannot be read and CSV has a requirement for an NTFS partition. 

If you were to try and configure the virtual machine to use a disk that is not under the control of the cluster, you would see this pop-up when the virtual machine configuration is refreshed by the cluster.

clip_image028

Viewing the Details, the error is –

clip_image030

The reason is also explained –

clip_image032

With the disk added as a cluster resource and Online in the Available Storage group, access the settings for the running virtual machine by executing a Right-click on the virtual machine resource  and selecting Settings (or selecting Settings in the lower right-hand Actions Pane).

clip_image034

In R2, we can Hot-Add a hard disk to a pre-existing SCSI controller (if you wanted to use and IDE Controller, the virtual machine would have to be shutdown).  Execute the task by selecting the SCSI Controller and then select Hard Drive and Add.

clip_image036

Make sure to select the correct disk from the drop down list.

clip_image038

The refresh of the virtual machine should complete successfully and the new disk will be added to the group containing the Virtual Machine.

clip_image040

 

An examination of the resource dependencies for the virtual machine resource now includes the new disk that was added.

clip_image042

At this point, test Live Migration of the group to ensure all resources will come Online on other nodes in the cluster.

clip_image044

When Live Migration completes, we have achieved the desired configuration.

clip_image046

The final step is to prepare the new storage in the VM itself.

clip_image048

That wraps it up for this blog.  I hope you have found this information useful.  Come back and see us.

Additional references:

Configuring Pass-through Disks in Hyper-V

Microsoft Cluster Team Blog

Virtualization TechNet Center

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support


Windows Server 2008 R2 Live Migration – “The devil may be in the networking details.”

$
0
0

Windows Server 2008 R2 has been publicly available now for only a short period of time, but we are already seeing a good adoption rate for the new Live Migration functionality as well as the new Cluster Shared Volumes (CSV) feature. I personally have worked enough issues now where Live Migration is failing that I felt a short blog on what process I have followed to work through these may have some value.

It is important to mention right up front that there is information publicly available on the Microsoft TechNet site that discusses Live Migration and Cluster Shared Volumes. This content also includes some troubleshooting information. I acknowledge that a lot of people do not like to sit in front of a computer monitor and read a lot of text to try and figure out how to resolve an issue. I am one of those people. Having said that, let’s dive in.

It has been my experience thus far that issues that prevent Live Migration from succeeding have to do with proper network configuration. In this blog, I will address the main network related configuration items that need to be reviewed in order to be sure Live Migration has the best chance of succeeding. I begin with an initial set of assumptions which include the R2 Hyper-V Failover Cluster has been properly configured and all validation tests have passed without failure, the highly available VM(s) have been created using cluster shared storage, and the virtual machine(s) are able to start on at least one node in the cluster.

I start off by identifying the virtual machines that will not Live Migrate between nodes in the cluster. While it should not be necessary in Windows Server 2008 R2, I recommend first running a ‘refresh’ process on each virtual machine experiencing an issue with Live Migration. I say it should not be necessary because a lot of work was done by the Product Group to more tightly integrate the Failover Cluster Management interface with Hyper-V. Beginning with R2, virtual machine configuration and management can be done using the Failover Cluster Management interface. Here is a sample of some of the actions that can be executed using the Actions Pane in Failover Cluster Manager.

clip_image002

If virtual machine configuration and management is accomplished using the Failover Cluster Management interface, any configuration changes made to a virtual machine should be automatically synchronized across all nodes in the cluster. To ensure this has happened, I begin by selecting each virtual machine resource individually and executing a Refresh virtual machine configuration process as shown here –

clip_image004

The process generates a report when it completes. The desired result is shown here –

clip_image006

If the process completes with a Warning or Failure, examine the contents of the report and fix the issue(s) that was reported and run the process again until it successfully completes.

If the refresh process completes without Failure, try to Quick Migrate the virtual machine to each node in the cluster to see if it succeeds.

clip_image008

If a Quick Migration completes successfully, that confirms the Hyper-V Virtual Networks are configured correctly on each node and the processors in the Hyper-V servers themselves are compatible. The most common problem with the Hyper-V Virtual Network configuration is that the naming convention used is not the same on every node in the cluster. To determine this, open the Hyper-V Management snap-in, select the Virtual Network Manager in the Actions pane and examine the settings.

clip_image010

The information shown below (as seen in my cluster) must be the same across all the nodes in the cluster (which means each node must be checked). This includes not only spelling but ‘case’ as well (i.e. PUBLIC is not the same as Public) –

clip_image012

It is important to be able to successfully Quick Migrate all virtual machines that cannot be Live Migrated before moving forward in this process. If the virtual machine can Quick Migrate between all nodes in the cluster, we can begin taking a closer look at the networking piece.

Start verifying the network configuration on each node in the cluster by first making sure the network card binding order is correct. In each cluster node, the Network Interface Card (NIC) supporting access to the largest routable network should be listed first. The binding order can be accessed using the Network and Sharing Center, Change adapter settings. In the Menu bar, select Advanced and from the drop down list choose Advanced Settings. An example from one of my cluster nodes is shown here where the NIC (PUBLIC-HYPERV) that has access to the largest routable network is listed first.

clip_image014

Note: You may also want to review all the network connections that are listed and Disable those that are not being used by either the Hyper-V server itself or the virtual machines.

On each NIC in the cluster, ensure Client for Microsoft Networks and File and Printer Sharing for Microsoft Networks is enabled (i.e. checked). This is a requirement for CSV which requires SMB (Server Message Block).

clip_image016

Note: Here is where people get into trouble usually because they are familiar with clusters and have been working with them for a very long time, maybe even as far back at NT 4.0 days. Because of that, they have developed a habit for configuring cluster networking which basically is outlined in KB 258750. This article does not apply to Windows Server 2008.

Note: If CSV is configured, all cluster nodes must reside on the same non-routable network. CSV (specifically for re-directed I/O) is not supported if cluster nodes reside on separate, routed networks.

Next, verify the local security policy and ensure NTLM security is not being restricted by a local or domain level policy. This can be determined by Start > Run > gpedit.msc > Computer Configuration > Windows Settings > Security Settings > Local Policies > Security Options. The default settings are shown here –

clip_image018

In the virtual machine resource properties in the Failover Cluster Management snap-in, set the Network for Live Migration ordering such that the highest speed network that is enabled for cluster communications and is not a Public network is listed first. Here is an example from my cluster. I have three networks defined in my cluster –

clip_image020

The Public network is used for client access, management for the cluster, and for cluster communications. It is configure with a Default Gateway and has the highest metric defined in the cluster for a network the cluster is allowed to use for its own internal communications. In this example, since I am also using ISCSI, the ISCSI network has been excluded from cluster use. The corresponding listing on the virtual machine resource in the Network for live migration tab looks like this –

clip_image022

Here, I have unchecked the iSCSI network as I do not want Live Migration traffic being sent over the same network that is supporting the storage connection. The Cluster network is totally dedicated to cluster communications only so I have moved that to the top as I want that to be my primary Live Migration network.

Note: Once the live migration network priorities have been set on one virtual machine, they will apply to all virtual machines in the cluster (i.e. it is a Global setting).

Once all the configuration checks have been verified and changes made on all nodes in the cluster, execute a Live Migration and see if it completes successfully.

Bonus material:

There are configurations that can be put in place that can help live migrations run faster and CSV to perform better. One thing that can be done, is to Disable NetBIOS on the NIC that will be supporting the primary network used by CSV for re-directed I/O. This should be a dedicated network and should not be supporting any other traffic other than internal cluster communications, redirected I/O for CSV and\or live migration traffic.

clip_image024

Additionally, on the same network interface supporting live migration, you can enable larger packet sizes to be transmitted between all the connected nodes in the cluster.

clip_image026

If, after making all the changes discussed here, live migration is still not succeeding, then perhaps it is time to open a case with one of our support engineers.

Thanks again fro you time, and I hope you have found this information useful. Come back again.

Additional resources:

Using Live Migration with Cluster Shared Volumes in Windows Server 2008 R2

High Availability Product Team Blog

Hyper-V and Virtualization on Microsoft TechNet

Windows Server 2008 R2 Hyper-V Forum

Windows Server 2008 R2 High Availability Forum

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Be Sure to Plan Carefully When Virtualizing Your Infrastructure

$
0
0

There is a lot of excitement around Microsoft virtualization technologies these days and rightfully so.  One of the ‘hottest’ areas right now appears to be making virtual machines highly available using Windows Server 2008 R2 Failover Clusters so end users can take maximum advantage of Live Migration and Cluster Shared Volumes (CSV).  This configuration not only saves a lot of money but also provides business continuity in the event of an unforeseen failure in  the environment.

While I could spend time extolling the virtues of our virtualization technologies, I am really here to discuss what can happen if one were to get too ‘overzealous’ and not use common sense and a sound plan for implementing the solution correctly.  As with many of the blogs you read here on the CORE blog site, they have been written because of experiences we have had with our customers.  This one is no different.

So, what happens when a customer decides they love Microsoft virtualization and high availability technologies so much, they want to virtualize their entire infrastructure?  And, suppose they want to be sure it’s highly available so they create a multi-node Failover Cluster to host the virtual machines.  When the customer completes the project, they are so very proud of what they have done because now they can retire their old hardware and save tons of money on power and cooling costs in their datacenter.  Everyone is happy and celebrations abound. And, then it happens…..someone decides that they need to shutdown the cluster(s), for whatever reason, it does not matter, and, after awhile, when they decide it is OK to bring the cluster(s) back online…they cannot.  Oh, and one more thing…..the clusters are running on Windows Server 2008 R2 CORE.  Trust me…this is a true story and has already happened more than once, hence the impetus behind this blog.

If the predicament is not immediately obvious, and it should be for cluster veterans, I will tell you that the cluster service will fail to start because it cannot contact a Domain Controller somewhere in Active Directory.  And, this is because all of the Domain Controllers and DNS servers (critical infrastructure servers) have been virtualized and are, in fact,  virtual machines currently supported by the cluster that is trying to start up.  Clearly, this is a case of having ones eggs all in one basket – not good.

How did we fix this?  It was not a quick fix.  In a nutshell, what the Support Engineer did was have the customer determine which storage LUN was hosting the VM files for one of the virtualized Domain Controller\DNS servers.  Then, the LUN was mapped to a standalone server so the VHD file could be copied off to another standalone Hyper-V server so a new VM could be created and placed in service.  Once this was accomplished, the cluster could be started.

How can this type of scenario be avoided? 

1.        Develop a solid, well thought out migration plan.  Ensure  the planning team includes people who understand how all the technologies function in a virtualized environment.

Note:  Please review
KB 888794: Considerations when hosting Active Directory domain controller in a virtual hosting environments

2.       Have at least one physical Domain Controller\DNS server available in the environment.

3.       If #2 is not an option, distribute the virtualized infrastructure servers across multiple hyper-v clusters and hope they will not all be Offline at the same time.

4.       Plan to have one of more Hyper-V servers running in a WORKGROUP configuration.  Hyper-V servers do not have to be joined to a Active Directory domain.  Then distribute some of the virtualized infrastructure servers across these servers.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Windows Server 2008 Failover Clusters: Networking (Part 1)

$
0
0

The Windows Server 2008 Failover Clustering feature provides high availability for services and applications. To ensure applications and services remain highly available, it is imperative the cluster service running on each node in the cluster function at the highest level possible. Providing redundant and reliable communications connectivity among all the nodes in a cluster plays a large role in ensuring for the smooth functioning of the cluster. Configuring proper communications connectivity within a failover cluster not only provides access to highly available services required by clients but also guarantees the connectivity the cluster requires for its own internal communications needs. The sections that follow discuss Windows Server 2008 Failover Clustering networking features, functionality and recommended processes for the proper configuration and implementation of network connectivity within a cluster.

The following sections provide the information needed to understand failover cluster networking and to properly implement it.

Windows Server 2008 Failover Cluster networking features

Windows Server 2008 Failover Clustering introduces new networking capabilities that are a major shift away from the way things have been done in legacy clusters (Windows 2000\2003 and NT 4.0). Some of these take advantage of the new networking features that are included as part of the operating system and others are a result of feedback that has been received from customers. The new features include:

  • A new cluster network driver architecture
  • The ability to locate cluster nodes on different, routed networks in support of multi-site clusters
  • Support for DHCP assigned IP addresses
  • Improvements to the cluster health monitoring (heartbeat) mechanism
  • Support for IPv6

New cluster network driver architecture

The legacy cluster network driver (clusnet.sys) has been replaced with a new NDIS level driver called the Microsoft Failover Cluster Virtual Adapter (netft.sys). Whereas the legacy cluster network driver was listed as a Non-Plug and Play Driver, the new fault tolerant adapter actually appears as a network adapter when hidden devices are displayed in the Device Manager snap-in (Figure 1).

image

Figure 1: Device Manger Snap-in

The driver information is shown in Figure 2.

image

Figure 2: Microsoft Failover Cluster Virtual Adapter driver

The cluster adapter is also listed in the output of an ipconfig /all command on each node (Figure 3).

image

Figure 3: Microsoft Failover Cluster Virtual Adapter configuration information

The Failover Cluster Virtual Adapter is assigned a Media Access Control (MAC) address that is based on the MAC address of the first enumerated (by NDIS) physical NIC in the cluster node (Figure 4) and uses an APIPA (Automatic Private Internet Protocol Addressing) address.

image

Figure 4: Microsoft Failover Cluster Virtual Adapter MAC address

The goal of the new driver model is to sustain TCP/IP connectivity between two or more systems despite the failure of any component in the network path. This goal can be achieved provided at least one alternate physical path is available. In other words, a network component failure (NIC, router, switch, hub, etc…) should not cause inter-node cluster communications to break down, and communication should continue making progress in a timely manner (i.e. it may have a slower response but it will still exist) as long as an alternate physical route (link) is still available. If cluster communications cannot proceed on one network, the switchover to another cluster-enabled network is automatic. This is one of the primary reasons that each cluster node must have multiple network adapters available to support cluster communications and each one should be connected to different switches.

The failover cluster virtual adapter is implemented as an NDIS miniport adapter that pairs an internally constructed virtual route with each network found in a cluster node. The physical network adapters are exposed at the IP layer on each node. The NETFT driver transfers packets (cluster communications) on the virtual adapter by tunneling through the best available route in its internal routing table (Figure 5).

image

Figure 5: NetFT traffic flow diagram

Here is an example to illustrate this concept. A 2-Node cluster is connected to three networks that each node has in common (Public, Cluster and iSCSI). The output of an ipconfig /all command from one of the nodes is shown in Figure 6.

image

Figure 6: Example Cluster Node IP configuration

Note: Do not be concerned with the name ‘Microsoft Virtual Machine Bus Network Adapter’ as these examples were derived from cluster nodes running as Guests in Hyper-V.

The Microsoft Failover Cluster Virtual Adapter configuration information for each node is shown in Figure 7. Keep in mind; the default port for cluster communication is still TCP\UDP: 3343.

image

Figure 7: Node Failover Cluster Virtual Adapter configuration information

When the cluster service starts, and a node either Forms or Joins a cluster, NETFT, along with other components, is responsible for determining the node’s network configuration and connectivity with other nodes in the cluster. One of the first actions is establishing connectivity with the Microsoft Failover Cluster Virtual Adapter on all nodes in the cluster. Figure 8 shows an example of this in the cluster log.

image

Figure 8: Microsoft Failover Cluster Virtual Adapter information exchange

Note: You can see in Figure 8 that the endpoint pairs consist of both IPv4 and IPv6 addresses. The NETFT adapter prefers to use IPv6 and therefore will choose the IPv6 addresses for each end point to use.

As the cluster service startup continues, and the node either Forms or Joins a cluster, routing information is added to NETFT. Using the three networks mentioned previously, Figure 9 shows each route being added to a cluster.

image

Route between 1.0.0.31 and 1.0.0.32

image

Route between 192.168.0.31 and 192.168.0.32

image

Route between 172.16.0.31 and 172.16.0.32

Figure 9: Routes discovered and added to NETFT

Each ‘real’ route is added to the ‘virtual’ routes associated with the virtual adapter (NETFT). Again, note the preference for NETFT to use IPv6 as the protocol of choice.

The capability to place cluster nodes on different, routed networks in support of Multi-Site Clusters

Beginning with Windows Server 2008 failover clustering, individual cluster nodes can be located on separate, routed networks. This requires that resources that depend on IP Address resources (i.e. Network Name resources), implement an OR logic since it is unlikely that every cluster node will have a direct local connection to every network the cluster is aware of. This facilitates IP Address and hence Network Name resources coming online when services\applications failover to remote nodes. Here is an example (Figure 10) of the dependencies for the cluster name on a machine connected to two different networks.

image

Figure 10: Cluster Network Name resource with an OR dependency

All IP addresses associated with a Network Name resource, which come online, will be dynamically registered in DNS (if configured for dynamic updates). This is the default behavior. If the preferred behavior is to register all IP addresses that a Network Name depends on, then a private property of the Network Name resource must be modified. This private property is called RegisterAllProvidersIP (Figure 11). If this property is set equal to 1, all IP addresses will be registered in DNS and the DNS server will return the list of IP addresses associated with the A-Record to the client.

image

Figure 11: Parameters for a Network Name resource

Since cluster nodes can be located on different, routed networks, and the communication mechanisms have been changed to use reliable session protocols implemented over UDP (unicast), the networking requirements for Geographically Dispersed (Multi-Site) Clusters have changed. In previous versions of Microsoft clustering, all cluster nodes had to be located on the same network. This required ‘stretched’ VLANs be implemented when configuring multi-site clusters. Beginning with Windows Server 2008, this requirement is no longer necessary in all scenarios.

Support for DHCP assigned IP addresses

Beginning with Windows Server 2008 Failover Clustering, cluster IP address resources can obtain their addressing from DHCP servers as well as via static entries. If the cluster nodes themselves have at least one NIC that is configured to obtain an IP addresses from a DHCP server, then the default behavior will be to obtain an IP address automatically for all cluster IP address resources. The new ‘wizard-based’ processes in Failover Clustering understand the network configuration and will only ask for static addressing information when required. If the cluster node has statically assigned IP addresses, the cluster IP address resources will have to be configured with static IP addresses as well. Cluster IP address resource IP assignment follows the configuration of the physical node and each specific interface on the node. Even if the nodes are configured to obtain their IP addresses from a DHCP server, individual IP address resources can be changed to static addresses (Figure 12).

image

Figure 12: Changing DHCP assigned to Static IP address


Improvements to the cluster ‘heartbeat’ mechanism

The cluster ‘heartbeat’, or health checking mechanism, has changed in Windows Server 2008. While still using port 3343, it is no longer a broadcast communication. It is now unicast in nature and uses a Request-Reply type process. This provides for higher security and more reliable packet accountability. Using the Microsoft Network Monitor protocol analyzer to capture communications between nodes in a cluster, the ‘heartbeat’ mechanism can be seen (Figure 13).

image

Figure 13: Network Monitor capture

A typical frame is shown in Figure 14.

image

Figure 14: Heartbeat frame from a Network Monitor capture

There are properties of the cluster that address the heartbeat mechanism; these include SameSubnetDelay, CrossSubnetDelay, SameSubnetThreshold, and CrossSubnetThreshold (Figure 16).

image

Figure 16: Properties affecting the cluster heartbeat mechanism

The default configuration (shown here) means the cluster service will wait 5.0 seconds before considering a cluster node to be unreachable and have to regroup to update the view of the cluster (One heartbeat sent every second for five seconds). The limits on these settings are shown in Figure 17. Make changes to the appropriate settings depending on the scenario. The CrossSubnetDelay and CrossSubnetThreshold settings are typically used in multi-site scenarios where WAN links may exhibit higher than normal latency.

image

Figure 17: Heartbeat Configuration Settings

These settings allow for the heartbeat mechanism to be more ‘tolerant’ of networking delays. Modifying these settings, while a worthwhile test as part of a troubleshooting procedure (discussed later), should not be used as a substitute for identifying and correcting network connection delays.

Support for IPv6

Since the Windows Server 2008 OS will be supporting IPv6, the cluster service needs to support this functionality as well. This includes being able to support IPv6 IP Address resources and IPv4 IP Address resources either alone or in combination in a cluster. Clustering also supports IPv6 Tunnel Addresses. As previously noted, intra-node cluster communications by default use IPv6. For more information on IPv6, please review the following:

Microsoft Internet Protocol Version 6

In the next segment, I will discussImplementing networks in support of Failover Clusters (Part 2). See ya then.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Windows Server 2008 Failover Clusters: Networking (Part 2)

$
0
0

In Part 1, I discussed Windows Server 2008 Failover Cluster networking features.  In this segment, I will discuss implementing networks in a Failover Cluster.

Implementing networks in support of Failover Clusters

The main consideration when designing Failover Cluster networks is to ensure there is built-in redundancy for cluster communications.  This is typically accomplished by having a minimum of two physical Network Interface Cards (NICs) installed in each node that will be part of the cluster.  These cards must be supported by two separate and distinct buses (e.g.  Two PCI NICs).  Many people think a single multi-port NIC card meets this requirement – it does not as this configuration creates a single point of failure for all cluster communications.   The best configuration would be two multi-port NICs running on separate buses and having fault tolerance implemented by way of NIC Teaming software (provided by 3rd Party vendors.) and being physically connected to separate network switches.

Note:  NIC Teaming is not supported on iSCSI connections.  Please review the iSCSI Cluster Support: Frequently Asked Questions.  The appropriate fault-tolerant mechanism for iSCSI connectivity would be multi-path software. Please review the Microsoft Multi-path I/O: Frequently Asked Questions.

There are two primary design scenarios when planning for Failover Cluster network connectivity.  In the first scenario (and the most common), all nodes in the cluster are located on the same networks.  In the second scenario, nodes in the cluster are located on separate and distinct routed networks (this is very common in multi-site cluster implementations).  Figure 18 shows an example of the second scenario.

clip_image002

Figure 18:  Multi-site cluster (network connectivity only)

Note:  Even though it is supported to locate cluster nodes on separate, routed networks, it is still supported to connect nodes in a multi-site cluster using stretched Virtual Local Area Networks (VLAN).  This configuration places the nodes on the same network(s).

It is important in any cluster that there are no NICs on the same node that are configured to be on the same subnet.  This is because the cluster network driver uses the subnet to identify networks and will use the first one detected and ignore any other NICs configured on the same subnet on the same node.  The cluster validation process will register a Warning if any network interfaces in a cluster node are configured to be on the same network.  The only possible exception to this would be for iSCSI (Internet Small Computer System Interface) connections.  If iSCSI is implemented in a cluster, and MPIO (Multi-Path Input/Output) is being used for fault-tolerant connections to iSCSI Storage, then it is possible that the network interfaces could be on the same network. In this configuration, the iSCSI network in the Failover Cluster Manager should be configured such that cluster would not use it for any cluster communications.

Note:  Please consult the iSCSI Cluster support: Frequently Asked Question.

As previously mentioned, Windows Server 2008 accommodates cluster nodes being located on separate, routed networks by including a new logic, called an OR logic, when it comes to IP Address resources.  Figure 19 illustrates this.

clip_image004

Figure 19:  IP Address Resource OR logic

When a Network Name resource is configured with an OR dependency on more than one IP Address resource, this means at least one of the IP Address resources must be able to come Online before the Network Name resource can come Online.  Since a Network Name resource can be associated with more than one IP Address, there is a property of a Network Name resource that can be modified so DNS registrations will occur for all of the IP Addresses.  The property is called RegisterAllProvidersIP (See Figure 20).

clip_image006

Figure 20:  Network Name resource properties

Note:  In Figure 20 above, Failover Cluster PowerShell cmdlets were used to access cluster configuration information.  This is new in Windows Server 2008 R2.  For more information, review the TechNet Cmdlet  Reference.

The default registration behavior is to register only the IP Address that can come Online on the node.  Implementing this other behavior by modifying the setting to (1) can assist name resolution in a multi-site cluster scenario.

Note:  Please review KB 947048 for other things to consider when deploying failover cluster nodes on different, routed subnets (multi-site cluster scenario).

While Failover Clusters require a minimum of two NICs to provide reliable cluster communications, there are scenarios where more NICs may be desired and\or required based on the services or applications that are running in the cluster.  One such scenario we already mentioned – iSCSI connectivity to storage.  The other scenario involves Microsoft’s virtualization technology – Hyper-V.

The integration of Failover Clustering with Hyper-V was introduced in Windows Server 2008 (RTM) in the form of making Virtual Machines highly available in a cluster by being able to move (Failover) the Virtual Machines between the nodes in the cluster using a process called Quick Migration.  In Windows Server 2008 R2, additional capabilities were introduced including Live Migration and Cluster Shared Volumes (CSV).  These features improved the high availability story for Virtual machines, but also introduced new networking requirements.   The inner workings of Hyper-V networking will not be discussed here.  For more information, please download this whitepaper (http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=3fac6d40-d6b5-4658-bc54-62b925ed7eea). 

The networking requirements in a Hyper-V Cluster supporting Live Migration and using Cluster Shared Volumes (CSV) can add up quickly as illustrated in Figure 21.

clip_image008

Figure 21: Hypothetical Networking Requirements

For more information on Live Migration and Cluster Shared Volumes in Windows Server 2008 R2, visit the Microsoft TechNet site.

Using Cluster Shared Volumes in a Failover Cluster in Windows Server 2008 R2

Hyper-V:  Using Live Migration with Cluster Shared Volumes in Windows Server 2008 R2

In the next segment I will discuss Troubleshooting cluster networking issues (Part 3).  See ya then.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Windows Server 2008 Failover Clusters: Networking (Part 3)

$
0
0

In Part 2, I discussed implementing networks in a Failover Cluster.  In this final segment, I will discuss troubleshooting cluster networking issues.

Troubleshooting cluster networking issues

As previously stated, it is important that redundant and reliable cluster communications connectivity exist between all nodes in a cluster.   However, there may be times when communications connectivity within a cluster gets disrupted either because of actual network failures or because of misconfiguration of network connectivity.  A loss of communications connectivity with a node in a cluster can result in the node being removed from cluster membership.  When a node is removed from cluster membership, it will terminate its cluster service to avoid problems or conflicts as other nodes in the cluster take over the services or applications and resources that were hosted on the node that was removed.  The node will attempt to rejoin the cluster when the cluster service restarts.  This problem can also have broader effects because the loss of a node in a cluster affects ‘quorum’.  Should the number of nodes participating in a cluster fall below a majority; all highly available services will be taken Offline until ‘quorum’ is re-established (The quorum model,  No Majority: Disk Only, is the one exception.  However, this model is not recommended). 

Here are some recommended troubleshooting procedures for cluster connectivity issues:

1.        Examine the system log on each cluster node and identify any errors reporting  a loss of communications connectivity in the cluster or even broader network related issues.  Here are some example cluster related error messages you may encounter:

clip_image002

Figure 22:  Cluster Network Connectivity error messages

Source:http://technet.microsoft.com/en-us/library/cc773562(WS.10).aspx

clip_image004

Figure 23:  Network Connectivity and Configuration error messages

Source:  http://technet.microsoft.com/en-us/library/cc773417(WS.10).aspx

2.       If the system logs provide insufficient detail, generate the cluster logs and inspect the contents for more detailed information concerning the loss of network connectivity.

Note: Generate the cluster logs by running this PowerShell cmdlet –

clip_image006

3.       Verify the configuration of all networks in the cluster.

4.       Verify the configuration of network connectivity devices such as Ethernet switches.

5.       Run an abbreviated cluster validation process by selecting only the Network tests.

clip_image008

The tests that are executed are shown here:

clip_image010

The desired end result is this:

clip_image012

As an example, here is the section in the validation report that shows the results for the List Network Binding Order test –

clip_image014

Some of the common issues seen with respect to the network validation tests include, but may not be limited to:

·         Multiple NICs on a cluster node configured to be on the same subnet.

·         Excessive latency (usually > 2 seconds) in ping tests between interfaces on cluster nodes.

·         Warning that the firewall has been disabled on one or more nodes.

6.       Conduct simple networking tests, such as a ‘ping’ test, across all networks enabled for cluster communications to verify connectivity between the nodes.  Use network monitoring tools such as Microsoft’s Network Monitor  to analyze network traffic between the nodes in the cluster (Refer to Figures 13 and 14).

7.       Evaluate hardware failures related to networking devices such as Network Interface Cards (NICs), network cabling, or network connectivity devices such as switches and routers as needed.

8.       Review the change management log (if one exists in your organization) to determine what, if any, changes were made to the nodes in the cluster that may be related to the disruption in communications connectivity.

9.       Consider opening a support incident with Microsoft because if a node is removed from cluster membership, this means there were no networks configured on that node that could be used to communicate with other nodes in the cluster.  If there are multiple networks configured for cluster use, as recommended, then cluster membership loss indicates a problem that affects all the networks or the system’s ability to send or receive heartbeat messages.

Note:  For additional information on Troubleshooting Windows Server 2008 consult TechNet.

Hopefully, the information provided in this three part blog was helpful and will assist in properly configuring network connectivity in Windows Server 2008 Failover Clusters.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Viewing all 76 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>