Ask the Core Team

What is a Microsoft Failover Cluster Virtual Adapter anyway?

A question often asked is, "What is the Microsoft Cluster Virtual Adapter and what can I do with it?" The typical, and correct answer, is to leave it alone and let it just work for you. While that answer satisfies most, others may want just a little more by way of an explanation, so hopefully, this blog will provide that.

The networking model in Windows Server 2008 Failover Clustering was rewritten to accommodate new functionality which included being able to obtain IP addresses from DHCP servers and being able to locate Cluster nodes on separate, routed subnets. Additionally, communications went from being UDP Broadcast transmissions to UDP Unicast with a smattering of TCP connections thrown in for good measure. What this all adds up to is more reliable and robust communication connectivity within the Cluster, no matter where the Cluster nodes were located. It no longer matters if Cluster nodes are located in the same physical rack in the same datacenter or in a server rack in a server room in a remote datacenter located at the end of an OC3 WAN connection. This now makes the Cluster more tolerant of single points of failure, e.g. Network Interface Card (NIC) card (and hence the new driver name 'Network Fault-Tolerant or NetFT.sys). The only real minimum requirement is multiple (at least two), redundant communication paths between all nodes in the Cluster. This way, the Cluster network driver (NETFT.SYS) could build a complete routing structure to provide the redundant communication connectivity the Cluster would need to keep applications and services highly available.

Note: Not having at least two networks available for cluster communications will result in a Warning (violation of a 'best practice') being recorded during the Cluster validation process. This is noted in the hardware requirements under Network Adapters and cable section.

To provide some examples of this new functionality and still not get deep into the new networking model, I generated a cluster log from a cluster node so I could illustrate how this new network model is reflected as the cluster service starts. In the cluster log, several entries are associated with NETFT. Some of these include, but may not be limited to, the following:

NETFT - Network Fault-Tolerant
TM - Topology Manager (discovers and maintains the cluster network topology. Reports failures of any networks or network interfaces. configures the Microsoft Failover Cluster Virtual Adapter)
IM - Interface Manager (Responsible for any network interfaces that are part of a cluster configuration)
NETFTAPI - NETFT Application Programming Interface (API)
FTI - Fault-Tolerant Interface

As the cluster service starts, there are events registered indicating NETFT is preparing for communications with other pieces of the cluster architecture -

00000784.000007cc::2009/01/30-14:26:38.199 INFO [NETFT] FTI NetFT event handler ready for events.
00000784.000007b0::2009/01/30-14:26:39.369 INFO [NETFT] Starting NetFT eventing for TM
00000784.000007b0::2009/01/30-14:26:39.369 INFO [NETFT] TM NetFT event handler ready for events.
00000784.000007b0::2009/01/30-14:26:39.369 INFO [CS] Starting IM
00000784.000007b0::2009/01/30-14:26:39.369 INFO [NETFT] Starting NetFT eventing for IM
00000784.000007b0::2009/01/30-14:26:39.369 INFO [NETFT] IM NetFT event handler ready for events.

As connectivity is established with other nodes in the cluster, routes are added -

00000784.00000648::2009/01/30-14:26:39.744 INFO [NETFT] Added route <struct mscs::FaultTolerantRoute>
00000784.00000648::2009/01/30-14:26:39.744 INFO <realLocal>172.16.0.181:~3343~</realLocal>
00000784.00000648::2009/01/30-14:26:39.744 INFO <realRemote>172.16.0.182:~3343~</realRemote>
00000784.00000648::2009/01/30-14:26:39.744 INFO <virtualLocal>fe80::2474:73f1:4b12:8096:~3343~</virtualLocal>
00000784.00000648::2009/01/30-14:26:39.744 INFO <virtualRemote>fe80::8b6:30ea:caa3:8da7:~3343~</virtualRemote>
00000784.00000648::2009/01/30-14:26:39.744 INFO <Delay>1000</Delay>
00000784.00000648::2009/01/30-14:26:39.744 INFO <Threshold>5</Threshold>
00000784.00000648::2009/01/30-14:26:39.744 INFO <Priority>99</Priority>
00000784.00000648::2009/01/30-14:26:39.744 INFO <Attributes>1</Attributes>
00000784.00000648::2009/01/30-14:26:39.744 INFO </struct mscs::FaultTolerantRoute>

Additional events are registered as the routes to the nodes become 'reachable' -

00000784.0000039c::2009/01/30-14:26:39.759 DBG [NETFTAPI] Signaled NetftRemoteReachable event, local address 172.16.0.181:003853 remote address 172.16.0.182:003853
00000784.0000039c::2009/01/30-14:26:39.759 DBG [NETFTAPI] Signaled NetftRemoteReachable event, local address 172.16.0.181:003853 remote address 172.16.0.182:003853
00000784.0000039c::2009/01/30-14:26:39.759 DBG [NETFTAPI] Signaled NetftRemoteReachable event, local address 172.16.0.181:003853 remote address 172.16.0.182:003853
00000784.000004f4::2009/01/30-14:26:39.759 INFO [FTI] Got remote route reachable from netft evm. Setting state to Up for route from 172.16.0.181:~3343~ to 172.16.0.182:~3343~.
00000784.000002f4::2009/01/30-14:26:39.759 INFO [IM] got event: Remote endpoint 172.16.0.182:~3343~ reachable from 172.16.0.181:~3343~
00000784.000002f4::2009/01/30-14:26:39.759 INFO [IM] Marking Route from 172.16.0.181:~3343~ to 172.16.0.182:~3343~ as up
00000784.000001f8::2009/01/30-14:26:39.759 INFO [TM] got event: Remote endpoint 172.16.0.182:~3343~ reachable from 172.16.0.181:~3343~
00000784.00000648::2009/01/30-14:26:39.759 INFO [FTW] NetFT is ready after 0 msecs wait.
00000784.00000648::2009/01/30-14:26:39.759 INFO [FTI] Route is up and NetFT is ready. Connecting to node W2K8-CL2 on virtual IP fe80::8b6:30ea:caa3:8da7%15:~3343~
00000784.0000061c::2009/01/30-14:26:39.759 INFO [CONNECT] fe80::8b6:30ea:caa3:8da7%15:~3343~: Established connection to remote endpoint fe80::8b6:30ea:caa3:8da7%15:~3343~.

A consequence of the changes made to the Cluster networking model is the fact that the Cluster network driver now manifests itself as a network adapter, a hidden adapter, but an adapter nonetheless.

While this is hidden from normal view (by default) in Device Manager (must select “Show hidden devices” to see it), it is plainly visible when listing the network configuration of a Cluster node using the ipconfig /all command line.

Like other adapters, the Microsoft Failover Cluster Virtual Adapter has a MAC address and both IPv4 and IPv6 addresses assigned to it. The IPv4 address is an Automatic Private Internet Protocol Addressing (APIPA) address and the IPv6 address is a non-routable Link-Local address, but that does not matter as all cluster communications are tunneled through the networks supported by the physical NICs as shown here using the route information obtained during the cluster service startup.

The MAC address that is assigned to the Microsoft Failover Cluster Virtual Adapter is based on the MAC address of one of the physical NICs

The Cluster network driver (netft.sys) is a kernel mode driver and is started and stopped by the Cluster Service.

The Cluster network driver has an entry under HKLM\System\CurrentControlSet\Services.

Additionally, there is an entry for the Microsoft Failover Cluster Virtual Adapter in the routing table for each Cluster node. Here are sample outputs for the three sections of the route print command executed on a Cluster node. The first part shows the listing of all the interfaces on the node. Interface 15 is the Microsoft Failover Cluster Virtual Adapter.

This next screen shows the IPv4 Route Table which reflects three entries for the Microsoft Failover Cluster Virtual Adapter.

And finally, the adapter appears in the IPv6 Route Table (If 15).

So, how can one get in trouble? Here are a couple of ways:

Disable the Microsoft Failover Cluster Virtual Adapter.
Sysprep an installation of Windows Server 2008 with the Failover Cluster feature installed. This will cause an error in the Cluster Validation Process.
Modifying any properties of the adapter.

Hopefully, this gives you a better feel for this new functionality in Windows Server 2008 Failover Clusters, and like I stated at the beginning of the blog, the correct answer is to not do anything to the adapter - just let it work for you. Thanks and we hope this has been helpful.

Chuck Timon and John Marlin
Senior Support Escalation Engineers
Microsoft Enterprise Platforms Support

It is time to update everyone on the types of issues our support engineers have been seeing for Hyper-V. The issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed. I think you will notice that the issues for Q2 have not changed much from Q1. Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation

Resolution: Information is provided on the Hyper-V TechNet Library which includes links to several Product Team blogs. Additionally, the Microsoft Virtualization site contains information that can be used to get a Hyper-V based solution up and running quickly.

InstallationIssues

Issue #1

After the Hyper-V role is installed, the customer creates a virtual machine, but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running.

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS. In some cases, the server needs to be physically shutdown in order for the new BIOS settings to take effect.

Issue #2

System hangs on restart at "Configuring Updates Stage 3 of 3" after the Hyper-V role is enabled, disabled, or updated.

Cause: This issue can be caused by the HP Network Configuration utility.

Resolution: Perform the steps documented in KB950792.

Issue #3

Customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Virtual Devices\Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components need to be installed.

Resolution: Install the Integration Services by opening the Virtual Machine Connection window, and then select Insert Integration Services Setup Disk on the Action menu.

Issue #2

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys driver issue.

Resolution: Install hotfix KB957967 to address this issue.

Issue #3

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Storvsp.sys driver issue.

Resolution: If a VM has a SCSI controller with no disks attached, this bugcheck can occur. The resolution is to remove any SCSI controllers which don’t have disks attached. This issue is fixed in SP2.

Issue #4

After you move a Windows Vista or Windows Server 2008 virtual machine from Virtual PC or Virtual Server, the Vmbus device fails to load. When you check the properties of the device in device manager, the device status displays one of the following messages:

This device cannot find enough free resources that it can use. (Code 12).

This device cannot start. (Code 10).

Cause: This issue occurs because the Windows Vista or Windows Server 2008 virtual machine is using the incorrect HAL.

Resolution: Perform the steps documented in KB954282.

Issue #5

Unable to associate the virtual COM port to a physical COM port.

Cause: By design (documented in the help file).

Snapshots

Issue #1

Snapshots were lost

Cause: Parent VHD was expanded. If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

Resolution: Restore data from backup.

Issue #2

Snapshots were deleted.

Cause: The most common cause is that the customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

Integration Components

Issue #1

A Windows 2000 (SP4) virtual machine with the Integration Components installed may shut down slowly.

Cause: This problem is caused by a bug in the operating system (outside of Hyper-V).

Resolution: KB959781 documents the workarounds for this issue on Server 2008. The issue is fixed in Windows Server 2008 R2.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred: The specified program requires a newer version of Windows.

Cause: The most common cause is that Service Pack 2 for Server 2003 wasn’t installed in the virtual machine..

Resolution: Install SP2 in the Server 2003 VM before you install the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )
‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID {7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).
The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause: This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution: Perform the steps documented in KB961804.

Issue #2

Creating or starting a virtual machine fails with the following error:

General access denied error' (0x80070005)

Cause: This issue can be caused by the Intel IPMI driver.

Resolution: Perform the steps documented on Intel’s site.

High Availability (Failover Clustering)

Issue #1

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is now available which covers how to configure Hyper-V on a Failover Cluster.

Issue #2

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node.

Cause: The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution: When virtual machine settings are changed on a VM that’s on a Failover Cluster, you must select the refresh virtual machine configuration option before the VM is moved to another node. There is a blog that discusses this.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

· If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

· The Microsoft Hyper-V VSS Writer may enter an unstable state if a backup of the Hyper-V virtual machine fails. If you run the vssadmin list writers command, the Microsoft Hyper-V VSS Writer is not listed. To return the Microsoft Hyper-V VSS Writer to a stable state, the Hyper-V Virtual Machine Management service must be restarted.

Resolution: An update (KB959962) is now available to address issues with backing up and restoring Hyper-V virtual machines.

Virtual Network Manager

Issue #1

How to configure a virtual machine to use a VLAN.

Resolution: VLANs are discussed in the following blogs: http://blogs.msdn.com/virtual_pc_guy/archive/2008/03/10/vlan-settings-and-hyper-v.aspx and http://blogs.msdn.com/adamfazio/archive/2008/11/14/understanding-hyper-v-vlans.aspx

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution: The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

This blog discusses the proper way to make a configuration change to a highly available virtual machine in a Windows Server 2008 (RTM) Failover Cluster. I will demonstrate how to add a Pass-through disk to a highly available virtual machine by attaching it to a SCSI Controller. You can also use an IDE controller, but I chose to use a SCSI controller because it is not available by default and as we walk through the process, you get the added benefit of seeing how to add it (Note: SCSI controllers require the installation of Integration Components in the Guest).

There are several reason why Pass-through disks are an attractive option. The main reason is that you can bypass the Hyper-V server file system and gain faster, direct access to the disk from inside a virtual machine. To accomplish this requires the disk be Offline from the operating systems perspective. There are tradeoffs, however, some of which include having to locate the virtual machine configuration file somewhere else and you lose the ability to take snapshots and you cannot use dynamic disks or configure differencing disks. For more information on the storage options for Hyper-V, you can review Jose Barreto's blog on the subject.

I start off with a highly available VM running the Windows Server 2008 operating system. The VM is using a VHD for its boot disk attached to an IDE controller.

Note: Hyper-V virtual machines can only boot from storage attached to IDE controllers.

In the Disk Management snap-in, I can see the disk I am using to support the boot disk for the VM and also the LUN I will be adding as the Pass-through disk attached to a SCSI controller. The disk I will use for the Pass-through disk is Offline (must be Offline or cannot be added to the VM).

Note: A new disk must be brought Online and Initialized before it can be used. This process writes a disk signature to the disk so cluster can use it. Once the disk has been initialized, it can be placed Offline again. No partitioning is required as that will be accomplished inside the virtual machine.

In the virtual machine, only the boot disk is currently visible in the Disk Management interface.

Since I will be modifying the configuration of the virtual machine, I first need to shut it down. In the Failover Cluster Management snap-in, right-click on the Virtual Machine resource and choose Shut Down.

Leave the Virtual Machine Configuration resource Online or you will not be able to access the machine settings in the Hyper-V Management snap-in.

In the Hyper-V Management Snap-in, start the Add Hardware wizard and choose to add a SCSI Controller as that is the type of interface the Pass-through disk will be attached to.

As part of the wizard, choose to add a hard drive to the SCSI Controller.

Since this will be a Pass-through disk, select the correct disk from the drop down list under Physical hard disk.

Complete the configuration and Start the Virtual Machine in the Failover Cluster Management interface.

With the virtual machine started, open the Disk Management snap-in. I can now see the new disk that was added.

However, I still do not see the new disk added to the virtual machine configuration in Failover Cluster Manager.

Since I made a change to a virtual machine that is under the control of the cluster, I need to inform the cluster service that a change has been made. I accomplish this by running the Refresh virtual machine configuration action in the right-hand pane.

The virtual machine is Saved as part of the process.

Once the refresh is completed, review the report that is generated to see if it was successful.

Examine the details of the report to see what changes were made.

Once I complete the review of the report and inspect the Failover Cluster Management snap-in, I see the new disk added to the group and it is Online.

Restart the virtual machine from the Failover Cluster Management snap-in and complete the configuration of the new storage in the virtual machine.

Once the partitioning and formatting of the volume is complete, refresh the display in the Failover Cluster Management snap-in and the information is updated for the new storage.

Back in the Disk Management snap-in, the disk now shows a Reserved status meaning it is under the control of the cluster (just like the boot disk).

All that remains is to test failover to other nodes in the cluster to ensure the new configuration comes Online successfully.

So, what is important to remember here?

When making changes to any highly available virtual machine, you must always Refresh the virtual machine configuration in the Failover Cluster Management snap-in before attempting a failover to another node in the cluster. Ensure the generated report is free of any errors.
Do not take the Virtual Machine Configuration resource Offline or you will not be able to make any changes to the VM as it will be removed from display in the Hyper-V Management snap-in.
Do not add the Pass-through disk as a cluster physical disk resource before modifying the virtual machine configuration. Let the Refresh process take care of all of that for you.
The disk must be in an Offline status in Disk Manager before it can be added to the virtual machine configuration as a Pass-through disk.
Finally, there is one anomaly when executing this process. After you modify a VM using the information in this blog, and if that VM is not the only VM on a LUN, if you were to add another VM to the same LUN and make it HA, when the operation completes, the disk corresponding to the pass-through disk will also be added as a 'dependency' to the new VM simply because it is already in the group. The dependencies will have to be manually modified by editing the property of the Virtual Machine resource. This is a known issue and will not be fixed.

Thanks again for your attention, and I hope this information helps.

Chuck Timon
Senior, Support Escalation Engineer
Microsoft Enterprise Platforms Support

We have encountered scenarios where customers are implementing Windows Server 2008 Failover Clusters and they want to make quite a few services and applications highly available but, they are not able to purchase additional Storage to facilitate this. Or, it may be that another business within their organization has higher priority for existing storage assets and their 'lower' priority cluster will just have to make do with what storage has already been allocated. What is a cluster administrator to do under such circumstances?

In this blog, as an example, I will show you how to configure a highly available File Server Group so that it can also be used to support highly available Print Services. I am using File and Print Services because that is the most common scenario we see in customer environments. To implement this configuration we will be using the concepts explained in KB947050: Advanced resource configuration in Windows Server failover clusters.

Note: Even though these procedures are fully supported, the preference is to use dedicated storage for each highly available service or application. This allows the built-in wizard-based processes to be used thus ensuring the correct configuration. It is recommended that customers review their current storage utilization within each cluster and discuss with their local Storage Team to see if it would be possible recover any excess storage space on current LUNs which could then be used to create new LUNs.

The starting point will be an already configured File Server application (CONTOSO-FS1) providing some shared folders for a couple of business groups within an organization.

We will be using the storage in this file Server group to also host the files needed for a highly available Print Server. The procedures we will use will be executed outside of the normal wizard-based process for configuring a highly available Print Server. When using the normal wizard-based process, a highly available Print Server application looks something like this:

The resources in the group include a Client Access Point (CAP) (CONTOSO-PS1), a Print Spooler resource and a piece of storage for storing print jobs and printer drivers. Additionally, since this was configured using the wizard-based process, we also have the correct group type configured [101] so we will have access to the Print Management interface via the Failover Cluster Management snap-in:

The process we will use will not provide direct access to the Print Management snap-in from inside Failover Cluster Manager and, we will have to take that into account. Let's get started....

Since we are outside of the normal wizard-based process, the cluster will not be able to verify the installation of any prerequisite Roles and\or Features, so the first thing is to ensure we install the Print Server Role on all nodes in the cluster.

Once the Print Server role is installed, the Print Management snap-in is listed under Administrative Tools.

The Print Management snap-in provides management capabilities for print services running either in the context of the local node or a highly available Print Server using a configured Client Access Point (CAP).

With the Print Server role installed, we can move forward with the manual configuration of the resources we will need to place in the highly available File Server group.

The Print Spooler resource requires a dependency on a Network Name and Physical disk resource. Both of these resources already exist and are Online in the group so we could use the existing resources. However, we will choose to create a new Client Access Point (CAP) (CONTOSO-PS2) so the users can connect using another Network Name. In the Actions pane to the right, select Add a resource action and select a Client Access Point.

This starts a wizard where we create a NetBIOS name and IP Address (IP address information would not be requested if using DHCP).

Once the resource is created, bring it Online.

Next, create a Print Spooler resource.

The Print Spooler resource is created and placed in an Offline state. The resource is Offline because additional configuration is required before it can be brought Online. If the resource were brought Online at this point, it would fail and would take the whole group down because a failover would be initiated. This would disrupt any user connections to shared folders (more on this later).

To complete the configuration, Right-click on the resource and select Properties. On the General tab you can change the display name for the resource (optional) but you must enter a path to a folder on the storage where spooled print jobs and printer drivers can be stored. I created a Spooler directory on the storage so I enter the path information.

Next, select the Dependencies tab and add dependencies for the storage and the CAP.

Verify the default setting for the Policies and Advanced Policies tabs and then click OK. Bring the Print Spooler resource Online.

Test failover to other nodes in the cluster to be sure you have high availability.

Earlier, I mentioned we would not be able to manage highly available printers in the Failover Cluster Management snap-in when we configure a Print Spooler resource using the method above. If we were to open the Print Management snap-in, accessible in Administrative Tools, we would only see the local cluster node listed.

We have to manually add the new CAP we created for the Print Spooler resource.

Once we complete this action, both the local and the highly available Print Server will be visible in the management interface.

Looking at the final result where we have a single grouping of resources which consist of both File Server and Print Server resources, we need to consider what happens in case of any failure?

This is a critical question that needs to be answered because if the default behavior is implemented which is "if restart is unsuccessful, fail over all resources in this service or application", then all the resources in the application group will be taken Offline and moved to another node in the cluster and brought back Online.

In this scenario then, if the Print Spooler resource were to fail and could not be restarted on the node that owns it, the entire group, which includes the File Server resource and its associated shared folders, would be taken Offline and moved to another node in the cluster. This will interrupt all client connections to the shared folders until they are brought back Online on another node and the connections re-established. A decision may be required, in this situation, where perhaps it is more important that access to the File Server be maintained while the Print Spooler is in a Failed state. To accomplish this, a cluster administrator would have to modify the setting by Unchecking the box on both the Print Spooler resource and the associated Client Access Point so that failures of either of these will not result in a failover. You obviously would not do this for the disk resource because that is the single common resource being used by both services and would want that to failover.

To ensure the modified settings result in the desired behavior, we can simulate failures on the modified resources and observe the results. Here, I am simulating a failure on the Client Access Point for the Print Server.

After a single successful restart of the resource, execute another simulated failure and the resource will go into a Failed state but will not force a failover of the group.

So there you have it. A method for maximizing existing resources in a cluster. Before we wrap up, I want to emphasize again that the preferred method would be to have dedicated storage for each highly available application or service and not try to multi-purpose current storage.

I hope this information proves to be useful to someone and keep the cards and letters coming.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Greetings once again from the support trenches here on the CORE team. I want to talk a bit about a Windows Server 2008 Failover Cluster issue that appears to be on the rise. What we are seeing is the Computer Object for the Cluster Name (a.k.a. Cluster Name Object (CNO) being removed from Active Directory resulting in the Cluster Name no longer being able to function properly. This does not happen automatically. It requires some sort of human interaction either by consciously going into AD and deleting the object or running some script (process) that deletes it. However this is being done, it appears to us that the implications are not fully understood and there is no quick recovery from this. In this blog, I hope to provide information that will help avoid this scenario from happening within your organization. Along the way, I want to provide some 'value-add' information by discussing how the cluster computer objects relate to each other.

The first step to preventing this from happening in your organization is to be sure there is a clear understanding of the cluster security model in Windows Server 2008. Rather than spend a whole lot of time and space here rehashing what is already publicly available, I refer you to the following:

KB 947049: Description of the Failover Cluster Security Model in Windows Server 2008.

Failover Cluster Step-by-Step Guide: Configuring Accounts in Active directory

After reviewing the materials, you should have an understanding of how security works in Windows Server 2008 Failover Clusters and an appreciation for the importance of not removing (or disabling) the Computer Objects created in Active Directory by the cluster. By default, the Computer Objects created by the cluster are all placed in the Computers container. These can be relocated to another OU, or even pre-staged in an OU before the cluster is created. If pre-staging, be sure to review the requirements in the Step-by-step Guide already mentioned. As an example (Figure 1), I created a Cluster OU and moved the cluster nodes and their associated objects into the OU.

Figure 1

You may want to consider implementing a similar practice in your organization as it groups the cluster objects together thereby reinforcing the idea that this grouping of objects is 'special' in some way.

Before moving forward and discussing the actual recovery process, I want to spend a little time reviewing the cluster 'family tree' to help you gain an understanding of how cluster objects are related. To illustrate, I will use a cluster named W2K8-CLUS (Figure 2)in the CONTOSO domain.

Figure 2

This cluster is located in the Cluster OU shown in Figure 1. Using Regedit.exe, I open the cluster registry hive and inspect the properties for the cluster. I can see the name of the cluster and the resource GUID for the Cluster Name.

Figure 3

Expanding the Resource GUID corresponding to the Cluster Name, I inspect additional properties for the resource. Selecting the Parameters entry displays the ObjectGUID for the cluster Computer Object in Active directory (Figure 4).

Figure 4

In Figure 5, we see the attribute in Active directory (must enable Advanced Features before the Attribute Editor tab is visible). You can also use ADSIEdit to view the same information.

Figure 5

The Cluster Name Object (CNO) functions as the primary security context for the cluster. The CNO is responsible for creating any additional Computer Objects (Virtual Computer Objects (VCO)) associated with the cluster. These Computer Objects represent Network Name resources in a cluster. A Network Name resource is created as part of a Client Access Point (CAP). Each Computer Object created by a cluster CNO contains an Access Control Entry (ACE) for the CNO on the Access Control List (ACL) for the object. The CNO is also responsible for synchronizing the password for each VCO in the domain. The VCOs associated with a particular CNO can be determined either by manually inspecting the ACL for each VCO in AD, or the information can be obtained in the cluster registry.

Opening the cluster registry hive and inspecting the properties of the Cluster Name resource, we can see an entry called ObjectGUIDS. This is a listing for each Computer Object created by the CNO in Active directory. In Figure 6, I have four Computer Objects in Active Directory associated with this cluster.

Figure 6

One of them is a Computer Object (VCO) associated with the CAP representing a highly available Print Server (CONTOSO-PS1) in this cluster (Figure 7).

Figure 7

Well, there you have it…the cluster family tree.

So, what happens if the Cluster Name Object is deleted from Active Directory? A few important things –

· The Cluster Name, if Online, will stay Online but will fail to come Online again if the resource is cycled (it will be placed in a Failed state). This will prevent being able to connect to the cluster remotely when trying to administer the cluster.

· The security context for the cluster is lost. This prevents the passwords for all associated VCOs from being synchronized within the domain. Also, any user, service or other process needing permission to access cluster objects will fail to be authenticated.

· No more CAPs can be created in the cluster.

Besides the items listed above, there are other indications of problems. The Cluster Name resource in the Cluster Core Resources group will be in a Failed state. Attempts to bring the resource Online will generate a pop-up error (Figure 8)

Figure 8

A FailoverClustering error (Event ID 1207) will be registered in the System Log (Figure 9).

Figure 9

The cluster log will report a failure to locate the CNO Computer Object in Active Directory (Figure 10)

Figure 10

It is, therefore, very important the CNOs Computer Object in the domain not be deleted.

How does one recover from this? The supported way(s) to recover an Active Directory object that has been accidentally, or intentionally, deleted are described in the following articles and will not be covered in detail here–

KB840001: How to restore deleted user accounts and their group memberships in Active Directory

TechNet Content - Recovering Active Directory Domain Services

Additionally, there are 3^rd party solutions that can be used to protect Active Directory objects and\or recover them if deleted. Finally, as a last ditch effort, and when there is no other alternative, there is a free utility called ADRestore (32-bit only) that can be used to recover the Computer Object associated with the CNO. Please review the following information before deciding to use this utility –

Microsoft Supportability Newsletter– Using ADRestore tool to restore deleted objects

Either of these methods can be used, but they may end up being time consuming, expensive or both.

Once the Computer Object has been recovered from Active Directory, the Repair Active Directory object action can be used to restore functionality in the cluster (Figure 11).

Figure 11

Note: The logged on user that will perform the Repair action must have rights to administer the cluster and must have the right to Reset Passwords in the domain.

I personally believe ‘an ounce of prevention is worth a pound of cure.’ To that end, my top recommendation is to implement the steps outlined in the section Preventing unwanted deletions in the TechNet Content already mentioned above. Beginning with Windows Server 2008, objects in Active Directory, such as the Computer Object shown here (Figure 12), can be protected from accidental deletion by simply checking a box – Protect object from accidental deletion.

Figure 12

With this ‘guard’ in place, when an object is selected for deletion, the first pop-up is presented (Figure 13)

Figure 13

If Yes is selected, the next error is presented to the user (Figure 14) thus preventing deletion.

Figure 14

If this isn’t enough, there is more help coming in Windows Server 2008 R2. Domain Services in Windows Server 2008 R2 will include an optional feature called Active Directory Recycle Bin. This feature is not enabled by default and must be added. Details about the feature can be found on TechNet

TechNet Content – Active Directory Recycle Bin Step-by-Step Guide

That about wraps it up for this installment. As usual, we hope this information is useful. Come back and visit.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

It is time to update everyone on the issues our support engineers have been seeing for Hyper-V for the past quarter. The issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed. I think you will notice that the issues for Q3 have not changed much from Q1\Q2. Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether. There will probably be one more blog for the Q4 results. Additionally, I would like to mention that we are highly recommending the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

InstallationIssues

Issue #1

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #2

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
‘ vmname ’ failed to initialize.
Failed to read or update VM configuration.

An error occurred while attempting to change the state of virtual machine vmname .
" VMName " failed to initialize
An attempt to read or update the virtual machine configuration failed.
" VMName " failed to read or update the virtual machine configuration: Unspecified error (0x80040005).

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Issue #3

After the Hyper-V role is installed, a customer creates a virtual machine but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS. In some cases, the server may need to be physically shutdown in order for the new BIOS settings to take effect.

Virtual Devices\Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached.

Resolution: Perform the steps documented in KB969266.

Issue #3

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free up disk space to allow the merge to complete.

Issue #2

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #3

Snapshots were lost

Cause: Parent VHD was expanded (not supported). If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

Resolution: Restore data from backup.

Integration Components

Issue #1

A Windows 2000 (SP4) virtual machine with the Integration Components installed may shut down slowly.

Cause: Thisproblem is caused by a bug in the Windows Software Trace Pre-Processor (WPP) tracing macro (outside of Hyper-V).

Resolution: KB959781 documents the workarounds for this issue on Server 2008.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred: The specified program requires a newer version of Windows.

Cause: Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution: Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

· The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )

· ‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID

{7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).

· The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause: This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution: Perform the steps documented in KB961804.

Issue #2

Creating or starting a virtual machine fails with the following error:

'General access denied error' (0x80070005).

Cause: This issue can be caused by the Intel IPMI driver.

Resolution: Perform the steps documented in KB969556.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is now available which covers how to configure Hyper-V on a Failover Cluster.

Issue #2

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause: The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution: When virtual machine settings are changed on a VM that’s on a Failover Cluster, you must select the ‘Refresh virtual machine configuration’ option before the VM is moved to another node. There is a blog that discusses this.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

· If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

Resolution: An update (KB959962) is now available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

Network connectivity issues

Cause: NIC teaming software

Resolution: Remove the NIC teaming software. Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Issue #3

Customers inquiring if Hyper-V supports NIC Teaming.

Resolution: Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution: The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Import/Export

Issue #1

Importing a virtual machine may fail with the following error:

A Server error occurred while attempting to import the virtual machine. Failed to import the virtual machine from import directory <Directory Path>. Error: One or more arguments are invalid (0x80070057).

Resolution: Perform the steps documented in KB968968.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

· An Event ID 1054 is logged to the Application Event log:

Event ID: 1054
Source: Userenv
Type: Error
Description:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted). Group Policy processing aborted.

· A negative ping time is displayed when you use the pingcommand.

· Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

It is time for the final installment of a year-long segment on the top issues in Hyper-V. It is appropriate since Windows Server 2008 R2 has finally released, and we can look forward to tracking\reporting any issues we may find in the new version of Hyper-V. As always, the issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed. I think you will notice that the issues for Q4 have not changed much from Q1\Q2\Q3. Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether (if you have been following this blog series, you will notice some already have). Additionally, we continue to highly recommend the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

Installation Issues

Issue #1

After the Hyper-V role is installed, a customer creates a virtual machine, but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS. In some cases, the server needs to be physically shutdown in order for the new BIOS settings to take effect.

Issue #2

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #3

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
‘ vmname ’ failed to initialize.
Failed to read or update VM configuration.

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Virtual Devices or Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Corrupted virtual hard disk (VHD) file.

Cause: The most common cause was a power outage or the server wasn’t shutdown properly.

Resolution: Restore the VHD file from backup.

Issue #3

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached (driver issue - Storvsp.sys).

Resolution: Perform the steps documented in KB969266.

Issue #4

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #2

Snapshots were lost

Cause: Parent VHD was expanded (not supported). If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

Resolution: Restore data from backup.

Issue #3

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free disk space to allow the merge to complete or move the .VHD and .AVHD file(s) to a volume with sufficient disk space and manually merge the snapshots.

Integration Components

Issue #1

On Windows Server 2008, when you attempt to install the Integration Components in a Hyper-V virtual machine running Windows Vista Service Pack 2, the installation may fail with the following error:

An error has occurred: One of the update processes returned error code 1.

Cause: This issue occurs if the management operating system (parent partition) that has the Hyper-V role installed does not have Service Pack 2 installed. If you have a virtual machine that’s running Windows Vista Service Pack 2, you need to use the Vmguest.iso from Service Pack 2 to install the Integration Components.

Resolution: Perform the steps documented in KB974503.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred: The specified program requires a newer version of Windows.

Cause: Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution: Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )
‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID {7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).
The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause: This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution: Perform the steps documented in KB961804.

Issue #2

Customer has multiple Hyper-V servers and virtual machines are getting duplicate MAC addresses.

Resolution: Configure the Hyper-V servers to use unique MAC address ranges by modifying the MinimumMacAddress and MaximumMacAddress registry values on each Hyper-V server. This issue is documented on TechNet: http://technet.microsoft.com/en-us/library/dd582198(WS.10).aspx. On Server 2008 R2, the MAC address ranges can be configured in the UI.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause: The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution: We have a KB article (KB 2000016) which discusses this issue for Windows 2008. On Windows 2008 R2, the experience has improved. If the virtual machine settings are modified within the Failover Cluster Management console, changes that are made to the VM will be saved to the Cluster (i.e. synchronized across all nodes in the cluster). If you make changes to the VM using the Hyper-V Manager Console, you must select the refresh virtual machine configuration option before the VM is moved to another node. This issue is documented in the Windows Server 2008 R2 help file. There is also a blog that discusses this.

Issue #2

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is available which covers how to configure Hyper-V on a Failover Cluster.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

· If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

Resolution: An update (KB959962) is available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

After the customer configured a virtual machine to use a VLAN ID, the virtual machine is unable to access the network.

Cause: The VLAN ID used by the virtual machine didn’t match the VLAD ID configured on the network switch.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Issue #3

How to configure a virtual machine to use a VLAN.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution: The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

· An Event ID 1054 is logged to the Application Event log:

· A negative ping time is displayed when you use the pingcommand.

· Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

BTW – Did I mention we are strongly recommending installing Windows Server 2008 SP2 on all Hyper-V server? Have a good one!

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

An issue involving a firewall configuration error in the cluster validation process just surfaced here in Microsoft Support so I thought I would post a quick blog in an effort to not only inform our readership, but to ‘nip this in the bud’ before we start seeing more.

After running a Windows Server 2008 R2 Failover Cluster validation report, you may see the following error –

“An error occurred while executing the test. There was an error verifying the firewall configuration. An item with the same key has already been added”

The error, as is, does not provide a clear direction to take when trying to troubleshoot. Thanks to the efforts of Cluster Product Group, the source of the issue was identified and a quick data collection process can be executed to help determine the ‘root’ cause.

The firewall configuration error is reported if any of the network adapters across the cluster nodes being validated have the same Globally Unique Identified (GUID). This can be determined by running the following WMI query on each node in the cluster and comparing the results. I chose to run the query inside PowerShell to display sample data in a formatted list-

GetWMI Win32_NetworkAdapter | fl Name,GUID

The sample output above shows the information associated with the three physical network adapters that exist in one of the nodes in my cluster. After the data is gathered from each node in the cluster, you just need to compare it and identify the duplicate GUID information.

The next logical question is, “How does one find themselves in this predicament?” In the cases we have encountered thus far, the cluster nodes were being deployed in an unsupported manner. In each case an ‘image’ was being used to deploy the nodes. We discovered that the operating system image was not properly prepared before being deployed by, for example, running sysprep.

Hopefully this information will be useful and will help avoid further occurrences of this issue. Thanks again and please come back.

Additional References:

Failover Cluster Step-by-Step Guide: Validating hardware for a Failover Cluster

KB 943984: The Microsoft Policy for Windows Server 2008 Failover Clusters

Deployment Tools Technical Reference

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

In this blog, I would like to explore some of the inner-workings of the Resource Host Subsystem (RHS) which is responsible for monitoring the health of the various cluster resources being provided as part of highly available services in a Failover cluster. A Windows Server 2008 Failover Cluster is capable of providing high availability services using a variety of resources some of which are included as part of the Failover Cluster feature and others are as part of ’cluster-aware’ applications like SQL and Exchange. Resources are designed to work together and are typically organized in Resource Groups (Figure 1). For example, a group of resources supporting a highly available File Server may consist of one or more of the following types of resources – Client Access Point (IP Address(s) + Network Name resource), Physical Disk (Storage), and a File Server. A highly available SQL Instance could contain the following resources - Client Access Point (IP Address + Network Name resource), Physical Disk (Storage), SQL Server and SQL Server Agent. Cluster resources are supported by special ‘plugins’ or resource Data Link Libraries (DLLs) that include coding to allow them to properly integrate\interoperate with the cluster service.

Figure 1

A Windows Server 2008 Failover Cluster is capable of hosting an unlimited number of resources. The management of these resources is the responsibility of the Resource Control Manager (RCM) and the Resource Host Subsystem (RHS) which provide this functionality as part of the Cluster Service itself (Figure 2).

Figure 2

The Resource Control Manager (RCM) is part of the overall cluster architecture and is responsible for implementing failover mechanisms and policies for the cluster service as well as establishing and maintaining the dependency tree (Figure 3) for each resource (e.g. a File Server resource requires a dependency on a Client Access Point and a Storage resource).

Figure 3

The Resource Control Manager maintains the state for individual resources (Online, Offline, Failed, Online Pending, and Offline Pending) as well as for Resource Groups (Online, Offline, Partial Online, and Failed). The Resource Control Manager can execute the following actions on a group of resources – Move, Failover and Failback. Which action is executed depends on several factors including the current ‘health’ of resources in the group, administrative actions taken on the group (e.g. Move Group), or the current policies in effect for the group. Here is an example (Figure 4) of Failover and Failback Group Policies –

Figure 4

Individual resources have policies (Figure 5) that apply to them as well.

Figure 5

The Resource Hosting Subsystem (RHS) is responsible for initially hosting all resources that come Online in the cluster in one default process – rhs.exe (Resource Host Monitoring process) (Figure 6).

Figure 6

Note: The rhs.exe *32 process supports 32-bit resource DLLs running in the cluster.

In previous versions of Microsoft clustering, this was called the resource monitor process (resrcmon.exe) (Figure 7).

Figure 7

There is one exception to this rule which has been implemented in the Windows Server 2008 R2 Failover Clustering feature. In Windows Server 2008 R2, the Cluster Group which consists of the Cluster Network Name resource, one or more associated IP address resources and a ‘witness’ resource and the Available Storage group are considered to be ‘critical’ cluster resource groupings and are hosted in an rhs.exe process separate from all the other cluster resources.

The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly. This is accomplished by executing IsAlive and LooksAlive processes which are specific to the type of resource. Examples of these are documented in the following KB article –

KB 914458 - Behavior of the LooksAlive and IsAlive functions for the resources that are included in the Windows server Clustering component of Windows Server 2003.

How often health checks are conducted is determined by the specific resource DLL or by a policy set by the cluster administrator. An example of this policy is shown in Figure 5. Should a resource fail to respond to a low-level LooksAlive check, a more in-depth IsAlive check is conducted. If a resource fails an IsAlive check, additional policies are executed until such time it is determined that a resource cannot run on a particular node in the cluster. When that point has been reached, RHS notifies the Resource Control Manager which will report the resource as Failed to the cluster service and a Failover is executed to move the Resource Group to another node in the cluster provided the default policy (Figure 8) is in effect.

Figure 8

There are times when a cluster administrator will choose not to implement the default policy shown in Figure 8 for specific ‘non-critical’ resources. This reduces instability in the cluster which could adversely impact clients connected to highly available service(s).

The IsAlive and LooksAlive health monitoring function is but a small part of what can be done with cluster resources. Figure 9 shows a listing of additional Resource DLL Entry-Point functions.

Figure 9

Note: Information on the Failover Cluster APIs can be found on MSDN.

Failure of an IsAlive call into a resource is but one way resources can become unavailable in the cluster. Other ways include:

Deadlocks in a resource DLL
Crashes in a resource DLL
RHS process itself terminates in the cluster
Cluster service fails on the node
Operating system failures (e.g. resource exhaustion)

Most of us who have been working with clusters for a long period of time understand what happens if a resource fails a critical health check. I want to spend a little time discussing resource deadlocks.

What is a resource ‘deadlock’? Basically, there are two common reasons for instability within a resource DLL. The resource DLL itself crashes (e.g. access violation in the resource DLL) or the resource fails to respond to a command in a timely fashion. Every time a call is made into a resource, a timer is started. If a response is not received within a specific period of time (configurable), the resource is considered to be deadlocked and the RHS process hosting that resource will be terminated and the resource will be placed in a newly created RHS process thereby isolating it from all the other resources running in the default rhs.exe process. When a deadlock happens, the Failover Cluster service registers an event in the cluster log. Here is an example of a deadlock occurring in the ‘Cluster Name’ resource –

000008c8.00002528::2009/06/17-20:07:57.900 WARN [RCM] ResourceControl(GET_NETWORK_NAME) to Network Name (email) returned 5910.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR [RHS] RhsCall::DeadlockMonitor: Call LOOKSALIVE timed out for resource 'Cluster Name'.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR [RHS] Resource Cluster Name handling deadlock. Cleaning current operation and terminating RHS process.

000008c8.00001cc4::2009/06/17-20:07:58.009 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Cluster Name', gen(0) result 4.

000008c8.00001cc4::2009/06/17-20:07:58.009 WARN [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'Cluster Name' has crashed or deadlocked; marking it to run in a separate monitor.

Figure 10

Entries are also made in the Windows System Event Log. Here is an example –

06/17/2009 04:07:58 PM Error Server1.contoso.com. 1230 Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM Cluster resource 'Cluster Name' (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.

06/17/2009 04:07:58 PM Critical Server1.contoso.com. 1146 Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor.

Figure 11

Information on these specific Failover Cluster error messages can be found on TechNet. The information for the two events shown in Figure 11 is shown in Figure 12.

Figure 12

In Windows Server 2008 R2, RHS events are registered with Windows Error Reporting. These events can be viewed in the Action Center under Control Panel. All RHS issues will be listed under the category ‘Failover Cluster Resource Host Subsystem.’

Examining the properties of a cluster resource highlights some of the information we have been discussing. Figure 13 points out some of the pertinent properties of a resource.

Figure 13

MonitorProcessID: Indicates the Process Identifier (PID) in task manger of the rhs.exe process associated with this resource. If multiple resources have been placed in their own RHS process, it can be difficult to discern which process is associated with which resource. Examining the properties of the specific resource can help.

Note: The Process ID is not displayed by default in Task Manager. You need to add the Column to the display by selecting View in the Menu Bar and from the drop down list select Select Columns. Check the box for PID (Process Identifier).

SeparateMonitor: Indicates if the resource has been placed in a separate monitor (0:No, 1:Yes).

IsAlivePoleInterval: Default is as shown indicating it is using the default setting for this specific resource type.

LooksAlivePollInterval: Default is as shown indicating it is using the default setting for this specific resource type.

DeadlockTimeout: Default setting indicating 5 minutes.

Resource deadlock detection was actually introduced in Windows Server 2003 clusters, however it was not turned on by default. Figure 14 illustrates this.

Figure 14

Deadlock detection is turned on by default in Windows Server 2008 (RTM + R2) and cannot be disabled.

So, what is the moral of this story? It is important to understand that cluster resource deadlocks are a symptom of a larger problem. The deadlock itself is not the problem….cluster is a victim of a problem that can exist either internal to the cluster node itself or somewhere external to the cluster. Applying a logical troubleshooting methodology can help understand where the problem may exist. But, to do that requires a couple of pieces of knowledge –

Identification of the specific resource that is deadlocked.
What is the entry point that is failing?
What is the entry point trying to do?

Using the example provided in Figures 10 and 11, we can see there was a deadlock in the cluster name resource during a LooksAlive entry point. Understanding what is being evaluated for a LooksAlive process for a Network Name resource may help identify the problem which could end up being local to the node or could perhaps involve connectivity to a DNS server on the network. Referring back to KB 914458, the cluster resource DLL (ClusRes.dll) is responsible for Network Name resource health checking (IsAlive\LooksAlive tests). Some of the tests that are conducted include:

· Determining if the Network Name (NetBIOS Name) is still registered on the network stack on the node. Opening a command prompt on a node and running an nbtstat –n command to view the local NetBIOS name table, will show the registrations for cluster Network Name resources. Here is an example of a Network Name supported a Client Access Point for a File Server –

Inspecting the Parameter data for the resource in the cluster registry hive, confirms the information –

Determine the result of a DNS registration attempt (dynamic DNS is required for this test).
If the Require DNS property is set and registration fails, then the IsAlive\LooksAlive test fails.

If all DNS registrations fail and the NetBIOS name is no longer registered locally on the node, the Network Name is no longer considered reachable and the resource is placed in a Failed state. Recovery processes are initiated by the cluster service on the local node first. If local recovery fails, the Group containing the Failed Network Name resource could be moved to another node in the cluster.

What are some things that can be done to help avoid, or at least mitigate, situations where a deadlock may occur? While not set in stone, here are some of my personal recommendations:

Make sure the operating system (OS) is running with the latest service pack plus any post-service pack updates that pertain to Failover Cluster, networking or storage connectivity.
If running highly available Microsoft applications like SQL or Exchange, ensure they are updated as well.
Consult with the storage vendor and ensure the shared storage is updated and configured correctly to work in a Microsoft Failover Cluster. Most storage vendors maintain a current support matrix.
Ensure there are reliable and redundant communications paths between all nodes in the cluster.
Ensure there is reliable connectivity between all nodes in the cluster and Active Directory.
Document all Third party products that are running in the cluster and ensure they are fully updated. Third party products that interact with storage or network connectivity are always potential suspects.
Use the cluster validation process to help troubleshoot issues seen in a cluster.
If you are a Cluster Administrator, you must be aware of all changes being implemented in the corporate infrastructure to determine potential impacts on highly available services.

Hopefully, you will find this information useful. Thanks again and please come back.

Additional References:

http://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

This blog is essentially an update, or follow-up if you will, to the original blog I wrote for Windows Server 2008 Failover Clustering. With the release of Windows Server 2008 R2 comes the ability to Live Migrate highly available virtual machines between nodes in a Failover Cluster. As if that were not enough to get customers excited about the product, we also include a feature called Clustered Shared Volumes (CSV) which is designed to work in conjunction with making virtual machines highly available in Windows Server 2008 R2 Failover Clustering using the Live Migration functionality. There are users out there that are a little confused about CSV and whether or not they must use CSV with Live Migration or can they still take advantage of pass-through disks in a virtual machine configuration. I am here to tell you that you can use both simultaneously if desired and that is what we will discuss here.

The ultimate goal is to arrive at the configuration shown here where a highly available virtual machine is using a CSV volume for storing configuration files and the virtual hard disk (VHD) supporting the base OS while still taking advantage of a pass-through disk for data storage.

We start off with the assumption that a virtual machine has already been made highly available using a CSV volume to store the configuration file and the base OS vhd and Live Migration has already been tested and is working properly.

Next, we prepare a new LUN that has been presented to the cluster to be used as a pass-through disk. In the Disk Management interface we can see the LUN has been presented and is Offline.

Since this is a LUN that the cluster has never seen before, we need to bring the disk Online first.

Once the disk is Online, we need to initialize the disk which will write a signature so the operating system can identify it.

Note: The cluster uses the disk signature as one attribute for uniquely identifying storage that it controls. How do we know cluster has control of a disk?

Bonus material– The disk shows as Reserved when a cluster has control of it.

Once the disk has been initialized, take the disk Offline. There is no need to partition the drive as that will be done by the OS running in the virtual machine.

In order for the disk to be used by a highly available virtual machine, it must be under control of the cluster. In the Failover Cluster Management snap-in, add the disk to the cluster.

Select the disk that is Offline.

Once added to the cluster, the new storage is placed in the Available Storage group and is brought Online.

Bonus material– Can a pass-through disk be used in the CSV namespace?

If we try to add the disk to the CSV namespace –

The process will not complete –

Reviewing the error –

So, the answer is pass-through disks cannot be added to a CSV namespace because the drive must be Offline in preparation for being configured in a VM. If the disk is Offline, the partition information cannot be read and CSV has a requirement for an NTFS partition.

If you were to try and configure the virtual machine to use a disk that is not under the control of the cluster, you would see this pop-up when the virtual machine configuration is refreshed by the cluster.

Viewing the Details, the error is –

The reason is also explained –

With the disk added as a cluster resource and Online in the Available Storage group, access the settings for the running virtual machine by executing a Right-click on the virtual machine resource and selecting Settings (or selecting Settings in the lower right-hand Actions Pane).

In R2, we can Hot-Add a hard disk to a pre-existing SCSI controller (if you wanted to use and IDE Controller, the virtual machine would have to be shutdown). Execute the task by selecting the SCSI Controller and then select Hard Drive and Add.

Make sure to select the correct disk from the drop down list.

The refresh of the virtual machine should complete successfully and the new disk will be added to the group containing the Virtual Machine.

An examination of the resource dependencies for the virtual machine resource now includes the new disk that was added.

At this point, test Live Migration of the group to ensure all resources will come Online on other nodes in the cluster.

When Live Migration completes, we have achieved the desired configuration.

The final step is to prepare the new storage in the VM itself.

That wraps it up for this blog. I hope you have found this information useful. Come back and see us.

Additional references:

Configuring Pass-through Disks in Hyper-V

Microsoft Cluster Team Blog

Virtualization TechNet Center

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Windows Server 2008 R2 has been publicly available now for only a short period of time, but we are already seeing a good adoption rate for the new Live Migration functionality as well as the new Cluster Shared Volumes (CSV) feature. I personally have worked enough issues now where Live Migration is failing that I felt a short blog on what process I have followed to work through these may have some value.

It is important to mention right up front that there is information publicly available on the Microsoft TechNet site that discusses Live Migration and Cluster Shared Volumes. This content also includes some troubleshooting information. I acknowledge that a lot of people do not like to sit in front of a computer monitor and read a lot of text to try and figure out how to resolve an issue. I am one of those people. Having said that, let’s dive in.

It has been my experience thus far that issues that prevent Live Migration from succeeding have to do with proper network configuration. In this blog, I will address the main network related configuration items that need to be reviewed in order to be sure Live Migration has the best chance of succeeding. I begin with an initial set of assumptions which include the R2 Hyper-V Failover Cluster has been properly configured and all validation tests have passed without failure, the highly available VM(s) have been created using cluster shared storage, and the virtual machine(s) are able to start on at least one node in the cluster.

I start off by identifying the virtual machines that will not Live Migrate between nodes in the cluster. While it should not be necessary in Windows Server 2008 R2, I recommend first running a ‘refresh’ process on each virtual machine experiencing an issue with Live Migration. I say it should not be necessary because a lot of work was done by the Product Group to more tightly integrate the Failover Cluster Management interface with Hyper-V. Beginning with R2, virtual machine configuration and management can be done using the Failover Cluster Management interface. Here is a sample of some of the actions that can be executed using the Actions Pane in Failover Cluster Manager.

If virtual machine configuration and management is accomplished using the Failover Cluster Management interface, any configuration changes made to a virtual machine should be automatically synchronized across all nodes in the cluster. To ensure this has happened, I begin by selecting each virtual machine resource individually and executing a Refresh virtual machine configuration process as shown here –

The process generates a report when it completes. The desired result is shown here –

If the process completes with a Warning or Failure, examine the contents of the report and fix the issue(s) that was reported and run the process again until it successfully completes.

If the refresh process completes without Failure, try to Quick Migrate the virtual machine to each node in the cluster to see if it succeeds.

If a Quick Migration completes successfully, that confirms the Hyper-V Virtual Networks are configured correctly on each node and the processors in the Hyper-V servers themselves are compatible. The most common problem with the Hyper-V Virtual Network configuration is that the naming convention used is not the same on every node in the cluster. To determine this, open the Hyper-V Management snap-in, select the Virtual Network Manager in the Actions pane and examine the settings.

The information shown below (as seen in my cluster) must be the same across all the nodes in the cluster (which means each node must be checked). This includes not only spelling but ‘case’ as well (i.e. PUBLIC is not the same as Public) –

It is important to be able to successfully Quick Migrate all virtual machines that cannot be Live Migrated before moving forward in this process. If the virtual machine can Quick Migrate between all nodes in the cluster, we can begin taking a closer look at the networking piece.

Start verifying the network configuration on each node in the cluster by first making sure the network card binding order is correct. In each cluster node, the Network Interface Card (NIC) supporting access to the largest routable network should be listed first. The binding order can be accessed using the Network and Sharing Center, Change adapter settings. In the Menu bar, select Advanced and from the drop down list choose Advanced Settings. An example from one of my cluster nodes is shown here where the NIC (PUBLIC-HYPERV) that has access to the largest routable network is listed first.

Note: You may also want to review all the network connections that are listed and Disable those that are not being used by either the Hyper-V server itself or the virtual machines.

On each NIC in the cluster, ensure Client for Microsoft Networks and File and Printer Sharing for Microsoft Networks is enabled (i.e. checked). This is a requirement for CSV which requires SMB (Server Message Block).

Note: Here is where people get into trouble usually because they are familiar with clusters and have been working with them for a very long time, maybe even as far back at NT 4.0 days. Because of that, they have developed a habit for configuring cluster networking which basically is outlined in KB 258750. This article does not apply to Windows Server 2008.

Note: If CSV is configured, all cluster nodes must reside on the same non-routable network. CSV (specifically for re-directed I/O) is not supported if cluster nodes reside on separate, routed networks.

Next, verify the local security policy and ensure NTLM security is not being restricted by a local or domain level policy. This can be determined by Start > Run > gpedit.msc > Computer Configuration > Windows Settings > Security Settings > Local Policies > Security Options. The default settings are shown here –

In the virtual machine resource properties in the Failover Cluster Management snap-in, set the Network for Live Migration ordering such that the highest speed network that is enabled for cluster communications and is not a Public network is listed first. Here is an example from my cluster. I have three networks defined in my cluster –

The Public network is used for client access, management for the cluster, and for cluster communications. It is configure with a Default Gateway and has the highest metric defined in the cluster for a network the cluster is allowed to use for its own internal communications. In this example, since I am also using ISCSI, the ISCSI network has been excluded from cluster use. The corresponding listing on the virtual machine resource in the Network for live migration tab looks like this –

Here, I have unchecked the iSCSI network as I do not want Live Migration traffic being sent over the same network that is supporting the storage connection. The Cluster network is totally dedicated to cluster communications only so I have moved that to the top as I want that to be my primary Live Migration network.

Note: Once the live migration network priorities have been set on one virtual machine, they will apply to all virtual machines in the cluster (i.e. it is a Global setting).

Once all the configuration checks have been verified and changes made on all nodes in the cluster, execute a Live Migration and see if it completes successfully.

Bonus material:

There are configurations that can be put in place that can help live migrations run faster and CSV to perform better. One thing that can be done, is to Disable NetBIOS on the NIC that will be supporting the primary network used by CSV for re-directed I/O. This should be a dedicated network and should not be supporting any other traffic other than internal cluster communications, redirected I/O for CSV and\or live migration traffic.

Additionally, on the same network interface supporting live migration, you can enable larger packet sizes to be transmitted between all the connected nodes in the cluster.

If, after making all the changes discussed here, live migration is still not succeeding, then perhaps it is time to open a case with one of our support engineers.

Thanks again fro you time, and I hope you have found this information useful. Come back again.

Additional resources:

Using Live Migration with Cluster Shared Volumes in Windows Server 2008 R2

High Availability Product Team Blog

Hyper-V and Virtualization on Microsoft TechNet

Windows Server 2008 R2 Hyper-V Forum

Windows Server 2008 R2 High Availability Forum

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

There is a lot of excitement around Microsoft virtualization technologies these days and rightfully so. One of the ‘hottest’ areas right now appears to be making virtual machines highly available using Windows Server 2008 R2 Failover Clusters so end users can take maximum advantage of Live Migration and Cluster Shared Volumes (CSV). This configuration not only saves a lot of money but also provides business continuity in the event of an unforeseen failure in the environment.

While I could spend time extolling the virtues of our virtualization technologies, I am really here to discuss what can happen if one were to get too ‘overzealous’ and not use common sense and a sound plan for implementing the solution correctly. As with many of the blogs you read here on the CORE blog site, they have been written because of experiences we have had with our customers. This one is no different.

So, what happens when a customer decides they love Microsoft virtualization and high availability technologies so much, they want to virtualize their entire infrastructure? And, suppose they want to be sure it’s highly available so they create a multi-node Failover Cluster to host the virtual machines. When the customer completes the project, they are so very proud of what they have done because now they can retire their old hardware and save tons of money on power and cooling costs in their datacenter. Everyone is happy and celebrations abound. And, then it happens…..someone decides that they need to shutdown the cluster(s), for whatever reason, it does not matter, and, after awhile, when they decide it is OK to bring the cluster(s) back online…they cannot. Oh, and one more thing…..the clusters are running on Windows Server 2008 R2 CORE. Trust me…this is a true story and has already happened more than once, hence the impetus behind this blog.

If the predicament is not immediately obvious, and it should be for cluster veterans, I will tell you that the cluster service will fail to start because it cannot contact a Domain Controller somewhere in Active Directory. And, this is because all of the Domain Controllers and DNS servers (critical infrastructure servers) have been virtualized and are, in fact, virtual machines currently supported by the cluster that is trying to start up. Clearly, this is a case of having ones eggs all in one basket – not good.

How did we fix this? It was not a quick fix. In a nutshell, what the Support Engineer did was have the customer determine which storage LUN was hosting the VM files for one of the virtualized Domain Controller\DNS servers. Then, the LUN was mapped to a standalone server so the VHD file could be copied off to another standalone Hyper-V server so a new VM could be created and placed in service. Once this was accomplished, the cluster could be started.

How can this type of scenario be avoided?

1. Develop a solid, well thought out migration plan. Ensure the planning team includes people who understand how all the technologies function in a virtualized environment.

Note: Please review KB 888794: Considerations when hosting Active Directory domain controller in a virtual hosting environments

2. Have at least one physical Domain Controller\DNS server available in the environment.

3. If #2 is not an option, distribute the virtualized infrastructure servers across multiple hyper-v clusters and hope they will not all be Offline at the same time.

4. Plan to have one of more Hyper-V servers running in a WORKGROUP configuration. Hyper-V servers do not have to be joined to a Active Directory domain. Then distribute some of the virtualized infrastructure servers across these servers.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

The Windows Server 2008 Failover Clustering feature provides high availability for services and applications. To ensure applications and services remain highly available, it is imperative the cluster service running on each node in the cluster function at the highest level possible. Providing redundant and reliable communications connectivity among all the nodes in a cluster plays a large role in ensuring for the smooth functioning of the cluster. Configuring proper communications connectivity within a failover cluster not only provides access to highly available services required by clients but also guarantees the connectivity the cluster requires for its own internal communications needs. The sections that follow discuss Windows Server 2008 Failover Clustering networking features, functionality and recommended processes for the proper configuration and implementation of network connectivity within a cluster.

The following sections provide the information needed to understand failover cluster networking and to properly implement it.

Windows Server 2008 Failover Cluster networking features

Windows Server 2008 Failover Clustering introduces new networking capabilities that are a major shift away from the way things have been done in legacy clusters (Windows 2000\2003 and NT 4.0). Some of these take advantage of the new networking features that are included as part of the operating system and others are a result of feedback that has been received from customers. The new features include:

A new cluster network driver architecture
The ability to locate cluster nodes on different, routed networks in support of multi-site clusters
Support for DHCP assigned IP addresses
Improvements to the cluster health monitoring (heartbeat) mechanism
Support for IPv6

New cluster network driver architecture

The legacy cluster network driver (clusnet.sys) has been replaced with a new NDIS level driver called the Microsoft Failover Cluster Virtual Adapter (netft.sys). Whereas the legacy cluster network driver was listed as a Non-Plug and Play Driver, the new fault tolerant adapter actually appears as a network adapter when hidden devices are displayed in the Device Manager snap-in (Figure 1).

Figure 1: Device Manger Snap-in

The driver information is shown in Figure 2.

Figure 2: Microsoft Failover Cluster Virtual Adapter driver

The cluster adapter is also listed in the output of an ipconfig /all command on each node (Figure 3).

Figure 3: Microsoft Failover Cluster Virtual Adapter configuration information

The Failover Cluster Virtual Adapter is assigned a Media Access Control (MAC) address that is based on the MAC address of the first enumerated (by NDIS) physical NIC in the cluster node (Figure 4) and uses an APIPA (Automatic Private Internet Protocol Addressing) address.

Figure 4: Microsoft Failover Cluster Virtual Adapter MAC address

The goal of the new driver model is to sustain TCP/IP connectivity between two or more systems despite the failure of any component in the network path. This goal can be achieved provided at least one alternate physical path is available. In other words, a network component failure (NIC, router, switch, hub, etc…) should not cause inter-node cluster communications to break down, and communication should continue making progress in a timely manner (i.e. it may have a slower response but it will still exist) as long as an alternate physical route (link) is still available. If cluster communications cannot proceed on one network, the switchover to another cluster-enabled network is automatic. This is one of the primary reasons that each cluster node must have multiple network adapters available to support cluster communications and each one should be connected to different switches.

The failover cluster virtual adapter is implemented as an NDIS miniport adapter that pairs an internally constructed virtual route with each network found in a cluster node. The physical network adapters are exposed at the IP layer on each node. The NETFT driver transfers packets (cluster communications) on the virtual adapter by tunneling through the best available route in its internal routing table (Figure 5).

Figure 5: NetFT traffic flow diagram

Here is an example to illustrate this concept. A 2-Node cluster is connected to three networks that each node has in common (Public, Cluster and iSCSI). The output of an ipconfig /all command from one of the nodes is shown in Figure 6.

Figure 6: Example Cluster Node IP configuration

Note: Do not be concerned with the name ‘Microsoft Virtual Machine Bus Network Adapter’ as these examples were derived from cluster nodes running as Guests in Hyper-V.

The Microsoft Failover Cluster Virtual Adapter configuration information for each node is shown in Figure 7. Keep in mind; the default port for cluster communication is still TCP\UDP: 3343.

Figure 7: Node Failover Cluster Virtual Adapter configuration information

When the cluster service starts, and a node either Forms or Joins a cluster, NETFT, along with other components, is responsible for determining the node’s network configuration and connectivity with other nodes in the cluster. One of the first actions is establishing connectivity with the Microsoft Failover Cluster Virtual Adapter on all nodes in the cluster. Figure 8 shows an example of this in the cluster log.

Figure 8: Microsoft Failover Cluster Virtual Adapter information exchange

Note: You can see in Figure 8 that the endpoint pairs consist of both IPv4 and IPv6 addresses. The NETFT adapter prefers to use IPv6 and therefore will choose the IPv6 addresses for each end point to use.

As the cluster service startup continues, and the node either Forms or Joins a cluster, routing information is added to NETFT. Using the three networks mentioned previously, Figure 9 shows each route being added to a cluster.

Route between 1.0.0.31 and 1.0.0.32

Route between 192.168.0.31 and 192.168.0.32

Route between 172.16.0.31 and 172.16.0.32

Figure 9: Routes discovered and added to NETFT

Each ‘real’ route is added to the ‘virtual’ routes associated with the virtual adapter (NETFT). Again, note the preference for NETFT to use IPv6 as the protocol of choice.

The capability to place cluster nodes on different, routed networks in support of Multi-Site Clusters

Beginning with Windows Server 2008 failover clustering, individual cluster nodes can be located on separate, routed networks. This requires that resources that depend on IP Address resources (i.e. Network Name resources), implement an OR logic since it is unlikely that every cluster node will have a direct local connection to every network the cluster is aware of. This facilitates IP Address and hence Network Name resources coming online when services\applications failover to remote nodes. Here is an example (Figure 10) of the dependencies for the cluster name on a machine connected to two different networks.

Figure 10: Cluster Network Name resource with an OR dependency

All IP addresses associated with a Network Name resource, which come online, will be dynamically registered in DNS (if configured for dynamic updates). This is the default behavior. If the preferred behavior is to register all IP addresses that a Network Name depends on, then a private property of the Network Name resource must be modified. This private property is called RegisterAllProvidersIP (Figure 11). If this property is set equal to 1, all IP addresses will be registered in DNS and the DNS server will return the list of IP addresses associated with the A-Record to the client.

Figure 11: Parameters for a Network Name resource

Since cluster nodes can be located on different, routed networks, and the communication mechanisms have been changed to use reliable session protocols implemented over UDP (unicast), the networking requirements for Geographically Dispersed (Multi-Site) Clusters have changed. In previous versions of Microsoft clustering, all cluster nodes had to be located on the same network. This required ‘stretched’ VLANs be implemented when configuring multi-site clusters. Beginning with Windows Server 2008, this requirement is no longer necessary in all scenarios.

Support for DHCP assigned IP addresses

Beginning with Windows Server 2008 Failover Clustering, cluster IP address resources can obtain their addressing from DHCP servers as well as via static entries. If the cluster nodes themselves have at least one NIC that is configured to obtain an IP addresses from a DHCP server, then the default behavior will be to obtain an IP address automatically for all cluster IP address resources. The new ‘wizard-based’ processes in Failover Clustering understand the network configuration and will only ask for static addressing information when required. If the cluster node has statically assigned IP addresses, the cluster IP address resources will have to be configured with static IP addresses as well. Cluster IP address resource IP assignment follows the configuration of the physical node and each specific interface on the node. Even if the nodes are configured to obtain their IP addresses from a DHCP server, individual IP address resources can be changed to static addresses (Figure 12).

Figure 12: Changing DHCP assigned to Static IP address

Improvements to the cluster ‘heartbeat’ mechanism

The cluster ‘heartbeat’, or health checking mechanism, has changed in Windows Server 2008. While still using port 3343, it is no longer a broadcast communication. It is now unicast in nature and uses a Request-Reply type process. This provides for higher security and more reliable packet accountability. Using the Microsoft Network Monitor protocol analyzer to capture communications between nodes in a cluster, the ‘heartbeat’ mechanism can be seen (Figure 13).

Figure 13: Network Monitor capture

A typical frame is shown in Figure 14.

Figure 14: Heartbeat frame from a Network Monitor capture

There are properties of the cluster that address the heartbeat mechanism; these include SameSubnetDelay, CrossSubnetDelay, SameSubnetThreshold, and CrossSubnetThreshold (Figure 16).

Figure 16: Properties affecting the cluster heartbeat mechanism

The default configuration (shown here) means the cluster service will wait 5.0 seconds before considering a cluster node to be unreachable and have to regroup to update the view of the cluster (One heartbeat sent every second for five seconds). The limits on these settings are shown in Figure 17. Make changes to the appropriate settings depending on the scenario. The CrossSubnetDelay and CrossSubnetThreshold settings are typically used in multi-site scenarios where WAN links may exhibit higher than normal latency.

Figure 17: Heartbeat Configuration Settings

These settings allow for the heartbeat mechanism to be more ‘tolerant’ of networking delays. Modifying these settings, while a worthwhile test as part of a troubleshooting procedure (discussed later), should not be used as a substitute for identifying and correcting network connection delays.

Support for IPv6

Since the Windows Server 2008 OS will be supporting IPv6, the cluster service needs to support this functionality as well. This includes being able to support IPv6 IP Address resources and IPv4 IP Address resources either alone or in combination in a cluster. Clustering also supports IPv6 Tunnel Addresses. As previously noted, intra-node cluster communications by default use IPv6. For more information on IPv6, please review the following:

Microsoft Internet Protocol Version 6

In the next segment, I will discussImplementing networks in support of Failover Clusters (Part 2). See ya then.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

In Part 1, I discussed Windows Server 2008 Failover Cluster networking features. In this segment, I will discuss implementing networks in a Failover Cluster.

Implementing networks in support of Failover Clusters

The main consideration when designing Failover Cluster networks is to ensure there is built-in redundancy for cluster communications. This is typically accomplished by having a minimum of two physical Network Interface Cards (NICs) installed in each node that will be part of the cluster. These cards must be supported by two separate and distinct buses (e.g. Two PCI NICs). Many people think a single multi-port NIC card meets this requirement – it does not as this configuration creates a single point of failure for all cluster communications. The best configuration would be two multi-port NICs running on separate buses and having fault tolerance implemented by way of NIC Teaming software (provided by 3^rd Party vendors.) and being physically connected to separate network switches.

Note: NIC Teaming is not supported on iSCSI connections. Please review the iSCSI Cluster Support: Frequently Asked Questions. The appropriate fault-tolerant mechanism for iSCSI connectivity would be multi-path software. Please review the Microsoft Multi-path I/O: Frequently Asked Questions.

There are two primary design scenarios when planning for Failover Cluster network connectivity. In the first scenario (and the most common), all nodes in the cluster are located on the same networks. In the second scenario, nodes in the cluster are located on separate and distinct routed networks (this is very common in multi-site cluster implementations). Figure 18 shows an example of the second scenario.

Figure 18: Multi-site cluster (network connectivity only)

Note: Even though it is supported to locate cluster nodes on separate, routed networks, it is still supported to connect nodes in a multi-site cluster using stretched Virtual Local Area Networks (VLAN). This configuration places the nodes on the same network(s).

It is important in any cluster that there are no NICs on the same node that are configured to be on the same subnet. This is because the cluster network driver uses the subnet to identify networks and will use the first one detected and ignore any other NICs configured on the same subnet on the same node. The cluster validation process will register a Warning if any network interfaces in a cluster node are configured to be on the same network. The only possible exception to this would be for iSCSI (Internet Small Computer System Interface) connections. If iSCSI is implemented in a cluster, and MPIO (Multi-Path Input/Output) is being used for fault-tolerant connections to iSCSI Storage, then it is possible that the network interfaces could be on the same network. In this configuration, the iSCSI network in the Failover Cluster Manager should be configured such that cluster would not use it for any cluster communications.

Note: Please consult the iSCSI Cluster support: Frequently Asked Question.

As previously mentioned, Windows Server 2008 accommodates cluster nodes being located on separate, routed networks by including a new logic, called an OR logic, when it comes to IP Address resources. Figure 19 illustrates this.

Figure 19: IP Address Resource OR logic

When a Network Name resource is configured with an OR dependency on more than one IP Address resource, this means at least one of the IP Address resources must be able to come Online before the Network Name resource can come Online. Since a Network Name resource can be associated with more than one IP Address, there is a property of a Network Name resource that can be modified so DNS registrations will occur for all of the IP Addresses. The property is called RegisterAllProvidersIP (See Figure 20).

Figure 20: Network Name resource properties

Note: In Figure 20 above, Failover Cluster PowerShell cmdlets were used to access cluster configuration information. This is new in Windows Server 2008 R2. For more information, review the TechNet Cmdlet Reference.

The default registration behavior is to register only the IP Address that can come Online on the node. Implementing this other behavior by modifying the setting to (1) can assist name resolution in a multi-site cluster scenario.

Note: Please review KB 947048 for other things to consider when deploying failover cluster nodes on different, routed subnets (multi-site cluster scenario).

While Failover Clusters require a minimum of two NICs to provide reliable cluster communications, there are scenarios where more NICs may be desired and\or required based on the services or applications that are running in the cluster. One such scenario we already mentioned – iSCSI connectivity to storage. The other scenario involves Microsoft’s virtualization technology – Hyper-V.

The integration of Failover Clustering with Hyper-V was introduced in Windows Server 2008 (RTM) in the form of making Virtual Machines highly available in a cluster by being able to move (Failover) the Virtual Machines between the nodes in the cluster using a process called Quick Migration. In Windows Server 2008 R2, additional capabilities were introduced including Live Migration and Cluster Shared Volumes (CSV). These features improved the high availability story for Virtual machines, but also introduced new networking requirements. The inner workings of Hyper-V networking will not be discussed here. For more information, please download this whitepaper (http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=3fac6d40-d6b5-4658-bc54-62b925ed7eea).

The networking requirements in a Hyper-V Cluster supporting Live Migration and using Cluster Shared Volumes (CSV) can add up quickly as illustrated in Figure 21.

Figure 21: Hypothetical Networking Requirements

For more information on Live Migration and Cluster Shared Volumes in Windows Server 2008 R2, visit the Microsoft TechNet site.

Using Cluster Shared Volumes in a Failover Cluster in Windows Server 2008 R2

Hyper-V: Using Live Migration with Cluster Shared Volumes in Windows Server 2008 R2

In the next segment I will discuss Troubleshooting cluster networking issues (Part 3). See ya then.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

In Part 2, I discussed implementing networks in a Failover Cluster. In this final segment, I will discuss troubleshooting cluster networking issues.

Troubleshooting cluster networking issues

As previously stated, it is important that redundant and reliable cluster communications connectivity exist between all nodes in a cluster. However, there may be times when communications connectivity within a cluster gets disrupted either because of actual network failures or because of misconfiguration of network connectivity. A loss of communications connectivity with a node in a cluster can result in the node being removed from cluster membership. When a node is removed from cluster membership, it will terminate its cluster service to avoid problems or conflicts as other nodes in the cluster take over the services or applications and resources that were hosted on the node that was removed. The node will attempt to rejoin the cluster when the cluster service restarts. This problem can also have broader effects because the loss of a node in a cluster affects ‘quorum’. Should the number of nodes participating in a cluster fall below a majority; all highly available services will be taken Offline until ‘quorum’ is re-established (The quorum model, No Majority: Disk Only, is the one exception. However, this model is not recommended).

Here are some recommended troubleshooting procedures for cluster connectivity issues:

1. Examine the system log on each cluster node and identify any errors reporting a loss of communications connectivity in the cluster or even broader network related issues. Here are some example cluster related error messages you may encounter:

Figure 22: Cluster Network Connectivity error messages

Source:http://technet.microsoft.com/en-us/library/cc773562(WS.10).aspx

Figure 23: Network Connectivity and Configuration error messages

Source: http://technet.microsoft.com/en-us/library/cc773417(WS.10).aspx

2. If the system logs provide insufficient detail, generate the cluster logs and inspect the contents for more detailed information concerning the loss of network connectivity.

Note: Generate the cluster logs by running this PowerShell cmdlet –

3. Verify the configuration of all networks in the cluster.

4. Verify the configuration of network connectivity devices such as Ethernet switches.

5. Run an abbreviated cluster validation process by selecting only the Network tests.

The tests that are executed are shown here:

The desired end result is this:

As an example, here is the section in the validation report that shows the results for the List Network Binding Order test –

Some of the common issues seen with respect to the network validation tests include, but may not be limited to:

· Multiple NICs on a cluster node configured to be on the same subnet.

· Excessive latency (usually > 2 seconds) in ping tests between interfaces on cluster nodes.

· Warning that the firewall has been disabled on one or more nodes.

6. Conduct simple networking tests, such as a ‘ping’ test, across all networks enabled for cluster communications to verify connectivity between the nodes. Use network monitoring tools such as Microsoft’s Network Monitor to analyze network traffic between the nodes in the cluster (Refer to Figures 13 and 14).

7. Evaluate hardware failures related to networking devices such as Network Interface Cards (NICs), network cabling, or network connectivity devices such as switches and routers as needed.

8. Review the change management log (if one exists in your organization) to determine what, if any, changes were made to the nodes in the cluster that may be related to the disruption in communications connectivity.

9. Consider opening a support incident with Microsoft because if a node is removed from cluster membership, this means there were no networks configured on that node that could be used to communicate with other nodes in the cluster. If there are multiple networks configured for cluster use, as recommended, then cluster membership loss indicates a problem that affects all the networks or the system’s ability to send or receive heartbeat messages.

Note: For additional information on Troubleshooting Windows Server 2008 consult TechNet.

Hopefully, the information provided in this three part blog was helpful and will assist in properly configuring network connectivity in Windows Server 2008 Failover Clusters.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Greetings CORE blog fans! It has been awhile so I thought it was time for another blog. In recent weeks, we have seen an issue where the Windows Server 2008 R2 storage validation test List All Disks is failing with a Status 87. Figure 1, is an example of what is displayed in the cluster validation report.

Figure 1: List All Disks failure in Cluster Validation Report.

This error is also reflected in the ValidateStorage log (Figure 2) located in %systemroot%\Cluster\Reports directory.

000016f4.00001714::01:02:06.180 CreateNtFile: Path \Device\HarddiskVolume2, status 87
000016f4.00001714::01:02:06.180 GetNtldrDiskNumbers: Failed to open device \Device\HarddiskVolume2, status 87
000016f4.00001714::01:02:06.180 GetNtldrDiskNumbers: Exit GetNtldrDiskNumbers: status 87
000016f4.00001714::01:02:06.180 CprepPrepareNodePhase2: Failed to obtain boot disk list, status 87
000016f4.00001714::01:02:06.180 CprepPrepareNodePhase2: Exit CprepPrepareNodePhase2: hr 0x80070057, pulNumDisks 0

Figure 2: ValidateStorage log entry

The decode for these errors is shown in figure 3.

# for decimal 87 / hex 0x57 :
ERROR_INVALID_PARAMETER winerror.h
# The parameter is incorrect.

Figure 3: Error decode

The cause for this failure to this point is unknown. What we do know is the path that is called out as seen in Figure 3: above always points to the 100 megabyte partition that is created at the root of the system drive. This partition is created by default and is in place to support BitLocker. The approved workaround is to assign a drive letter to the 100 megabyte partition and re-run the validation process. The List All Disks storage test should pass at that point. There is no adverse impact to assigning the drive letter to this partition. As a reminder, BitLocker is not supported in a cluster environment. This is documented in KB 947302. If an attempt is made to enable BitLocker in a cluster node, the error in Figure 4 is displayed.

Figure 4: Error when trying to enable BitLocker on a cluster node

I have an ‘ask’ of our readership. If anyone reading this blog can ‘on demand’ repro this issue, we want to hear from you. This goes beyond just telling us, “Yeah, I’ve had that issue myself.” I am interested in hearing from anyone who has perhaps manipulated a setting in their controller card that can either cause validation to fail in this way or make it pass. I am interested in hearing from someone who had this failure, changed a setting of some kind, either in software or hardware, and the error went away. Be sure to provide the details (Make and model of controller, Firmware and driver versioning information, steps to reproduce the issue, etc…)

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

The Windows Server 2008 Failover Clustering: Networking three-part blog series has been out for a little while now. Hopefully, it has been helpful. Little did I know there would be an opportunity to write another part. This segment will be short as it covers a very specific scenario. One that we rarely see, but we have encountered it enough that I felt it might be worth writing about it.

There are applications written to access resources that are being hosted in Microsoft clusters running on Windows Server 2008 (RTM + R2). The resource could be a File Server, could be a SQL database, or whatever. The point is that the required resource is being hosted in a Failover Cluster. It is hoped that applications that need to function in this manner are written properly to locate the required resource being hosted in a cluster. By that I mean I would expect an application to be written in a manner where it would first query a name server (DNS server) and then use the information obtained to make a proper connection to the required cluster resource. In a Failover Cluster, that connection point is known as a Client Access Point (CAP). A CAP consists of a Network Name (NetBIOS) resource and one or more IP Address resources. The default behavior in a Windows Server 2008 cluster is to dynamically register CAP information in a DNS server provided it is configured to support Dynamic Updates. This occurs when the CAP is brought Online in the cluster. There are applications that are not written in this manner. There are some application that are written in such a way that they will make a local connection on a cluster node by binding to the first network adapter and then use the IP address configured for that adapter. The end result is in a cluster, the first connection listed in the binding order by default is the Microsoft Failover Cluster Virtual Adapter. This adapter uses an IP address that is drawn from the APIPA (Automatic Private IP Addressing) address range which is non-routable and not registered in DNS.

To assist with helping make these types of applications work better, we can use a utility that has been released for public download on the Microsoft MSDN site. The utility is called ‘nvspbind.’ So, the first step is to download and install the utility on each cluster node. The options we will be using are shown in Figure 1.

Figure 1: Options for nvspbind

First we need to identify the adapter that is the Microsoft Failover Cluster Virtual Adapter by using the nvspbind /n command (Figure 2). The adapter is ‘Local area connection* 9’.

Figure 2: Identify the Microsoft Failover Cluster Virtual Adapter

Next, we use the 'nvspbind /o ms_tcpip’to determine the binding order for IPv4 (Figure 3).

Figure 3: Listing the bindings for IPv4

We can see here, that the adapter is listed at the top of the binding order for IPv4 which is causing the problem for some applications. We need to move the adapter down in the binding order so we will use the following command to accomplish that –

C:\nvspbind /- “local area connection* 9” ms_tcpip (Figure 4).

Figure 4: Moving the adapter down in the binding order for IPv4

Note: The adapter can be moved further down by using /-- if desire.

Once the adapter has been positioned correctly in the binding order, the application can be tested to see if it now works as desired.

To further highlight the effect of this utility, we can inspect the registry. First, we need to locate some information for the Microsoft Failover Cluster Virtual Adapter. Navigating to the following registry key (Figure 5), and locate the adapter –

HKEY_LOCAL_MACHINE\SYSTEM\CurrenControlSet\Class\{4D36E972-11CE-BFC1-08002BE10318}

Figure 5: Microsoft Failover Cluster Virtual Adapter NetCfgInstanceId

The same information shown in Figure 5 is also displayed in Figure 2.

With the information in hand, navigate to the following registry key (Figure 6) to verify the adapter is no longer listed at the top of the binding order.

Figure 6: HKLM\SYSTEM\CurrentControlSet\services\Tcpip\Linkage

That’s about it. Thanks for your time and, as always, we hope the information here has been useful to you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Catchy title, huh. Perhaps not, but it is really what we want you to do. This will be a pretty short blog to get out some information that is important for you to know as it may help resolve a Hyper-V issue quickly, or, better yet, prevent one from happening at all. Inside Microsoft, we have what we call Supportability Program Managers (SPM). They help drive product quality by looking at the types of issues that come through our Customer Support organization. They also look at issues being reported in technology forum posts. They track trends so we can improve the product. In a conversation I had recently with the Hyper-V SPM, I was made aware of a number of issues that were resolved last quarter by simply installing a hotfix. So, here I am. Help us, help you by spending some time checking out these two online resources:

Hyper-V Update List for Windows Server 2008:http://technet.microsoft.com/en-us/library/dd430893(WS.10).aspx

Hyper-V Update List for Windows Server 2008 R2:http://technet.microsoft.com/en-us/library/ff394763(WS.10).aspx

While not updated on a daily basis, these resources should be the first stop when you run into an issue with Hyper-V. We cannot make every fix for the operating system and its components available via Windows Update. Some may require downloading using a link provided in a KB article.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

I know what you are thinking, “How hard can it be to work with cluster file shares?”. I would be willing to bet a lot of you have been working with File Server clusters since NT 4.0 days. If you are still working with them today in Windows Server 2008 R2, you know things have changed. In this blog, I hope to give you some insight into a piece of functionality both within Failover Cluster and Explorer that may alter the way you work with file shares in your organization. It may even help finally solve a mystery that has been plaguing some of you for a while now.

I will be working with a 2-Node Windows Server 2008 R2 Failover Cluster (Figure 1).

Figure 1

In the cluster, I created a highly available File Server (CONTOSO-FS1). I created a series of folders, using the Explorer interface, on the storage in the File Server resource group (Figure 2).

Figure 2

I use the folders to make shares highly available in the CONTOSO-FS1 File Server resource group.

There are three main ways to provision shares in a Failover Cluster using built-in GUI tools.

1. Failover Cluster Management snap-in

2. Share and Storage Manager snap-in

3. Explorer interface

In the Failover Cluster Management interface, the Add a shared folder function is available in the Actions pane (Figure 3).

Figure 3

In the Share and Storage Management interface, the Provision Share function is available in the Actions pane (Figure 4).

Figure 4

In Explorer, you simply Right-Click on the folder and Share with users (or nobody to stop sharing) (Figure 5).

Figure 5

The end result using any of these three methodologies is shared folders appearing in the Failover Cluster Manager snap-in in the CONTOSO-FS1 resource group (Figure 6).

Figure 6

A similar display can be seen in Share and Storage Manager (Figure 7).

Figure 7

Inspecting the cluster registry hive, we can see the shares defined under the appropriate File Server Resource (FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk)) (Figure 8).

Figure 8

At this point you may be thinking, “So what Chuck. This isn’t rocket science. We know all this stuff.” And, you may be right. Setting up the shares is the easy part, and we provide you with several methods with which to accomplish this, but what happens when you no longer want to share ‘stuff’ anymore? This is where it could get a little interesting.

If you do not want to share a folder anymore, there are correct ways to do this. In the Failover Cluster Management interface, Right-Click on the shared folder and select Stop Sharing (Figure 9).

Figure 9

In the Share and Storage Manager interface. Right-Click on the share and select Stop Sharing (Figure 10).

Figure 10

Finally, in the Explorer interface, Right-Click on the folder and select Share with Nobody (Figure 11).

Figure 11

The unexpected behavior occurs in the Explorer interface if instead of choosing to stop sharing by executing the process in Figure 11, the user chooses to Delete the folder (Figure 12). There could be unintended consequences for that action.

Figure 12

In Explorer, when the folder is selected for deletion, a pop-up Confirmation window is displayed. An example of one is shown in Figure 13.

Figure 13

If Yes is selected, the folder is deleted. In the Failover Cluster Management interface, however, the shared folder that was just deleted in Explorer is still displayed and appears to be Online (Figure 14).

Figure 14

Even the cluster registry hive will show the share present under the File Server resource (Figure 15).

Figure 15

Note: In previous versions of clustering, the cluster service maintained cluster file share information in the registry key HKLM\System\CurrrentControlSet\Services\LanmanServer\Shares.

Here is the punch line – the next time the File Server Resource is cycled Offline and then back Online again (like during a Failover of the resource group to another node in the cluster), an Error (Event ID 1588) will be registered in the System Event Log (Figure 16). The error indicates that the share that cannot be found also cannot be brought Online by the File Server resource.

Figure 16

The cluster log reports a problem as well but it is only a Warning (Figure 17).

00000944.00000688::2010/08/07-18:05:31.183 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed in NetShareGetInfo(CONTOSO-FS1, Pictures), status 2310. Tolerating...

00000944.00000b04::2010/08/07-18:06:31.185 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed in NetShareGetInfo(CONTOSO-FS1, Pictures), status 2310. Tolerating...

00000944.00000590::2010/08/07-18:07:31.190 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed in NetShareGetInfo(CONTOSO-FS1, Pictures), status 2310. Tolerating...

00000944.00000830::2010/08/07-18:08:31.194 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed in NetShareGetInfo(CONTOSO-FS1, Pictures), status 2310. Tolerating...

00000944.00000b48::2010/08/07-18:09:31.197 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed in NetShareGetInfo(CONTOSO-FS1, Pictures), status 2310. Tolerating...

Figure 17

Decoding Status 2310 (Figure 18)

Figure 18

These errors in the System Event Log do not prevent the File Server resource from coming Online and bringing all the other valid shared folders Online (except if it were the last shared folder associated with the File Server resource. See the ‘bonus material’ at the end of the blog). However, I think you can quickly see that the process of deleting shared folders instead of just stopping them from being shared can, over time, accumulate orphaned entries in the cluster registry hive and the Event ID 1588 Error messages will continue to be registered for each of the ‘orphaned’ shares.

One way this behavior manifests itself is if a shared folder is created in Failover Cluster Manager or Share and Storage Manager, and is then deleted in Explorer. The Event ID 1588 is registered because the cluster registry hive is not ‘cleaned’ up properly. If the folder is shared in Explorer and then subsequently deleted in Explorer, a different pop-up Warning is displayed (Figure 19).

Figure 19

If folders are not deleted but instead are just stopped from being shared, then the cluster is cleaned up properly and the error should not be registered. If the pop-up in Figure 19 is displayed (as opposed to the pop-up shown in Figure 13), then the share will be properly removed from the Failover Cluster and the cluster registry hive will be properly cleaned up.

Another scenario where we could see an Event ID 1588 registered, but not be the result of the cluster registry hive not being cleaned up properly, would be where the System account had been removed from the default security setting for a folder that was shared in a Failover Cluster.

Bonus Material:

What happens if the final shared folder that is associated with a File Server Resource is deleted? At the first LooksAlive\IsAlive check, the File Server resource will fail. A failover will be initiated, but in the end, the File Server Resource will remain in a Failed state. An Event ID 1587 (Figure 20) could be registered along with the customary Event ID 1069 reporting a cluster resource failure.

Figure 20

The cluster log entry will be different from the previous entry (Figure 17) as shown in the highlighted section below (Figure 21). This time it is not a Warning but an Error ([ERR]) that is seen in the cluster log.

00000720.00000a70::2010/08/10-22:25:13.616 INFO [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Shares 'are being scoped to virtual name CONTOSO-FS1

00000720.00000a70::2010/08/10-22:25:13.616 DBG [RHS] Resource FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk)) called SetResourceStatus: checkpoint 2. Old state OnlinePending, new state OnlinePending

00000720.00000a70::2010/08/10-22:25:13.616 WARN [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed to open path e:\Documents. Error: 2. Maybe a reparse point...

00000720.00000a70::2010/08/10-22:25:13.616 ERR [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed to open path e:\Documents with reparse flag. Error: 2.

00000720.00000a70::2010/08/10-22:25:13.616 ERR [RES] File Server <FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk))>: Failed to online a single share among 1 shares.

00000720.00000a70::2010/08/10-22:25:13.616 DBG [RHS] Resource FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk)) called SetResourceStatus: checkpoint 2. Old state OnlinePending, new state Failed

00000720.00000a70::2010/08/10-22:25:13.616 ERR [RHS] Online for resource FileServer-(CONTOSO-FS1)(Contoso-FS1 (Disk)) failed.

I hope this information has been helpful and perhaps solved a few mysteries out there.

Thanks for your attention and come back.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Quite a while back I wrote a blog on a new functionality in Windows Server 2008 Failover Clusters called ‘file share scoping’ (http://blogs.technet.com/b/askcore/archive/2009/01/09/file-share-scoping-in-windows-server-2008-failover-clusters.aspx). I was informed recently that our Networking Support Team refers to this blog frequently when working with customers who are migrating to Windows Server 2008 Failover Clusters and discover that CNAME (Canonical Names) records in DNS, that had been in-place to support their Windows Server 2003 File Server clusters, no longer work with Windows Server 2008 Failover Clusters. Users keep asking if there is a way to disable this functionality or if it can be changed by adding a registry key or something. At this time, there is no disabling this behavior and our Product Team has been made aware of the feedback we have been receiving on this. No official plans have been announced with respect to making any changes in future releases of the Operating System.

While we wait and see what the future holds, I have been asked to write a short blog on how users can better work within the constraints of this functionality. In a File Server Resource Group you typically have a Client Access Point (CAP), a File Server Resource, a Physical Disk resource and some Shared Folders (Figure 1).

Figure 1

Suppose, in a Windows Server 2003 cluster environment, there were several CNAME records created in DNS that pointed to the same File Server Cluster so users from various organizations within a company could access their data files. For example, suppose we had CNAME records for OPS-FS1, Academics-FS1 and Executive-FS1. After completing a migration to a Windows Server 2008 R2 File Server cluster, these CNAME records no longer work and end users can no longer access their data. How can we fix that?

To remedy the situation, create additional CAPs in the File Server Resource group that contains the shared folders that contain the data the users need to access. To do this will require stepping outside of the normal wizard-based process that was used to create the original highly available File Server resource group and instead use the procedures described in KB 947050.

Start by selecting the File Server resource group and in the Right-hand Actions pane select Add a resource (Figure 2).

Figure 2

From the list of available resources, select Client Access Point (Figure 3).

Figure 3

Provide the requested information and complete the wizard. Do this for all required Client Access Points. When completed, bring all the CAPs Online. Here is my result (Figure 4).

Figure 4

At this point, decide which shared folders need to be available to users when each Client Access Point connection is made. Then, create the shared folders in the correct context. Figure 5 shows the selections available when executing the Add shared folder action in the Actions pane.

Figure 5

As an example, in my 2-Node cluster, all folders shown in Figure 1 were shared in the context of CONTOSO-FS1. After adding the additional Client Access Points that were needed, a decision was made that the Academics share was needed in the Academics-FS1 context, the Executive and Archive folders were needed in the Executive-FS1 context and finally the Operations folder was needed in the OPS-FS1 context. When sharing folders in multiple contexts, the display can start getting a little cluttered (Figure 6).

Figure 6

When all File Server resources are Online, all shared folders associated with those resources are displayed. If a multiple File Server resources are associated with the same shared folder, multiple entries are displayed (Figure 6). This is in addition to the administrative share for the associated physical disk resource.

To help clarify some of the confusion, modify the Description on the Sharing tab for the Property page of the shared folder to reflect its associated File Server resource (Figure 7).

Figure 7

This provides some organization to what can be a cluttered display (Figure 8).

Figure 8

Additional administrative overhead is incurred here as well because multiple Access Control List (ACLs) entries must be maintained on the same set of folders. Depending on the tools used to migrate the data to a windows Server 2008 Failover cluster, that information could already be present on the storage and not be an issue.

I hope this helps provide a solution for you organization. See you next time.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support