Exchange 2010 SP1 Hung I/O watchdog thread causes intentional BSOD

The other day I ran into an issue with a customer in that one of
their Exchange 2010 mailbox servers was continuously rebooting after what
seemed to be a few minutes, and I thought I would share the experience and end result. This box was  hosting the active copy of their mailbox database and the passive database had  already been put into a “failed and suspended” state by the system.

The system did not fail over to the passive copy obviously  and there was upwards of 6000+ transaction logs in the copy queue length. I had  the customer shut down the guest in vsphere and vmotion it to another host while it was powered down. This seemed tom stabilize the box and stop the automatic restarting after a few minutes of uptime.  I was also then able to grab the memory dump files and analyzed  them with Windbg. The bugcheck codes reported in the memory.dmp file was

BugCheck F4, {3, fffffa8004dc9b30, fffffa8004dc9e10, fffff80001bd8f40}

The culprit to the blue screen was msexchangerepl.exe. That
seemed kind of odd so I looked into any issues with SP1 or RU3 for SP1 which is
what the customer is running at, which brought me to a technet article explaining new high availability features in Exchange 2010 SP1

In Exchange 2010 SP1, ESE has a new feature to detect “hung I/O”
with a watchdog thread and will log events that Active Manager will see and
respond to accordingly.  Active Manager is the component in Exchange 2010 which is the brains of choosing which database(s) to activate during best copy selection while handling recovery from database issues.

If the ESE watchdog has a database i/o issue that lasts greater than four minutes, the msexchangerepl.exe services will initiate a bug check by terminiating wininit.exe.

So it seems that Exchange is trying to protect and recovery  itself from storage I/O issues, by initiating a blue screen and restarting the  system. I’m unsure though how the actual integrity of the Exchange database files and logs are ensured by the  system before the bug check occurs. It also appears that you can change the behavior of this feature in the registry or in Active Directory.

Some useful links below.

Active Manager technical details – http://technet.microsoft.com/en-us/library/dd776123.aspx

Blog Post on Exchange 2010 Hanging I/O  – http://thoughtsofanidlemind.wordpress.com/2011/02/10/sp1-and-bsod/

ESE on Hung IO at the bottom of the page – http://technet.microsoft.com/en-us/library/ff625233.aspx

DFS replication, DFS Replication File Filters, FSRM Quotas & File Screening = Headaches!

I recently was troubleshooting a DFS issue for a customer which seemed to be sporadic for end users and made no sense to the administrators on site. DFS was setup with two Windows Server 2008 R2 SP1 file servers as targets and a domain based DFS namespace was setup to publish two root paths.

Symptoms: End users were experiencing long delays while trying to save data to their home directory which was mapped using the dfs namespace. They also experienced the same lag while trying to read from this same directory. Yet it did not happen and was not reproducible from the desktop.

Troubleshooting: Dumping the client dfs cache referral, domain and provider proved to look ok. Mapping to each of the servers directly also seemed to pan out as expected. So DNS and querying the domain controllers for referrals seemed to be fine as far as I could tell. Finally an end-user had the issue and I was able to determine that this user was only having the issue when dfs server 2 was the file server it received as its target from Active Directory.

I focused my attention on DFS file server 2 and started looking at the logs. The application log didn’t show anything helpful, but the DFS replication log in the “crimson channel” did show warnings about the staging directory, target replication directory or both, did not have enough free space. The warning was raised with event id: 4502.

The staging quota for the replication group was set to a very high value in MB so I was certain that it was not out of disk space. There was plenty of free space on the volume in disk management as well.

I then started looking outside of DFS for what could be causing the space issue for the staging directory. FSRM was in fact installed and in use. Both quotas and file screen templates had been setup on the directory that was being replicated by DFS which included the staging directory in its default path. For example D:Sharedfsrprivate. The quote was set to hard limit of 5GB which was in fact causing the free space issue on the staging directory no matter what quota size was setup in DFSR. I believe that when the staging directory is running low or out of free space, a cleanup process is run to free up more space so that it can continue operations. Digging further, I opened up the dfs replication log, which by default is located in the C:Windowsdebug directory, and noticed many, many access is denied errors. The file types that were being logged with access is denied were the file types setup in the file screen template in FSRM which was configured to block.

Looking at the time stamps of the events in the DFSR replication log in the crimson channel in regard to free space, the time stamps in the debug log for access is denied error and remembering that the staff and end users mentioned it was a sporadic issue when saving or reading file, It clicked and perfmon was immediately opened to check the disk for I/O. Saving and opening = reading and writing to the disk. “Lag” = high disk i/o. The disk was in fact being hit harder than it should have been during normal operations.

Resolution: We removed the file screen template and quota on the E:Share* which included the dfsrprivate directory. Restarted both the FSRM and DFS Replication services in the services applet to generate new logs and events. The warning logs about staging directory free space did not resurface. The debug log in C:Windowsdebug for dfs replication showed clean with no errors. Disk i/o ramped up in perfmon as the dfs replication back log caught up but then flattened out. Replicating the issue from the beginning was difficult but I could not reproduce the issue after the changes were made and have not heard of the issue since.

I wonder how then would you be able to utilize FSRM file screening with DFS and DFS replication effectively since the two technologies don’t seem to be aware of each other? I came across the following blog post which mentions using DFS file filters to ensure that extension that are included in the FSRM file screening are also excluded from DFS replication groups. Look for reason #9.

http://blogs.technet.com/b/askds/archive/2007/10/05/top-10-common-causes-of-slow-replication-with-dfsr.aspx

Easy enough! I started plugging in some extensions into the DFS Replication group file filters box to test this out. After typing in 254 characters, I could not add anymore. Using the Audio/Video file filter alone in FSRM has a list of extensions greater than this character limit. What’s up with that? Well, as this is a domain based DFS namespace, and configuration information is stored in AD, off to Active Directory we go. Opening ADSI edit and navigating to the following attribute will allow for greater than 254 characters to be added.

Domain Context

DC=domain,dc=com

CN=System

CN=DSFR-GlobalSettings

CN=ReplicationGroup

CN=Content

CN=FolderName

Right click CN=FolderName and select properties. If you navigate to the attribute msDFSR-FileFilter you can edit and add in greater than 256 characters which you cannot do in the DFS Management GUI.

I’m currently in the process of building this exact scenario in my lab environment to confirm my theory but hopefully adding the values in ADSI edit will be recognized and excluded from replication. If anyone ever actually reads this and can test, please post back any findings.

Hope it helps someone, someday.

UDP Push Notification and the Outlook 2003 “Unknown Error” with an Exchange 2010 Mailbox

Outlook 2003 is very much still in use today in the real world. In fact, it is in 90% of the environments I have worked in that are in the process of transitioning to Exchange 2010. During a particular transition, a test user group that was migrated to Exchange 2010 on the back end but still utilizing Outlook 2003 SP3 on the front end in “online / non cached exchange mode”, was experiencing a reproducible “unknown error” message when deleting messages.

Outlook 2010 did not experience the issue and it also was not reproducible via Outlook Web App. Digging around for a bit I came across MS KB Article ID 2009942 that made me think that UDP Push Notifications may be the culprit.

Push notification support was apparently added to Roll Up 3 for Exchange 2010 SP1 but MUST be enabled by adding in a registry key. In my situation, while this did help with the delay, it did not completely resolve the issue. Hopefully in RU5 it will be improved as I did not see it in the list of RU4 fixes/enhancements. Registry key information is below.

Subkey location: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesMSExchangeRPCParametersSystem
Subkey name: EnablePushNotifications
Type: REG_DWORD
Value: 1
Note If this registry key does not exist, or if its value is set to 0, the UDP notification support feature is not enabled.

Another workaround is also to enable cached exchange mode in the Outlook client which was not a viable option for me as the clients were being accessed via terminal services. Outlook 2003 is also in extended support as of 2009 so I don’t see them adding to much into it. “Push” for Outlook 2010, it really is time update if your still on 2003!

Unable to Open Default Mail Folders. You do not have permission to log on.

Working on an Exchange 2003 to 2010 transition I came across the “unable to open default mail folders. You do not have permission to log on” error.

After a bit of troubleshooting, including; verifying OWA access, Autodiscover, isolating the issue to all mailboxes whether migrated or newly created 2010 mailboxes and recreating the Outlook profile, the issue came down to DNS.

The customer environment was using a fqdn of subdomain.domain.org and an A record for the CAS array name was never placed into the subdomain zone. Once added the issue was resolved immediately.

Error 0x8007010B attempting to download the OAB

I found myself trouble shooting a rather difficult scenario the other day and felt that hopefully someone may benefit from my pain if I shared.

After resolving some Autodiscover issues in an Exchange 2010 SP1 environment consisting of two WNLB CAS servers, I was left with an error when attempting to download the OAB in Outlook 2010. The Autodisover issues were due to IIS virtual directory security settings and once resolved, Autodiscover, Availability and testing e-mail autoconfiguration via Outlook was successful.

Attempting to download the OAB in Outlook 2010 resulted in a 0x8007010B error. After finding this helpful page I learned that this error points to an invalid directory. My autodiscover settings were set to point Outlook clients to the fqdn of my WNLB ip address for my CAS array.

This was resolvable in both explorer and in IE I was prompted for credentials then able to view the oab.xml file as expected.

FDS services were working as I would expect and the OAB directory on both CAS servers was being updated and populated properly based on the schedule set in EMC.

I then thought that maybe the directory being reported as invalid was local to my machine and Outlook related rather than an issue in the Exchange configuration.

On a client that was having OAB download issues, I also noticed that after clicking download address book in Outlook, the default name was “Download Offline Address Lists” which struck me as odd. I always remembered this being “Global Address List” in a default Exchange installation.

I also noticed that on my client machine there was no directory named “Offline Address Books” in C:Users%username%AppDataLocalMicrosoftOutlook

After deleting my Outlook profile and re-creating using the now resolved autodiscover service I was able to successfully download the offline address list and populate the Outlook directory with a folder to house the OAB.

My assumption is that during email profile creation on the local machine the C:Users%username%AppDataLocalMicrosoftOutlookOffline Address Books is created and populated with the OAB using the url received from the Autodiscover service. If the initial profile creation is unsuccessful, the Offline Address Books directory will not be created, therefore resulting in the 0x8007010B error when attempting to download the OAB.

Hope this ends up helping someone in the future.

CITRIX lmadmin.exe crash after unexpected shutdown

I have exprienced multiple times now that after an unexpected shutdown of a Windows 2008 R2 Standard guest running on Hyper-V 2008 R2 non SP1,
the CITRIX licensing server service will not start or stay started after manual intervention. In all instances, uninsalling the licensing server and reinstalling fixed the issue.
I did not have to restart my server after removal of the software or reimport my license files. Just remove and reinstall.

See the below CITRIX article which references the known issue for Windows 2003 but does not reference Windows 2008.

http://support.citrix.com/article/CTX125194

You can also follow a thread on the forums to get more detail around what other users are experiencing.

http://forums.citrix.com/thread.jspa?threadID=258444&start=0

If anyone reads this and can verify that build 10009 will also resolve the issue for server 2008 please let me know. Of course the real ix for me is to determine what is causing the cluser service to crash unexpectedly on one of my cluster nodes!

Service Pack 1 for Windows Server 2008 R2 installation

Documented below is my experience with updating a Windows Server 2008 R2 Standard virtual machine running on a Windows Server 2008 RTM standalone hypervisor to Windows Server 2008 R2 SP1. My main interest in testing the service pack is for RemoteFX in my blade center environment.The virtual server being updated has the following roles and features installed.

Remote Desktop Services
RemoteApp Manager
RD Session Host Configuration
Remote Desktop Services Manager
Web Server (IIS)

With that out of the way, on to installation. I am using the same installation .iso image that was used to install Windows 7 SP1 as all versions are included in a single image totaling 1.9GB. Upon execution of the setup.exe, I am presented with Install Windows Server Service Pack notification window. It has identified my system as running Windows Server 2008 R2 but doesn’t seem to care whether or not it is Standard, Enterprise or Datacenter.

After continuing, I am presented with accepting license terms which was not the case with the Windows 7 SP1 installation from the same image file.

Again presented with the preparing your computer display dialog shown below.

After preparation is completed, I am presented with a message stating to “Save all work, close all open programs, and then click Install”. There is also a check box which defaults to being selected that will automatically restart the system.

After clicking install, the installer begins to download what seems to be a required update. The update is knowledge base article 976902 (http://support.microsoft.com/kb/976902). I did not encounter this when upgrading my client OS to SP1. Either it happened very fast when I ran the Windows 7 sp install or it was not required.

After the prerequisite update was downloaded and I can only assume installed, the installer downloaded a cab file just as it did with the Windows 7 service pack and then immediately proceeded to install the service pack.

As with the Windows 7 SP1 installation before completely shutting down after installation is finished, the system displays a “Configuring Service Pack” screen. Whatever it is running in the background takes about 5 minutes to complete and then the system finally restarts to believe it or not “Configure Service Pack” again.

About 45 total minutes after starting the installation of the service pack I was presented with Ctrl + Alt + Del and was able to login to my system. Upone login I was presented with the Windows Server 2008 R2 Service Pack 1 is now installed message. Below is a screenshot of my virtual machine system properties.

As far as the installation process for both Windows 7 and SErver 2008 R2 is concerned I feel it is very painless and kind of nice to have a single image containing everything needed for both my client and server operating systems. Now to try and break somethings and test out new RDS features!

Windows 7 Service Pack 1 is here!

Windows 7 Sevice Pack 1 was released to manufacturing in February of 2011. Below is my experience with installing the service pack complete with screenies. My main purporse for installing the service pack so quickly is to be able to test the RemoteFX components of Windows 2008 R2 SP1 RDS. I am currently piloting 2008 R2 RDS to determine if it suites my needs for some application deployment. Another post on that at a later date, but for now onto installing SP1 on my client workstation. Hopefully I don’t break anything as I have alot of admin consoles and so on installed on this workstation. Here goes….

The .iso file that I have downloaded is 1.9GB in size on disk. They are getting large these days but it does in fact contain all languages and all versions of the service pack. Digging into the iso via Windows Explorer confirms that all three versions, (ia64, x86 and x64) are included in the image file.

Win7 iso contents

Executing the autorun.exe file contained on the DVD image file, presents me with the following notification window and allows me to either cancel or proceed with the installation of the service pack.

Clicking next begins starts the installation process and also prepares the computer for the service pack installation. I’m assuming that what is happening here in the background is the installer is runing some prerequisite checking to determine the proper version and language needed to proceed.

After the prep work has been completed, I am presented with a new window telling me to save my work, close all open programs and click Install. Also, note that by default the installer is configured to restart the computer when the installation is completed.

The first thing that takes place as part of the installation is the creation of a system restore point. Then immediately following the installer attempts to download a .cab file for my version of Windows 7. (x64). I would love to know what is required to download at this point when I have already downloaded a 1.9GB iso file??? I plan to dig!

Immediately following successful download of required files, the installer moves right into the installation of the service pack.

After the installation wizard was completed which took approximately 30 minutes on my system an automatic shutdown of my system was initiated.
Before completely shutting down, I was brought to a Configuring Service Pack screen which forgive me but was only able to grab a screenie with my Windows 7 Phone.
This entire process of configuring took about 5 minutes to complete followed by a shutdown. I will replace this image with a nicer one when I run the installation again on a virtual machine.

As my system booted, I was presented with the familiar preparing to configure system notification followed by a second configuring service pack notice. This also took about 5 minutes to complete which I was then presented with the Cntrl + Alt + Del login prompt.

After login to my system I was then presented with a notice that Windows 7 Service Pack 1 installation was completed.

Current installation time on my system which has guts consisting of 4GB of memory and a Core 2 Duo @ 2.93 was just under 1 hour. At first glance everything has seem to have went smoothly. If I find any broken applications that I have installed or anything of interest I will be sure to post.

Onto installation of Service Pack 1 for Windows Server 2008 R2….