Terminal Services: It’s not really PFM (Pure F***ing Magic)

I have been frustrated in my Terminal Services environment because every time I seem go get my problems put to bed, they wake up again and meaner than ever. I have approximately 250 TS users with 50 users logged on at any given time. We are running Server 2003 R2 Enterprise and when I initially arrived on the scene we were running two TS machines on a Microsoft Virtual Server platform and a third on a standalone physical machine. They were load-balanced via Microsoft NLB Cluster services and would stop functioning sporadically. The only solution at the time was to tear down the NLB Cluster and rebuild it. Soon thereafter we left the Microsoft virtual environment in lieu of VMWare. We went that route specifically for Site Recovery Manager and the ability to get VM’s restored to our DR facility in fairly short order. So with that I had 3 very beefy servers geared up as ESX 4.0 Hosts. Placed them in my Virtual Center, and installed 2 VERY beefy TS machines (first mistake). I created two 4 core 8 GB servers with 100 GB of storage each. I set up a default Microsoft NLB (second mistake) to load balance both of my TS Servers.

Well, as some of you may have already experienced, it doesn’t quite work that easily. The symptom was that I could not reach the second server. In fact the second server had issues reaching the network consistently as well. After some research I found out that it was due to the way that the NLB handles mac-addresses and the NLB Cluster IP and the way that VMWare handle RARP flood requests. I am not going to deep dive right now but you can find out more about it here. . The short of it was that I needed to configure the NLB in Multicast Mode. So I did, and that too didn’t work. So I took it to the next level and disabled RARP transmission as outlined here, and all seemed good… for a while. V-Motion was acting up after that, mainly because VMWare did not notify the host that the virtual server moved. This ruined my plans for dynamic VM resource management for the entire vswitch. There had to be a better way

I dug down deeper and really began honing in on the ARP/MAC and Cluster IP issue. I started looking at my Cisco switch for ways to solve my problem. I found it. I needed to create static ARP and MAC entries in my switches directly connected to the VM Host. The following commands worked for me (edited of course)

Config t

arp 10.0.100.10 03bf.0a00.640a ARPA

mac-address-table static 03bf.0a00.640a vlan 1 int Fa0/1 Fa0/15 Fa0/16

wr mem

  • Where 10.0.100.10 is the NLB Cluster IP address
  • 03bf.0a00.640a is the virtual MAC address of the NLB Cluster itself
  • vlan 1 is the vlan that the vswitch the VM is on
  • Fa0/1 Fa0/15 Fa0/16 are the interfaces connecting to the VM Hosts

And I had stability at layer 2/3… but of course that was not good enough.

Shortly thereafter I started getting complaints that the performance was just too slow. I looked at the summary of the VM and I saw that the Consumed Host CPU was minimal and that the memory was also minimal. It was then that I started thinking virtually, not physically. VMWare has an evil habit of waiting until all the assigned cores are available before putting through a process. When I have a 4 core requirement, on a relatively busy VM Host, it takes a long time to get all 4 core’s free to get anything done. So I began to employ the Zerg Rush strategy for TS boxes (hey it worked in Starcraft). I created a small 2 core 4 GB ram TS template and deployed many of them. We have licenses for Server 2003 Enterprise and therefore had a 1 to 3 exchange rate of Physical to Virtual. I also kept most of them on the same VM Host since similar applications would be competing for 2 cores in a similar way, thus giving preference to none. My performance woes seemed to disappear, but as you can guess… seeming is believing.

There are some shortcomings of Server 2003 NLB that make this tool a bit inadequate. All the servers must be on the same network, the Affinity is configurable however restrictions based on connection number are not. If a member of NLB cluster goes down, the NLB will still attempt to route users to it. There is no reporting to speak of, and finally there is a 32 server limit. It is because of these reasons (and because of the goofy way NLB handles arp) that I decided to go with a 3rd party NLB solution. I ended up choosing loadbalancer.org’s virtual appliance and couldn’t be happier. It uses a loopback adapter on each server with a high metric and the cluster IP to overcome the arp issue. I can choose various weighted approaches to load balancing. I get reporting, health checks, and can use the NLB for a myriad of load balancing scenarios. It was quick to set up and the servers are good to go; now I can party… I wish.

While I have a solid layer 2/3 foundation with a robust NLB setup bringing redundancy and availability to my environment, I am hamstrung at layer 7 itself. The Terminal Servers themselves were just not performing adequately for more than a couple days. I was receiving the following errors every couple days:

“Windows – Low on Registry Space – The system
has reached the maximum size allowed for the system part of the registry.
Additional storage requests will be ignored.”

 

“Windows was unable to load the profile but has
logged on with the default profile system. Default – Insufficient system
resources exist to complete the requested service.”

I would reset some lingering disconnected sessions as well as eventually reboot the system. All would be well for a while until the message came back. Additionally I noticed that the temporary Profiles in C:\Documents and Settings was eating up all my space. So I figured, “Hey! I have a SAN with plenty of space; I’ll just mount an iSCSI drive and put the Documents and Settings there.” I know, I’m brilliant right!

“Documents and Settings is a Windows system folder and is required for Windows to run properly. It cannot be moved or renamed.”

The problem then was that all the articles I read were really focused on an unattended install with a unattend.txt file. I already had machines in production, I didn’t want to have to build a new machine and create a new template to experiment with this plan. So I took the following article and read to the registry path edit .

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList

When I went to the registry path I found the setting to change. I changed the ProfilesDirectory entry to reflect the new iSCSI drive I had mounted. I then deleted all the non stock GUID’s (kept All users, Administrator, Default User).

I was not worried because we have roaming profiles for our users; all the profiles on that machine were temporary. I then navigated to the c:\Documents and Settings folder and deleted all the non stock profiles. I copied over the All Users, Administrator, and Default User folder to the new location and after a reboot I was done with that. Testing showed the new users getting profile creation on the new drive.

As for the registry issue, I dug up this article which made sense. The legacy printers were dragging down my user profiles and creating relics and hogging space. I added the PrinterMasKey to the HKEY_USERS\.default\printers
registry subkey and rebooted the server. That made quick work of the registry error. I made sure I was on the latest service pack then rebooted. The step-by-step is below.

To enable this hotfix, you must create the PrinterMaskKey registry subkey. To do this, follow these steps:

  • Click Start, click Run, type regedit, and then click OK.
  • Locate the following registry subkey:
  • HKEY_USERS\.default\printers
  • Right-click the registry subkey that you located in step 2, point to New, click Key, type PrinterMaskKey, and then press ENTER.
  • Exit Registry

When all was said and done, I then wanted to reclaim my space. I ran a defrag, emptied the recycle bin. I then downloaded sdelete , extracted the sdelete.exe file and saved it to the root of my C:\ . I then ran sdelete –c on the server to zero out all the space on my vhd. Finally I shut down the TS VM and migrated it to another store, since the drive is thin provisioned I was able to get my space back and move on from there. Now I hope I can rest… we will see.

 

Advertisements

~ by lavazzza on November 25, 2009.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: