This is perhaps a slightly "techier" entry than you're used to seeing from me. It's OK. People around me often forget that as an IT consultant, I still have plenty of hands-on work. Today I'm going to talk about a topic which touches nearly every aspect of my work these days: Active Directory Health.
Microsoft Active Directory is a wonderful thing...except when it's either abused or neglected, or tinkered with too much.Then it turns into a wounded entity which can drag down or halt your organization's productivity.
I spend a lot of time cleaning up Active Directory for new clients. The methodology is surprisingly simple, so much so that many clients simply don't believe me. There's a limited number of attributes to consider and/or check.
Number and placement of Domain Controllers
This is a topic about which much has been written. Let's stick to the basics. At a minimum, each forest/domain should have two domain controllers (DCs) for fault tolerance. It is my personal preference that one of these be physical. The other(s) can be virtual.
There are two reason for my wanting a physical DC, both of which reflect my old-school tendencies regarding disaster recovery and business continuity: DNS Availability and Time Sync.
In terms of DNS Availability, if my hypervisor cluster goes down or is unavailable for any reason, having working DNS available (on a physical DC) makes troubleshooting much easier than if I have to stop and remember (or look up...if I can) IP addresses for my hypervisor cluster and other machines.
Time Synchronization
Active Directory utilizes Kerberos as an authentication mechanism. Kerberos is notoriously picky about time sync. If two devices have more than a slight timesync difference, Kerberos will frequently throw authentication errors.
It has been my experience that virtual machines have a tendency to not be good at tracking time consistently. While it's possible to point a virtual DC responsible for time to an external NTP time source (which, to be fair, has worked for some of our clients), a physical DC provides a hardware-based reference clock for Active Directory.
The physical DC should receive NTP time sync from a highly reliable source. I don't recommend picking a naval observatory or government time server off of a list. Those systems don't provide a guaranteed NTP signal to outside systems. Instead, use the public time server pool at pool.ntp.org, or a hardware based system such as a GPS time sync appliance.
The time syncing DC should be the holder of the FSMO (Flexible Special Master Operations) "PDC" role, which is responsible for providing network time to Active Directory.
Network Communication
The #1 barrier to a healthy Active Directory forest/domain is bad or intermittent network communication to and from domain controllers. If every member server can't communicate to a primary and secondary DC, AD will not be able to maintain a synchronized directory database. (This goes double for DCs; all DCs must be able to communicate with all other DCs.)
DNS Server Configuration
It's difficult to believe, but not pointing all of your servers and workstations to the correct DC can cause network issues. I see this come into play every time we need to replace domain controllers for a client. Nobody ever remembers to go through every device and change out the DNS servers being used. This includes not just servers, but workstations (via DHCP, GPO, or static configuration), switches, routers, appliances, and firewalls.
Domain Controllers should NOT refer to themselves for DNS. A DC with 127.0.0.1 at the top of its DNS server list may appear to lock up for 5-10 minutes during reboots as various dependent services try to communicate with a DNS service that hasn't started up yet. Each DC should have another DC as the first server in its DNS list. The loopback entry, if you really want to put it in, should be at the very bottom of the list.
Replication
In situations where DCs have been added and removed over a period of time, the default "automatic" replication connections may not be sufficient. Ensure that every satellite site has a working replication connection to your primary site.
Running DCDIAG
I've lost track of how many times I've walked into a client with Active Directory issues, only to find they've never tried running DCDIAG. The utility is there for a reason, folks. Learn to love DCDIAG; in its basic form, with no switches, it's read-only and won't hurt anything, I promise.
Periodically run DCDIAG and check that all tests pass. I'm not terribly concerned about the tests that check log files failing; those tests only indicate that one or more warnings or errors were found in a given log file. You should still check the log file errors listed as examples, which brings me to...
Browsing Event Logs
There's no substitute for checking event logs on every server (and/or workstation) involved in a persistent problem situation. This is pretty basic research.
There are three log files that I always check on each DC and on each involved server to get a feel for what's going on with a given network. In order, they are: System, Security, and Application.
The System log tells me how the server itself is feeling. If a server is having health issues which might affect AD, those issues will show up here first.
The Security log tells me if authentication is working or not working.
The Application log tells me which apps are complaining and why. I look at this VERY closely on DCs to check if someone is trying to use a DC as an applications server. Just because a DC *can* run applications, that doesn't mean it *should*. Let a DC be Just A DC. Don't try to make it a file or print server, don't run databases on it, don't make it a vCenter server, just...don't. DCs should run Active Directory and related authentication services such as LDAP or RADIUS, DNS, and DHCP. The fewer services on a set of DCs, the higher the probability of a stable Active Directory network.
With any of the event logs, I usually don't have to browse back more than a couple of month or so to get a feel for the server's pattern of recurring errors and/or warnings. In networks with Active Directory issues, there will always be a pattern.
For example, a lot of Kerberos errors usually indicate a problem with timesync, communication, or both. In most cases, plugging the resulting error codes into your favorite search engine will reveal a number of potential avenues for further investigation.
None of the above tools is what I'd consider "rocket science." This is all pretty basic, simple stuff. When you use these tools consistently and together, Active Directory cleanup can be much easier than most people think.
Microsoft Active Directory is a wonderful thing...except when it's either abused or neglected, or tinkered with too much.Then it turns into a wounded entity which can drag down or halt your organization's productivity.
I spend a lot of time cleaning up Active Directory for new clients. The methodology is surprisingly simple, so much so that many clients simply don't believe me. There's a limited number of attributes to consider and/or check.
Number and placement of Domain Controllers
This is a topic about which much has been written. Let's stick to the basics. At a minimum, each forest/domain should have two domain controllers (DCs) for fault tolerance. It is my personal preference that one of these be physical. The other(s) can be virtual.
There are two reason for my wanting a physical DC, both of which reflect my old-school tendencies regarding disaster recovery and business continuity: DNS Availability and Time Sync.
In terms of DNS Availability, if my hypervisor cluster goes down or is unavailable for any reason, having working DNS available (on a physical DC) makes troubleshooting much easier than if I have to stop and remember (or look up...if I can) IP addresses for my hypervisor cluster and other machines.
Time Synchronization
Active Directory utilizes Kerberos as an authentication mechanism. Kerberos is notoriously picky about time sync. If two devices have more than a slight timesync difference, Kerberos will frequently throw authentication errors.
It has been my experience that virtual machines have a tendency to not be good at tracking time consistently. While it's possible to point a virtual DC responsible for time to an external NTP time source (which, to be fair, has worked for some of our clients), a physical DC provides a hardware-based reference clock for Active Directory.
The physical DC should receive NTP time sync from a highly reliable source. I don't recommend picking a naval observatory or government time server off of a list. Those systems don't provide a guaranteed NTP signal to outside systems. Instead, use the public time server pool at pool.ntp.org, or a hardware based system such as a GPS time sync appliance.
The time syncing DC should be the holder of the FSMO (Flexible Special Master Operations) "PDC" role, which is responsible for providing network time to Active Directory.
Network Communication
The #1 barrier to a healthy Active Directory forest/domain is bad or intermittent network communication to and from domain controllers. If every member server can't communicate to a primary and secondary DC, AD will not be able to maintain a synchronized directory database. (This goes double for DCs; all DCs must be able to communicate with all other DCs.)
DNS Server Configuration
It's difficult to believe, but not pointing all of your servers and workstations to the correct DC can cause network issues. I see this come into play every time we need to replace domain controllers for a client. Nobody ever remembers to go through every device and change out the DNS servers being used. This includes not just servers, but workstations (via DHCP, GPO, or static configuration), switches, routers, appliances, and firewalls.
Domain Controllers should NOT refer to themselves for DNS. A DC with 127.0.0.1 at the top of its DNS server list may appear to lock up for 5-10 minutes during reboots as various dependent services try to communicate with a DNS service that hasn't started up yet. Each DC should have another DC as the first server in its DNS list. The loopback entry, if you really want to put it in, should be at the very bottom of the list.
Replication
In situations where DCs have been added and removed over a period of time, the default "automatic" replication connections may not be sufficient. Ensure that every satellite site has a working replication connection to your primary site.
Running DCDIAG
I've lost track of how many times I've walked into a client with Active Directory issues, only to find they've never tried running DCDIAG. The utility is there for a reason, folks. Learn to love DCDIAG; in its basic form, with no switches, it's read-only and won't hurt anything, I promise.
Periodically run DCDIAG and check that all tests pass. I'm not terribly concerned about the tests that check log files failing; those tests only indicate that one or more warnings or errors were found in a given log file. You should still check the log file errors listed as examples, which brings me to...
Browsing Event Logs
There's no substitute for checking event logs on every server (and/or workstation) involved in a persistent problem situation. This is pretty basic research.
There are three log files that I always check on each DC and on each involved server to get a feel for what's going on with a given network. In order, they are: System, Security, and Application.
The System log tells me how the server itself is feeling. If a server is having health issues which might affect AD, those issues will show up here first.
The Security log tells me if authentication is working or not working.
The Application log tells me which apps are complaining and why. I look at this VERY closely on DCs to check if someone is trying to use a DC as an applications server. Just because a DC *can* run applications, that doesn't mean it *should*. Let a DC be Just A DC. Don't try to make it a file or print server, don't run databases on it, don't make it a vCenter server, just...don't. DCs should run Active Directory and related authentication services such as LDAP or RADIUS, DNS, and DHCP. The fewer services on a set of DCs, the higher the probability of a stable Active Directory network.
With any of the event logs, I usually don't have to browse back more than a couple of month or so to get a feel for the server's pattern of recurring errors and/or warnings. In networks with Active Directory issues, there will always be a pattern.
For example, a lot of Kerberos errors usually indicate a problem with timesync, communication, or both. In most cases, plugging the resulting error codes into your favorite search engine will reveal a number of potential avenues for further investigation.
None of the above tools is what I'd consider "rocket science." This is all pretty basic, simple stuff. When you use these tools consistently and together, Active Directory cleanup can be much easier than most people think.