Monitoring a production computer network is important, whether you're a small shop or an enterprise installation. The amount of insight you can gain from properly configuring your devices and intelligently extracting operational data is incredibly valuable, but it can take some time to set it all up. Here are some of our basic tips for network professionals who are considering deployment of a network monitoring suite.
When you've gotten your network monitoring software installed and ready to go, it can be tempting to monitor everything under the sun, but some consideration should be paid to the quantity of network traffic that will be generated from these activities. Overloading your monitoring box can lead to false positives and false alarms, which can defeat the purpose of monitoring your network.
Make a list of all of the things you'd like to monitor - consider devices with historical problem areas as well as mission critical devices that simply cannot go down. Prioritize the information you wish to capture and make sure you separate the "must-haves" from the "would-be-nices". Start with your high priority monitors and build up from there.
Be sure to watch the processor, disk, and memory usage to see if your hardware is keeping up with the software. Also keep an eye on network traffic to make sure data isn't getting bottle-necked at your NIC. If you find your monitoring server is lagging, consider removing some of your peripheral monitors or configure the software to poll the devices less frequently.
Monitoring protocols such as WMI tend to offer a lot of information about targeted servers and work stations, but large quantities of WMI monitors can bog down your network. Consider using Simple Network Management Protocol (SNMP) whenever possible, since this protocol uses much less resources than WMI. Using TCP/IP scripts is another light-weight alternative to WMI and may allow you to monitor your services just as accurately.
If you device to deploy WMI monitoring on your network, it can be hassle keeping track of multiple Windows accounts for more than one device. Maintaining credentials this way can cause confusion, avoidable errors, and false positives. Consider setting up a master domain admin account for your monitoring software - that way you only need to manage one Windows account to do your WMI monitoring.
When yous start searching your devices for things to monitor via SNMP, you'll quickly realize the task is impossible without first obtaining the necessary Management Information Base files (or "Mibs") to translate OIDs into intelligible, human-readable forms. Indeed, locating and testing Mib files can be one of the more challenging tasks when it comes to deploying site-wide network monitoring solutions.
Assuming you can locate the files required for your devices, we recommend taking a bit of extra time to establish what we call a "Mib Repository." Simply put, this is a directory you establish on a back-up hard drive where we will organize and retain all of the Mib files we gather for our devices. There is nothing quite as frustrating as having to go back and find Mibs again when you migrate servers or redeploy a solution at a different site.
Create a folder to contain all of the mibs, and within that folder create directories for each manufacturer whose devices you plan to monitor. Inside the manufacturer folders, create a directory for each device type you plan to monitor, as well as a "general" folder for any Mibs that apply to all the manufacturer's devices.
Keeping a Mib repository in this fashion can help you stay organized, locate monitoring statistics faster, and ensure you only need to hunt down mibs once.
In order to monitor a device with SNMP, it must have an agent either built in to the operating system, or you'll need to install an agent on it. By default, agents are configured to accept SNMP packets from any host, and the community string is usually set to "public". We need to change these two things in order to make sure the agent will only talk to our monitoring server and it will only respond to a secure passkey of our choosing.
First, you'll want to set the agent to only accept packets from the IP address or host name of your monitoring server - communication from any other device via SNMP will be ignored. This prevents nefarious individuals or programs from sending commands via SNMP to your device, which could enable them to make changes to your device or gather information they could use to hack into your network.
Second, you always want to change the community string so that it's no longer "public." The community string functions as the password or security key needed to send commands to the device via SNMP, and "public" is the default which is well known to those looking to exploit SNMP. Change it to a complex password that is only known to your networking team.
In addition to providing real-time statistics for monitoring, SNMP can send alert messages known as traps. You must configure the device's SNMP agent to send traps to the IP address or host name of your monitoring server. By default, the agent is usually set to send traps to the locahost, which probably isn't what you need.
Once you have the traps directed to your network monitoring server, it can seem like you're good to go, but we always recommend testing traps from your target devices to make sure the monitoring server can receive them without issue. Many devices have the ability to generate test traps in order to verify a clear path to the monitoring box, but you may need to use a third-party app to test traps from a Windows-based machine.
We recommend a free program called TrapGen. This small app does just what it says it will - it allows you to send dummy traps with custom OIDs in order to make sure your alerts will function correctly. Install TrapGen on your target server or workstation and then use the command line utility to generate traps and ensure they're getting received by your monitoring software.
With your monitors properly set up and data streaming in from your network, the next step is to automate the system using alerts. Many monitoring solutions provide a way to send emails, SMS messages, or notifications to your technicians so they can get alerts when devices are encountering problems and respond as quickly as possible.
It can be tempting to set up a full array of alerts for problems large and small - staying on top of our network is the goal, after all. But experience tells us that the more alerts you're getting, the less of them you're actually paying attention to. It can become notification overload if you're not careful, so you should strongly consider which monitors deserve active alerts that get sent to your smartphone, and which you'll resign to manually researching once-a-week or so.
Go back to your monitoring priorities from tip #1 and set up alerts from the most critical ones. Determine which personnel need to receive which alerts and limit notifications accordingly. Consider setting up a communal "alert bucket" email address that the team can take turns sifting through to identify legitimate problems. Once the important notifications are identified, remove or reduce alerts from everything else.
One way to help cut-down on alert avalanches is by setting up dependencies on your devices. For example, if you have a network switch with a dozen workstations attached to it, you could be looking at thirteen or more alerts if that switch encounters a problem, simply because those computers depend on that switch for connectivity to the larger network. In this case, you can set those computers to be "dependent" on the switch, which will suppress alerts from the dependent devices should the primary device go down.
This situation is known as an "up-dependency", meaning that the secondary devices will only be monitored if the primary device is indeed up. However, you can also set up "down-dependencies" which function the opposite way. With a down-dependency, the secondary devices will only be monitored if the primary device is down. This can be useful for fail-over devices or other back-up equipment that is only online when another device is found to be offline.