Server Monitoring Best Practices

As a business, you may be running many on-site or Web-based applications and services. Security, data handling, transaction support, load balancing, or the management of large distributed systems. The deployment of these will depend on the condition of your servers. So it’s vital for you to continuously monitor their health and performance.

Here are some guidelines designed to help you get to grips with server monitoring and the implications that it carries.

Understand Server Monitoring Basics

The basic elements of “Server monitoring” are events, thresholds, notifications, and health.

1. Events

Events are triggered on a system when a condition set by a given program occurs. An example would be when a service starts, or fails to start.

2. Thresholds

A threshold is the point on a scale that must be reached, to trigger a response to an event. The response might be an alert, a notification, or a script being run.

Thresholds can be set by an application, or a user.

3. Notifications

Notifications are the methods of informing an IT administrator that something (event, or response) has occurred.

Notifications can take many forms, such as:

  • Alerts in an application
  • E-mail messages
  • Instant Messenger messages
  • Dialog boxes on an IT administrator’s screen
  • Pager text messages
  • Taskbar pop-ups

4. Health

Health describes the set of related measurements defining the state of a variable being monitored.

For instance, the overall health of a file server might be defined by read/write disk access, CPU usage, network performance, and disk fragmentation.

Set Clear Objectives

Server Monitoring Best PracticesDecide what it is you need to monitor. Identify those events most relevant to detecting potential issues that could adversely affect operations or security.

A checklist might include:

  1. Uptime and performance statistics of your web servers
  2. Web applications supported by your web servers
  3. Performance and user experience of your web pages, as supported by your web server
  4. End-user connections to the server
  5. Measurements of load, traffic, and utilisation
  6. A log of HTTP and HTTPS sessions and transactions
  7. The condition of your server hardware
  8. Virtual Machines (VMs) and host machines running the web server

Fit Solid Foundations

It’s safe to say that most IT administrators appreciate useful data, clearly presented enabling them to view lots of information in a legible area. This means that you should take steps to ensure that your monitoring output is easy to read and well presented.

A high-level “dashboard” can serve as a starting point. This should have controls for drilling down into more detail. Navigation around the monitoring tool and access to troubleshooting tools should be as transparent as possible.

It’s also necessary to:

• Identify the top variables to monitor, and set these as default values. Prioritise them in the user interface (UI).

• Provide preconfigured monitoring views that match situations encountered on a day-to-day basis.

• Have a UI that also allows for easy customization.

• Users/IT managers should be able to choose what they want to monitor at any given time. Or be able to adjust the placement of their tools. And they should be able to decide the format they want to view the data in.

• The UI text should be consistent, clear, concise, and professional. From the outset, it should state clearly what is being monitored – and what isn’t.

Build, to Scale

Organizations of different sizes naturally have different monitoring needs. Small Organization IT administrators often look to fix problems after they’ve been identified. Monitoring is generally part of the troubleshooting process. Monitoring applications should intelligently identify problems, and notify the users via e-mail and other means. Keep the monitoring UI simple.

Medium Organization IT administrators monitor to isolate big and obvious problems. A monitoring system should provide an overview of the system, and explanations to help with the troubleshooting process. Preconfigured views, and the automation of common tasks performed on receiving negative monitoring information (e.g., ping, trace route.) will speed response. Again, keep the monitoring UI simple.

Large Organization/Enterprise IT administrators require more detailed and specific information. Users may be dedicated exclusively to monitoring, and will appreciate dense data, with mechanisms for collaborating. Long-term ease of use will take precedence over ease of learning.

Set Up Red Flags

You should provide a set of “normal” or “recommended” values, as a baseline. This will give context to the information being monitored. The system may give the range of normal values itself, or provide tools for users to calculate their own.

Within the application, make sure that data representing normal performance can be captured. This can be used later, as a baseline for troubleshooting. In any case, users should be able to tell at a glance when a value is out of range, and is then a possible cause for concern. Your monitoring software can assist in this, by setting a standard alert scale, across the application.

In Western cultures, common colors for system alerts are:

  • Red = Severe or Critical
  • Yellow = Warning or Informational
  • Green = Good or OK

For accessibility, colors should be combined with icons, for users who are sight-impaired. Words that can be dictated by a screen reader are also appropriate. Limit the use of custom icons in alerts though as users may resent having to learn too many new ones. There may also be conflicts with icons in other applications but saying that, common icons, that are recognizable, are fine, as there’s nothing new to learn.

Explain the Language

Don’t assume that your users will understand all the information your monitoring software provides. Help them interpret the data, by providing explanations, in the user interface.

  • Use roll-overs to display specific data points, such as a critical spike in a chart
  • Explain onscreen, how the monitoring view is filtered. For example, some variables or events might be hidden (but not necessarily trouble-free). The filter mechanism, an explanation of the filter, and the data itself should be positioned close together
  • Give easy access to any variables that are excluded in a view
  • State when the last sample of data was captured
  • Reference the data sources
  • There should be links to table, column, and row headings, with pop up explanations of the variables, abbreviations, and acronyms
  • Provide links beside the tables themselves, with pop up explanations of the entire table

Let Them Know

Alerts should be sent out, to indicate there is a problem with the system. Notifications should be informative enough to give IT administrators a starting point to address the problem. Information which helps the user take action should be displayed near the monitoring information. Probable causes and possible solutions should be prominently displayed.

Likewise, the tools needed for solving common problems should be easily accessible at the notification point.

You should log 24 to 48 hours of data. That way, when a problem arises, users will have enough information available to troubleshoot. Note that some applications need longer periods of monitoring, and some shorter. The log length will be determined by the scope of your day-to-day operations.

Provide multiple channels for notification (email, Instant Messages, pager text, etc.)

Users should be able (and encouraged) to update, document, and share the information needed to start troubleshooting.

Keep Them Informed

Users often need to use monitoring data for further analysis, or for reports. The monitoring application itself should assist, with built-in reporting tools. Performance statistics and an overall summary should be generated at least once a week. Analysis of critical or noteworthy events should be available, on a daily basis.

Allow users to capture and save monitoring data – e.g., the “normal” performance figures used as a baseline for troubleshooting. Users should be able to easily specify what they want recorded (variables, format, duration, etc.). They should also be allowed to log the information they’re monitoring.

There should be a central repository, for all logs from different areas of monitoring. A single UI can then be used, to analyze the data. Export tools (to formats such as .xls, .html, .txt, .csv) should be provided. This will help to facilitate collaboration in reporting and troubleshooting.

Take Appropriate Measures

Different graph types should be appropriate to the type of information you are analyzing.

Line graphs are good for displaying one or more variables on a scale, such as time. Ranges, medians, and means can all be shown simultaneously.

Table format makes it easy for users to see precise numbers. Table cells can also contain graphical elements associated with numbers, or symbols to indicate state. The most important information should appear first, or highlighted so that it can be taken in at glance.

Histograms or bar graphs allow values at a single point in time to be compared easily. Ranges, medians, and means can all be displayed simultaneously.

Some recommendations:

  • When using a line graph, show as few variables as possible. Five is a safe maximum. This makes the graph easier to read
  • Avoid using stacked bar graphs. It’s better to use a histogram, and put the values in clusters along the same baseline. Alternatively, break them up into separate graphs
  • When using a graph to show percentage data, always use a pie chart
  • Consider providing a details pane; clicking a graph will display details about the graph in the pane
  • Avoid trying to convey too many messages in one graph
  • Never use a graph to display a single data point (a single number)
  • Avoid the use of 3D in your charts; it can be distracting
  • Allow users to easily flip between different views of the same data

Push the Relevant Facts

Displaying a lot of stuff onscreen makes it harder for administrators to spot the information that is of most value – like critical error messages.

Draw attention to what needs attention, most:

  • by placing important items prominently
  • by putting more important information before the less important
  • by using visual signposts, such as text or an icon, to indicate important information

Preconfigured monitoring views will reduce the emphasis on users configuring the system. Allow users to customise the information and highlight what they think is important, so it can be elevated in the UI. Group similar events – and consider having a global overview of the system, visible at all times.

Hide the Redundant

If it hasn’t gone critical, or isn’t affecting anything, they don’t need to see it. At least, not immediately. If a failure reoccurs, don’t keep showing the same event, over and over. Try to group similar events into one.

Allow your users to tag certain events as ones they don’t want to view. Let them set thresholds that match their own monitoring criteria. This allows them to create environment-specific standards, and reduces false alarms. Use filters and views, to give users granular control of what they are monitoring.

Provide the ability to zoom in for more detailed information, or zoom out for aggregated data. Allow users to hide unimportant events, but still have them accessible.

Be Prepared, for the Worst

As well as probable causes, the application should suggest possible solutions, for any problems that occur. Administrators will likely have preferred methods of troubleshooting. But, in diagnostic sciences, it helps to get a second opinion. It’s essential to identify events most indicative of potential operational or security issues. Then, automate the creation of alerts on those events, to notify the appropriate personnel.

Being prepared also means that all data should be backed up and stored off the premises as well as on the network. This protects against the obvious such as hardware failure or malware attacks, but also against complete disaster such as a fire at the premises.

And the Best, that Can Happen

With proper monitoring measures in place, you greatly reduce the risk
of losses due to poor server performance. This has a corresponding positive effect on your business – especially online services and transactions.

A well-tuned monitoring system will help facilitate the identification of potential issues, and accelerate the process of fixing unexpected problems before they can affect your users.

OK, folks that’s it for this post. Have a nice day guys…… Stay tuned…..!!!!!

Don’t forget to like & share this post on social networks!!! I will keep on updating this blog. Please do follow!!!

Leave a comment