One of the main responsibilities of IT staff is to keep the company systems up and running, the success of which is commonly measured as percentage ‘uptime’. For the business uptime = money made and downtime = money lost; the aim is to get the uptime figure as close to 100% as possible. That figure is most commonly based on the response to tests, by network monitoring systems, on one or more specific servers or services. The weakness of this approach is that a network might have a server uptime of 99.99%, but the average uptime on a company PC may only be 85%. It's also possible to fall into the trap of obsessing about uptime when it's not as important as you might think. I take a more pragmatic approach, that downtime is only an issue if it effects the users of the system.
For most companies, it's more productive to expand the definition of ‘downtime’ to include ANY reduction in functionality on ANY system during operating hours. Now a truer picture of uptime appears (though the 100% target becomes even more impossible to reach). This is the approach I prefer to take, focusing a little wider on the organisation as a whole, rather than exclusively on the servers and their uptime.
Unfortunately it’s very difficult to measure uptime in such a holistic way, based on every server, switch, router, PC, printer or mobile device. Traditional network monitoring systems fall short of the task as the complexity of the functionality they would need to test is too great. Only a person can make an assessment of whether a system is 'up' or 'down'. For example, the PC may be up, Outlook may be running, and the mail server may be fine; but there is a permission error on a public folder so the user can't see what they need to see. For the user, THAT is downtime.
An approach to get a better understanding of your total system performance is to use your helpdesk or other issue tracking system to track reported incidents and their impact. If you educate and encourage your users to record every issue they encounter, and you record how long it takes to resolve the problem, then you have a better picture of the real uptime of your network. That picture allows you to focus your resources on ALL causes of downtime, rather than the traditional priorities (server or component failure, power outages etc.)
Taking this approach will identify the causes of downtime which are often missed; poorly configured client software, incorrect permissions, user training etc. Focus on these issues, and you will increase the real uptime of your systems and ultimately the productivity of staff. It's not as conspicuous or immediately gratifying as a 99.99% uptime figure, but it's ultimately more meaningful!