Here's a couple examples. Recently I found this when looking at a UPS:
Um, it's not September anymore. So what is happening is that Orion is no longer getting data back when polling this node. Am I getting an error? Does the Node have a red flashing box? Am I getting a daily report that something is amiss? No on all counts. This is frustrating beyond my ability to express it.
So, how many nodes have this issue? According to tech support, there is no way to run a report to find out. I suppose I could just pull up all 2000 of my nodes one at a time to check. Seriously.
Last night I had another issue that is along the same vein but somewhat different. There was some amount of service stopping and starting on my main server, and several nodes just stopped being polled. How many? Again, no way to know. They just appear green, but there are no statistics being collected. No Application stats in SAM. No CPU. No Memory. No Disk. No network stats including latency, which leads me to believe no polling (not even ping) was occurring. Error? Down or unknown node? Nope. All happy green. Nothing wrong here. *SIGH* Rebooting an additional poller fixed the issue, but I was just lucky to stumble upon it before the Thanksgiving holiday.
So there's the rant. Now let's talk solutions:
For UnDP issues, Orion should 1) Change the Node graphic to have a flashing red box like an interface was down (or make it a global check box option), and 2) Create a report to show Top 10/All node that have UnDP that have not updated in 12 hours or some other arbitrary value.
For the times when Orion mysteriously stops polling, I'm open to suggestions. Maybe each poller should have a separate process that checks all nodes on all pollers to be sure data is populating. Run in every hour, once a day, whatever. It's not complicated, just check and see if there is SOMETHING from a node in the last hour or so.
Am I alone in seeing these issues?