|
Introduction
Introduction to Really Simple Monitoring and Lokad.Monitoring
Featured Software is hard. Although designing apps for the web (and now for the cloud) has never been easier, you still need to keep an eye on all those apps that you're supporting. At Lokad, we have designing rather complex mashups of enterprise apps, and we got the feeling that some aspects where not properly addressed by existing monitoring solutions. This project is multiple things at once: a philosophy: why and how should you monitor your apps? a XML format: the Really Simple Monitoring format (RSM). a web client designed for Windows Azure: Lokad.Monitoring The approach and the RSM format are not specific of Windows Azure or Microsoft apps. We believe it applies to nearly every single enterprise apps (or even complex consumer apps). Why do you even need monitoring?Availability and integrityCloud computing is mostly relieving organization from dealing with hardware faults, but it does not prevent apps to be misdeployed or misconstrued. The apps you have the more painful is becomes to check that everything is running just fine at any given time. Challenge: monitoring is domain-specific. Although, there might be some generic monitoring like check whether this HTTP request get a 200 OK or not, for any complex app, controlling the inner state of the app is one thing that really matter. Yet, there is no generic notion on what is the correct state of my app. Only the app developer can figure that out. Collecting exceptional eventsFault or exceptions of any kind should be fixed. Yet, easier said than done. In particular, if the developer has to routinely check every single log of every single app ever deployed, problems are likely to stay undetected for a long time. Challenge: faults travel in packs. Collecting exceptions isn't hard. The problem is that is so easy to be overflowed by them. Human analysis of logs is very poorly scalable. It's does not take more than a few hundred log entries to outscale the patience of the sysadmin (aka one of us), which is really not that much. Collecting key indicatorsWith the cloud, performance problems become both easier and harder at the same time. Easier because if the app is correctly designed, adding more machines (scaling out) is THE solution. Harder, because suddenly, performance depends on subtle interactions between the demand and whatever resource allocation policy is setup. Challenge: performance makes (more) sense over time. When it comes to performance measurement, numbers don't make much sense in isolation. It's usually much more insightful when it's possible to compare numbers, especially to compare them over time with time-series. Monitoring is a TaxAs illustrated here above, monitoring is best when domain-specific. Swarming sysadmin with big pile of logs is easy and pointless. It's exposing the few key information that really matters. Such an approach means that every app should implement some ad-hoc monitoring logic, this is the monitoring tax. It does not strictly add value to your app, at least not feature-wise, but it does make much more robust when bad thing happens. Really Simple MonitoringSince monitoring is a tax, the problem quick become how do we make the monitoring tax as low as possible? The approach adopted by Lokad is called Really Simple Monitoring. It consists in letting every single app expose a single REST endpoint with the following format: <rsm>
<messages>
<message>
<id>1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2010-05-02T18:30:02Z</updated>
<title>DatabaseNotReachable</title>
<summary>Can't connect to database.</summary>
<tags>fault database network</tags>
<link>http://example.lokad.com/database/errors</link>
</message>
<message> ... </message>
</messages>
<indicators>
<indicator>
<name>/forecast-flow/count</name>
<started>2010-05-01T03:24:18Z</started>
<updated>2010-05-02T18:30:02Z</updated>
<instance>1225c695-cfb8-4ebb-aaaa-80da344efa6a</instance>
<value>1223</value>
<tags>forecastflow</tags>
<link>http://example.lokad.com/forecast-flow/count</link>
</indicator>
<indicator> ... </indicator>
</indicators>
</rsm>In short, messages represent all dated events such as logs or exceptions. Indicators represents all monitored values representing the state of the app the time the observation is made. RSM is a minimal POX format (POX = Plain Old Xml) to be naively exposed over HTTP(S). RSM is dumb on purpose. If anything smart has to be done, this is the duty of the monitoring client, not the one of the monitored app. Lokad.Monitoring, web client for AzureThis app comes a package ready to be deployed under Windows Azure. It supports: collecting RSM logs. creating custom reports with custom rules to decide whether app is green or not. exposing the status of your apps through widgets. |