|
CalmonDesignDoc
Design document for the Calmon monitoring project
Calmon - a monitoring system
IntroductionThis is a work in progress. Many, many things are yet undefined. You can help. Email comments to jls-calmon@semicomplete.com BackgroundThere are things I like about nagios. There are things I do not like. From my experiences, it solves about half of my monitoring needs, not including the need for other tools: munin, collectd, cacti, monit, etc. Many other monitoring systems (some commercial, some free) are simply forks of Nagios and provide little on top of what nagios already provides. Those that are not direct descendants of Nagios fall to other design problems such as only supporting discovery instead of directly defining nodes or devices (see Source of Truth below). Few address notifications, acknowledgements, and escalations. Fewer still support aggregates or estimations easily. This goes way beyond the problems that any given monitoring tool has a poor dashboard or UI. Monitoring is hard to get right. Monitoring system metrics is important. Monitoring business metrics is also important. A complex computer system will need to monitor things as well as groups of things and aggregate data from multiple sources to correlate failures with causes. A good monitoring system allows you to check for health, test assumptions, correlate data, and automate reactions. There are many monitoring tools ranging from narrow to wide scopes. Some are complete, others are incomplete. None I have found have enough of the features I want in the basic system. Some monitoring tools help you from check to notification, but don't adequately handle common alert methods and don't take feedback (acknowledgements, etc) easily. Goals
Nomenclature
XXX How do we define a group of hosts? Or a group of host:service pairs? ConsiderationsThe following are considerations made which impact the design and feature set of Calmon. Source of TruthYou need a source of truth to define what hosts have which services, and which services are important, and which devices are unimportant, etc. In Nagios, for example, you define truth in the config files by defining hosts, services, commands, contacts, schedules, etc. In OpenNMS, some truth comes from the network through discovery (ping sweeps to find hosts, etc). In other systems you might have a SQL database that has devices, network information, etc. Truth is important for any automated system to reflect the intentions of its users. Truth comes from humans. Truth is how you define the system. There are two main philosophies about where this source of truth comes from: First philosophy is that the network serves as your source of truth. You know that a switch's port 34 belongs to 'host foo' because you put that in the port description. You know that the server foo.example.com is a fileserver because you noted that in /etc/motd, or because nfsd is active. In this philosophy, any truth is knowable by simply asking systems about themselves because you spread your truth across all nodes and devices in your network of datacenters, systems, etc. The second philosophy is to centrally store your source of truth. Any truth queries (What host am I? What packages do I need? What vlans belong on this switch port?) are answerable from this central authority. As above, server foo.example.com is a fileserver because you told the central truth source that it is a fileserver and thus should have nfsd running and configured appropriately. In practice, centralized truth can range from a simple yaml file to an database-backed web app. How to store truth is out of scope for Calmon. Thus, since there are two fairly different philosophies, it is not realistic for Calmon to provide either method of truth storage. Instead, Calmon should provide a way to plug your own source of truth into it, feeding it host, service, test, and other information. Push vs PullIn this context, push means that one networked process sends data to another process; pull means that one networked process requests data of another process. In a monitoring system, this usually describers how tests or checks are invoked and how results are moved around. Fundamentally, the end result is the same: data is transferred. The only decision is then to choose which node initiates the connection. Both push and pull are valid methods. In some situations, one may be required due to firewall rules. In other situations, it make make more sense to use pull if only to centralize where scheduling decisions are made (nagios does this, for example). From this point of view, it seems necessary to support both push and pull for sending data and requests between monitoring nodes. Scaling PerforamanceScale everything. Do it well, or reasonably well. This includes, but is not necessarily limited to:
Scaling UsabilitySmall/simple infrastructures have different needs than large or complex infrastructures. In order to support both, we can treat all infrastructures as large and complex, since small and/or simple is a subset. The way to shield the complexity in configuration for a simple environment is to provide a healthy set of sane defaults. This means we'll need to provide a standard library and apply some default decisions to some of our objects (hosts/devices, contacts, notifications, etc). For any non-trivial number of things monitored, it must be easy to search for things in the system. Rollup and aggregates must be easy to use. That is, it should be easy to configure a thing called 'frontends' that encompasses all your frontends. Further, it should be easy to view this aggregate in the dashboard and query it from the API. Configuration TargetsTBD I want to target the most common system infrastructures in use today. I also want to make it easy/possible to do less common infrastructures. Anyway, I need more descriptions of infrastructures and services and how you would ideally monitor them before I can really pick a common set of targets. Some infrastructure scenarios:
The main differences between cloud and noncloud is, in my mind, simply a difference in your source(s) of truth. On EC2, for example, your list of host and other data is queryable via the API. ArchitectureComponentsMonitoring systems generally include a few active components:
Other components:
Undecided ownership:
AgentThe agent allows you to run tests and do collection on individual nodes. This is similar to Nagios NRPE in idea. It also seems reasonable that an agent could be standalone (a la NRPE) or embedded into any number of applications. Embedded agents would give you invaluable white-box testing in your live production environment. Agents can push test reports or have test reports requested. Pull scenarios make sense for periodic checks orchestrated by a master server. Push scenarios make sense for a local agent to push report information as soon as it has it. For example, you could embed agent functionality into your own software that, upon known-important exceptions or failures, will push reports up to the calmon server which can result in an immediate notification. ServerReceives or collects (push vs pull) data from agents and tests. This data is logged for historical viewing. The server should also manage notifications, silences, escalations, etc. The server can also act like a proxy. Dashboard
APIAn API, probably networked, is required for allowing calmon to be easily used and abused by other tools. API should also provide search features. ProxyProxies act as store-and-forward systems between agents and servers. They are not necessarily needed in all situations. Escalation ManagementVery common to have a situation where you have a primary and secondary on-call person. If the primary does not acknowledge a page within a time threshold, it should escalate to the secondary person. Other cases may eventually want a tertiary. Escalation should continue trying to alert people in the rotation (primary, secondary, etc) until someone acknowledges. Additionally, acknowledging a page doesn't mean you acknowledge a problem. Notification ManagementSometimes problem in a single system (such as database failures) can cause problems in multiple applications, causing massive floods of notifications from monitoring systems. This sucks. Some options are available:
SchemaIn general, we will be checking various things for health. Each check we will call a test. From software engineering unit testing, you have a test that is part of a test suite. Multiple test suites may exist for a given product. In Calmon:
Test results include test output, exit status, duration, and start time. A host is a network-addressable device. It can be a switch, router, firewall, storage node, plain server, blade, etc. ? Given virtual machines and blades, would it make sense to allow hosts to have host children? ? A service, such as 'production master database,' has certain behaviors that need to be verified. Each behavior could be represented as one or more tests. Since some behavior testing is reusable, such as "mysql" we should allow for services to have services. ? Where should we define scheduling (how often to run, what times of day/week to run or not run). Addons, Plugins, ExtensionsCalmon should be easy to extend, hook, and abuse.
API FeaturesHTTP(S) + REST? TBD. Dashboard Features
MiscNomenclature for Nagios users
Nagios calls 'plugins' what I would call any given program. For example, in Nagios you may define a service that runs a nagios plugin 'check_tcp' on port 22. In Calmon, we call that a test that runs a command 'check_tcp' on port 22. Services are much more than single tests, so it makes sense to make Service a collection of tests. | ||||||||||