Enhancement #743: Switching to Prometheus? - DuckCorp Infrastructure - DuckCorp Projects

Actions

Copy link

Enhancement #743

open

Switching to Prometheus?

Added by Marc Dequènes almost 4 years ago. Updated about 2 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Marc Dequènes

Category:

Service :: Supervision

Start date:

2021-11-17

Due date:

% Done:

60%

Estimated time:

Patch Available:

Confirmed:

Branch:

monitoring_prometheus

Entity:

DuckCorp

Security:

Help Needed:

Description

With exporters gaining TLS support there is no obvious major problem left and we can do some testing.

I've started a new playbook and role to experiment and so far it is working well.

Some though, in no order:

zabbix: hard to configure all in the slow UI
zabbix: certain features are slow to come (#495, native systemd support, LLD web checks…)
prometheus_ansible_role: no service autodetection anymore, I found very easy to map "features" to inventory groups or variables; it's now easy to manually disable or force-enable if needed
prometheus: nice feature coming to help split the config, but in the meanwhile I might be able to use file_sd_configs and avoid passing inventory vars directly into the role to work around the problem
grafana: I would have preferred if grafana was packaged in Debian but in the end it's very handy to make use of their dashboard libraries and avoid spending hours and hours designing every little graph
prometheus: using textfiles collector can be an alternative to the lack of exporter or when it's not packaged (used for NTP/chrony)

What we have so far:

node basic and all the hardware goodies, temperature etc seem to be there too
poller stats
Bind
Postfix
Apache
PG
LXD
blackbox with checks of almost all public services endpoints, with TLS and protocol checks when possible too
Prosody but no grafana dashboard and the amount of stats are limited; there are additional modules called measure_* to complement but they are not packaged
MySQL
NTP
Nextcloud

I was able to setup several exporters and borrow various alerts from https://awesome-prometheus-alerts.grep.to/ but even if we have more than before in certain areas I'd like to check if we're missing something important (compared to our Zabbix installation):

~~time sync is checked but NTPd stats are missing; there is an exporter but it is not packaged~~
no maps, but if that was cute that was also utterly useless
ProFTPd, but I'm not sure it's worth it now
SNMP checks for my internal switches, more out of curiosity
SNMP checks for my printer, but I don't use it very often so it's not critical
OpenLDAP stats, more out of curiosity
~~MDA, this is important~~
~~MySQL, also important~~
~~alerts via mail, IRC and XMPP~~

What I plan to look at:

[WIP] make the role generic and split it form our main repo (and use it at OSCI)
~~generation of alerter contacts and alert methods (Matrix, XMPP, Mail)~~
~~blackbox, maybe replace smokeping? add check for certs, DNSSEC etc~~
[WIP] grafana base config generation
~~MySQL exporter~~
SNMP for my internal switches
could we make certain graphs public? (like pings etc?)
~~Dovecot exporter, but not packaged in Debian~~
~~Nextcloud exporter~~ backported and bumped to 0.5.0 for token auth support
Matrix alert hook, but not packaged in Debian
node exporter maintainers do not want to add systemd service stats but there is a systemd exporter that would help get per-service resource consumption stats
the IRC relay displays only limited info, no severity coloring, and sometimes disconnect and is unable to reconnect; NinjaBot seems to be a nice alternative
SSH checks on non-standard port (currently Orthos and Nicecity checks only check the gateway…)

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Branch set to monitoring_prometheus

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Description updated (diff)
Status changed from New to In Progress

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

% Done changed from 0 to 10

Actions

Copy link

Updated by Pierre-Louis Bonicoli almost 4 years ago

Orthos will need to be able to connect to Nicecity (port TCP 9089), hence we hit #711: I propose to apply the same patch to the Nicecity fw configuration: is that ok?

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Marc Dequènes almost 4 years ago

Description updated (diff)
% Done changed from 10 to 50

Zabbix was replaced by Prometheus.

Actions

Copy link

Updated by Marc Dequènes over 3 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by Marc Dequènes over 3 years ago

Description updated (diff)

Actions

Copy link

#11

Updated by Marc Dequènes over 3 years ago

Description updated (diff)

Actions

Copy link

#12

Updated by Marc Dequènes over 3 years ago

Description updated (diff)
% Done changed from 50 to 60

Actions

Copy link

#13

Updated by Marc Dequènes about 2 years ago

Description updated (diff)

Dovecot added openmetrics support and I added the configuration in the role and in our deployment. In some interesting article I found the Dovecot config as well as a Grafana dashboard template. The template required some fixes and I discovered exporter_exporter does not support the INFO type, but now that's out of the way.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DuckCorp » DuckCorp Infrastructure

Custom queries

Enhancement #743

Switching to Prometheus?

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Pierre-Louis Bonicoli almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes almost 4 years ago

Updated by Marc Dequènes over 3 years ago

Updated by Marc Dequènes over 3 years ago

Updated by Marc Dequènes over 3 years ago

Updated by Marc Dequènes over 3 years ago

Updated by Marc Dequènes about 2 years ago