Project

General

Profile

Actions

Bug #658

closed

Ensure LDAP is started before services using it

Added by Marc Dequènes over 4 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Category:
Service :: IS / AAA / PKI
Start date:
2019-07-14
Due date:
% Done:

100%

Estimated time:
Patch Available:
Confirmed:
No
Branch:
Entity:
DuckCorp
Security:
Help Needed:

Description

For example PHP-FPM is started too early on Toushirou. We need to list all the affected services,

As it is specific to our use of LDAP account for certain services, I think a DC-specific service file distributed by the dc-ldap role could ensure all affected services wait until slapd is up (on LDAP servers only).

Actions #1

Updated by Marc Dequènes over 3 years ago

PHP-FPM needs to wait for LDAP:

Aug 18 05:55:50 Toushirou php-fpm7.3[2175]: [18-Aug-2020 05:55:50] ERROR: [pool albums.georgesleyeti.fr] cannot get uid for user 'georgesleyeti'
Aug 18 05:55:50 Toushirou php-fpm7.3[2175]: [18-Aug-2020 05:55:50] ERROR: FPM initialization failed
Aug 18 05:55:50 Toushirou systemd[1]: php7.3-fpm.service: Main process exited, code=exited, status=78/CONFIG
Aug 18 05:55:50 Toushirou systemd[1]: php7.3-fpm.service: Failed with result 'exit-code'.

Actions #2

Updated by Marc Dequènes over 3 years ago

One more:

Aug 18 14:21:45 Elwing systemd[5160]: duck-calibre-server.service: Failed to determine user credentials: No such process
Aug 18 14:21:45 Elwing systemd[5160]: duck-calibre-server.service: Failed at step USER spawning /usr/bin/calibre-server: No such process
Aug 18 14:21:45 Elwing systemd[1]: duck-calibre-server.service: Main process exited, code=exited, status=217/USER
Aug 18 14:21:45 Elwing systemd[1]: duck-calibre-server.service: Failed with result 'exit-code'.

Actions #3

Updated by Marc Dequènes over 3 years ago

  • Status changed from New to Resolved
  • Assignee set to Marc Dequènes
  • % Done changed from 0 to 100

There was a fix for duck-calibre-server.service but that was not sufficient; it should be ok now.

I also added a fix in the httpd_php_fpm role.

Since we rebooted all machines today, that should be all.

Actions #4

Updated by Pierre-Louis Bonicoli about 3 years ago

This issue occurred yesterday on Toushirou:

1. slapd package has been updated by unattended-upgrade (from /var/log/unattended-upgrades/unattended-upgrades-dpkg.log)

Log started: 2021-02-04  06:18:23
apt-listchanges: Reading changelogs...
Preconfiguring packages ...
[...]
Unpacking slapd (2.4.47+dfsg-3+deb10u5) over (2.4.47+dfsg-3+deb10u4) ...
[...]
Setting up slapd (2.4.47+dfsg-3+deb10u5) ...
  Backing up /etc/ldap/slapd.d in /var/backups/slapd-2.4.47+dfsg-3+deb10u4... done.
Processing triggers for systemd (241-7~deb10u5) ...
Processing triggers for man-db (2.8.5-2) ...
Processing triggers for libc-bin (2.28-10) ...
[...]
Restarting services...
 systemctl restart apache2.service clamav-freshclam.service dovecot.service matrix-synapse.service nslcd.service php7.3-fpm.service postfix@-.service proftpd.service tt-rss.service zabbix-agent.service
Job for php7.3-fpm.service failed because the control process exited with error code.
See "systemctl status php7.3-fpm.service" and "journalctl -xe" for details.
[...]

2. php-fpm status
journalctl -u php7.3-fpm.service
-- Logs begin at Tue 2021-02-02 18:55:52 CET, end at Thu 2021-02-04 15:27:15 CET. --
Feb 04 06:18:36 Toushirou systemd[1]: Stopping The PHP 7.3 FastCGI Process Manager...
Feb 04 06:18:36 Toushirou systemd[1]: php7.3-fpm.service: Succeeded.
Feb 04 06:18:36 Toushirou systemd[1]: Stopped The PHP 7.3 FastCGI Process Manager.
Feb 04 06:18:36 Toushirou systemd[1]: php7.3-fpm.service: Consumed 13h 12min 20.075s CPU time.
Feb 04 06:18:36 Toushirou systemd[1]: Starting The PHP 7.3 FastCGI Process Manager...
Feb 04 06:18:37 Toushirou php-fpm7.3[12259]: [04-Feb-2021 06:18:37] ERROR: [pool albums.georgesleyeti.fr] cannot get uid for user 'georgesleyeti'
Feb 04 06:18:37 Toushirou php-fpm7.3[12259]: [04-Feb-2021 06:18:37] ERROR: FPM initialization failed
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Main process exited, code=exited, status=78/CONFIG
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Failed with result 'exit-code'.
Feb 04 06:18:37 Toushirou systemd[1]: Failed to start The PHP 7.3 FastCGI Process Manager.
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Consumed 62ms CPU time.

3. slapd logs
journalctl -u slapd.service 
-- Logs begin at Tue 2021-02-02 18:55:52 CET, end at Thu 2021-02-04 15:55:26 CET. --
Feb 04 06:18:30 Toushirou systemd[1]: Stopping LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Feb 04 06:18:30 Toushirou slapd[841]: daemon: shutdown requested and initiated.
Feb 04 06:18:30 Toushirou slapd[841]: slapd shutdown: waiting for 2 operations/tasks to finish
Feb 04 06:18:30 Toushirou slapd[841]: DIGEST-MD5 common mech free
Feb 04 06:18:30 Toushirou slapd[841]: DIGEST-MD5 common mech free
Feb 04 06:18:30 Toushirou slapd[841]: slapd stopped.
Feb 04 06:18:30 Toushirou slapd[11005]: Stopping OpenLDAP: slapd.
Feb 04 06:18:30 Toushirou systemd[1]: slapd.service: Succeeded.
Feb 04 06:18:30 Toushirou systemd[1]: Stopped LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Feb 04 06:18:30 Toushirou systemd[1]: slapd.service: Consumed 4h 34min 19.908s CPU time.
Feb 04 06:18:30 Toushirou systemd[1]: Starting LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Feb 04 06:18:30 Toushirou slapd[11016]: @(#) $OpenLDAP: slapd  (Jan 22 2021 03:54:40) $
                                                Debian OpenLDAP Maintainers <pkg-openldap-devel@lists.alioth.debian.org>
Feb 04 06:18:30 Toushirou slapd[11017]: slapd starting
Feb 04 06:18:30 Toushirou slapd[11011]: Starting OpenLDAP: slapd.
Feb 04 06:18:30 Toushirou systemd[1]: Started LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Feb 04 06:18:30 Toushirou slapd[11017]: do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:18:30 Toushirou slapd[11017]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrep2: rid=002 (-1) Can't contact LDAP server
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrep2: rid=004 (-1) Can't contact LDAP server
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrepl: rid=002 rc -1 retrying (2 retries left)
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrepl: rid=004 rc -1 retrying (2 retries left)
Feb 04 06:31:38 Toushirou slapd[11017]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:31:38 Toushirou slapd[11017]: do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE

@Marc Dequènes should not this issue be reopened ?

Actions #5

Updated by Pierre-Louis Bonicoli about 3 years ago

some workarounds:
  • add an ExecPreStart condition to php-fpm unit
  • use a cache (for example sssd)
  • patch OpenLDAP in order to call sd_notify
  • migrate to 389-ds wich uses sd_notify@
Actions #6

Updated by Marc Dequènes about 3 years ago

  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 80
Actions #7

Updated by Marc Dequènes almost 3 years ago

I think changing our LDAP server must be part of a larger improcement project like switching to FreeIPA for eg. At the moment I have no time for this.

The patch is broken according to upstream and I fear that would become difficult to maintain anyway.

I think having a cache would benefit all processes on the filesystem and also the shell server which does network calls for each fs access, thus I think it's a very good idea. That does not mean having the situation fixed in the LDAP server is not something we should follow. I really prefer that to the ExecPreStart workaround which is limited to one service.

Actions #8

Updated by Marc Dequènes almost 3 years ago

  • Category set to Service :: IS / AAA / PKI
  • % Done changed from 80 to 90

I deployed sssd; it was a real pain in the ass but it is working now.

Notes:

For the last two RH explicitly refused to implement it.

Because of the access filter limitation we cannot replace nlscd and have to use both…

Let's see if all our problems are fixed now.

Actions #9

Updated by Marc Dequènes almost 3 years ago

sssd was also restarted by needrestart so I excluded this service from automatic restart.

Actions #10

Updated by Marc Dequènes over 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

We're good.

Actions

Also available in: Atom PDF