Project

General

Profile

Bug #658

Ensure LDAP is started before services using it

Added by Marc Dequènes almost 2 years ago. Updated 19 days ago.

Status:
In Progress
Priority:
High
Category:
Service :: IS / AAA / PKI
Start date:
2019-07-14
Due date:
% Done:

90%

Estimated time:
Patch Available:
Confirmed:
No
Branch:
Entity:
DuckCorp
Security:
Help Needed:

Description

For example PHP-FPM is started too early on Toushirou. We need to list all the affected services,

As it is specific to our use of LDAP account for certain services, I think a DC-specific service file distributed by the dc-ldap role could ensure all affected services wait until slapd is up (on LDAP servers only).

Associated revisions

Revision e4ed7753 (diff)
Added by Marc Dequènes 19 days ago

dc-accounts: use sssd to handle NSS with caching

refs #658

History

#1

Updated by Marc Dequènes 9 months ago

PHP-FPM needs to wait for LDAP:

Aug 18 05:55:50 Toushirou php-fpm7.3[2175]: [18-Aug-2020 05:55:50] ERROR: [pool albums.georgesleyeti.fr] cannot get uid for user 'georgesleyeti'
Aug 18 05:55:50 Toushirou php-fpm7.3[2175]: [18-Aug-2020 05:55:50] ERROR: FPM initialization failed
Aug 18 05:55:50 Toushirou systemd[1]: php7.3-fpm.service: Main process exited, code=exited, status=78/CONFIG
Aug 18 05:55:50 Toushirou systemd[1]: php7.3-fpm.service: Failed with result 'exit-code'.

#2

Updated by Marc Dequènes 9 months ago

One more:

Aug 18 14:21:45 Elwing systemd[5160]: duck-calibre-server.service: Failed to determine user credentials: No such process
Aug 18 14:21:45 Elwing systemd[5160]: duck-calibre-server.service: Failed at step USER spawning /usr/bin/calibre-server: No such process
Aug 18 14:21:45 Elwing systemd[1]: duck-calibre-server.service: Main process exited, code=exited, status=217/USER
Aug 18 14:21:45 Elwing systemd[1]: duck-calibre-server.service: Failed with result 'exit-code'.

#3

Updated by Marc Dequènes 9 months ago

  • Status changed from New to Resolved
  • Assignee set to Marc Dequènes
  • % Done changed from 0 to 100

There was a fix for duck-calibre-server.service but that was not sufficient; it should be ok now.

I also added a fix in the httpd_php_fpm role.

Since we rebooted all machines today, that should be all.

#4

Updated by Pierre-Louis Bonicoli 3 months ago

This issue occurred yesterday on Toushirou:

1. slapd package has been updated by unattended-upgrade (from /var/log/unattended-upgrades/unattended-upgrades-dpkg.log)

Log started: 2021-02-04  06:18:23
apt-listchanges: Reading changelogs...
Preconfiguring packages ...
[...]
Unpacking slapd (2.4.47+dfsg-3+deb10u5) over (2.4.47+dfsg-3+deb10u4) ...
[...]
Setting up slapd (2.4.47+dfsg-3+deb10u5) ...
  Backing up /etc/ldap/slapd.d in /var/backups/slapd-2.4.47+dfsg-3+deb10u4... done.
Processing triggers for systemd (241-7~deb10u5) ...
Processing triggers for man-db (2.8.5-2) ...
Processing triggers for libc-bin (2.28-10) ...
[...]
Restarting services...
 systemctl restart apache2.service clamav-freshclam.service dovecot.service matrix-synapse.service nslcd.service php7.3-fpm.service postfix@-.service proftpd.service tt-rss.service zabbix-agent.service
Job for php7.3-fpm.service failed because the control process exited with error code.
See "systemctl status php7.3-fpm.service" and "journalctl -xe" for details.
[...]

2. php-fpm status
journalctl -u php7.3-fpm.service
-- Logs begin at Tue 2021-02-02 18:55:52 CET, end at Thu 2021-02-04 15:27:15 CET. --
Feb 04 06:18:36 Toushirou systemd[1]: Stopping The PHP 7.3 FastCGI Process Manager...
Feb 04 06:18:36 Toushirou systemd[1]: php7.3-fpm.service: Succeeded.
Feb 04 06:18:36 Toushirou systemd[1]: Stopped The PHP 7.3 FastCGI Process Manager.
Feb 04 06:18:36 Toushirou systemd[1]: php7.3-fpm.service: Consumed 13h 12min 20.075s CPU time.
Feb 04 06:18:36 Toushirou systemd[1]: Starting The PHP 7.3 FastCGI Process Manager...
Feb 04 06:18:37 Toushirou php-fpm7.3[12259]: [04-Feb-2021 06:18:37] ERROR: [pool albums.georgesleyeti.fr] cannot get uid for user 'georgesleyeti'
Feb 04 06:18:37 Toushirou php-fpm7.3[12259]: [04-Feb-2021 06:18:37] ERROR: FPM initialization failed
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Main process exited, code=exited, status=78/CONFIG
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Failed with result 'exit-code'.
Feb 04 06:18:37 Toushirou systemd[1]: Failed to start The PHP 7.3 FastCGI Process Manager.
Feb 04 06:18:37 Toushirou systemd[1]: php7.3-fpm.service: Consumed 62ms CPU time.

3. slapd logs
journalctl -u slapd.service 
-- Logs begin at Tue 2021-02-02 18:55:52 CET, end at Thu 2021-02-04 15:55:26 CET. --
Feb 04 06:18:30 Toushirou systemd[1]: Stopping LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Feb 04 06:18:30 Toushirou slapd[841]: daemon: shutdown requested and initiated.
Feb 04 06:18:30 Toushirou slapd[841]: slapd shutdown: waiting for 2 operations/tasks to finish
Feb 04 06:18:30 Toushirou slapd[841]: DIGEST-MD5 common mech free
Feb 04 06:18:30 Toushirou slapd[841]: DIGEST-MD5 common mech free
Feb 04 06:18:30 Toushirou slapd[841]: slapd stopped.
Feb 04 06:18:30 Toushirou slapd[11005]: Stopping OpenLDAP: slapd.
Feb 04 06:18:30 Toushirou systemd[1]: slapd.service: Succeeded.
Feb 04 06:18:30 Toushirou systemd[1]: Stopped LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Feb 04 06:18:30 Toushirou systemd[1]: slapd.service: Consumed 4h 34min 19.908s CPU time.
Feb 04 06:18:30 Toushirou systemd[1]: Starting LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Feb 04 06:18:30 Toushirou slapd[11016]: @(#) $OpenLDAP: slapd  (Jan 22 2021 03:54:40) $
                                                Debian OpenLDAP Maintainers <pkg-openldap-devel@lists.alioth.debian.org>
Feb 04 06:18:30 Toushirou slapd[11017]: slapd starting
Feb 04 06:18:30 Toushirou slapd[11011]: Starting OpenLDAP: slapd.
Feb 04 06:18:30 Toushirou systemd[1]: Started LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Feb 04 06:18:30 Toushirou slapd[11017]: do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:18:30 Toushirou slapd[11017]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrep2: rid=002 (-1) Can't contact LDAP server
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrep2: rid=004 (-1) Can't contact LDAP server
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrepl: rid=002 rc -1 retrying (2 retries left)
Feb 04 06:31:28 Toushirou slapd[11017]: do_syncrepl: rid=004 rc -1 retrying (2 retries left)
Feb 04 06:31:38 Toushirou slapd[11017]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Feb 04 06:31:38 Toushirou slapd[11017]: do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE

Marc Dequènes should not this issue be reopened ?

#5

Updated by Pierre-Louis Bonicoli 3 months ago

some workarounds:
  • add an ExecPreStart condition to php-fpm unit
  • use a cache (for example sssd)
  • patch OpenLDAP in order to call sd_notify
  • migrate to 389-ds wich uses sd_notify@
#6

Updated by Marc Dequènes about 2 months ago

  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 80
#7

Updated by Marc Dequènes 22 days ago

I think changing our LDAP server must be part of a larger improcement project like switching to FreeIPA for eg. At the moment I have no time for this.

The patch is broken according to upstream and I fear that would become difficult to maintain anyway.

I think having a cache would benefit all processes on the filesystem and also the shell server which does network calls for each fs access, thus I think it's a very good idea. That does not mean having the situation fixed in the LDAP server is not something we should follow. I really prefer that to the ExecPreStart workaround which is limited to one service.

#8

Updated by Marc Dequènes 19 days ago

  • Category set to Service :: IS / AAA / PKI
  • % Done changed from 80 to 90

I deployed sssd; it was a real pain in the ass but it is working now.

Notes:

For the last two RH explicitly refused to implement it.

Because of the access filter limitation we cannot replace nlscd and have to use both…

Let's see if all our problems are fixed now.

Also available in: Atom PDF