Let's Encrypt Renewal Triggers Apache Crash

This issue is summarized here: https://forum.virtualmin.com/t/apache-crash-during-lets-encrypt-renewal/...

Essentially, LE requests and receives approval for a cert. Apache gets a restart request, but after the cert file is written, the key file has not yet finished writing to the filesystem, httpd detects the cert/key mismatch and the entire server goes down.

14:45:41.573 - LE reports that it is done 14:45:42.307 - Graceful requested 14:45:42.631 - ssl.cert written (ssl.key not yet modified) 14:45:43.047 - apache is reading configs, ssl mismatch, crash 14:45:43.619 - ssl.key written

Is it possible to - specify a slightly longer delay in requesting the restart of httpd, or - specify the daily schedule that LE checks domains for renewal so that if the issue persists, it manifests during non-peak, non-business-hours

Status: 
Active

Comments

Ilia's picture
Submitted by Ilia on Sat, 07/04/2020 - 09:35

Hi,

Thanks for the heads up.

I remember reading about this in the past on ACME Tiny issue tracker on GitHub. I remember, someone proposed to solve this using an artificially created delay.

Honestly, I don't understand how would that be possible, as when we run the script, we neither cache it nor run it in a background (sub-process)?

Moreover, I could never reproduce this issue or have encountered it myself.

If you have steady steps to reproduce it, share it with us, it can be easily fixed.

I would recommend using an official certbot client for requesting SSL certificates.

Ilia's picture
Submitted by Ilia on Sat, 07/04/2020 - 09:36

Notice: I marked this issue as non-private and cross-referenced it to your public post on forums.

The only way I can see this happen is if Apache was restarted at around the same for some other reason.

Would it be possible to not have LE run during business hours?

Ilia's picture
Submitted by Ilia on Sat, 07/11/2020 - 04:47

It's not normal, neither expected, as all renewals on our side and on all of our servers work without an issue.

It would be interesting to see an errors from a global Apache log, when this happens?

Would it be possible to not have LE run during business hours?

I think the easiest way you could achieve your goal, with rough success though, is to go to Server Configuration/SSL Certificate/Let's Encrypt and simply update renewal only, at a very early hour, let's say 4-5 am. Presumably, it might do the trick.

Sorry to resurrect this so late, I can show from global apache logs where this is happening. We had the issue where the R3 certs wouldn't auto-renew, patched that; as soon as it ran, httpd crashed, it called for a graceful 30 seconds apart:

[Thu Apr 08 15:02:06.557399 2021] [mpm_prefork:notice] [pid 13568] AH00171: Graceful restart requested, doing restart

[Thu Apr 08 15:02:36.174209 2021] [mpm_prefork:notice] [pid 13568] AH00171: Graceful restart requested, doing restart

The graceful at 15:02:06 tried to renew a cert and while that was in the process, another graceful was triggered, causing the crash (because the ssl.crt file wasn't written yet. Here are the logs from the 15:02:36 graceful:

Thu Apr 08 15:02:36.174209 2021] [mpm_prefork:notice] [pid 13568] AH00171: Graceful restart requested, doing restart

[Thu Apr 08 15:02:36.223075 2021] [fcgid:emerg] [pid 805] mod_fcgid: server is restarted, pid 805 must exit

[Thu Apr 08 15:02:36.225035 2021] [fcgid:emerg] [pid 805] (22)Invalid argument: mod_fcgid: can't lock process table in PM, pid 805 AH00526: Syntax error on line 4986 of /etc/httpd/conf/httpd.conf: SSLCertificateFile: file '/home/*****/domains/*****/ssl.cert' does not exist or is empty

^^ That file was not done writing until 15:02:37

Context: system_u:object_r:home_root_t:s0

Access: 2021-04-08 15:02:39.332579096 -0400

Modify: 2021-04-08 15:02:37.016591340 -0400

Change: 2021-04-08 15:02:37.370589469 -0400

Finally, here's the first line of the LE log that shows that the cert attempted to renew at the 15:02:06 graceful: 2021-04-08 15:02:07,658:DEBUG:certbot._internal.main:certbot version: 1.11.0

It's problematic because dozens of our clients use pingdom and get alerted right away. We just need a way to set a manual delay between renewals or a hook that allows us to define a custom action after successful cert installation.

Does that /home/*****/domains/*****/ssl.cert file from the error message actually exist?

It's unclear if it's just disappearing temporarily during the renewal, or if there's an invalid reference to it in the Apache config.

tpnsolutions's picture
Submitted by tpnsolutions on Thu, 04/08/2021 - 19:34

Thought I'd chime in on this topic as I recently had to address a ton of issues pertaining to LE. Not sure what officially caused it, but here's what I discovered.

  1. Orphaned LE configurations which was causing LE to attempt a renewal on non-existent domains
  2. SSL wasn't enabled on a domain, but LE was still trying to fetch an SSL cert for the domain

I had to in the end, using the CLI manually remove certs from LE via "certbot" and further "enable" then "disable" SSL for a domain, followed by going to the SSL Certificate page (which seems to appear even when SSL isn't enabled) and finally click "Delete Certificate" from this page.

Once I did this for all affected domains, things went back to normal.

I would be happy to offer up some time to see if this is actually what is causing their issues, as it had me really stumped for a while doing a lot of extra work to keep things running smooth that was eliminated after the fix was applied.

For the issue of orphaned Let's Encrypt configs, make sure you have automatic renewal turned off in certbot - it's not needed with Virtualmin, as we handle the renewal scheduling.