Back to PKI Mistakes

PKI Emergency Mistakes

The fixes that make things worse

Emergency ResponseWhen pain hits: Right now3 mistakes covered
PKI Emergency Mistakes - The fixes that make things worse

Production is down. Everyone's watching. You need to fix it NOW. These are the moments when the worst PKI decisions get made - not because people are incompetent, but because pressure overrides process. The "temporary" fixes made at 2am become permanent problems that haunt you for years.

14

"Let's just disable certificate validation"

What Happens

A certificate issue is causing application failures. Under pressure to restore service, someone adds verify=False orCURLOPT_SSL_VERIFYPEER = 0 to "temporarily" bypass the problem. Production comes back up. Everyone goes back to bed.

Why It Seems Reasonable

"Production is down. We need to fix it NOW. We'll fix the cert properly tomorrow and remove the bypass. Just need to stop the bleeding."

The Reality

"Later" never comes. verify=False stays in the codebase forever. Nobody remembers why it's there. Nobody wants to touch it because "it might break something." Meanwhile, your application is vulnerable to MITM attacks.

Real-World Consequence

2am outage. Engineer adds verify=False to restore service. Files a ticket to "fix the cert issue properly and remove bypass." Ticket sits in backlog for 18 months. Penetration test finds it. Security incident. Audit finding. Remediation project. All because of one line added under pressure.

The Fix

  • Fix the actual certificate issue even under pressure - it's rarely harder than the bypass
  • If bypass is absolutely necessary, set a 24-hour automated reminder to remove it
  • Code reviews must block verify=False in any production path
  • Security scanning should flag disabled SSL verification as critical

Warning Signs

  • verify=False or equivalent in production code
  • "We had to do that to fix something, not sure what" in code comments
  • Environment variables like DISABLE_SSL_VERIFY=true
15

"Quick, generate a new cert!"

What Happens

Certificate error appears. Team immediately requests a new certificate from the CA. New certificate arrives. Same error. Request another certificate. Same error. Hours pass. Multiple certificates are issued. The problem was never the certificate.

Why It Seems Reasonable

"The cert is broken. We need a new cert. That's what you do when a cert doesn't work - you get a new one. The CA will fix whatever's wrong."

The Reality

The certificate might not be the problem. "New cert" is a reflex, not a diagnosis. Most certificate errors are caused by missing intermediates, wrong installation, expired root certificates, or name mismatches - none of which a new certificate fixes.

Real-World Consequence

Chain validation fails. Team requests new certificate. Same issue. Second new certificate. Same issue. Three hours and three certificates later, someone finally reads the actual error message: "unable to get local issuer certificate." The problem was a missing intermediate certificate. The original cert was fine.

The Fix

  • Diagnose first - read the actual error message before taking any action
  • Test the existing certificate with openssl or online tools before reissuing
  • Only request a new certificate after confirming the certificate itself is the issue
  • Use the Certificate Error Decoder to understand what the error actually means

Warning Signs

  • Multiple certificate reissues for the same problem
  • Team not reading or understanding the actual error message
  • "The new cert has the same problem" being said more than once
16

"We can figure out who owns this later"

What Happens

Certificate expires on an unknown service. Nobody knows who owns it. The cert gets renewed to fix the immediate problem. Ownership investigation is deferred. Ticket closed. Twelve months later, same problem, same "who owns this?" conversation.

Why It Seems Reasonable

"Production is down. We need to fix it now. Figuring out who owns this service is a separate project. We'll create a task to investigate ownership after we stabilize."

The Reality

If you don't figure out ownership during the incident, you never will. The urgency to investigate evaporates the moment production is back up. The "ownership investigation" task sits in backlog until the same orphan certificate causes the same problem next year.

Real-World Consequence

Certificate expires on mystery server. Emergency renewal. "We'll figure out ownership later." Same orphan certificate causes same 2am outage exactly one year later. Same Slack thread: "Does anyone know who owns this?" Same emergency renewal. Same deferred ownership task. The cycle continues indefinitely.

The Fix

  • Make ownership determination mandatory before incident closure
  • No "Owner: Unknown" or "Owner: TBD" allowed in certificate inventory
  • Use the incident as leverage - people respond when production is down
  • If owner truly can't be found, document the service for decommissioning review

Warning Signs

  • Same certificates causing repeat incidents year after year
  • "Owner: TBD" or blank owner fields in certificate inventory
  • "Does anyone know who owns this?" appearing in incident channels repeatedly

Key Takeaway

Emergency response creates the worst PKI decisions because pressure overrides process. The "temporary" bypass, the reflexive reissue, the deferred ownership investigation - these all become permanent problems. The best time to fix them is during the incident, when you have attention and urgency. After the incident, nobody cares until next year's outage.