SPF PermError vs Fail: A Postmortem of a BEC Attack

The alert came from Finance at 9:17 AM. A wire transfer request from our CFO's address, but the CFO was on a plane over the Atlantic. Standard procedure for our team is to check the headers first. What we saw was confounding: the spoofed email had passed our Secure Email Gateway (SEG) with a green checkmark next to the SPF authentication result. For a domain protected with a hard `fail` DMARC policy, this shouldn't be possible.

This is the story of a successful phishing attack that exploited not a zero-day or a user's mistake, but a subtle and widely misunderstood state in the Sender Policy Framework. Our own approved change request had created the vulnerability two days earlier.

The root cause was an SPF `PermError`, a permanent error state that, unlike a definitive `Fail`, many mail gateways interpret as a neutral signal. It’s a quiet failure mode that can silently dismantle your email authentication defenses.

Anatomy of an Attack: From Benign Change to BEC Payload

Every incident has a patient zero. In this case, it wasn't a person, but a TXT record in our DNS zone. The attack began with a routine request that no one would have flagged as a security risk.

The Marketing Request

On a Tuesday afternoon, our Marketing Ops team filed a ticket. They were onboarding a new email service provider (ESP) for a product newsletter and needed to authorize its sending infrastructure. The request was simple: add `include:new-esp.com` to our corporate SPF record. The change was reviewed for syntax, approved by IT, and deployed via our DNS provider's portal within the hour. The ticket was closed, and everyone moved on.

The Attacker's Opening

Forty-eight hours later, an attacker initiated a Business Email Compromise (BEC) campaign. Using a cloud server with a clean IP, they crafted an email spoofing our CFO's exact `From:` address. The email was simple, urgent, and directed to a specific accounts payable clerk: a request to process a new vendor payment for a rush project. The attacker didn't need to compromise an account; they just needed to exploit how receiving mail servers validate domain identity.

When the attacker's server connected to our SEG, the gateway dutifully initiated the SPF check as specified by RFC 7208. It fetched our domain's TXT record, began parsing the mechanisms, and then abruptly stopped. It hadn't found a definitive `pass` or `fail` to match the sending IP. Instead, it produced a `PermError`.

Received-SPF: PermError (example.com: permanent error in processing during lookup of cfo@example.com: exceeded max lookups) — Authentication-Results header excerpt from the incident

The SEG's policy engine, like many commercial gateways, is configured to avoid dropping mail on ambiguous signals. A `PermError` means the check is broken, not that the sender is illegitimate. So, it treated the result as `Neutral`. With SPF neutralized and no DKIM signature to evaluate, our DMARC policy of `p=reject` had no failed authentication result to act upon. The email was delivered.

Death by a Thousand Lookups: Why Our SPF Record Broke

The phrase 'exceeded max lookups' in that header was the smoking gun. SPF isn't just a simple string; it's a recursive set of instructions for DNS queries. And it comes with a strict, non-negotiable limit.

RFC 7208, section 4.6.4, explicitly limits the total number of mechanisms that trigger DNS lookups to 10 during an SPF evaluation. This includes mechanisms like `a`, `mx`, `ptr`, `exists`, and, most deceptively, `include`. This limit exists for a very good reason: to prevent SPF checks from being used as a vector for reflection-based Denial of Service (DoS) attacks against DNS infrastructure. If an attacker could craft an SPF record that triggered 100 nested lookups, they could amplify a small email send into a massive DNS query storm.

The Domino Effect of `include`

The danger lies in nested `include` statements. Each `include` points to another domain's SPF record, which may itself contain more lookups. Your one `include` might actually cost you three or four lookups against your limit. Over time, as a company adds services—Office 365, Salesforce, SendGrid, Zendesk, and now our new ESP—the SPF record becomes bloated. Each one adds to the total count.

Our record, prior to the change, was already at nine lookups. It was a time bomb waiting for a trigger. The `include:new-esp.com` statement from Marketing was that trigger. That single line, when resolved, added three more lookups to our chain. The total count jumped from a precarious 9 to an invalid 12. At that moment, our SPF record became functionally useless, failing with a `PermError` for every single receiving mail server in the world.

`PermError` Isn't `Fail`: A Critical Distinction in Gateway Policy

This is the core of the incident, and it's a distinction lost on many IT and security professionals. The difference between an SPF `PermError` vs `Fail` isn't semantic. It's the difference between a gate that's locked and a gate that's been taken off its hinges.

The Certainty of `Fail` vs. the Ambiguity of `PermError`

A `Fail` result (denoted by `-all`) is an explicit, authoritative statement from the domain owner. It tells the receiving server, 'If the sending IP doesn't match any of the preceding mechanisms, you MUST consider this email unauthorized.' A DMARC policy of `p=quarantine` or `p=reject` relies on this strong signal to trigger its action. It's a clear binary outcome.

A `PermError`, however, is a procedural failure. The checker is telling the gateway, 'I couldn't complete the evaluation because the domain's SPF record is broken or too complex.' RFC 7208 advises that a `PermError` result should be treated as if no SPF record was found at all (`None`). Many SEGs default to an even more lenient `Neutral` verdict. Why? Because blocking legitimate mail is often seen as a bigger business risk than letting a potential phish through. If a major vendor like Salesforce breaks their SPF record, a gateway that blocks based on that `PermError` could bring a customer's business to a halt. So, they fail open.

The attacker doesn't need to find a flaw in your gateway's logic. They just need to find a flaw in your DNS record and rely on the gateway's risk-averse, standards-compliant behavior. Your `p=reject` policy is rendered inert because it never receives the `fail` signal it needs to activate.

Reconstructing the Past: Proving the SPF State at Time-of-Attack

During an incident, proving the root cause is everything. We couldn't just look at our DNS record—we had already fixed it by removing the offending `include` to stop the bleeding. We had to prove what the record looked like two days prior. This is where DNS forensics comes in.

Passive DNS and Change Logs

Our first step was using passive DNS (pDNS) repositories. These services constantly crawl DNS and store historical records. By querying our domain's TXT record history in a threat intelligence platform, we could see a timeline of every change. The data clearly showed the transition from the old record to the new, 12-lookup record on Tuesday, matching the timestamp of the marketing request.

This pDNS data was then correlated with our internal change management system. The Jira ticket, CR-2024-8119, contained the approval from IT and the confirmation from the engineer who made the change. By cross-referencing the external pDNS evidence with our internal, timestamped audit trail, we built an irrefutable timeline. We could demonstrate exactly when the vulnerability was introduced and prove it was the direct cause of the `PermError` observed during the attack.

Shifting Left: Building Guardrails for DNS Changes

A postmortem that ends with 'we fixed the record and told people to be more careful' is a failure. The real lesson is to make this specific failure impossible to repeat. The solution is to treat your DNS zone file like application code and apply the same automated testing discipline.

We immediately began scripting a pre-commit check for our DNS infrastructure-as-code repository. Most organizations manage DNS changes manually through a web portal, but the principle can be adapted. The goal is to create an automated gate that validates any proposed SPF change before it can be deployed.

Automated SPF Validation in CI/CD

Our new workflow, integrated into our GitHub Actions pipeline, performs two critical checks on every pull request that modifies a TXT record:

First, it checks for basic syntax validity. Is the string a well-formed SPF record? Second, and most importantly, it performs a full recursive lookup simulation. The script parses the proposed record, resolves every `a`, `mx`, and `include` mechanism recursively, and counts the total number of DNS-querying terms. If the final count exceeds 10, or if any part of the record is syntactically invalid, the action fails. It blocks the merge, posts a comment explaining the failure, and prevents the broken record from ever reaching production.

This 'shifts left' the discovery of the problem. Instead of being found by an attacker and our SOC team days later, the lookup limit violation is caught and explained to the developer or operations engineer within seconds of their commit.

No Record is 'Set and Forget'

An SPF record isn't static infrastructure; it's a dynamic security policy that represents your brand's trust on the internet. It evolves with every new third-party service you authorize to send email. Each addition carries the risk of silent failure, either by breaking syntax or, as we saw, by exceeding a hard-coded protocol limit.

For years, our record was fine. Then one small, well-intentioned change was enough to quietly invalidate our entire SPF posture, giving a free pass to an attacker who understood the RFCs better than we did. Continuous monitoring and pre-deployment validation are the only ways to manage this risk effectively.

The takeaway

Your SPF record is only as strong as the sum of its parts. The `include` mechanism is powerful, but it's a blank check written against your 10-lookup budget, and you have no control over how many lookups your vendor decides to use in their own record tomorrow.

While integrating SPF validation into a CI/CD pipeline is the most durable solution, it's not the only one. Regular, automated audits are critical for catching configuration drift before it becomes an incident. Tools that can analyze and visualize the entire SPF lookup chain, like the free validator in MailSleuth.AI, can turn this invisible threat into a clear, actionable metric. How many lookups is your domain hiding?

Postmortem: How an SPF PermError Let a BEC Attack Past Our Gateway