A Guide to a Resilient MX Record Backup Configuration

It's 3 AM, and an alert jolts you awake. The primary Email Security Gateway (SEG) vendor is reporting a widespread outage. You feel a brief moment of calm, remembering the backup MX record you configured years ago. Mail should just fail over to your secondary mail exchanger, right? Wrong. In reality, you've just unknowingly disabled your most critical security controls.

The concept of a 'backup MX' is a dangerous misnomer. Sending Mail Transfer Agents (MTAs) don't see 'primary' and 'backup'; they see a prioritized list of hosts. As defined in RFC 5321, they try the lowest preference number first. If that host is unreachable, they move to the next lowest, and so on. They will gleefully deliver your sensitive mail to a server with a higher preference number, regardless of its security posture.

This isn't a plan for resilience. It's a blueprint for a breach. Real resilience requires a fundamental shift in thinking: from a passive, cold backup to an active, redundant system that maintains your security posture at all times.

Why Your 'Backup MX' Is an Active Point of Failure

Most so-called backup MX records point to a cheap relay service or directly to a cloud mailbox provider's endpoint. This server is rarely, if ever, configured with the same level of scrutiny as the primary SEG. It's a 'cold' site, unpatched, unmonitored, and missing the expensive licenses for sandboxing, URL detonation, and advanced threat intelligence that you pay for on your primary gateway.

When your primary SEG goes down, all inbound mail traffic immediately diverts to this weaker link. The consequences are severe and immediate. Phishing campaigns that your primary gateway would have blocked now sail directly into user inboxes. A Business Email Compromise (BEC) attempt from a lookalike domain, normally flagged by your SEG's header analysis and reputation checks, now looks perfectly legitimate.

Authentication Breakdowns and Reputational Damage

The problems extend beyond simple filtering gaps. Email authentication protocols are notoriously brittle. If your backup MX isn't properly configured, it can break the chain of trust. For example, is its IP address included in your SPF record (RFC 7208)? If not, any mail flowing through it will fail SPF checks downstream.

Worse, some simple relays or misconfigured MTAs might perform minor modifications to the message body or headers — like adding a footer — which will invalidate the DKIM signature (RFC 6376). When either SPF or DKIM fails alignment, your DMARC policy (RFC 7489) will trigger a `fail` verdict. Legitimate emails to your own organization suddenly start getting rejected or quarantined, leading to operational chaos. The system designed for continuity becomes the source of the outage.

Case Study: A Cloud Gateway Outage and the Bypass Effect

Imagine a typical setup: a third-party SEG sits in front of Microsoft 365. The MX records are weighted with a preference of 10 for the SEG, and a 'just-in-case' record with preference 100 points directly to the `your-domain.mail.protection.outlook.com` endpoint.

One Tuesday morning, the SEG provider suffers a major incident. Their ingestion endpoints stop responding. Immediately, sending MTAs around the world see the failure and pivot to the priority 100 record. For the next six hours, 100% of the company's inbound mail flows directly to Microsoft 365, completely bypassing the specialized security layers of the SEG.

The security team is now flying blind. All their custom rules, threat intel feeds, and URL rewriting policies are useless. Their incident response playbooks, which rely on logs from the SEG platform, are irrelevant. Sophisticated malware, which the SEG's sandbox would have caught, is now delivered. The Exchange Online Protection (EOP) layer catches the obvious threats, but it's not tuned for the same targeted attacks the premium SEG was designed to stop. The 'backup' has become a gaping hole in the perimeter.

Pattern 1: Geographic Redundancy with a Single Provider

A better, though still limited, approach is to rely on the built-in geographic redundancy of a major cloud provider. If your mail is hosted in Microsoft 365 or Google Workspace, you are already benefiting from this to some extent.

When you point your MX record to an endpoint like `aspmx.l.google.com` or `contoso-com.mail.protection.outlook.com`, you aren't pointing to a single server. You're pointing to a global network of servers managed by Anycast DNS. A sender in Europe will be directed to a datacenter in Dublin or Amsterdam, while a sender in Asia connects to one in Singapore. An outage in a single datacenter won't stop mail flow, as DNS will simply resolve to the next-closest healthy entry point.

The Vendor Lock-in Risk

This pattern provides resilience against localized hardware or network failures. It's a massive improvement over a single on-premises MTA. However, it does nothing to protect you from a global service degradation, a misconfiguration pushed to all instances by the vendor, or a security vulnerability affecting the platform's core software.

You are entirely dependent on that one vendor's operational security and competence. If their entire mail service suffers a logical failure, your mail flow stops. Dead. There is no failover. For many organizations, this level of risk is acceptable. For those in regulated industries or with zero tolerance for downtime, it's not enough.

Pattern 2: True Redundancy Across Two Active Services

This is the gold standard. True resilience means using two distinct, fully-featured, and actively maintained email security gateways from different vendors. This is not a 'hot-cold' setup; it's 'hot-hot' or, at minimum, 'hot-warm'.

In this architecture, you configure your MX records with two tiers of priority. For instance, you might have Mimecast as your primary and Proofpoint as your secondary. The configuration would look something like this:

yourdomain.com. 3600 IN MX 10 us-mx1.mimecast.com.
yourdomain.com. 3600 IN MX 10 us-mx2.mimecast.com.
yourdomain.com. 3600 IN MX 20 mx1-us1.proofpoint.com.
yourdomain.com. 3600 IN MX 20 mx2-us1.proofpoint.com. — Example multi-vendor MX record configuration

The key is that the Proofpoint service (priority 20) is not a dumb forwarder. It's a fully configured instance with its own set of policies, anti-malware scanning, and filtering rules that mirror the primary Mimecast service. When Mimecast's endpoints fail, mail flows to Proofpoint, which applies a nearly identical level of scrutiny before delivering to the final mailbox. Your security posture is maintained during the outage.

This approach also requires careful configuration of protocols like MTA-STS (RFC 8461), where the policy file must list the MX hosts for *both* vendors as valid recipients. Neglecting this will cause compliant senders to refuse to deliver mail during a failover event.

Don't Guess, Test: How to Validate Your Failover

You cannot assume your failover configuration works. You must test it. Hope is not a strategy. The good news is that you can simulate a failover without actually taking your primary gateway offline.

First, verify your MX records are published correctly using `dig` or `nslookup`. Look for the different preference numbers.

Simulating a Delivery with Telnet

Next, you can directly test the higher-priority (failover) MX host's willingness to accept mail for your domain using `telnet`. This classic tool speaks raw SMTP and tells you exactly how the server will behave. Connect to your secondary MX on port 25, the standard SMTP port.

$ telnet mx1-us1.proofpoint.com 25
Trying 207.67.38.1...
Connected to mx1-us1.proofpoint.com.
220 mx1-us1.proofpoint.com ESMTP
HELO mytestpc.local
250 mx1-us1.proofpoint.com Hello mytestpc.local [1.2.3.4]
MAIL FROM:<test@example.org>
250 sender <test@example.org> ok
RCPT TO:<ceo@yourdomain.com>
250 recipient <ceo@yourdomain.com> ok

The critical line is the response to the `RCPT TO` command. If you get a `250 recipient ... ok`, the server is configured correctly to accept mail for your domain. If you get an error like `550 Recipient address rejected`, your failover is broken. The server doesn't recognize itself as a valid gateway for your domain. Finding this during a scheduled test is a minor configuration task. Finding it during a real outage is a crisis.

Beyond MX: The Role of ARC in Complex Mail Flows

In any architecture involving multiple hops—especially a multi-vendor failover setup—one protocol becomes critical: Authenticated Received Chain (ARC), specified in RFC 8617. ARC is designed to preserve email authentication results across the complex paths that legitimate email often takes.

Imagine an email passes SPF and DKIM checks at your primary SEG (Mimecast). Mimecast then adds an `Authentication-Results` header and forwards the email to Office 365. But what happens during a failover? The email goes to your secondary SEG (Proofpoint). Proofpoint might add a subject-line tag or a footer, which breaks the original DKIM signature. When this modified message arrives at Office 365, it fails DMARC.

ARC solves this. Before forwarding the message, both Mimecast and Proofpoint should add an `ARC-Seal` and `ARC-Message-Signature`. This cryptographically signs the authentication results from their hop. When the final destination (Office 365) sees a broken DKIM signature but a valid ARC chain, it can look at the initial `pass` verdict from the trusted SEG and choose to trust the message anyway. Without ARC, your resilient architecture might cause you to reject legitimate mail during a failover.

The takeaway

Stop thinking in terms of 'backups'. Start thinking in terms of active, testable redundancy. A resilient mail architecture isn't a single DNS record with a high preference number; it's a system where every component, primary or secondary, is configured with the same security posture and actively validated. The goal is not just to receive mail during an outage, but to receive it securely.

The difference between a minor incident and a full-blown crisis is whether you test your assumptions. Your failover path is another attack surface. Treat it as such. As mail flows get more complex with multiple SEGs and forwarding rules, understanding the full authentication chain from sender to inbox is critical, which is where tools for deep header analysis and DMARC monitoring become indispensable for any SOC.

Beyond the Backup MX: Designing Resilient Inbound Mail Architectures