Building a Cloud-Native DNS Server
The Problem
.mil
domains are tricky. Unlike traditional top level domains, all .mil
domains are delegated by the DISA Network Information Center or NIC. SHE BASH currently serves DoD Platform One, and Platform One services are spread across a myriad of domains that have no means of central management. To change a few DNS entries, one might have to go through multiple places and people to get this accomplished. Some of these locations and processes have led to outages in DNS which have disrupted internal and external services. Needless to say, that is a problem. To stabilize DNS we needed a way to centralize our own DNS management and host all services on a single domain, DSO.MIL (DevSecOps is the DSO bit).
Management of these domains is further complicated by decentralized zone management controlled by multiple AWS Route53 accounts and several entities. The DNS root servers are tightly controlled by the Defense Information Security Agency, DISA. The result of this third-party management is unfortunately grotesque inefficiency and lack of control in the event of downtime.
To further complicate the current state of DNS, our endpoints are split between two separate domains. The first, dsop.io
and the second, cdl.af.mil
. The second domain, cdl.af.mil
is not even a Platform One domain, but rather a subdomain of the Air Force at large for Cloud One.
The bottom line is that DNS was ugly and control of DNS offered a path to expedite critical operations and give Platform One the control required to reliably offer enterprise services to mission partners without falling prey to the "classic way of DoD Software development", which is unspoken, acceptable unreliability.
Since DNS is a mission critical service, ensuring high availability and resiliency is paramount, but impossible when DNS is out of your immediate control.
All this is to say, my task at Platform One was to help build a cloud-native DNS server. Our solution leveraged CoreDNS running on Kubernetes. As stated this task seems simple, but in practice, it is an arduous undertaking. Like everything else in the Department of Defense, bureaucracy and policy is the hardest barrier when forging new solutions.
Obstacles
To make things slightly more cumbersome, CoreDNS is poorly documented (it really is), and security policies for DNS in DoD are quite specific – DNSSEC being the primary requirement of DNS as defined by DISA. DISA does not make the implementation requirements for DNSSEC available publicly, but creative Googlers may be able to find something that I cannot publish here. Regardless, our implementation is not as simple as commercial DNS service providers that offer one-click DNSSEC like Cloudflare does for shebash.io
.
Not only can one not use more modern DNSSEC aids or leverage a common signing key (CSK), to handle zone signing, we were forced to implement split key DNSSEC (this link is an issue we opened on Github as we expanded our architecture for a primary and secondary server to handle zone transfers).
DNSSEC not only must be properly implemented according to the DISA policy, but key rotations must occur frequently (45 days for our zone signing key), and improper implementation of DNSSEC quite simply is not an option unless we forfeit our original position (that DNS is mission critical and downtime is unacceptable).
To complicate matters further, but to avoid specifics that need not be hashed out in this post, we had the tricky task of ensuring that our name servers were within an assigned IP range given to us by DISA. As you can undoubtedly infer at this point, DISA plays a large role in DNS for all of DoD, and their policies are where complacency kills innovation, but creativity wins wars.
The battle here was a single domain to prove the hypothesis that DNS in the Department of Defense can be securely managed in the cloud without the overbearing oversight of DISA, which would optimize other factions of operations. The war is to scale this capability DoD-wide to optimize programs and truly enable DevSecOps in the modernization efforts of the Department of Defense. Ambitious? Yes. Impossible? Absolutely not.
Architecture
I cannot go into great detail here so the abridged version is as follows:
- Create a Kubernetes based CoreDNS authoritative DNS server
- Enable key signing and zone signing within CoreDNS for "on the fly zone signing"
- Get trusted IP space from DISA
- Import IP space into the cloud[1].
- Register new name servers with DISA using the IPs they provided
- Set up DNS zone management with Gitops using Flux
There is plenty more detail I could add here, however, rather than spend all month explaining it, and not to give away our secret sauce, the aforementioned is plenty to allow the imagination to fill in the rest.
While the process was painful, the results are sweet. Our mission was clear. Architect, engineer and deploy a cloud native DNS server to serve DSO.MIL. At the end of the day, we were successful. Many programs in DoD have lofty goals, but no path to execution. They are what I consider "fake agile". A fundamental unwillingness to take on risk to reap the rewards of building something that previously had not existed. Platform One is not fake agile. Contrarily, Platform One is arguably more agile than most of the private sector's DevSecOps initiatives.
I attribute this to the visionary leadership that pour their blood sweat and tears into the mission on a daily basis. From the top down and the bottom up. It doesn't matter. Everyone has a voice, and decisions are merit-based regardless of rank or title. When a true meritocracy exists, good ideas cease being nebulous floating goals and begin taking on real world capabilities. Real work goes into the technology that furthers the mission to modernize legacy systems and more importantly, legacy practices.
It is a privilege to be part of such an aggressively innovative program that sees the opportunity and success of past failures.