Beyond Distrust, what can a root program do?
Moving away from the binary trust model for CAs and taking small steps to heal the ecosystem
Foreword: This is a highly opinionated piece. Most of what is being presented here is just one solution to a problem that, so far, has not been solved.
The problem: Today, trust in certificate authorities is a binary model. You either fully trust the CA, or you don’t trust it at all. There is no in-between step, and this is bad for compliance enforcement, and ultimately bad for the users.
A common saying in the public WebPKI world is, “the security of WebPKI is as strong as it’s weakest link.” I feel like this is an easy cop out of taking actual steps that can help reduce and mitigate the impact of that weakest link.
Today, the line in the sand that a CA needs to cross to eventually get distrusted is arguably too far. Realistically, unless a CA is actively being malicious, they’re probably not going to get distrusted. If the CA is a larger CA, then we might be even getting into “too big to fail” territory. Meanwhile, I also don’t particularly think punitive measures are effective, or appropriate, if the goal of them isn’t to help WebPKI and users at the end of the day. The other goal of these measures should be to be a wake-up call for CAs, and allow them to recover and have these measures taken off of them.
Here’s a couple of ideas on how & when a root program can start putting limitations on CAs:
Enforced Shorter Certificate Lifetimes
We’ve seen time and time again that CAs consciously choose to not follow the rules that they set for themselves, with various excuses. When a CA doesn’t follow the rules (either it be the baseline requirements, or even their own Certification Practice Statement) when issuing a certificate, it is considered that the certificate in question is misissued. A CA has a requirement to revoke a misissued certificate in either 24 hours, if its a serious issue, or 5 days after finding out about it.
Now, what if they don’t? What do root programs do when this happens? Generally, nothing. The CA is generally required to open another incident for failing to revoke, but beyond that, there are no consequences for the CA, and the enforcement of the rules simply does not happen. Here’s a couple of recent failure to revoke incidents:
Entrust: Failure to revoke EV TLS certificates issued before CPS update
VikingCloud: Delayed revocation of TLS certificates in connection to bug #1883779
Telekom Security: Revocation delay for TLS certificates with basicConstraints not marked as critical
In the last link, I framed a question for the root programs. The answer I received was partially satisfactory. However, I’ve personally yet to see any measure come out of a root program for these failures both as an external observer, and when I was an engineer working on a CA.
It should also be noted that a famous case involving a failure to revoke is “Let's Encrypt: Failure to revoke for Certificate Lifetime Incident.” Despite this failure, we got something pretty extraordinary outcome from this incident, as it was the birthplace of ACME Renewal Information.
Should a CA be distrusted if they fail to revoke on time? Some might say absolutely. In some circumstances, if the CA has shown this behavior consistently, then I’d probably agree with them. But there are some steps we can take before a full distrust.
As the time of writing this, CAs are allowed to issue certificates with a maximum lifetime of 398 days. A root program can place a measure on a CA significantly reducing the maximum allowed lifetime of certificates for them. For example, a graduated timeline of going down to 180 days, 90 days, 30 days, and then eventually a maximum of a short lived subscriber cert which is currently defined as a certificate with a maximum lifetime of 10 days.
This is not a purely a punitive measure, but a measure that recognizes that shorter lifetime certificates are both good for the ecosystem, and the damage they do are much more limited since they have a natural expiry date.
Simply put, if a CA demonstrates that they are not able to revoke certificates in time, then their future certificates should have more restrictions on them for some period of time. Once these CAs have had the opportunity to demonstrate their improvement in this space, this restriction can be lifted.
Restricting CAs from being able to issue EV certificates
There has been a trend with incidents involving Extended Validation certificates in the compliance space:
CAs generally have a much harder time replacing these certificates when incidents happen around them.
The response time to these incidents are significantly worse than response time to incidents involving DV.
The lack of automation in this space really hurts agility of response.
I don’t think we’re getting rid of EV certificates any time soon. Even though, with this trend, EV certs clearly hurt the ecosystem quite a bit, without really having any benefits. However, not everyone agrees with me here, and I understand that. I do think that we probably agree on one thing in this space: EV certs do demonstrate a higher level of trust, and validation, than DV certificates. Given this, the scrutiny, and enforcement around EV certs should be significantly stricter than DV certificates.
If a CA demonstrates that they’re unable to effectively manage incidents involving EV certificates, then why should they be able to continue issuing them, putting everyone else who relies on EV certs at risk? The ability to issue these certs is not a right, it is a privilege. Root programs, can, and should, restrict the ability of misbehaving CA’s to issue EV certificates. Once the CA proves a period of operational excellence (which, mind you, does not mean that they don’t have any incidents, but that the incidents they do have is managed with a strong and objectively good response) the ability to issue EV certs can be restored for them.
Again, the point of this restriction is not to punish a CA. It is to help mitigate the risk a CA poses to the wider web, and to help the CA improve its operations.
Name Constraints for the CA
In an ideal world, all the TLDs would have the same level of importance, and no TLD would be more important than other TLDs. Realistically though, this is not the case. Being able to issue a malicious certificate for a `.ninja` domain is pretty bad, but no where comparable to being able to issue a malicious certificate for `.com`.
The inability of a CA to respond to incidents with haste, and strong operational excellence is a tell-tale sign that this CA should probably not be able to issue for the entire domain space and should be restricted from issuing to some TLDs.
This methodology would also be great in introducing new CAs to the WebPKi ecosystem too!
The methodology of what TLDs should be allowed/restricted is not something that I have enough information to make, but I can think of a few high level guidelines:
A new CA being introduced should potentially be restricted to ~100 of the most recently added TLDs that have open registration.
A country specific CA should potentially be restricted to the ccTLD of that country for some time, or maybe even permanently if they have no desire to issue beyond that ccTLD.
A misbehaving CA should be restricted from issuing for `.com`, `.net`, and `.org`, and possibly some, or all ccTLDs (assuming it is not a country specific CA).
Root programs can achieve this either through the Name Constraints x.509 extension, or through a dynamic policy system that the user agent for that root program obtains every couple of days and applies.
There’s definitely many more options than I could discuss here that root programs are capable of doing before the nuclear option of distrusting a CA. I hope that consideration is given to these options, and that the root programs start taking steps to prevent the compliance regime of WebPKI from slipping away too much.