This piece is Part 3 of a series on Entrust’s historically harmful behavior as a CA. If you haven’t read Parts 1 and 2 yet, you can find the links here:
Similar to Part 1, I’m using a SIRQ (Subjective Incident Response Quality) to assign some subjective score to each of these incidents.
In this part, we take a look at a handful of other incidents Entrust has historically had. We notice that later on, there is a remarkable improvement in their incident response.
Entrust: Invalid data in State/Province Field
Incident Start: 2020-08-10
Incident Reported: 2020-08-12
Full Incident Response: 2020-08-12
Incident Closed: 2021-06-07
In this incident, Entrust was informed by a third-party that they’ve issued certificates with invalid State and Province data.
In their incident report, Entrust claims that they’ve found the impacted certificates, and reached out to the subscribers to set a revocation date of 5 days from the incident being reported to them (2020-08-15).
According to Entrust: there were 397 misissued certificates, and 395 of them belonged to a single subscriber, the remaining two certificates belonged to another subscriber.
Unfortunately, in the same incident report they say that the subscriber with 395 impacted certificates has claimed that the revocation deadline wouldn’t be possible for them. Entrust agrees to delay the revocation, however Entrust does not set a deadline on when the revocation will actually happen.
These types of failures are not uncommon with OV and EV certificates. The unfortunate thing is that Entrust has dealt with similar incidents before [1], [2], [3], [4], and it’s clear that their remediation efforts did not go far enough.
In the report, Entrust claims that they will:
We will be doing a wider check of our certificate population to check for other uses of “NA”, “N/A”, “Not Applicable and we will also check state/province and country combinations based on the current ISO state/province country lists that are available. This will also allow us to check for any other potential issues, such as spelling mistakes.
On 2020-08-13, Entrust claims that:
We ran a full scan on our certificate population and did not find any other certificates where the state was set to "NA", N/A", or Not Applicable.
On 2020-08-18, they say that they’ve set a deadline of September 5th, 2020 to revoke the remaining 395 certificates that all belong to the large organization.
On 2020-09-15, they claim that all the impacted certificates were revoked as of 2020-09-04.
Ryan Sleevi then chimes in, asking if Entrust has only really root caused this to human error. He also asks if false-positive testing is done with the validation staff:
Did I overlook a more detailed analysis here as to the root cause? I see the third case discussed somewhat at length, but this first case, as best I can tell, is mostly attributed to human error? Have I overlooked anything?
Related, can you discuss whether Entrust performs any false-positive testing of secondary reviewers? That is, send approvals to a secondary set of eyes that have known issues, to simulate them making it past the first reviewer, to ensure that they're detected, or to test the effectiveness of existing controls?
To which Entrust responds that this was both a human failure and system design failure. They also say that Entrust does not do false-positive testing of the secondary reviewers. Entrust them commits to doing false-positive tests regularly by January 2021.
On 2020-09-27, Entrust says that they’ve received another report of issuing certificates with the wrong state data.
Ryan then chimes in, saying that the review Entrust promised to do wasn’t actually done:
Dathan: It sounds like this new incident is completely unrelated to those explanations offered in Comment #1. It also suggests that a holistic evaluation of existing certificates wasn't done (per your commitment to do exactly that on Comment #1). Given that commitment was made on 2020-08-12 to systemically evaluate and address this, it seems like it both could and would have been accomplished by now, so I'm concerned that Comment #10 highlights third-parties are still finding issues.
Specifically, the commitment made in Comment #1 was
we will also check state/province and country combinations based on the current ISO state/province country lists that are available
I understand that Comment #2 stated:
We ran a full scan on our certificate population and did not find any other certificates where the state was set to "NA", N/A", or Not Applicable.
But that's not all that you committed to.
To which Entrust replies that they’ve not done the full test. By 2020-10-27, it seems like Entrust has been able to successfully run these scans, and are not having this problem anymore.
Entrust did not stop issuance during this incident response, even though the fixes for this had not been implemented. More certificates are found to be misissued after this incident had been open. This is once again, Entrust deciding to keep misissuing certificates while knowing that their system has a known-bug that’s actively misissuing certificates.
Entrust then updates on 2021-05-19, stating that they’ve found two more misissued certificates. Note that this is about 9 months after the initial problem was found. It took Entrust till 2021-05-13 to update their services with the fix for this problem.
SIRQ: 2
Entrust: Late Revocation for Invalid State/Province Issue
Incident Start: 2020-08-12
Incident Reported: 2020-08-12
Full Incident Response: 2020-08-12
Incident Closed: 2020-10-09
Since Entrust failed to revoke the impacted certificates in the previous incident on time, this incident had to be filed. Unfortunately in this incident, Entrust puts the interests of their subscriber ahead of the interests of public trust once more. However, compared to previous incidents, they weren’t too untimely with their revocation.
The action items from this incident do nothing to prevent this incident from happening again. They sound more like a sales pitch, than an action item:
Educating Customers on Best Practices to Handle Emergency Certificate Replacement Scenarios
Entrust deals with many large international organizations that issue certificates to many departments and sub organizations. Certificates are often managed by a smaller group that delivers certificates to many individuals within the organization. One of the major challenges in managing the regular certificate lifecycle within these larger organizations is tracking who ultimately owns the certificate, where the certificate is deployed, and the type of system or application that requires this certificate. While knowing this information does not solve every issue related to this late revocation problem, such as expediting change controls/outages process and having resources to generate new CSRs/keys and deploy new certificates, it can certainly help to cut down the amount of time to organize certificate replacement. Our certificate management system includes built-in tracking and reporting capabilities that can help customers find the right information they need during an emergency situations to help expedite the communication process. We need to make sure that our customers are leveraging this as much a possible in the event that there is a requirement to quickly replace certificates for any reason.Emphasizing the Value in Automation
Certificate lifecycle automation is a valuable tool that can be used to speed up the certificate deployment/replacement process and eliminate the possibility of human error. However, there are many challenges to fully automating PKI within large organizations, including cost and environment complexity. As a CA, we have a responsibility in helping our customers achieve higher levels of automation. We do support many third party integrations and certificate lifecycle automation tools though our APIs, along with other automation protocols that our customers can leverage. While we cannot force our customers to use these tools, we can certainly do more when it comes to emphasizing the need for automation, and these events underscore how automation can help organizations to save resources and reduce their risk of running into issues when undergoing high volume certificate replacement.
SIRQ: 1
Entrust: Incorrect keyUsage for ECC certificate
Incident Start: 2020-09-25
Incident Reported: 2020-09-25
Full Incident Response: 2020-09-25
Incident Closed: 2020-10-31
In this incident, Entrust issued an ECC certificate with the `keyUsage` extension having the “Key Encipherment” usage set. They were able to find this problem with post issuance linting.
Unfortunately, Entrust says “Under investigation” on whether or not they’ve stopped issuances for the time being.
We also get a glimpse into how this mistake came to be:
The certificate management system is set to issue certificates in the Enterprise and Retail model. The PKI hierarchy is set with different CAs to issue certificates with RSA or ECC subscriber keys. The certificate profiles for this CAs are set to meet the requirements of the BRs, RFC 5280 and RFC 5480. The error occurred when a Retail OV ECC SSL certificate was issued from a OV RSA SSL CA. ECC keys are not supposed to be supported in the Retail model, but in this case the request was not blocked. Unfortunately, all SSL certificate requests in the Retail model are directed to CA configured to issue subscriber certificates with RSA keys.
In this report, Entrust does not yet commit to any action items.
Ryan asks how this incident overlooked existing incidents of this nature. Entrust claims that the linked incidents are not related to this incident:
(In reply to Ryan Sleevi from comment #1)
From the timeline, this seems to overlook issues like Bug 1647468 , or further back, Bug 1560234, as well as discussion within the IETF for the past year. Could you help factor these in to your timeline?
I don't think that the issues referenced apply to our issue. We were aware of the requirement; however, I cannot put a specific date on when we understood the requirement. Our policy was correct when we created CAs to issue certificate with ECC keys in April 2016. This policy was also implemented in our post issuance linting software.
The current issue is not a misunderstanding of the requirement, but that there is a bug in the enrollment software which sent a certificate request with an ECC key to a CA which was configured to issue subscriber certificates with an RSA key. This issue is currently being investigate and we will post a plan for resolution.
We also planned to address that the issue was not detected with zlint. We found https://github.com/zmap/zlint/pull/479/commits/90dbcc6b3f5623441b09754e219bfb4e71ce9ce0 and https://github.com/zmap/zlint/issues/454. We support this change. However, if the change is not made available, we will implement in our pre-issuance linting.
More details to follow.
After a few back and forths, Ryan Sleevi states:
Thanks Bruce, that's helpful at least to understand the disconnect.
I think the approach I was hoping for, for all CAs, is that:
An incident is reported by another CA
The (non-affected) CA looks at:
The resulting misissuance
The root causes
The mitigations the CA is proposing/applying
The (non-affected) CA examines their own system:
Is such a misissuance possible (independent of root causes)?
What are the controls to prevent this?
How are they tested?
Have they been reviewed to make sure documentation and production are aligned?
Are there variants of the root cause that can affect the CA?
Does the non-affected CA have similar mitigations in place?
Going to these other issues, it seems like had an examination like the above been done, then even though the supposed root causes (lack of awareness) are different, the question would have been: "Would our pre-issuance systems prevent this?", which would have determined that no, not yet. The distinction between
keyAgreement
andkeyEncipherment
is, to me at least, largely irrelevant; the issue at play here is a disallowed key usage (which does seem identical between these issues)In your original report, you said:
ECC keys are not supposed to be supported in the Retail model, but in this case the request was not blocked. Unfortunately, all SSL certificate requests in the Retail model are directed to CA configured to issue subscriber certificates with RSA keys.
So this is why I bring up the other issues. If you had, say, pre-issuance linting, presumably the attempt to use the Retail OV RSA profile with an ECC key would have flagged the pre-issuance lint, which would have discovered that the Retail profiles were sending everything to the RSA profile (at least, as I understand it) and that ECC keys weren't getting blocked.
Does that at least help better explain the relation and relevance of these other issues? The
keyEncipherment
vskeyAgreement
seems largely moot, and this is more generally about using those other incidents (of thekeyUsage
not matching the subscriber key) to make sure that sufficient controls were in place to ensure thatkeyUsage
, well, matched subscriber keys.Do you have a timline for a plan for remediation, even an initial WIP one?
To re-state, SSL certificates using ECC keys are not supported in our Retail mode. However, we did find path where the Retail service does not reject a CSR with an ECC key. Since the Retail service only uses CAs which have certificate profiles based on an RSA key, the CA signed a certificate with an ECC key and the incorrect key usage. This error is being corrected.
In addition, the pre-issue linting which uses zlint, did not provide an error and block certificate issuance. However, zlint does provide a Notice that the key usage is incorrect. We will update zlint to the latest version and update our code to provide an error, which will mitigate this error in the future.
Our timeline to have this incident remediated is no later than 30 October 2020.
A few worries I have reading this:
For over a month, anyone who had read this incident, could cause Entrust to issue an invalid certificate by just passing in an ECC CSR.
This “mistake” is on the Key Usage extension, which is an important extension in the x.509 certificate. It effectively defines what the powers of this certificate are.
Entrust never clarifies how they’re monitoring bugzilla for other incidents. We’ve seen time and time again that Entrust has difficulties actually applying lessons from other CA’s incidents to their own CA. (See: Entrust: OCSP response signed with SHA-1)
For some reason, Entrust doesn’t use their post issuance linter, in the preissuance phase. I’m really not sure why this is such a sticking problem for them. In the previous two parts of this series, we’ve seen this pattern of not holistically applying linters across their issuance stack multiple times.
Why is there a difference between their `Retail` and `Enterprise` stacks, to the point where one is capable of ECC and one isn’t? That’s just so much more complexity to manage. A good incident response would’ve explained the history of how they’ve ended up with two different stacks.
SIRQ: 0
Entrust: Failure to provide a preliminary report within 24 hours
Incident Start: 2020-09-27
Incident Reported: 2020-09-27
Full Incident Response: 2020-10-01
Incident Closed: 2021-01-27
An external user reports that they’ve sent Entrust a Certificate Problem Report and Entrust has not responded to them with a preliminary report within the 24 hour time frame set by the baseline requirements.
Entrust does a good job here detailing the timeline of the incident, and creates action items that will prevent a similar problem in the future:
There are 3 things that we will implement to improve our incident response process and our requirement to provide a preliminary report:
(1) Create a workflow in our CRM tool to enforce the incident response process steps and send out periodic email alerts to the compliance team if the ticket does not proceed to the next step within a certain amount of time. The workflow would allow the agent to transition the issue throughout the various milestones:
1. Certificate problem report received and tagged in our system as a compliance issue (this step already exists)
2. Certificate problem confirmed internally as mis-issuance or key compromise
3. Provide preliminary report to subscriber
4. Provide a preliminary report to the reporter
5. Revoke certificates within the deadline
6. Case closed(2) As part of this workflow, implement a preliminary report template to make sure that the agent always sends out the right information.
(3) Training on the CRM updates and template for the agents who will be handling intake of certificate problem reports
However, unfortunately Entrust fails to provide a timeline for these. Entrust then updates us on 2020-10-08, a week after the incident report that no deadlines have been set for these yet.
Later Entrust confirms that the implementation has been set to a target date of January 2021. Entrust does get this change implemented by then.
SIRQ: 4
Entrust: Subscriber provides private key with CSR
Incident Start: 2020-10-19
Incident Reported: 2020-10-23
Full Incident Response: 2020-10-23
Incident Closed: 2020-12-11
Entrust discovers that a customer has included their private key, alongside the certificate signing request sent to Entrust. This would mean that the private key is compromised, and is a misissuance. Unfortunately, while Entrust was investigating this, they misissued another certificate with a similar problem. This means that Entrust, once again, did not turn off issuance when this problem was discovered.
This incident highlights the importance of public incident reporting. Various other CAs comment on the incident, explaining that they were also impacted by this and thanks to this issue were able to address it.
In fact, Entrust goes above and beyond by:
This is the SHA1 hash of the hex representation of the key modulus for the certificates with compromised keys. This will help other CAs assess if they issued any certificates with a compromised key as indicated in this bug.
For clarification, we posted the Debian Weak Key list style of hash as we assume that the CAs would already be used to ingesting it. This representation is only the last 10 bytes of the hash. I believe that it is discussed here, https://wiki.debian.org/SSLkeys.
This is genuinely a great incident report. It makes it clear to understand what the problem was. Entrust shared samples of how this input looks like, and details to aid CAs in searching for these.
Unfortunately though Entrust did not stop their certificate issuance, while they were working on the fix. If fast remediation isn’t possible, its better to stop issuance. Beyond that, the title of the incident seemed like it’s blaming the subscriber for a failure on Entrust’s software not properly sanitizing the input from their subscribers.
SIRQ: 3
Entrust: Incorrect Business Category Value Discovered in an EV SSL Certificate
Incident Start: 2021-01-05
Incident Reported: 2021-01-06
Full Incident Response: 2021-01-08
Incident Closed: 2021-07-15
In this incident, Entrust discovered that they’ve been misissuing a certificate since 2019, with the wrong business category:
On 5 January 2021 at approximately 17:00 UTC, the Entrust Verification team discovered a single EV SSL certificate with an invalid Business Category value. This was discovered during a regular EV business profile re-validation. It appears that during a contact update change on the business profile in 2019, the business category was inadvertently updated by an agent from its correct value (Government entity) to an incorrect value (Private Organization).
What’s concerning here is that this issue was missed for effectively two years. Once in 2019, and once in 2020.
Entrust then continues with how they investigated this problem:
After conducting multiple interviews with senior agents along with agents who were involved with the original mistake on 18 April 2019, we found that the main reason this mistake was first introduced is due to a lack of highlighting changes with the previous verification data when updating a business profile in our Verification system.
Thanks Dathan.
I'm actually encouraged to see this report, because I believe it's a sign of Entrust improving on its incident reporting. From it, I believe I was able to get an understanding of what went wrong, and it also appears that your mitigations - looking at the systemic factors such as the human interface - are meaningful to address root causes, as opposed to suggesting it's an operator training issue (with the operators being trained to use the existing interface).
Entrust then commits to a release in July, with the expected UI changes. Entrust delivers on this change, and shares screenshots of what the new application is doing to make changes more obvious.
SIRQ: 5
Entrust: Incorrect Jurisdiction Country Value in an EV Certificate
Incident Start: 2021-03-03
Incident Reported: 2021-03-03
Full Incident Response: 2021-03-04
Incident Closed: 2021-07-15
In this incident, Entrust discovers that they are issuing certificates for organizations in Botswana (Country Code: BW), but using the Jursidiction Country set to South Africa (Country Code: ZA).
Similar to the previous incident, Entrust claims that their ineffective WebUI is ultimately the root cause of this incident. And then some complexity regarding country codes:
Another point to note is that Entrust issues many certificates in ZA and often uses a QGIS known as “CIPC”. The QGIS that was used for this particular profile in BW is known “CIPA”. It was noted during our investigation that the agent mistook “CIPA” for “CIPC”, which also led to the agent entering the incorrect value of ZA. Also, note that the field for the Jurisdiction Country value in our vetting system is currently a country drop-down that is selected by the agent that is independent of the vetting source being used.
Ultimately, this and the previous incident were probably discovered as a remediation item for: Entrust: Invalid data in State/Province Field
SIRQ: 5
This marks the end of Part 3 of this series. To summarize:
Entrust shows an improvement in their incident response.
There weren’t any incidents requiring mass-revocations. The largest revocation was about 370 certificates, and Entrust took a month to get those down.
Entrust does not stop their certificate issuance (again), when detecting that their CA is misissuing certificates.
Where Entrust has historically struggled the most is mass revocations. These series of incidents did not have any large mass revocations. I believe this analysis would’ve gone differently based on what we’ve seen in the past, and recently, if the circumstances of these incidents were different.
The conclusion I get from this is that Entrust knows how to properly do incident response. They’re just choosing not to do so when the incident requires a large number of revocations. This is deeply problematic for Entrust. Ignorance of the BRs and rules and requirements is not why they’re struggling with incident response. They are struggling because they are actively choosing to put their subscribers over the requirements for CAs.
Out of these seven incidents, Entrust self-discovered 5 of them and the last two were externally reported.