Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it Florent Chabaud 1 1 Atos Big Data & Cybersecurity, Rue du Gros Caillou – 78340 Les Clayes-sous-Bois – France Abstract For decades, Trusted Computing has tried to anchor trust in the hardware, and the existence of Trusted Platform Modules (TPM) in most modern design is evidence that this approach is now well understood. The default behavior of recent Operating Systems like Windows 11 is even to deny booting if this security feature is absent. But this approach is not sufficient in a modern world where one needs to trust remote platforms. To preserve confidence in security, one needs to limit the trusted computing base (TCB) of a system at a level where an assessment can make sense. Trusted Execution Architecture (TEA) is the result of a partnership with ProvenRun to implement a TCB in Atos servers in a consistent way, from Edge to High Performance Computing. This allows to envision security features based on some common Root-of-Trust known to different platforms, at different scales and levels of interaction. Keywords 1 Trusted Computing, Edge Computing, High Performance Computing, Remote Attestation, Operating System 1. Introduction In 1993, the NSA tried to introduce the Clipper Chip to promote Key Escrowed Encryption [1]. Even if this attempt failed [2] and backfired in promoting open-source encryption [3], it showed the importance of hardware in computer security, and paved the way to trusted computing. Soon, the Trusted Computing Platform Alliance, renamed as Trusted Computing Group [4], will emerge and promote another piece of hardware, the Trusted Platform Module (TPM), now standardized [5] and embedded in most platforms. But this concept is now revisited by another industry consortium, the Open Compute Project [6], which adds to the TPM some other security chips. Even if adding security hardware can make sense, it is always raising the question of how this new hardware can be trusted. Understanding the alleged improvement in terms of security is also important to assess the security benefit ratio, and in the end, other options can be envisioned. In this paper, we will bring a quick survey of the state-of-the-art of trust in hardware in section 2. We will discuss the rationale of Atos Trusted Execution Architecture (TEA) and the pros and cons of this software-oriented approach in section 3. We will then detail some aspects of the implementation in section 4. In the end, section 5 will sketch future innovative security in Atos HPC architectures, as allowed by the Atos TEA. 2. Hardware Trust State-of-the-Art Overview 2.1. Smart Cards Long before NSA tried to promote its Clipper Chip, the ideas of using small pieces of hardware to secure secrets arose in several places. Several patents were filed around this invention, but the seminal industrialization patent was the creation of the first portable support with both a processor and a C&ESAR’22: Computer & Electronics Security Application Rendezvous, Nov. 15-16, 2022, Rennes, France EMAIL: florent.chabaud@atos.net ORCID: 0000-0002-5007-6025 ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Proceedings of the 29th C&ESAR (2022) 115 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it memory, allowing the small piece of plastic to cryptographically interact with its environment in an active way. Embedding the processor and the memory in the same single ship came rapidly after this, innovating the reign of smart cards. It is worth recalling that Michel Ugon, a French engineer of the Bull company later acquired by Atos was at the core of these inventions[8][9][10]. The large dissemination of smart cards makes them a primary target of a new class of cryptographic attacks: if the secret remains in the chip, maybe the way it is used leaks some information on the secret. A seminal attack of this kind is due to P. Kocher and al. who invented Differential Power Analysis (DPA) [11] which remains one of the threats a cryptoprocessor needs to deal with, among other types of side channel attacks. 2.2. Hardware Security Modules Hardware Security Modules (HSM) are another example of devices which were designed to protect secrets. HSMs usually embeds features to physically protect their internals, and provide tamper evidence at physical (labels, screws…) and logical levels (logs, alarms…). Certification standards such as Common Criteria [12] or FIPS-140 [13] are developed with HSM in mind to evaluate the robustness of these security mechanisms. As usual, it is worth noting that being certified is not a guarantee of security. Depending on the security model, a certified HSM can be proven vulnerable to threats which are out of its protective scope. Interestingly, a recent example proved the need for HSMs to be self-protected against firmware tampering, not only on their crypto processors, but also on their applicative part [14]. Said differently, HSMs firmware also needs some Hardware Root-of-Trust! 2.3. Trusted Platform Module (TPM) The Trusted Computing Group (TCG) promotes Trusted Computing concepts across personal computers around the use of a Trusted Platform Module (TPM). It has now become an international standard [5] for a secure cryptoprocessor providing several security functions: - Unique device keys: the TPM embeds some private keys which are normally certified by its manufacturer. - Measurement: the TPM securely stores some Platform Configuration Registers (PCR) which are obtained by chaining the cryptographic hash of several memory areas in a specific order. Usually, the memory areas are the successive codes used during the booting sequence, therefore building a chain-of-trust. The PCR values can be locally verified by the operating system to check that the boot sequence wasn’t tampered. - Remote attestation: using its unique device keys, the TPM can sign its PCRs to remotely attest that the boot sequence was not tampered. This signature can be verified against the TPM manufacturer public key certification infrastructure. - Key wrapping: using its unique device keys, the TPM can encrypt other cryptographic keys to ensure their secure storage. This ensures that the locally encrypted keys cannot be decrypted without the TPM. - Random number generator: the TPM usually embeds some hardware random number generator suitable for cryptographic usage. It is important to understand that the TPM cannot guarantee the security of the CPU booting process by itself. It must be completed by some bootstrapping process to kick-off the measurements and take their results in account. 2.4. Trusted Execution Environment (TEE) Following M. Sabt and al. [15] we take as a definition of a Trusted Execution Environment (TEE) “a tamper-resistant processing environment that runs on a separation kernel”. It aims at providing on a single CPU an isolation between a “normal” kernel and a “trusted” one, protected against software and 116 Proceedings of the 29th C&ESAR (2022) F. Chabaud hardware attacks. Several TEE solutions exist but all of them are based on some hardware technologies such as Intel TXT [16] or ARM TrustZone [17]. The latter is widely used in mobile environments as stated by M. Sabt and al. The security of a TEE solution results from the hardware technology used, but it depends a lot more on the usage of this technology at software level. Even if the TEE is fully isolated at hardware level, its purpose is to exchange data with the normal world and process it in a secured environment. Any vulnerability in the driver which ensure communication between the two worlds can ruin the overall security of the TEE [18][19]. 2.5. Secure Chips Other secure chips have been developed in different industries to ensure firmware integrity. Examples of solution are found in the Set-top-box area where control access system vendors ensure digital rights management (DRM) on video streams through hardware security features. For instance, Nagra On-Chip Security (NOCS) “brings the hardware “root of trust” that ensures platform security” [20]. Another example is the ARM-based cryptographic embedded controller [21] which proposes all the features to implement a TEE. 2.6. Open Compute Project (OCP) The Open Compute Project (OCP) is an organization [6] that shares designs of data center products and best practices among several companies. It leads several projects around datacenter design. As usual in those types of organization, a sponsoring program is in place with different levels [7]. Being able to claim a product is OCP Inspired™ requires at least a Silver subscription. Curiously, the annual fee is decreasing from Silver to Platinum level, but this is compensated by the obligation to contribute to events and overall activity of the project. This may explain why the list of members is roughly split in two between Community members at lowest rates, and Platinum members. Platinum members encompass companies such as Alibaba, AMD, ARM, Cisco, Deutsche Telekom, Google, HPE, Huawei, IBM, Intel, Meta, Microsoft, Nokia, Nvidia, or Schneider Electric, among others. Through its Security project, the objective of the OCP could be summarized as an effort to gather all previous security technologies like TPM and secure chips in an organized standard able to ensure secure computing. Also all documentation is shared according to a Creative Commons license [22] allowing to share and adapt the material. 2.6.1. OCP Platform Security Overview The overall organization of the OCP is somehow fuzzy, but two parallel approaches are identified which should eventually converge: 1. The Datacenter Secure Control Module (DC-SCM) specification [23]. 2. The OCP Platform Security Overview [24]. The main outcome of this last document can be summarized in the excerpted Figure 1. It introduces a new piece of hardware, the Platform Active Root-of-Trust (PA RoT). The role of this PA RoT, which could be ensured by the DC-SCM, is aligned with the NIST Platform Firmware Resiliency Guidelines [25]. This Special Publication was issued in May 2018, and its guidelines have soon become a de facto standard of what a platform needs to implement to improve their resiliency against a variety of known attacks, both at software and/or hardware level. It proposes a progressive approach with three different platform security levels: Protected, Recoverable, and Resilient. Proceedings of the 29th C&ESAR (2022) 117 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it Figure 1: Overview of Secured Platform Architecture according to OCP [24] – CC BY-SA 4.0 license [22] 2.6.2. Attestation of System Components The OCP has issued some requirements and recommendations around attestation of system components [26]. The document goal is to allow a platform (verifier) to build its platform inventory containing a list of all security-relevant devices, whether they support authentication and attestation or not. Attestations are secured by a set of cryptographic keys and protocols which are used in attestation mechanisms. Cryptographic requirements refer to standard NIST documents. Because of these requirements, each device must be provided with a set of cryptographic keys: 1. A Unique Device Secret (UDS) which is used to characterize the attester device. 2. A private authentication key unique to each device. The corresponding certificate is allowed for digital signature usage. This key is intended to be immutable and certified by the provisioner. 3. A private signature key unique to each device. The corresponding certificate is allowed for digital signature and content commitment usages. This key is intended to be updated and certified by the device owner. For the provisioner, the specification also requires some key management infrastructure using HSM to protect: 1. The keys of the Provisioner’s Certificate Authority. 2. The keys of the Updater role. 3. The keys of the Firmware signer role. To be noted that there is a notion of ownership transfer that implies that the Updater and Firmware signer keys can be changed by the Device owner. Also, the requirement implies the existence of a root of trust within each attester, able to perform cryptographic operations, including random number generation with sufficient entropy. References to NIST publications and FIPS 140-3 [13] at level 2 is recommended. Once this is set, attester devices must be capable of communicating their authentication and attestation capabilities to the platform, and platforms must be capable of interrogating potential attester devices and recording their authentication and attestation capabilities. The two references used to implement the corresponding protocols are described in DMTF’s SPDM [27] (see 2.3.9) and Microsoft’s Cerberus [29]. A sample implementation of the DMTF’s SPDM specification is also a reference [28]. 118 Proceedings of the 29th C&ESAR (2022) F. Chabaud 3. Atos’ Approach 3.1. Threat Model When dealing with firmware security, the threat model can drastically impact the level of protection needed. It is indeed a different story to protect the firmware integrity of a server lying in a physically isolated datacenter, or to address the same problem on a smart card which can be easily replaced by a copy. Also of importance is the scope of the intended protection. In our case, and in this paper, we focus on the security of the platform with an agnostic approach of the CPU/GPU components. We aim to ensure some security independently from the existing technologies at OS level. In particular, and as an example, the operating system can still use the TPM when it is present and leverage the CPU technologies to ensure it’s booted in a secured way. This will be further explained in section 4.4. But we want to ensure a certain level of security of the platform even if none of these security features is used. So, let’s first identify the type of attack scenario which one would like to prevent in this context. 3.1.1. Physical RAM Access The first scenario of attack is basic. If one has physical access to the server, he or she could leverage this access to reprogram the memories of the hardware and have the platform firmware execute unwanted operations. For instance, in the context of an HSM which would protect secret keys, reprogramming the firmware could be a simple way to get a given secret key copied on an external interface, hence compromising it. 3.1.2. Supply Chain Attack Physical RAM access may be assumed as limited in time. If the physical access is possible for days, like during shipment, the attack possibilities in altering firmware are much more important. Components could be replaced which would try to mimic the behavior of the original ones while preserving some backdoor, for instance. 3.1.3. The Persistent Remote Attack As the firmware will have some critical vulnerabilities discovered, the security objective is to make the platform able to recover from such attack and to avoid its persistency. If such an attack can change every piece of data in the platform memories, then it is pretty clear that the attack can remain, since the platform relies on its memories to boot. The use of some immutable data seems therefore mandatory to ensure security. 3.1.4. Rogue Developer The internal threat remains a possibility for any vendor, and whatever the source code tainting approach. The effects are the same in case of an intrusion on the development infrastructure. Controlling the code can mitigate this risk only if these controls are not by-passed. A good way to ensure that the code is controlled at least once is to have it signed with a properly protected cryptographic key. This ensure as well a resilient posture in case of late discovery of some rogue activity. In this case, the root- of-trust remains the cryptographic keys which are used to sign the firmware. 3.2. Sovereignty Principle From a security perspective, a platform MUST use a Hardware Root-of-Trust (HW RoT). It is the only way to ensure some protection against software attacks and to achieve a level of resilience. Without it, any of the above listed attacks, if successful, would achieve a state where trust would be damaged in Proceedings of the 29th C&ESAR (2022) 119 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it an irreversible way. On the other hand, if the Hardware RoT exists, the platform may be rebuilt from this basis. This is the approach specified by OCP with its Platform Active Root-of-Trust and promoted by the NIST Platform Firmware Resiliency Guidelines [25]. But Atos also wanted to limit unknown hardware to increase trust and confidence in the solution. Even if some of the proposed PA RoT are open source like the Open Titan [31], adding new hardware increases the attack surface and the complexity of the server. And one also must take in account the delays to qualify and stabilize the new hardware [32]. In the end, the trust in the resulting hardware is disputable and will eventually come from wide usage, as the story of the TPM told us. We therefore limited the HW RoT to some public cryptographic key anchored in the silicon, whose corresponding private keys are handled in an HSM developed by Atos: the Trustway Proteccio netHSM [30]. This HW RoT is then propagated through a Chain-of-Trust applied to: 1. A secure boot sequence (Chain-of-Trust for Detection – CTD). 2. A secure firmware update (Chain-of-Trust for Upgrade – CTU). These chains-of-trust will be later detailed in section 4.1. 3.3. Baseboard Management Controller Most of the modern servers have a Baseboard Management Controller (BMC), which is responsible for the power-on of the main CPUs, and the management of the firmware. From a security perspective, it is already a piece of the platform you need to trust. And it has already been proven that it can be a source of weakness for a server [33]. For the servers developed by Atos, the BMC is hosted in a System-on-Chip (SOC). We decided to leverage this existing hardware and to elevate it as the Platform Active Root-of-Trust for the platform. The Figure 2 modifies the original figure from OCP (see Figure 1) to illustrate the approach. The server's CPUs are therefore seen as symbiont devices relative to the BMC. The advantage of the approach is that this hardware is mandatory in all servers, as it is the interface to the management infrastructure to power-on the platform or upgrade its firmware. It also plays a key role in the overall integrity of the platform, and its security should be hardened. ARM B M Figure 2 – Trusted Execution Architecture (TEA) of Atos servers – adapted from a CC BY-SA [22] licensed material by OCP [24] 3.4. Security Implications The current implementation of the Atos BMC is based on Open BMC [34], a Linux Foundation collaborative open-source project whose goal is to produce an open-source implementation of the BMC Firmware Stack. The Open BMC project has already security in mind with firmware signature 120 Proceedings of the 29th C&ESAR (2022) F. Chabaud verification during secure boot [35]. But this doesn’t appear sufficient to reach the security level needed for a PA RoT. And even hardening Open BMC Linux kernel would not achieve hardware-like security. But the underlying hardware embeds an ARM core with TrustZone technology [17]. Leveraging this technology makes possible to achieve a decent level of security, even without a dedicated security component. This is the implementation we will detail in section 4. Assuming this technology is efficiently implemented, what are the impacts on the above threat model: 1. Reprogramming the firmware of the server assumes the possibility to reboot the server with a rogue firmware. This possibility is prevented as the BMC will verify the signature of firmware during boot sequence. And the cryptographic keys used for this verification are out of reach for a standard physical access. 2. Changing components of the platform is the threat covered by the device/peripheral attestation mechanism. Of course, the security of this mechanism depends on the existence of the PA RoT, which could be replaced in our case by another BMC. This rogue SoC would have to implement a backdoor in a way that resist subsequent firmware upgrade of the BMC using Atos firmware. This seems an acceptable residual risk. 3. The persistent remote attack risk is covered in the same way as the direct reprogramming of the firmware memories. The BMC will verify the signature of the firmware during boot sequence. The firmware upgrade feature which is present in the BMC will also verify signature of the firmware before authorizing the upgrade. 4. The public keys anchored in the hardware make possible to recover from a situation of a trapped development as long as the private keys are duly protected. Of course, the use of a non-dedicated hardware for security has some drawbacks. For instance, it is envisioned to implement a firmware TPM in the Atos BMC. This would allow to add this security feature in HPC environment where no TPM is usually implemented for physical space reason. But it cannot be claimed the same level of security, since the TPM chips are usually certified at high level of security (see for instance [36]). For the threat model we described, dedicated to Platform security and Firmware integrity, there is no significant change in the risks. However, for cryptographic storage of user keys, which is one of the key features of a TPM for the end user, the risk assessment would have to be considered accordingly. 4. Atos’s PA RoT Implementation 4.1. Ownership Transfer Preparation Allowing the change of cryptographic keys to the platform owner is difficult to ensure while preserving the overall security, since the purpose of anchoring RoT in the hardware is to prevent software-based attacks which could change the keys used for firmware verification. Even if these keys are public, their integrity is of utmost importance to the security objectives. Signed firmware is used to authenticate firmware before critical functions: 1. CTD: During boot sequence to ensure that the next step of the boot sequence will activate an authenticated binary code. 2. CTU: During updating process to ensure that the firmware image is authenticated before flashing it in RAMs. These two chains are independent and complementary. Three types of keys can therefore be identified to verify the signature of a firmware: 1. At the beginning of the boot sequence, to benefit from Hardware root-of-trust (red key in the Figure 3). 2. During boot sequence where a public key embedded in firmware can be used to pursue the chain-of-trust in a flexible way (orange key in the Figure 3). Proceedings of the 29th C&ESAR (2022) 121 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it 3. Before flashing, where a public key can be used to verify the signature of the firmware payload. The corresponding public key can be stored either in a hardware secured part of the SOC, or in a firmware provided it is protected by another chain-of-trust (see green public key in the Figure 3 as an example). Figure 3 - Type of firmware signing cryptographic keys Except the root-of-trust key secured at hardware level, all public keys used to verify firmware integrity and/or boot chain integrity must be part of a signed firmware. This ensures that the modification of the verification keys is authenticated provided the key store is properly implemented. It is important to understand that the hardware secure boot is very limited in practice. It only secures the first stage of booting, a little program which cannot exceed a few kilobytes of code (63 K for ARM), because it will be loaded in ARM memory for signature verification. All the other operations of a chain- of-trust for detection (CTD) or chain-of-trust for upgrade (CTU) will exceed this limit and will therefore rely on software-based security. Yet, this also allows ownership transfer provided the customer trusts its vendor, which seems a legitimate hypothesis. Indeed, at least the CTD key is included in the first stage booting code, which ensures the possibility to change it, while preserving the overall security of the scheme. Depending on the needs, CTD and CTU keys may be owned by the vendor or not. Consequently, in theory, the chain- of-trust keys can be changed for each owner provided at least one signature is performed involving the hardware RoT key owned by Atos. 4.2. Firmware Key Management The root of trust is the initial public key which must be inserted and secured by hardware security measures. The root-of-trust must be immutable once the hardware security measure is in place. Therefore, the corresponding private key is of critical importance in a production environment. Due to the hardware hardening it is not possible to update a root-of-trust key by firmware upgrade. The only allowed operation may be to invalidate a compromised key. It is therefore mandatory to anticipate the compromise of such key by organizational measures and by generating several backup keys that will be injected in the hardware in case of a compromise (see 4.2.2). 122 Proceedings of the 29th C&ESAR (2022) F. Chabaud 4.2.1. ARM Secure Boot Specification For intellectual property reason, we cannot here reproduce the precise way the ARM Secure Boot is implemented. We will therefore just sketch the main points to help understanding how the TEA root- of-trust key is managed. By default, secure boot is not enabled. This is mandatory in the design process since initialization of the ARM component will need to boot the CPU. The usual chicken-and-egg situation mandates the component to be initially insecure. Several ways to activate secure boot are available. To simplify, let’s say that we have: 1. A reversible way to activate hardware secure booting through jumpers on hardware pins. 2. An irreversible way to activate hardware secure booting through one-time-programmable hardware memories in the SoC. The first option is intended for development, testing, and qualification. In the end of the production process, the second option will be used to set the component in secure mode. Once set in secure mode, only a signed first stage can be used to boot the SoC (see Figure 3). The signature will be verified against the keys which have been inserted in the component. Consequently, it is possible and recommended to introduce the public keys in the ARM core as soon as possible in the factory process. It has no immediate impact and no risk to brick the SoC provided the secure boot is left in reversible mode. It has the advantage to personalize the component early in the process, making it more difficult to tamper with (see Table 1). Table 1 PA-RoT states State DEV Key PROD Keys Secure Boot Usage OPEN Non present Non present Disabled SoC reception DEV Activated Activated Reversibly Development activated DEV-PROD Reversibly Activated Reversibly Qualification deactivated activated Validation CLOSED Irreversibly Activated Irreversibly Production deactivated activated 4.2.2. Secure Boot Spare Keys For obvious security reason, the hardware secure boot public keys are also injected through OTP memories and cannot be changed once in secure mode. For a given component, the same keys will therefore be in use from day 1 of its production until its end-of-life. If the component is to be used for ten years, the corresponding private keys must be protected during this time. And on such long period, one must anticipate risks such as key loss, key compromise, identity usurpation on the firmware signing chain, etc. As a first consequence, HSM should always be used to protect hardware secure boot private keys. This seems consistent with the sensitivity of keys which cannot be changed in the field if an incident occurs. Secondly, the set of hardware secure boot keys should not be limited to one key. It is therefore separated in one production key, and several spare keys. Any of these keys could sign a firmware recognized by the BMC. But the only one in use is the production key. The spare keys are created just in case something weird happens to the production key. But they must be created at the same time because their public part will be injected in production. And the protection of their private part is of utmost importance. Proceedings of the 29th C&ESAR (2022) 123 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it 4.2.3. Private Key Protection All the keys used to ensure trust in the platform do not deserve the same level of protection. For instance, DEV keys are considered insensitive. They will be deactivated in production and are therefore considered extractable from the HSM. This allows to have some outsourced development without having to give access to the HSM signature mechanisms. This is obviously not the case for PROD keys which are generated, stored, and used in an Atos Trustway Proteccio HSM [30] configured in RGS mode [37]. All production keys are saved in backups protected by a 3-out-of-6 Shamir scheme [38]. Besides, the access to the signature function is controlled: - By logical measures for Chain-of-trust keys. - By the use of a smart card for the HW RoT production key. - Using a smart card AND the possession of the key backup for the HW Spare production keys. 4.3. Trusted Execution Architecture (TEA) The System-on-Chip (SoC) used in the BMC embeds the TrustZone technology which is part of its ARM Core [17]. This is used to host a hardened Operating System TeaCore provided by ProvenRun to implement a Trusted Execution Environment (TEE) as seen in section 2.4. The TEE is then used to secure the two Chains-of-trust related to secure boot and firmware upgrade (see Figure 4). TeaCore is based on the use of a proven operating system ProvenCore which has been certified by ANSSI at EAL7 level in a different context [39]. It ensures a better flexibility to later add new security features such as cryptographic keys secure storage, firmware TPM, and/or flash runtime monitoring. Together with the HW RoT key anchored in the silicium of the BMC, the TeaCore provides the architecture to implement a full Platform Active Root-of-Trust as proposed by OCP. As of today, the chains of trust for development and upgrade are implemented. Next steps could introduce attestation mechanisms. Atos BMC (Linux) TeaCore Secure Services TeaCore Secure Services Abstraction Layer TA TA TA Secure Boot FW upgrade Normal World TrustZone Secure World Crypto Sec. Storage Other Figure 4 - Trusted Execution Environment of the Chains-of-Trust 124 Proceedings of the 29th C&ESAR (2022) F. Chabaud 4.4. Full Secure Boot Sequence We have now all the information to understand the full boot sequence of an Atos server implementing the new Trusted Execution Architecture. For this example, we will use the case of a server based on an Intel CPU implementing the TXT technology [16]. This technology implements its own RoT which signs the Initial Boot Block (IBB), the first step of the Intel TXT secure boot sequence. If activated, the Intel RoT will prevent any change of the IBB which is not properly signed. The secure boot sequence of the CPU can also imply the TPM, either a physical one if present, or the firmware implementation by Intel [40], or the one Atos could add in TEA using the TrustZone technology. This boot process will end at Operating System level. In the case of Windows 11, BitLocker can use the TPM and check through the measurements that the boot process was sane. Atos TEA do not interfere with this process. It only adds at the beginning a preliminary verification of the IBB. Since the BMC is responsible for powering on the CPU, it will use its TrustZone to perform a signature verification of the IBB and won’t power-on the CPU in case of a signature error. The whole sequence will therefore start from the ARM secure boot sequence of the BMC and ensure firmware integrity through the different existing mechanisms (see Figure 5). The approach would work the same way for another type of CPU. Figure 5 - Chains-of-Trust for Detection 5. Potential Application in HPC Now that we have a TEE enabled in our servers, from Edge to Enterprise servers, let’s see the type of application we could envision in a High Performance Computing (HPC) environment. 5.1. How an HPC Could Be a Unique Device One drawback of the Platform Active Root-of-Trust approach is that the secure component becomes a single point of failure for the system. This is especially true when comes the definition of the Unique Device Secret (UDS). As introduced in section 2.6.2, the attestation mechanism assumes the implementation of some hierarchical certification where each vendor attests the integrity of its product using some public key infrastructure mechanism. The PA RoT will therefore use all these UDS to control the authenticity of the platform components with some signature mechanism, and the Proceedings of the 29th C&ESAR (2022) 125 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it verification of the certificates of the public keys. The trusted cryptoprocessor PA RoT is also the privileged place to store the UDS of the platform itself, allowing it to become attester to the remote management infrastructure (verifier). But this approach reaches some limits when it comes to big cloud infrastructure or high-performance computers. As an example, one of the recently deployed Atos HPC platforms counts 300 000 computing cores shared among roughly 4500 CPUs [41]. Which one of those components will host the UDS and identify the HPC in a unique way? 5.2. An HPC Architecture Overview It is not the purpose here to detail the architecture of an HPC installation. Besides, this is an evolving matter. Schematically, an HPC framework will gather nodes of different types exchanging data through an Ethernet or Interconnect fast network. Management Nodes or Rack Management Controllers (RMC) can exist to manage a physical cluster of a hundredth of computing nodes. Each cluster is interconnected with the other clusters to form the overall HPC (see Figure 6). Figure 6 – Cluster architecture From an operational point of view, the access to the computing power is devoted to some login nodes, and one of the management software roles is to schedule the job requests submitted to the login nodes to optimize the computing power. Each job will get allocated some computing nodes and storage resources for an amount of time, depending on the pre-requisites of the job request. The whole purpose of the architecture is to avoid latency in messages exchange between the computing nodes and in input/output writing on the storage nodes, while dealing with astronomically high amount of data. In a standard attestation approach, a compute node would have to check the attestation status of the storage node before sending data to it. This would mean data exchange between the nodes consuming the interconnect bandwidth. Even if it may sound marginal, keep in mind that these machines are pushing the specifications at their limits, and are subject to some avalanche effects when unwanted events occur. On the other hand, all the development framework around HPC has already incorporated error events because the scale of the HPC makes plausible to encounter errors when the machine is in use. Hardware faults, hot swaps, are part of the normal use of an HPC computer. This can be a drawback in a classical platform firmware attestation mechanism, but it can also be turned to our advantage. 5.3. The HPC DNA: a Patented Approach Under the hypothesis that the Trusted Execution Architecture is implemented, an immediate benefit arises from it. All nodes will get a TEE through their BMC (see Figure 7). Of course, the same would occur if each node would come with a TPM or any form of PA RoT secured chip. But the truth is that 126 Proceedings of the 29th C&ESAR (2022) F. Chabaud adding some secure element in these nodes is not that easy for physical constraints (power alimentation, cooling, space) while a BMC is mandatory anyway. So now, we have this trusted capacity on all our nodes, and we can leverage it. Figure 7 – Example of the management network of a cluster Like a living body can identify its cells through characteristics determined by the DNA common to all cells, the idea of a local hardware-secured zone keeping some DNA-like secrets shared by all the machine nodes, makes possible for each node of the machine to perform the access controls, without relying on a remote server to determine if a communication is tampered or not. In other words, this generalizes the notion of Unique Device Secret (UDS) to a global platform such as a High-Performance Computer (HPC) or a Cloud-based infrastructure. Any node of the machine will assume that its counterpart possesses the shared secret. It is therefore possible to encrypt communication under this assumption. If the counterpart fails to decrypt it, this will be treated as a glitch or hardware failure using the normal exception mechanisms of HPC development libraries. 5.4. Unique Secret Generation The powering-on of an HPC machine is done in several steps. For instance, a rack will be powered- on before its computing blades can be powered-on. There is no guarantee on the order of the powering- on. Some computing blades can be powered-on before another rack is powered-on. The hypothesis is that all nodes will eventually establish a connection through a management network, without guarantee that all connections are feasible. For simplification, we will assume that a Rack Management Controller exists, which can be seen as a BMC dedicated to the management of all the BMCs of a rack. We will also assume that the sequence of initialization starts an RMC before the BMC it manages. The process must therefore ensure the following properties: - If the machine is powered-on, a new secret must be generated by the first powered on RMC. - If the machine is powered-off, which is an unlikely event, a new secret must be generated on next power-on by the first RMC that will be powered on. Proceedings of the 29th C&ESAR (2022) 127 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it - If several RMCs are powered-on in parallel, a negotiation mechanism must converge towards a single secret. - Any new RMC or BMC that will be powered on must get the secret in a secure way. Generating a secret in the RMC could be a challenge from a cryptographic perspective, since we do not assume a physical random number generator is available on the BMC. It seems feasible as soon as a reliable random noise generator is available. One needs to avoid the repeatability of the boot process which could lead to the generation of the same secret on each boot. To ensure a proper source of noise, the global entropy must reach at least 256 bits. Fortunately, the RMC is gathering a lot of physical sensors information which can be leveraged to assembly enough noise in the random number generator of the device. When a new component is added to the management network, it can be of two kinds: - If it is a BMC, it will get the secret from its RMC. - If it is an RMC, it will negotiate its secret against the other RMC as follows. One cannot know in advance the order of RMC appearance in the management network. And in the life of the machine, some racks may be shut down for maintenance, then reconnected. One wants the secret to stabilize as soon as possible while preserving the history of the used keys. This is a possible application for blockchain technology and decentralized consensus making. If RMCa and RMCb have booted and generated their secrets Sa and Sb, one cannot choose among these secrets. But a cryptographic mechanism can take place to establish a common secret Sab. This secret is timestamped in the blockchain and becomes its first block. Two blocks are then added with Sa and Sb. If RMCc joins later with its secret Sc it will have to adopt the secret Sab. And the block Sc is added to the chain. The block chain length is therefore related to the number of racks whose secrets were changed so far. If two racks exchange two different block chains with the same initial secret Sab, the block chain is reconciled with the missing blocks. If the two initial secrets differ, the longer chain will be privileged. The blocks of the shorter chain will be added. If the chains have the same length the chain with the smaller hash will be kept. If an RMC has to change its secret after negotiation, it has to propagate the new secret to its rack components. The blocks to add in the blockchain indicate the other RMCs to inform of the secret change. 5.5. Security Discussion To prevent the secrets from being compromised when a computing blade is extracted, the corresponding secrets are stored on RAM in an encrypted way. The component extraction powers the component off by design. This ensures that at least a portion of the encrypted key is erased. The encryption mechanism can therefore guarantee the disappearance of the key if a sufficient portion of the key is lost. This security mechanism is prone to cold boot attacks [42] but this kind of scenario is mitigated if the secret is updated regularly. Before delivering the secret to a new component of the machine, it is of course important to determine if the new component is sane. It is at this step that remote attestation protocol can be used. The UDS at node level makes perfect sense for this as it is inserted at factory time to build security upon trusted remote attestation of a component (see 2.6.2). This pre-inserted private key should never be exposed outside its security module. Zero-knowledge protocols can use the key to attest the authenticity of the TPM-like feature remotely. This way, a newly inserted component can be checked remotely for sanity before providing the secret. And the private key is needed to decipher the secret, protecting it on first communication. 6. Conclusion Based on well-known concepts of Product Security, Atos has implemented a Trusted Execution Architecture (TEA), common to all its servers. The trust in this implementation is founded on: 1. Public cryptographic root-of-trust keys anchored in silicon. 128 Proceedings of the 29th C&ESAR (2022) F. Chabaud 2. Private keys protected by an RGS certified Atos Trustway Proteccio HSM. 3. The well-known ARM TrustZone technology embedded in the existing BMC component of our platforms. 4. The hardened operating system TeaCore developed by ProvenRun on Atos specification and based on their formally proven and EAL7 certified operating system ProvenCore. This TEA is first used to ensure some Platform Firmware Resiliency through firmware signatures verified at boot time and before any upgrade. Its generalization to all Atos-made platforms makes possible some innovative security features. As an example, we presented an innovative approach to device attestation applicable to High-Performance Computing (HPC) environments which generalizes the notion of Unique Device Secret (UDS) to a global platform such as a HPC or a Cloud-based infrastructure. 7. References [1] Howard S. Dakoff, The Clipper Chip Proposal: Deciphering the Unfounded Fears That Are Wrongfully Derailing Its Implementation, 29 J. Marshall L. Rev. 475 (1996). [2] Y. Frankel and M. Yung. Escrow Encryption Systems Visited: Attacks, Analysis and Designs. Crypto 95 Proceedings, August 1995. [3] Philip Zimmermann - Why I Wrote PGP (June 1991 – updated 1999) URL: http://www.philzimmermann.com/EN/essays/WhyIWrotePGP.html [4] Trusted Computing Group. URL: https://trustedcomputinggroup.org/ [5] Information technology — Trusted Platform Module, International Standards Organization ISO/IEC 11889 series (2009-2015). [6] Open Compute Project. URL: https://www.opencompute.org [7] OCP Membership Tiers. URL: https://www.opencompute.org/membership [8] Michel Ugon. Support d’information portatif muni d’un microprocesseur et d’une mémoire morte programmable, CII-Honeywell-Bull patent FR77.26107. 26/8/1977. [9] Michel Ugon. Portable data carrier including a microprocessor. CII-Honeywell-Bull patent US4.211.919A. 26/8/1977. https://patents.google.com/patent/US4211919A [10] Michel Ugon. Single chip microprocessor with on-chip modifiable memory, Bull CP8 patent US4.382.279. 25/4/1978. https://patents.google.com/patent/US4382279A/en?oq=4.382.279 [11] P. Kocher, J. Jaffe, B. Jun, "Differential Power Analysis" Advances in Cryptology - Crypto 99 Proceedings, Lecture Notes In Computer Science Vol. 1666, M. Wiener, ed., Springer-Verlag, 1999. [12] Common Criteria for Information Technology Security (ISO/IEC 15408) https://www.commoncriteriaportal.org/ [13] Security Requirements for Cryptographic Modules, Federal Information Processing Standards Publication 140-3, National Institute of Standards and Technology, March 22, 2019. URL: https://doi.org/10.6028/NIST.FIPS.140-3. [14] Jean-Baptiste Bédrune and Gabriel Campana. Everybody be cool, this is a robbery! SSTIC 2019 Proceedings, June 2019. URL: https://www.sstic.org/media/SSTIC2019/SSTIC- actes/hsm/SSTIC2019-Article-hsm-campana_bedrune_neNSDyL.pdf [15] M. Sabt, M. Achemlal and A. Bouabdallah, "Trusted Execution Environment: What It is, and What It is Not," 2015 IEEE Trustcom/BigDataSE/ISPA, 2015, pp. 57-64, doi: 10.1109/Trustcom.2015.357 [16] Intel Trusted Execution Technology (Intel® TXT) Software Development Guide, rev. 017.3 March 2022 https://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-txt-software- development-guide.pdf [17] ARMLtd, “Arm security technology - building a secure system using trustzone technology,” Rev. C, April 2009. https://developer.arm.com/documentation/PRD29-GENC-009492/c [18] Di Shen. Attacking your “Trusted Core” - Exploiting TrustZone on Android. BlackHat USA 2015. https://www.blackhat.com/docs/us-15/materials/us-15-Shen-Attacking-Your-Trusted-Core- Exploiting-Trustzone-On-Android.pdf Proceedings of the 29th C&ESAR (2022) 129 Setting Hardware Root-of-Trust from Edge to Cloud, and How to Use it [19] Laginimaineb, "Bits, Please!: Full TrustZone exploit for MSM8974" 8 octobre 2015. https://bits- please.blogspot.com/2015/08/full-trustzone-exploit-for-msm8974.html [20] Nagra-Certified Secure Video/Audio Chipsets Surpass 80 Million Mark, 2015. https://dtv.nagra.com/nagra-certified-secure-videoaudio-chipsets-surpass-80-million-mark [21] Microchip CEC1702 Data Sheet, 2019. https://www.microchip.com/en-us/product/CEC1702 [22] Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ [23] Datacenter Secure Control Module (DC-SCM) Specification (2021). URL: https://www.opencompute.org/documents/ocp-dc-scm-spec-rev-1-0-pdf [24] OCP Platform Security Overview. URL: https://docs.google.com/document/d/1- bfAF86cEKcn1guF-Qj2C2HhMM2oJ2njNGdHxZeetR0/edit# [25] Special Publication 800-193 Platform Firmware Resiliency Guidelines, National Institute of Standards and Technology, May 2018. URL: https://doi.org/10.6028/NIST.SP.800-193 [26] Attestation of System Components v1.0 Requirements and Recommendations (2020) https://www.opencompute.org/documents/attestation-v1-0-20201104-pdf [27] Security Protocol and Data Model (SPDM) Specification https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.0.0.pdf [28] Libspdm, a sample implementation of the DMTF SPDM specification https://github.com/DMTF/libspdm [29] Project Cerberus Firmware Challenge Specification https://github.com/opencomputeproject/Project_Olympus/tree/master/Project_Cerberus [30] Atos Trustway Proteccio netHSM https://atos.net/en/solutions/cyber-security/data-protection-and-governance/hardware-security- module-trustway-proteccio-nethsm [31] Open Titan: the first open source project building a transparent, high-quality reference design and integration guidelines for silicon root of trust (RoT) chips. https://opentitan.org/ [32] Dominic Rizzo. OpenTitan at one year: the open source journey to secure silicon. Google Open Source Blog, 7 December 2020. https://opensource.googleblog.com/2020/12/opentitan-at-one- year-open-source.html [33] Fabien Périgaud, Alexandre Gazet and Joffrey Czarny. Backdooring your server through its BMC: the HPE iLO4 case. SSTIC 2018 proceedings. [34] Open BMC: Defining a Standard Baseboard Management Controller Firmware Stack. https://www.openbmc.org/ [35] Joel Stanley. Securing firmware: Secure and Trusted boot in OpenBMC. January 2020, LCA 2020. https://archive.org/details/lca2020-Securing_firmware_Secure_and_Trusted_boot_in_OpenBMC [36] Infineon Techologies AG OPTIGA™ Trusted Platform Module SLB9672_2.0 v15.20.15686.00 Common Criteria Part 3 conformant EAL 4 augmented by ALC_FLR.1 and AVA_VAN.4 Certification Report. BSI-DSZ-CC-1113-2021. 21 May 2021. https://www.commoncriteriaportal.org/files/epfiles/1113a_pdf.pdf [37] Arrêté du 13 juin 2014 portant approbation du référentiel général de sécurité et précisant les modalités de mise en œuvre de la procédure de validation des certificats électroniques, JORF n°0144 du 24 juin 2014. https://www.ssi.gouv.fr/entreprise/reglementation/confiance- numerique/le-referentiel-general-de-securite-rgs/ [38] Adi Shamir. "How to share a secret", Communications of the ACM, 22 (11): 612–613, 1979. [39] ProvenCore secure OS achieves EAL7 Common Criteria certification, 13 September 2019. https://provenrun.com/provencore-secure-os-achieves-eal7-common-criteria-certification/ [40] Darek Fanton. Intel Platform Trust Technology – TPM for the Masses. 6 July 2022. https://www.onlogic.com/company/io-hub/intel-platform-trust-technology-ptt-tpm-for-the- masses/ [41] Patricia Pottier. The NWP systems at Météo-France. 30th ALADIN Wk & HIRLAM ASM 2020. https://www.umr-cnrm.fr/aladin/IMG/pdf/poster-france-wk2020-web.pdf [42] Sergei Skorobogatov. Low temperature data remanence in static RAM. Technical report UCAM- CL-TR-536. University of Cambridge Computer Laboratory, June 2002. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-536.pdf 130 Proceedings of the 29th C&ESAR (2022)