# From New Technologies to New Solutions Exploiting FRAM Memories to Enhance Physical Security

Stéphanie Kerckhof<sup>1</sup>(⊠), François-Xavier Standaert<sup>1</sup>, and Eric Peeters<sup>2</sup>

 <sup>1</sup> ICTEAM/ELEN/Crypto Group, Université Catholique de Louvain, Charleroi, Belgium
 <sup>2</sup> Texas Instruments, Dallas, TX, USA
 {stephanie.kerckhof,fstandae}@uclouvain.be, e-peeters@ti.com

**Abstract.** Ferroelectric RAM (FRAM) is a promising non-volatile memory technology that is now available in low-end microcontrollers. Its main advantages over Flash memories are faster write performances and much larger tolerated number of write/erase cycles. These properties are profitable for the efficient implementation of side-channel countermeasures exploiting pre-computations. In this paper, we illustrate the interest of FRAM-based microcontrollers for physically secure cryptographic hardware with two case studies. First we consider a recent shuffling scheme for the AES algorithm, exploiting randomized program memories. We exhibit significant performance gains over previous results in an Atmel microcontroller, thanks to the fine-grained programmability of FRAM. Next and most importantly, we propose the first working implementation of the "masking with randomized look-up table" countermeasure, applied to reduced versions of the block cipher LED. This implementation provides unconditional security against side-channel attacks (of all orders!) under the assumption that pre-computations can be performed without leakage. It also provides high security levels in cases where this assumption is relaxed (e.g. for context or performance reasons).

#### 1 Introduction

Providing (physical) security against side-channel attacks is a challenging task for cryptographic designers [10]. This is especially true for low-cost embedded devices, with strongly constrained resources. Typical examples of countermeasures in this context include masking [3] and shuffling [9]. But in both cases, the concrete security levels attained by the protected implementations highly depend on hardware assumptions, in particular the amount of noise in the measurements which may not be sufficient, e.g. in 8-bit devices [19,20]. The performance overhead they imply can also be significant [7,15]. As a result, such countermeasures are usually combined in a somewhat heuristic manner, in order to ensure "practical" security against a wide enough category of adversaries [16].

In parallel to these advances, some more recent works have tried to formalize the problem of physical security, in order to extend the guarantees of provable security from algorithms and protocols to implementations. The main challenge in this case is to find relevant restrictions of the adversaries. A typical example is the one of leakage-resilient cryptography, where the assumption is that the information leakage of a single algorithm run is bounded (see, e.g. [5] for an early reference, [17] for a recent one, and many other proposals in between). Alternatively, another line of work is based on the assumption of secure pre-computations, e.g. in order to prevent "continual memory attacks", as formalized by Brakerski et al. [2] and by Dodis et al. [4]. The one-time programs introduced at Crypto 2008 are another (extreme) way to exploit such secure pre-computations [6]: they essentially correspond to a program that can be executed on a single input, whose value can be specified at run time. Nevertheless, the practical relevance of these solutions is still limited by sometimes unrealistic hardware assumptions and (mainly), by large performance overheads.

In this paper, we start from the observation that both for practice-oriented and theory-oriented countermeasures against side-channel analysis, the exploitation of secure pre-computations is highly related to the problem of fast and efficient non-volatile storage. In this context, a significant drawback of the mainstream Flash memories is that write operations are slow and energy-consuming. Furthermore, their number of tolerated write/erase cycles is also limited (from 10 k to 100 k, typically), which may prevent their frequent use for cryptographic operations. Interestingly, the recently available Ferroelectric RAM (FRAM) provides a solution to these issues<sup>1</sup>. As a result, we investigate whether it can be used as a technology enabler to improve the performances and security of protected implementations. For this purpose, we first consider the shuffling countermeasure, and its instantiation for the AES algorithm based on randomized program memories proposed in [20]. We show that FRAM allows significantly improved performances in terms of pre-computation time. Next, we discuss the application of these new memories to the Randomized Look-Up Table (RLUT) countermeasure [18]. It can be viewed as a type of one-time program specialized to side-channel analysis, or as a generic masking scheme that provides unconditional security against side-channel attacks of all orders (i.e. independent of the statistical moment estimated by the adversary). We provide the first working implementation of this solution applied to reduced versions of the block cipher LED [8], and analyze its performances in various settings. In particular, we investigate the contexts of complete and secure pre-computations, and the tradeoffs corresponding to partial (and partially leaking) ones. We reach performances that are close to higher-order masking in the latter case [15], while complete and secure pre-computations also ensures much higher security levels at practically reachable cost. Therefore, our results suggest that FRAM is a promising solution for improving the security of low-cost tokens such as smart cards, especially when some pre-computations can be performed in a safe environment.

Strictly speaking, FRAM is not a new technology as it was introduced as a high-security alternative to Flash memories back in the early 2000 s by Fujitsu. However, FRAM-based smart cards did not make it to mass market at that time, due to excessive manufacturing costs and limited ability to reduce cell transistor size.

The rest of the paper is structured as follows. Section 2 provides the necessary background on FRAM and discusses our security model. Section 3 contains the implementation results of the shuffling with randomized program memory countermeasure, and their comparison with the previous work from Asiacrypt 2012. Section 4 describes the RLUT countermeasures, our proposed implementation and the various tradeoffs it provides. Finally, conclusions are in Sect. 5.

## 2 Background

#### 2.1 FRAM Microcontrollers

Standard solutions for non-volatile storage such as Flash and EEPROM usually suffer from long programming times, as well as a high voltage required to program bit cells with hot carrier injection or Fowler-Nordheim tunneling effect. In addition, the charge pump overhead as well as the high current supply they require make these technologies not ideal for applications where frequent data logging or ultra-low-power write operations are needed (e.g. all RF applications such as e-passport, RF Banking Card, ...). FRAM is a promising alternative that combines the advantages of non-volatile memories with much faster write speed (e.g. 125 ns per 64-bit word for the 130-nm TI FRAM technology exploited in MSP430FR devices), less power (82 uA/MHz active power in the same technology) and infinite (10<sup>15</sup>) write-erase cycle performances.

FRAM stores information through the use of a stable electric dipole found in ferroelectric crystals (insensitive to the magnetic field). The polarization-voltage hysteresis loops for such materials are very similar to the B-H curve of magnetic materials. Exploiting this fact, a FRAM bit cell structure consists of a ferroelectric capacitor containing the crystal. The capacitor is connected to a plate line, bit lines, and a transistor switch to access it. This is also referred to as a 1T-1C memory cell mode (Fig. 1, left). By contrast, 2T-2C memory cell modes (Fig. 1, right) would store the data as 2 opposite values in each 1T-1C cell of its



Fig. 1. FRAM bit cell modes: left: 1T-1C structure, right: 2T-2C structure.

structure (similar to what is found in EEPROM for instance). Reading the data from FRAM occurs by placing a voltage on the plate line. The idea is that for every read operation, one tries to set the cell to a 0 state. If the voltage causes the dipoles inside the capacitor to flip its orientation, then a large charge Q is generated on the bit line. On the contrary, if the orientation of the dipole is already negative prior to applying the voltage to the plate line in a read cycle, then the dipole direction does not flip, and only a small charge Q is induced on the bit line. The difference can be measured by a sense amplifier. An important consequence of this description is that FRAM reads are destructive, and therefore require a refresh process. Nevertheless, this process is automatically completed by the controller and therefore transparent to the user.

In this paper, we used a microcontroller of the MSP430FRxxxx family from Texas Instruments. This type of microcontrollers provides an ultra-low-power 16-bit RISC CPU, and a set of instructions performing operations on either 8 or 16 bits of data. The available non-volatile FRAM memory can have a size of up to 64 kb (microcontrollers with 128- and 256-kilobyte capabilities are already announced). All the developed code was made for a MSP430FR5739 microcontroller, containing 16 kb of FRAM, and was tested using the MSP-EXP430FR5739 experimenter's board and Code Composer Studio 5.3.

#### 2.2 Security Model

The following sections mainly aim at demonstrating the efficiency of FRAM-based cryptographic computations. Yet, since we consider implementations protected against side-channel attacks, it is important to say a few words about the security model we rely on. Both for the shuffling in Sect. 3 and for the RLUT countermeasure in Sect. 4, we can consider two alternatives:

- 1. Secure pre-computations. That is, the permutations used in shuffling and the randomized program used in RLUT are pre-computed without leakage, prior to the execution of the cryptographic algorithm. As a result, the security of the shuffling is exactly the one analyzed at Asiacrypt 2012 [20]. And the security of the RLUT countermeasure is unconditional: even an adversary accessing the (identity) leakage of all the intermediate computations in the target implementation would not recover any information about its key.
- 2. Leaking pre-computations. That is, the permutations used in shuffling and the randomized program used in RLUT are computed online, and leaking information. In this case, the security of both countermeasures is less investigated, and essentially depends on how much information is leaked during precomputation (in fact, the same observation holds for most countermeasures exploiting randomness, e.g. masking). Note however that the randomness used to protect these implementations is generated on-chip, and is never output. Hence, adversaries can only mount SPA attacks against it. As a result, we can informally state that our implementations will remain secure in this context, as long as one can guarantee SPA security for this part of the computation (a similar informal separation between SPA and DPA was used to argue about the security of fresh re-keying schemes, e.g. in [12]).

Note that in terms of security, the main advantage of FRAM is to make the first model more realistic. Indeed, one could (at least theoretically) imagine to implement a randomized program with SRAM memories. But in addition to performances that would most likely be poor in this case, such a solution should anyway be implemented online (hence leaking), since SRAM is volatile.

### 3 Improving Past Results: The Shuffling Case

Shuffling the execution order of independent operations is a possible solution to improve security of cryptographic implementations against side-channel attacks. The goal of shuffling is to distribute the intermediate cipher values over a given period of time, so that an attacker will only be able to observe a chosen intermediate value at a particular moment in time with a certain probability. A typical example of independent operations that can be shuffled is the SubBytes layer in the AES. Indeed, whatever the order in which each of the 16 S-Box's outputs is generated, the result of the SubBytes layer will not be affected.

A previous work on shuffling, proposed 3 implementations of the AES on an Atmel ATMega644p microcontroller [20]: a basic one with double indexing, an optimized one with randomized execution path, and a variant with randomized program memory. In this paper, we focus on this third proposal, for which FRAM technology provides significant improvements (the two first ones lead to essentially similar performances, independent of the non-volatile memory used). Randomizing the program memory corresponds to rewriting the code in a randomized way before each algorithm run. In other words, a pre-computation phase modifies (inside the code) which registers and memory addresses will be used during the execution, which then remains essentially the same as an unshuffled one. While promising in principle, such an instantiation of the shuffling idea faces some limitations when implemented in Flash-based Atmel microcontrollers. First, even when only a few bytes need to be modified, a complete memory page must be erased and rewritten, which takes a lot of time (more or less 4.5 ms each time a page is written or erased). Next, the memory can only be rewritten a limited number of times (namely 10000). Hence, FRAM-based microcontrollers are natural candidates to relax these limitations as we now detail.

Implementing the AES algorithm with randomized program memory requires three main functions. First, a permutation generator must be defined - we used exactly the same implementation as proposed in [20]. Next, it is necessary to have an AES description with well defined sets of 16 independent operations, on which the shuffling can be applied. Although such operations are easily found for SubBytes and AddRoundKey, their specification is more difficult for ShiftRows and MixColumns, for which the 16 bytes are not manipulated independently. This implies that their output cannot be stored at the same location as their input, resulting in the need of 16 additional bytes of temporary storage. Furthermore, the FRAM microcontroller we use has only 12 CPU registers, which is not enough to store a complete AES state. Therefore, each independent operation needs to access the FRAM with absolute addressing, which is more time

| Unprotected AES |                  | Code size | Data size | Cycle count |
|-----------------|------------------|-----------|-----------|-------------|
|                 |                  | 1076      | 52        | 5800        |
| Shuffled AES    | Perm. generation | 194       | 18        | 2240        |
|                 | Code shuffling   | 418       | 0         | 2751        |
|                 | AES execution    | 2404      | 146       | 8479        |
|                 | Total            | 3016      | 164       | 13470       |

Table 1. AES program size (in bytes) and cycle counts in the MPS430FR5739.

consuming than working on registers. As for the implementations of Asiacrypt 2012, dummy key-schedule operations have also been added to the "on-the-fly" key-schedule, in order to obtain enough independent operations for this part of the implementation as well [20]. Eventually, the last function needed is the one randomizing the code before execution. This randomization was achieved by modifying the bytes of instructions referring to the cipher state's or round key's memory addresses. Interestingly, since the code and the data are both stored in the same FRAM memory, modifying some bytes of the code or some bytes of cipher state and round key takes exactly the same amount of time.

Our implementation results are available in Table 1 (and are given for encryption only). For reference, we first implemented an unshuffled version of the AES in the MSP430FR5739 microcontroller. Even if performance comparisons obtained with different technologies always have to be considered with care, it is worth noticing that it is slightly more time-consuming than Atmel ones (e.g. the open source AES Furious requires 3546 cycles to execute [13]). This is mainly a consequence of the limited number of registers available in FRAM microcontrollers, leading to more frequent memory accesses. By contrast and as expected, the pre-computation time required to shuffle the code is strongly reduced, from 18 ms in Atmel devices to 0.19 ms (running the chip at 16 MHz), which corresponds to a ratio of approximately 100. This is the main advantage of our implementation. Note finally the increased data size and cycle count for executing the AES in its shuffled version, which is essentially due to the previously mentioned execution of dummy key-schedules. We conclude that the overhead required to shuffle the AES algorithm based on a randomized program memory is now in line with practical applications constraints.

# 4 Making New Results Possible: The RLUT Case

The previous section described how FRAM memories allow significant speedups for shuffled implementations exploiting randomized program memories. In this section, we show how similar ideas can be used to enable the efficient implementation of the RLUT countermeasure. For this purpose, we first recall the intuition behind this countermeasure, then describe its application to reduced versions of the block cipher LED, and finally discuss implementation results.

#### 4.1 Description of the Countermeasure

We will focus on the protection of a single S-box that is the most challenging part of the countermeasure. Intuitively, it is convenient to start from the firstorder Boolean masking depicted in Fig. 2. In this scheme, a random mask m is first added to the sensitive value x which is then sent trough the combination of a bitwise kev addition ⊕ and S-box S. A correction function C is used (taking both  $x \oplus m \oplus k$  and m as input) in order to produce the output mask q such that  $S(x \oplus m \oplus k) = S(x \oplus k) \oplus q$ . Such an implementation typically gives rise to 4 leakage points denoted as  $L_1$ ,  $L_2$ ,  $L_3$  and  $L_4$  on the figure ( $L_2$  being the combination of two parts). It ideally guarantees that statistical moments of order 2 will have to be estimated by an adversary in order to recover secret information. The word "ideally" here refers to the fact that physical defaults such as glitches can lead to exploitable information in lower-order statistical moments [11]. For example, the leakage point  $L_2$  on the figure corresponds to the manipulation of  $x \oplus m \oplus k$  leading to  $\mathsf{L}_2^a(x \oplus m \oplus k)$ , and m leading to  $\mathsf{L}_2^b(m)$ , in parallel. It implies first-order exploitable information if these two parts of the leakage function are not independent<sup>2</sup>. Boolean masking can be naturally generalized to d shares (d=2) in the example of Fig. 2), leading to an (ideal as well) data complexity increase proportional to  $(\sigma_n^2)^d$ , with  $\sigma_n^2$  the variance of the noise in the leakage samples, as demonstrated by Chari et al. [3].

From this description, a first step towards the RLUT countermeasure is the observation that if the master key is fixed and for n-bit S-boxes, one can replace the computation of  $S(x \oplus k)$  by a pre-computed table of size  $2^n \times n$  (the correction function can be implemented similarly as a table of size  $2^{2n} \times n$ ). It directly leads to the implementation of Fig. 3, which is functionally equivalent to the previous one, but where the key addition has been "included" in a key-dependent permutations  $P_k(x)$ . From the side-channel security point-of-view, it still corresponds to a first-order secure implementation. Next, and in order to provide unconditional security against side-channel attacks of all orders, the main idea is to replace the Boolean masking operation  $x \oplus m$  by an extension to three shares denoted as  $G_i(x,m) = x \oplus m \oplus a_i$ , where  $a_i$  is a n-bit random mask that is pre-computed in a leakage-free environment. These operations, illustrated in Fig. 4, can also be implemented as tables of size  $2^{2n} \times n$ , so that the shares  $a_i$ will never be manipulated during the "online" execution of the algorithm. If the  $G_1$  (resp.  $G_2$ ) function is refreshed before each run of the protected implementation, it guarantees that no information can be extracted from the leakage points  $(L_1, L_2)$  (resp.  $(L_3, L_4)$ ). We additionally need  $G_1$  and  $G_2$  (i.e. their hidden  $a_i$  shares) to be independent, in order to avoid fourth-order leakages taking advantage of the correlation between the tables' inputs and outputs. Eventually, it remains to randomize the permutation  $P_k(x)$  (and the correction function C) in order to completely hide the key, even from identity leakage functions, as represented in Fig. 5. This way, a non-linear S-box can be computed securely. This leads to an implementation in which the operations  $G_1$ ,  $G_2$ , R and RC are

As a typical example,  $L_2 = L_2^a + L_2^b$  would correspond to an ideal implementation, while  $L_2 = L_2^a \cdot L_2^b$  would leak first-order information, as discussed in [19].



Fig. 2. Boolean masking.



Fig. 3. Boolean masking with LUTs.



Fig. 4. Randomized Boolean masking with LUTs.



Fig. 5. Randomized Boolean masking with randomized LUTs.

#### Algorithm 1. Table refreshing.

```
- input: P_k.

1. Pick a_1 \stackrel{R}{\leftarrow} \{0,1\}^n;

2. Pick a_2 \stackrel{R}{\leftarrow} \{0,1\}^n

3. Pick a_3 \stackrel{R}{\leftarrow} \{0,1\}^n;

4. Pre-compute G_1(I,J) = I \oplus J \oplus a_1;

5. Pre-compute R(I) = P_k(I) \oplus a_2;

6. Pre-compute G_2(I,J) = I \oplus J \oplus a_3;

7. Pre-compute R(I,J) = r(I) \oplus P_k(I \oplus J \oplus a_1) \oplus a_3;

- output: G_1, R, G_2, RC.
```

#### **Algorithm 2.** S-box evaluation on input x.

```
- input: G_1, R, RC.

1. Pick m \stackrel{R}{\leftarrow} \{0,1\}^n;

2. Compute G_1(x,m);

3. Compute R(G_1(x,m));

4. Compute R(G_1(x,m),m);

- output: R(G_1(x,m)), RC(G_1(x,m),m).
```

pre-computed according to Algorithm 1, and executed according to Algorithm 2. Extending this S-box computation to a complete cipher is straightforward: we just need independent tables for all the S-boxes. As for the linear operations, they have to be applied independently on the two shares that are explicitly manipulated by the leaking device, just as in standard Boolean masking.

#### 4.2 Application to Reduced LED

The previous section suggests that the RLUT countermeasure has high memory requirements, that strongly depend on the S-box size used in the block cipher to protect. In particular, given a  $N_r$ -round cipher with  $N_s$  S-boxes per round, the implementation of the RLUT countermeasure essentially requires storing:

- A table map that corresponds to all the  $a_i$  shares generated during precomputation, with memory cost estimated as  $(N_s \cdot N_r) \cdot 2 + N_s$  n-bit words (where the factor 2 corresponds to the fact that excepted for the first round, the share  $a_1$  in Algorithm 1 is always provided by the previous round).
- A randomized program that corresponds to the tables R and RC, with memory cost estimated as  $N_r \cdot N_s$  tables of size  $2^n \times n$  and  $2^{2n} \times n$ , respectively.

Note that operations  $G_i$  are never explicitly used during the cipher execution, but for the first round to mask, and last round to unmask after a secure computation is completed. Following these estimations, and as discussed in [18], it is natural to consider a cipher with 4-bit S-boxes for this purpose. In the following, we will consider reduced (16-bit) versions of the LED cipher illustrated in Fig. 6.

While such a cipher is naturally too small for being deployed in actual applications, we use it to refine our model for RLUT performance estimates. As will be discussed in the next sections, scaling to larger number of rounds and block



Fig. 6. Reduced version of the block cipher LED.

size (e.g. the full 64-bit LED cipher) will be possible in soon available 128-and 256-kilobyte versions of our FRAM microcontroller. In the figure, the 4-bit S-boxes of LED are denoted as S, and its linear diffusion layer as MixColumns.

#### 4.3 Implementation in FRAM Microcontrollers

We now describe how to implement reduced (with up to 4 rounds) LED ciphers within the 16 kb of FRAM available in our MSP430FR microcontroller.

The first building block required in a RLUT-masked implementation is a randomness generator (needed to produce the  $a_i$  values of Algorithm 1). For illustration, we used a LFSR with CRC-32 polynomial for this purpose (alternative ways of generating randomness could of be considered, e.g. using a leakage-resilient PRG if leaking pre-computations are considered [5,17]).

Next, the part computing the randomized program can be implemented quite straightforwardly, following the description in Sect. 4.1. The trickiest bit was to efficiently arrange 4-bit outputs into memory bytes, without giving any unnecessary information on the RLUT input values<sup>3</sup>. Using one byte to store two consecutive RLUT outputs was rejected, since accessing one or the other value in the byte would have led to different code behaviors, depending on the LSB bit of the RLUT's input. Instead, we stored the outputs coming from two different RLUTs for the same input value in a single byte. This time, the LSB (resp. MSB) part of one byte will be accessed when an odd (resp. even) word of the state needs to be computed, giving no information on the word's value itself. Based on this strategy, the RLUTs R and RC can be generated efficiently from the cipher key, the S-Box and the table map, that are all stored in memory.

Eventually, the last piece of code concerns the execution of the block cipher itself. Again, the fact that the operations are performed on 4-bit words had to be taken into account while accessing the variables or tables stored in memory. One round of the reduced algorithm is executed by first reading the R and RC tables' outputs, corresponding to the cipher state and mask intermediate values. Then, the MixColumn layer is executed on each of the shares. It is implemented using an Xtime table, as suggested in the specifications of LED [8].

<sup>&</sup>lt;sup>3</sup> This has no impact on the security in case of secure pre-computation, but may increase the information leakage in case of online randomization of the tables.

#### 4.4 Results and Discussion

As described in Sect. 4.2, an estimation of the memory size required to store the table map and randomized program of the RLUT countermeasure can be derived from the number of rounds  $N_r$ , the number of S-Boxes per round  $N_s$  and the S-Box bit-size n. This estimation is illustrated by the dashed line of Fig. 7, in the case where  $N_s = 4$ , n = 4 and  $N_r$  varies from 1 to 4. A plain line representing the actual results we obtained for our implementation is also plotted. The two curves follow the same trend, with the offset separating them corresponding to the code size needed to implement the cipher itself (i.e. excluding the tables for which the memory requirements are growing with  $N_r$  - see the detailed results in Appendix A, Table 2). Interestingly, these results suggest that for any parameters  $N_r$ ,  $N_s$  and n, the memory requirements needed to implement a block cipher protected with the RLUT countermeasure can be quite accurately predicted. For example, such an implementation for a full (64-bit) version of the block cipher LED (corresponding to  $N_r = 32$ ,  $N_s = 16$  and n = 4) would roughly require a memory size of 70 kb, and could therefore be implemented in the soon available 128-kilobyte FRAM microcontrollers.

The question of accurate predictions can also be asked for pre-computation time: estimates for this metric were similarly provided in [18]. Namely, the time needed to generate the RLUTs can be approximated with  $((N_s.N_r).2 + N_s) + (N_s.N_r).2^n + (N_s.N_r).2^{2n}$  "elementary operations" (where the first term corresponds to randomness generation, and the later ones correspond to the refreshing of the R and RC tables). Our actual implementation results directly allow translating these "elementary operations" into a concrete number of clock cycles. The results in Fig. 8 (also reported in Appendix A, Table 3) again confirm a nice correlation with predictions. Namely, the main difference between both curves is a factor 40, which presumably corresponds to the number of cycles needed to perform each elementary operations. Extrapolating these results to the full (64-bit) version of the LED block cipher suggests pre-computation time complexities



Fig. 7. Program size of the LED cipher protected with RLUTs.



Fig. 8. Pre-computation time of the LED cipher protected with RLUTs.

around 140 000 elementary operations, corresponding to  $5\,600\,000$  cycles (i.e. an execution time of  $35\,\mathrm{ms}$  at  $16\,\mathrm{MHz}$ ), which would be acceptable for some applications (and is likely to be improved with technology scaling).

Eventually, it is worth noticing that time and memory complexities could be reduced by exploiting some performance vs. security tradeoffs. A first solution would be an implementation for which only some rounds are masked with RLUTs, while the others use standard masking schemes. For example, one could protect only the first and last 4 rounds of (full, 64-bit) LED, i.e. 8 out of 32, resulting in an approximate reduction of the time and memory complexities down to 25% of their original values. However, this solution may be risky in front of advanced attacks such as algebraic ones, that can exploit the leakage of the middle rounds [14]. Another approach to reduce the pre-computation time consists in refreshing only a fraction of the table map (and randomized program) after each cipher execution, and to perform this refreshing randomly. Interestingly, such a tradeoff would also take advantage of the non-volatile capacities of FRAM memories, since the complete randomized program could be pre-computed offline, while only a part of it would be modified online. It is flexible since the fraction of RLUTs modified per cipher executions could be adapted to the application requirements. As an illustration, modifying 10% of the RLUTs in (full, 64-bit) LED would reduce the pre-computation time to approximately 560 000 cycles, which is getting close to the performances of a third-order masked AES (470 000 cycles in [15]). Combining the two appraoches would of course be possible as well, e.g. by randomizing the first- and last-round tables in priority.

#### 5 Conclusion

Our results put forward that FRAM is a promising technology in the context of side-channel resistant cryptographic hardware, since it enables the efficient implementation of various countermeasures taking advantage of pre-computations. The case of RLUTs is particularly relevant to illustrate this observation.

Indeed, they have never been implemented so far, they will soon be applicable to complete block ciphers, and may lead to high security levels for small embedded devices, independent of hardware assumptions that may be hard to fulfill. Important scopes for further investigations include the evaluation of the security levels obtained in the context of partially leaking pre-computations. In particular, analyzing the online refreshing of partially randomized programs mentioned in Sect. 4.4 would be very useful. Besides, the design of block ciphers that are well suited to implementations with RLUTs (e.g. with light(er) non-linear layers and strong(er) linear ones) is another interesting research avenue.

Acknowledgements. Stéphanie Kerckhof is a PhD student funded by a FRIA grant, Belgium. François-Xavier Standaert is a research associate of the Belgian Fund for Scientific Research (FNRS-F.R.S.). This work has been funded in parts by the Walloon region WIST program project MIPSs, by the European Commission through the ERC project 280141 (acronym CRASH) and by the European ISEC action grant HOME/2010/ISEC/AG/INT-011 B-CCENTRE.

# A RLUT Implementation Results

See Tables 2 and 3.

**Table 2.** Program size of the LED cipher protected with RLUTs (in bytes).

|          | Code size | Data size | Total |
|----------|-----------|-----------|-------|
| 1 Round  | 1180      | 562       | 1742  |
| 2 Rounds | 1180      | 1112      | 2292  |
| 3 Rounds | 1180      | 1662      | 2842  |
| 4 Rounds | 1180      | 2212      | 3392  |

Table 3. Cycle counts of the LED cipher protected with RLUTs.

|          | Randomness generation | RLUTs generation | LED execution | Total  |
|----------|-----------------------|------------------|---------------|--------|
| 1 Round  | 1422                  | 41 910           | 404           | 43 736 |
| 2 Rounds | 2208                  | 83807            | 642           | 86657  |
| 3 Rounds | 2991                  | 125716           | 883           | 129590 |
| 4 Rounds | 3770                  | 167617           | 1118          | 172505 |

#### References

- 51th Annual IEEE Symposium on Foundations of Computer Science FOCS 2010, Las Vegas, Nevada, USA, pp. 23–26. IEEE Computer Society, 23–26 October 2010
- Brakerski, Z., Kalai, Y.T., Katz, J., Vaikuntanathan, V.: Overcoming the hole in the bucket: public-key cryptography resilient to continual memory leakage. In: FOCS [1], pp. 501–510

- 3. Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999)
- 4. Dodis, Y., Haralambiev, K., López-Alt, A., Wichs, D.: Cryptography against continuous memory attacks. In: FOCS [1], pp. 511–520
- Dziembowski, S., Pietrzak, K.: Leakage-resilient cryptography. In: FOCS, pp. 293–302. IEEE Computer Society (2008)
- Goldwasser, S., Kalai, Y.T., Rothblum, G.N.: One-time programs. In: Wagner, D. (ed.) CRYPTO 2008. LNCS, vol. 5157, pp. 39–56. Springer, Heidelberg (2008)
- Grosso, V., Standaert, F.-X., Faust, S.: Masking vs. multiparty computation: how large is the gap for AES? In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 400–416. Springer, Heidelberg (2013)
- 8. Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 326–341. Springer, Heidelberg (2011)
- 9. Herbst, C., Oswald, E., Mangard, S.: An AES smart card implementation resistant to power analysis attacks. In: Zhou, J., Yung, M., Bao, F. (eds.) ACNS 2006. LNCS, vol. 3989, pp. 239–252. Springer, Heidelberg (2006)
- 10. Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks Revealing the Secrets of Smart Cards. Springer, New York (2007)
- 11. Mangard, S., Popp, T., Gammel, B.M.: Side-channel leakage of masked CMOS gates. In: Menezes, A. (ed.) CT-RSA 2005. LNCS, vol. 3376, pp. 351–365. Springer, Heidelberg (2005)
- Medwed, M., Standaert, F.-X., Großschädl, J., Regazzoni, F.: Fresh re-keying: security against side-channel and fault attacks for low-cost devices. In: Bernstein, D.J., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp. 279–296. Springer, Heidelberg (2010)
- 13. Poettering, B., Furious, R.: http://point-at-infinity.org/avraes/
- Renauld, M., Standaert, F.-X.: Algebraic side-channel attacks. In: Bao, F., Yung, M., Lin, D., Jing, J. (eds.) Inscrypt 2009. LNCS, vol. 6151, pp. 393–410. Springer, Heidelberg (2010)
- 15. Rivain, M., Prouff, E.: Provably secure higher-order masking of AES. In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 413–427. Springer, Heidelberg (2010)
- Rivain, M., Prouff, E., Doget, J.: Higher-order masking and shuffling for software implementations of block ciphers. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 171–188. Springer, Heidelberg (2009)
- 17. Standaert, F.-X., Pereira, O., Yu, Y.: Leakage-resilient symmetric cryptography under empirically verifiable assumptions. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013, Part I. LNCS, vol. 8042, pp. 335–352. Springer, Heidelberg (2013)
- Standaert, F.-X., Petit, C., Veyrat-Charvillon, N.: Masking with randomized look up tables. In: Naccache, D. (ed.) Cryphtography and Security: From Theory to Applications. LNCS, vol. 6805, pp. 283–299. Springer, Heidelberg (2012)
- Standaert, F.-X., Veyrat-Charvillon, N., Oswald, E., Gierlichs, B., Medwed, M., Kasper, M., Mangard, S.: The world is not enough: another look on second-order DPA. In: Abe, M. (ed.) ASIACRYPT 2010. LNCS, vol. 6477, pp. 112–129. Springer, Heidelberg (2010)
- 20. Veyrat-Charvillon, N., Medwed, M., Kerckhof, S., Standaert, F.-X.: Shuffling against side-channel attacks: a comprehensive study with cautionary note. In: Wang, X., Sako, K. (eds.) ASIACRYPT 2012. LNCS, vol. 7658, pp. 740–757. Springer, Heidelberg (2012)



http://www.springer.com/978-3-319-08301-8

Smart Card Research and Advanced Applications 12th International Conference, CARDIS 2013, Berlin, Germany, November 27-29, 2013. Revised Selected Papers

Francillon, A.; Rohatgi, P. (Eds.) 2014, XII, 271 p. 111 illus., Softcover

ISBN: 978-3-319-08301-8