# An ASIC Implementation of the AES SBoxes* 

Johannes Wolkerstorfer ${ }^{1}$, Elisabeth Oswald ${ }^{1}$, and Mario Lamberger ${ }^{2}$<br>${ }^{1}$ Institute for Applied Information Processing and Communications, Graz University of Technology, Inffeldgasse 16a, A-8010 Graz, Austria Johannes.Wolkerstorfer@iaik.at, http://www.iaik.at<br>2 Department of Mathematics<br>Graz University of Technology, Steyrergasse 30, A-8010 Graz, Austria


#### Abstract

This article presents a hardware implementation of the SBoxes from the Advanced Encryption Standard (AES). The SBoxes substitute an 8 -bit input for an 8 -bit output and are based on arithmetic operations in the finite field $G F\left(2^{8}\right)$. We show that a calculation of this function and its inverse can be done efficiently with combinational logic. This approach has advantages over a straight-forward implementation using read-only memories for table lookups. Most of the functionality is used for both encryption and decryption. The resulting circuit offers low transistor count, has low die-size, is convenient for pipelining, and can be realized easily within a semi-custom design methodology like a standard-cell design. Our standard cell implementation on a $0.6 \mu \mathrm{~m}$ CMOS process requires an area of only $0.108 \mathrm{~mm}^{2}$ and has delay below 15 ns which equals a maximum clock frequency of 70 MHz . These results were achieved without applying any speed optimization techniques like pipelining.


Keywords: Advanced Encryption Standard (AES), finite field arithmetic, inversion, Application Specific Integrated Circuit (ASIC), stan-dard-cell design, Very Large Scale Integration (VLSI), scalability, pipelining.

## 1 Introduction

The Advanced Encryption Standard (AES) is a symmetric encryption algorithm. It will become a FIPS standard in Fall 2001 [1]. AES will replace the DESalgorithm in the coming years since it offers higher levels of security. AES supports key lengths of 128,192 , and 256 bits. It operates on 128 -bit data blocks. The major building blocks of the AES algorithm are the non-linear SBoxes (SubByte-operation) and the MixColumn-operation. Both are based on finite field arithmetic and have an inverse function which is used for decryption.

[^0]```
AESROUND () \{
    SubByte(State);
    ShiftRow(State);
    MixColumn(State);
    AddRoundKey(State, RoundKey);
\}
```

Fig. 1. Round function of the AES-algorithm

The AES-algorithm's operations are performed on a two-dimensional array of bytes called the State. The State consists of four columns and four rows of bytes. For both encryption and decryption, the AES-algorithm uses a round function that is composed of four different transformations which modify the State (see Fig. 1). First, the SubByte-function substitutes all bytes of the State using a lookup-table called SBox. SBox-table entries are calculated by inversion in the finite field $\mathrm{GF}\left(2^{8}\right)$ followed by a short final transformation. Second, the rows of the State are shifted by different offsets (ShiftRow-function). The MixColumnfunction scrambles the columns of the State by multiplying a finite field constant. An addition of the State with the Roundkey - which is derived from the input key - concludes the round function which is executed ten times when 128-bit keys are used. The RoundKey is calculated by operations which are similar to those of the round function and require the SBox functionality too.

The efficiency of an AES hardware implementation in terms of die-size, throughput, and power consumption is mainly determined by the implementation of the MixColumn-operation and the SBoxes. The remaining operations are trivial: ShiftRow is a simple cyclic shift, and AddRoundKey is a XOR-operation of the State and the RoundKey. Up to 20 instances of AES-SBoxes are used to realize hardware for the AES round function. The exact number of SBoxes depends on the architecture's degree of parallelism and is determined by throughput requirements and the desired clock frequency. In case that the AES-module should also decrypt data, it has to be taken into account that the SBoxes used for decryption have a different functionality. The number of SBoxes and their style of implementation has important influence on the size and the speed of an AES hardware. For this reason, V. Rijmen (one of the AES inventors) suggests in [4] an alternative method for the computation of the AES-SBox. It consists essentially of a replacement of the SBox lookup-table by an efficient combinational logic for the computation of the inverse elements in $\operatorname{GF}\left(2^{8}\right)$. Therefore, another representation of the finite field $\operatorname{GF}\left(2^{8}\right)$ is used. This representation leads to an efficient implementation of the finite field arithmetic and was investigated in connection with the implementation of error correcting codes in C. Paar [7], Soljanin et al. [5], and Mastrovito [6]. In contrast to V. Rijmen's original proposal which additionally suggests the optimal normal basis representation of finite field elements (for a definition see [3]) we use the polynomial representation of finite field elements. The benefit of our method is that we have a far more flexible hardware architecture (in comparison to the possible architectures with a straightforward SBox implementation) without the necessity to do complex conversions from one
representation (of finite field elements) to another. The main advantages of our architecture are:

- lower transistor count and die-size than a ROM-based approach,
- a short critical path to achieve a high operational frequency,
- easier implementation within a semi-custom design methodology since all computations can be done with standard-cells,
- flexibility for speed optimization: pipelining techniques can trade throughput for latency.
- suitability for a full-custom implementation: a few leaf-cells using an appropriate logic-style could increase speed or decrease power consumption.

The remainder of this article provides the mathematical background of the finite field arithmetic and the computation of the AES SBoxes in Sect. 2. The building blocks of an SBox and the according formulas are given in Sect. 3. Section 4 presents the implementation.

## 2 Mathematical Background

This article uses the same notation and conventions as the AES specification [1]. All notations and mathematical operations required for the SBox-operation are presented in a condensed form.

Bytes. The basic data unit of AES are Bytes $a=\left\{a_{7}, a_{6}, a_{5}, a_{4}, a_{3}, a_{2}, a_{1}, a_{0}\right\}$ each holding eight bits. A Byte can be interpreted as an element of the GaloisField $\mathrm{GF}\left(2^{8}\right)$ in polynomial representation:

$$
a(x)=\sum_{i=0}^{7} a_{i} x^{i}=a_{7} x^{7}+a_{6} x^{6}+a_{5} x^{5}+a_{4} x^{4}+a_{3} x^{3}+a_{2} x^{2}+a_{1} x+a_{0} .
$$

The coefficients $a_{i}$ of a polynomial $a(x)$ are bits. Bytes can be written in different notations. For example, the binary value $\{01100011\}$ is $\{63\}$ in hexadecimal notation and represents the polynomial $x^{6}+x^{5}+x+1$.

Addition. The addition of two Bytes representing polynomials $a(x), b(x) \in$ $G F\left(2^{8}\right)$ is achieved by adding their corresponding coefficients modulo 2 which is a XOR-operation usually denoted with $\oplus$.

$$
\begin{equation*}
a(x) \oplus b(x)=\sum_{i=0}^{7} a_{i} x^{i} \oplus \sum_{i=0}^{7} b_{i} x^{i}=\sum_{i=0}^{7}\left(a_{i} \oplus b_{i}\right) x^{i} \tag{1}
\end{equation*}
$$

The additive inverse of a Byte is the Byte itself: $-b(x)=b(x)$ and therefore subtraction is identical with addition: $a(x)-b(x)=a(x)+b(x)$.

Multiplication. The multiplication of $a(x), b(x) \in G F\left(2^{8}\right)$ - denoted with $a(x) \otimes b(x)$ - requires an irreducible polynomial of degree 8. For the AESalgorithm it is defined as

$$
m(x)=x^{8}+x^{4}+x^{3}+x+1=1\{00011011\}(\text { bin })=1\{1 b\}(\text { hex })
$$

The multiplication $q(x)=a(x) \cdot b(x)$ in $\mathrm{GF}\left(2^{8}\right)$ is done by multiplying the polynomials $a(x) b(x)$ which yields a polynomial $p(x)$ with degree less than 15. This step is followed by a modular reduction step $q(x)=p(x) \bmod m(x)$ to ensure that the result is an element of $\operatorname{GF}\left(2^{8}\right)$.

A convenient method to multiply in the finite field $\mathrm{GF}\left(2^{8}\right)$ is to generate eight partial products: $P_{i}(x)=a(x) \cdot x^{i}$ and to add those partial products where the according bit $b_{i}$ of the multiplier $b(x)$ is $1: q(x)=\sum_{i=0}^{7} P_{i} b_{i}$. The partial products can be calculated efficiently by iterating a multiplication by $x: P_{i}(x)=$ $P_{i-1}(x) \cdot x \bmod m(x), P_{0}(x)=a(x)$. A multiplication by $x$ is termed xtimes and is given by

$$
\begin{align*}
q(x) & =x \operatorname{times}(a)=a(x) x \bmod m(x)  \tag{2}\\
q_{0} & =a_{7}, \quad q_{1}=a_{0} \oplus a_{7}, \quad q_{2}=a_{1}, \quad q_{3}=a_{2} \oplus a_{7} \\
q_{4} & =a_{3} \oplus a_{7}, \quad q_{5}=a_{4}, \quad q_{6}=a_{5}, \quad q_{7}=a_{6} .
\end{align*}
$$

Xtimes can be implemented by a shift left operation of the input Byte $a$ and a conditional addition of the irreducible polynomial $m(x)$ if the most significant bit $\left(a_{7}\right)$ of $a$ is set. This ensures a Byte as result.

Inversion. The multiplicative inverse $a^{-1}$ of an element $a \in G F\left(2^{8}\right)$ has the property that $\forall a \in G F\left(2^{8}\right) \backslash\{0\}: a \otimes a^{-1}=\{1\}$. Calculating the inverse of a Byte is even more costly than multiplying Bytes. A widely used algorithm for inversion is the extended Euclidean algorithm described in [2]. Unfortunately, this algorithm is not suitable for a hardware implementation.

### 2.1 GF( $\left.2^{8}\right)$ as an Extension of GF( $\left.2^{4}\right)$

Usually, the field $\operatorname{GF}\left(2^{8}\right)$ is seen as a field extension of $\mathrm{GF}(2)$ and therefore its elements can be represented as Bytes. An isomorphic - but for hardware implementations far better suited - representation is to see the field $\operatorname{GF}\left(2^{8}\right)$ as a quadratic extension of the field $\mathrm{GF}\left(2^{4}\right)$. In this case, an element $a \in \operatorname{GF}\left(2^{8}\right)$ is represented as a linear polynomial with coefficients in $\operatorname{GF}\left(2^{4}\right)$,

$$
\begin{equation*}
a \cong a_{h} x+a_{l}, \quad a \in \operatorname{GF}\left(2^{8}\right), \quad a_{h}, a_{l} \in \mathrm{GF}\left(2^{4}\right) \tag{3}
\end{equation*}
$$

and will be denoted by the pair $\left[a_{h}, a_{l}\right]$. Both coefficients of such a polynomial have four bits. All mathematical operations applied to elements of $\mathrm{GF}\left(2^{8}\right)$ can also be computed in this representation which we call two-term polynomials. Two-term polynomials are added by addition of their corresponding coefficients

$$
\begin{equation*}
\left(a_{h} x+a_{l}\right) \oplus\left(b_{h} x+b_{l}\right)=\left(a_{h} \oplus b_{h}\right) x+\left(a_{l} \oplus b_{l}\right) . \tag{4}
\end{equation*}
$$

Multiplication and inversion of two-term polynomials require a modular reduction step to ensure that the result is a two-term polynomial too. The irreducible polynomial needed for the modular reduction is given by

$$
\begin{equation*}
n(x)=x^{2}+\{1\} x+\{e\} . \tag{5}
\end{equation*}
$$

The coefficients of $n(x)$ are elements in $\operatorname{GF}\left(2^{4}\right)$ and are written in hexadecimal notation. Their particular values are chosen to optimize the finite field arithmetic.

The multiplication of two-term polynomials involves multiplication of elements in GF $\left(2^{4}\right)$ which requires an irreducible polynomial of degree 4 which is given by

$$
\begin{equation*}
m_{4}(x)=x^{4}+x+1 \tag{6}
\end{equation*}
$$

Deriving formulas for multiplication in $\operatorname{GF}\left(2^{4}\right)$ is similar to Byte-multiplication. Multiplication in $\operatorname{GF}\left(2^{4}\right)$ is given by

$$
\begin{gathered}
q(x)=a(x) \otimes b(x)=a(x) \cdot b(x) \bmod m_{4}(x), \quad a(x), b(x), q(x) \in \mathrm{GF}\left(2^{4}\right) \quad(7) \\
\hline a_{A}=a_{0} \oplus a_{3}, \quad a_{B}=a_{2} \oplus a_{3} \\
q_{0}=a_{0} b_{0} \oplus a_{3} b_{1} \oplus a_{2} b_{2} \oplus a_{1} b_{3} \quad q_{1}=a_{1} b_{0} \oplus a_{A} b_{1} \oplus a_{B} b_{2} \oplus\left(a_{1} \oplus a_{2}\right) b_{3} \\
q_{2}=a_{2} b_{0} \oplus a_{1} b_{1} \oplus a_{A} b_{2} \oplus a_{B} b_{3} \quad q_{3}=a_{3} b_{0} \oplus a_{2} b_{1} \oplus a_{1} b_{2} \oplus a_{A} b_{3} .
\end{gathered}
$$

Squaring in $\operatorname{GF}\left(2^{4}\right)$ is a special case of multiplication and is given by

$$
\begin{align*}
& q(x)=a(x)^{2} \bmod m_{4}(x), \quad q(x), a(x) \in \operatorname{GF}\left(2^{4}\right)  \tag{8}\\
& q_{0}=a_{0} \oplus a_{2}, \quad q_{1}=a_{2}, \quad q_{2}=a_{1} \oplus a_{3}, \quad q_{3}=a_{3} .
\end{align*}
$$

The inverse $a^{-1}$ of an element $a \in G F\left(2^{4}\right)$ can be derived by solving the equation $a(x) \cdot a^{-1} \bmod m_{4}(x)=1$ as follows

$$
\begin{align*}
q(x)= & a(x)^{-1} \bmod m_{4}(x), \quad q(x), a(x) \in \mathrm{GF}\left(2^{4}\right)  \tag{9}\\
& a_{A}=a_{1} \oplus a_{2} \oplus a_{3} \oplus a_{1} a_{2} a_{3} \\
q_{0}= & a_{A} \oplus a_{0} \oplus a_{0} a_{2} \oplus a_{1} a_{2} \oplus a_{0} a_{1} a_{2} \\
q_{1}= & a_{0} a_{1} \oplus a_{0} a_{2} \oplus a_{1} a_{2} \oplus a_{3} \oplus a_{1} a_{3} \oplus a_{0} a_{1} a_{3} \\
q_{2}= & a_{0} a_{1} \oplus a_{2} \oplus a_{0} a_{2} \oplus a_{3} \oplus a_{0} a_{3} \oplus a_{0} a_{2} a_{3} \\
q_{3}= & a_{A} \oplus a_{0} a_{3} \oplus a_{1} a_{3} \oplus a_{2} a_{3} .
\end{align*}
$$

The concatenation of two bits $a_{i} a_{j}$ in Equations 7 and 9 represents a binary multiplication which is an AND-operation. In contrast to inversion in $\operatorname{GF}\left(2^{8}\right)$, inversion in $\mathrm{GF}\left(2^{4}\right)$ is suitable for a hardware implementation using combinational logic since all Boolean equations depend only on four input bits.

Inversion of Two-Term Polynomials. Inversion of two-term polynomials is the equivalent operation to inversion in $\operatorname{GF}\left(2^{8}\right)$. A multiplication of a two-term polynomial with its inverse yields the 1-element of the field: $\left(a_{h} x+a_{l}\right) \otimes\left(a_{h}^{\prime} x+\right.$
$\left.a_{l}^{\prime}\right)=\{0\} x+\{1\}, \quad a_{h}, a_{l}, a_{h}^{\prime}, a_{l}^{\prime} \in G F\left(2^{4}\right)$. From this definition the formula for inversion can be derived:

$$
\begin{align*}
\left(a_{h} x+a_{l}\right)^{-1}=a_{h}^{\prime} x+a_{l}^{\prime} & =\left(a_{h} \otimes d\right) x+\left(a_{h} \oplus a_{l}\right) \otimes d  \tag{10}\\
d & =\left(\left(a_{h}^{2} \otimes\{e\}\right) \oplus\left(a_{h} \otimes a_{l}\right) \oplus a_{l}^{2}\right)^{-1} .
\end{align*}
$$

Inversion of two-term polynomials involves only operations in $\mathrm{GF}\left(2^{4}\right)$ which are suitable for a hardware implementation using combinational logic. Most of the functionality is used to calculate the term $d$ which is used to calculate both coefficients of the inverted two-term polynomial.

Transition between Representations of GF( $\mathbf{2}^{8}$ ). The finite field GF $\left(2^{8}\right)$ is isomorphic to the finite field $\operatorname{GF}\left(\left(2^{4}\right)^{2}\right)$ which means that for each element in $\operatorname{GF}\left(2^{8}\right)$ there exists exactly one element in $\operatorname{GF}\left(\left(2^{4}\right)^{2}\right)$. The bijection from an element $a \in G F\left(2^{8}\right)$ to a two-term polynomial $a_{h} x+a_{l}$ where $a_{h}, a_{l} \in G F\left(2^{4}\right)$ is given by the function map:

$$
\begin{align*}
a_{h} x+a_{l}= & \operatorname{map}(a), \quad a_{h}, a_{l} \in \mathrm{GF}\left(2^{4}\right), \quad a \in \mathrm{GF}\left(2^{8}\right)  \tag{11}\\
& a_{A}=a_{1} \oplus a_{7}, \quad a_{B}=a_{5} \oplus a_{7}, \quad a_{C}=a_{4} \oplus a_{6} \\
a_{l 0}= & a_{C} \oplus a_{0} \oplus a_{5}, \quad a_{l 1}=a_{1} \oplus a_{2}, \quad a_{l 2}=a_{A}, \quad a_{l 3}=a_{2} \oplus a_{4} \\
a_{h 0}= & a_{C} \oplus a_{5}, \quad a_{h 1}=a_{A} \oplus a_{C}, \quad a_{h 2}=a_{B} \oplus a_{2} \oplus a_{3}, \quad a_{h 3}=a_{B} .
\end{align*}
$$

The inverse transformation ( $\mathrm{map}^{-1}$ ) converts two-term polynomials $a_{h} x+a_{l}$, $a_{h}, a_{l} \in G F\left(2^{4}\right)$ back into elements $a \in G F\left(2^{8}\right)$. It is given by

$$
\begin{align*}
a= & m a p^{-1}\left(a_{h} x+a_{l}\right), \quad a \in \mathrm{GF}\left(2^{8}\right), \quad a_{h}, a_{l} \in \mathrm{GF}\left(2^{4}\right)  \tag{12}\\
& a_{A}=a_{l 1} \oplus a_{h 3}, \quad a_{B}=a_{h 0} \oplus a_{h 1}, \\
a_{0}= & a_{l 0} \oplus a_{h 0}, \quad a_{1}=a_{B} \oplus a_{h 3} \\
a_{2}= & a_{A} \oplus a_{B}, \quad a_{3}=a_{B} \oplus a_{l 1} \oplus a_{h 2} \\
a_{4}= & a_{A} \oplus a_{B} \oplus a_{l 3}, \quad a_{5}=a_{B} \oplus a_{l 2} \\
a_{6}= & a_{A} \oplus a_{l 2} \oplus a_{l 3} \oplus a_{h 0}, \quad a_{7}=a_{B} \oplus a_{l 2} \oplus a_{h 3}
\end{align*}
$$

Both transformations can be derived by following the procedure given in C. Paar's PhD thesis [7]. They differ from the transformations given in [5] because the irreducible polynomial $m(x)$ for $\operatorname{GF}\left(2^{8}\right)$ is different.

## 3 SBox Building Blocks

The SubByte transformation operates independently on each Byte of the State using a substitution table (SBox). An AES-SBox is composed of two transformations:

1. Calculate the multiplicative inverse in the finite field $\operatorname{GF}\left(2^{8}\right)$. The element $\{00\}$ is mapped to itself. Table 1 presents the inversion.

Table 1. Inversion of a Byte $\{x y\} \in G F\left(2^{8}\right)$ in hexadecimal notation.

|  | 0 | 1 | 2 | 23 | 3 | $4{ }^{4} 5$ |  |  |  | 8 | 9 | a | b |  | d | e |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 00 | 01 | 8d | d f6 | f6 cb | b 5 | 527 | 7 bl | d1 | e8 | 4 f | 29 | c0 | b0 | e1 | e5 |  |
|  | 74 | b4 | aa | 4b | 4 b 99 | 92 | 2 b 6 | 605 | $5 f$ | 58 | 3 f | fd | cc | If | 40 | ee |  |
|  | 3 a | 6 | 5 | f1 | f1 | 54 | 4 d | c | c9 | c1 | 0a | 98 | 15 | 30 | 44 | a2 |  |
|  | 2c | 45 | 92 | 6 c | 6c f3 | 339 | 6 | 66 | 42 f | f2 | 35 | 20 | 6 f | 77 | bb | 59 |  |
|  | 1d | fe | 37 | 67 | 67 2d | 2d 3 | 31 f | f5 6 | 69 | a7 | 64 | ab | 13 | 54 | 25 | e9 |  |
|  | ed | 5 c | 05 | 5 ca | ca 4c | 4c 2 | 248 | 87 b | bf | 18 | 3e | 22 | f0 | 51 | ec | 61 |  |
|  | 16 | 5 e | af |  | d3 49 | 9 a | 63 |  |  | f4 | 47 | 91 | df | 33 | 93 | 21 |  |
| 7 | 79 | b7 | 797 | 785 | 8510 | 0 | 5 b | ba 3 | 3c b | b6 | 70 | d0 | 06 | a1 | fa | 81 |  |
| 8 | 83 | 7 | 7 f | 80 | 8096 | 9673 | b | 5 | 56 | 9 b | 9e | 95 | d9 | f7 | 02 | b9 |  |
|  | de | 6a | 32 | 26 d | 6 d d8 | d8 8 | 8a | 7 | 72 | 2a | 14 | 9 f | 88 | f9 | dc | 89 |  |
|  | fb | 7 c | c 2 e | c3 | c3 8f | 8 f b | 6 | 654 | 48 | 26 | c8 | 12 | 4a |  | e7 | d2 |  |
|  | 0c | e0 | 1 f | f ef | ef 11 | 175 | 757 | 787 | 71 | a5 | 8 e | 76 | 3d | d | bc | 86 |  |
|  | 0b | 28 | 8 2f | a3 | 33 da | da d | d4 | e4 0 | Of | a9 | 27 | 53 | 04 | 1b | fc | ac |  |
|  | 7a | 07 | 7 ae | 63 | 63 c 5 | 5 db | db e2 | e2 | ea | 94 | 8b | c4 | d5 | 9c | f8 | 90 |  |
|  | b1 | 0d | d6 | 6 eb | eb c6 | 60 | De c | cf a | ad | 08 | 4 e | d7 | e3 | 5 d | 50 | 1 e |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

2. Apply the affine transformation which is given by Equation 13.

| $q=\operatorname{aff}$ _trans $(a)$ | $q=\mathrm{aff}_{\text {_trans }}{ }^{-1}(a)$ |
| :---: | :---: |
| $a_{A}=a_{0} \oplus a_{1}, a_{B}=a_{2} \oplus a_{3}$, | $a_{A}=a_{0} \oplus a_{5}, a_{B}=a_{1} \oplus a_{4}$, |
| $a_{C}=a_{4} \oplus a_{5}, a_{D}=a_{6} \oplus a_{7}$ | $a_{C}=a_{2} \oplus a_{7}, a_{D}=a_{3} \oplus a_{6}$ |
| $q_{0}=\overline{a_{0}} \oplus a_{C} \oplus a_{D}$ | $q_{0}=\overline{a_{5}} \oplus a_{C}$ |
| $q_{1}=\overline{a_{5}} \oplus a_{A} \oplus a_{D}$ | $q_{1}=a_{0} \oplus a_{D}$ |
| $q_{2}=a_{2} \oplus a_{A} \oplus a_{D}$ | $q_{2}=\overline{a_{7}} \oplus a_{B}$ |
| $q_{3}=a_{7} \oplus a_{A} \oplus a_{B}$ | $q_{3}=a_{2} \oplus a_{A}$ |
| $q_{4}=a_{4} \oplus a_{A} \oplus a_{B}$ | $q_{4}=a_{1} \oplus a_{D}$ |
| $q_{5}=\overline{a_{1}} \oplus a_{B} \oplus a_{C}$ | $q_{5}=a_{4} \oplus a_{C}$ |
| $q_{6}=\overline{a_{6}} \oplus a_{B} \oplus a_{C}$ | $q_{6}=a_{3} \oplus a_{A}$ |
| $q_{7}=a_{3} \oplus a_{C} \oplus a_{D}$. | $q_{7}=a_{6} \oplus a_{B}$. |

Overlined bits in Equation 13 and 14 denote inverted bits. Decryption requires the inverse function of SubByte (InvSubByte) which reverses the SBoxoperation by applying the inverse affine transformation first (Equation 14). Then, the multiplicative inverse in the finite field $\operatorname{GF}\left(2^{8}\right)$ is calculated.

Inversion in the finite field $\operatorname{GF}\left(2^{8}\right)$ is needed to calculate the SubBytefunction as well as InvSubByte. It makes sense, to merge the encryption SBox with the decryption SBox in order to reuse the finite field inversion circuit for decryption. Figure 2 depicts this approach. The control signal enc switches between encryption and decryption. If encryption is chosen (enc=1), the inverse affine transformation (aff_trans ${ }^{-1}$ ) is bypassed and the input $a$ is directly fed


Fig. 2. Architecture of the AES-SBox
into the inversion circuit. The output of the inversion circuit is modified by the affine transformation block which calculates the result of the SubByte-function. During decryption (enc=0), the inverse affine transformation is active and the affine transformation is bypassed to calculate InvSubByte. The delay for encryption and decryption is essentially the same because the circuit's complexity for the affine transformation and its inverse are equal.

The circuit for inversion of elements in $\operatorname{GF}\left(2^{8}\right)$ covers most of the SBox functionality. In our approach the inversion is calculated with combinational logic and is based on Equation 10 which operates in $\operatorname{GF}\left(2^{4}\right)$. The operations occurring in this equation correspond to the function blocks shown in Fig. 2. Furthermore, this function block has to convert data from $\operatorname{GF}\left(2^{8}\right)$ to two-term polynomials and vice versa. The blocks map resp. map ${ }^{-1}$ provide this functionality based on Equation 11 and 12. Addition of elements in $\mathrm{GF}\left(2^{4}\right)$ is accomplished by a bitwise XOR-operation. Squaring relies on Equation 9. Multiplication of an element in $\mathrm{GF}\left(2^{4}\right)$ with the constant $\{e\}$ is given by

$$
\begin{align*}
q= & a \otimes\{e\}  \tag{15}\\
& a_{A}=a_{0} \oplus a_{1}, a_{B}=a_{2} \oplus a_{3} \\
q_{0}= & a_{1} \oplus a_{B} \\
q_{1}= & a_{A} \\
q_{2}= & a_{A} \oplus a_{2} \\
q_{3}= & a_{A} \oplus a_{B} .
\end{align*}
$$

Multiplication and inversion in $\operatorname{GF}\left(2^{4}\right)$ are the most complex function blocks and rely on Equations 7 and 9.

## 4 Implementation

Our implementation is based on the architecture described in Sect. 3. It has a maskable affine transformation block, a maskable inverse affine transformation block and a block which calculates the inverse in the finite field GF $\left(2^{8}\right)$. Most of the functionality can be implemented with XOR-gates. Additionally, inverters, AND-gates, and 2-to-1 multiplexers are required, but they can be neglected for performance analysis purposes.

Table 2. Complexity of the SBox

| block | $d_{\text {XOR }}$ | XORs | instances | sum XORs |
| :--- | ---: | ---: | ---: | ---: |
| aff_trans | 3 | 16 | 1 | 16 |
| aff_trans $^{-1}$ | 2 | 12 | 1 | 12 |
| map $^{2}$ | 2 | 11 | 1 | 11 |
| map $^{-1}$ | 3 | 15 | 1 | 15 |
| $\oplus$ | 1 | 4 | 3 | 12 |
| $\wedge^{2}$ | 1 | 2 | 2 | 4 |
| $\otimes\{e\}$ | 2 | 5 | 1 | 5 |
| $\otimes$ | 2 | 12 | 3 | 36 |
| $\wedge^{-1}$ | 3 | 12 | 1 | 12 |
|  | Max 15 |  |  | Sum 123 |

Table 3. Pipelining of the SBox

| stages | flipflops | frequency | area |
| :---: | :---: | :---: | :---: |
| 0 | 0 | $100 \%$ | $100 \%$ |
| 1 | 12 | $178 \%$ | $111 \%$ |
| 3 | 28 | $205 \%$ | $151 \%$ |

Table 2 lists the resources of all blocks - measured in number of XORs. The overall amount of gates are 123 XOR-gates with two inputs, 16 2-to- 1 multiplexers and a dozen of inverters and AND-gates. If XOR-gates with three inputs are available, the number of gates can be reduced. The delay of the blocks is measured in numbers of XOR-gates in series $\left(d_{X O R}\right)$. The critical path for encryption is composed of 15 XOR-gates in series. For decryption it is 14 because the inverse of the affine transformation has lower complexity. XOR-gates with three inputs will shorten the critical path and improve performance.

A feature that can be exploited to gain higher throughput is pipelining. Pipelining is a technique which subdivides the critical path by insertion of storing elements (flipflops). Subdividing the SBox functionality into a number of stages


Fig. 3. Layout of the AES-SBox
is easy to accomplish since flipflops can be inserted nearly anywhere when SBoxes are implemented with combinational logic. Pipelining introduces latency but the additional clock cycles are made up by an increased clock frequency as shown in Table 3. This technique will offer best results if the number of SBox instances used for an AES implementation is kept low (e.g. 4), otherwise latency will consume more time than it is saved by shortening the critical path.

Our implementation of the AES-SBox which combines the SubByte-function and InvSubByte-function from the AES algorithm is a standard-cell circuit on a $0.6 \mu \mathrm{~m}$ CMOS process from AMS using two metal layers. It has an area of $0.108 \mathrm{~mm}^{2}$ and contains 1624 transistors ( 406 NAND equivalents) when no pipelining is used. A layout is shown in Fig. 3. Layout simulation with typical mean parameters considering parasitics yields a delay below 14.2 ns which equals a maximum clock frequency of 70 MHz for both encryption and decryption. A pipelined version with one stage is slightly bigger $\left(0.120 \mathrm{~mm}^{2}\right)$, contains nearly 2000 transistors ( 500 NAND equivalents) and yields a delay below 8 ns $(125 \mathrm{MHz})$. Further increasing the number of pipeline stages will not improve throughput that strong and has a worse ratio of performance benefit to area penalty. It should be considered when the maximum clock frequency is of utmost interest.

For comparison, a lookup-table based approach requires a 4 k -bit ROM to store the 256 8-bit entries of SubByte and InvSubByte. An implementation on the same process technology uses about $0.200 \mathrm{~mm}^{2}$ and has an estimated maximum frequency of 100 MHz [9].

For an SBox implementation using a full-custom design methodology we suggest to use a differential logic style [8]. The logic functions of an SBox based on combinational logic are dominated by XOR-gates and differential XOR-gates offer good performance and have a moderate transistor count. We assume that a full-custom design could halve the required chip area because an SBbox is a small module and does not require the excessive driving capability offered by
standard cells. Output transistors of gates can be dimensioned smaller and this will in turn make it possible to scale all transistors down without deteriorating performance. At least three leaf cells are needed: a XOR-gate, an AND-gate and an inverter. To develop these cells will be an rewarding task if the resulting performance gain and area saving is pictured.

## 5 Related Work

Several AES-implementations have been presented recently. For comparison with our work, only ASIC circuits [11] or circuits exploiting a more efficient finite field arithmetic are of interest [10]. The approach followed in IBM implementation [10] is also based on a conversion of elements in $\mathrm{GF}\left(2^{8}\right)$ into two-term polynomials. In contrast to our approach, they calculate the whole round function in this representation. Therefore, they choose the conversion function $\operatorname{map}()$ in a way that minimizes the overall gate count. Our primary focus on choosing $\operatorname{map}()$ was to minimize the critical path of the complete SBox and secondarily to keep the gate count low. Our $\operatorname{map}()$-function has a shorter critical path compared to the IBM implementation, the critical path of $\operatorname{map}^{-1}()$ is identical.

## 6 Conclusion

This article presented an ASIC implementation of the SBoxes from the Advanced Encryption Standard (AES). It is based on finite field arithmetic rather than using lookup-tables. This approach offers higher flexibility. Area requirements can be traded for the maximum clock frequency. The architecture can be easily implemented within a standard-cell design methodology because it completely relies on combinational logic. It is also well suited for a full-custom implementation since it uses only a few leaf cells. We implemented the AES-SBox on $0.6 \mu \mathrm{~m}$ CMOS process with standard-cells. The most promising configuration of design parameters we found is a single stage pipeline architecture. It has a silicon area of $0.12 \mathrm{~mm}^{2}$ and a maximum clock frequency of 125 MHz . This configuration has a latency of one clock cycle like a ROM based approach. In comparison to ROMs, it offers better performance on smaller area and can even be improved by exploiting better suited logic styles like differential logic.

## References

1. NIST, Advanced Encryption Standard (AES), FIPS PUBS 197, National Institute of Standards and Technology, November 2001.
2. A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography, CRC Press, New York, 1997.
3. R. Lidl and H. Niederreiter, Introduction to finite fields and their applications, Cambridge University Press, Cambridge, 1986.
4. V. Rijmen, Efficient Implementation of the Rijndael SBox, http://www.esat.kuleuven.ac.be/~rijmen/rijndael/.
5. E. Soljanin, R. Urbanke, An Efficient Architecture for Implementation of a Multiplier and Inverter in $\operatorname{GF}\left(2^{8}\right)$, Lucent Technologies.
6. E. D. Mastrovito, VLSI Architectures for Computations in Galois Fields, PhD thesis, Linköping University, Linköping, Sweden, 1991.
7. C. Paar, Efficient VLSI Architectures for Bit Parallel Computation in Galois Fields, PhD thesis, Universität Essen, 1994.
8. J. B. Kuo, J. H. Lou, Low-Voltage VLSI Circuits, John Wiley, New York, Jan. 1999.
9. AMS, Memory Compiler for Diffusion Programmable ROM in $0.6 \mu m$ CMOS, http://www.amsint.com/databooks/.
10. A. Rudra, P. Dubey, C. Jutla, V. Kumar, J. Rao, P. Rohatgi, Efficient Rijndael Encryption Implementation with Composite Field Arithmetic, Proceedings of Workshop on Cryptographic Hardware and Embedded Systems, France, 2001, to be published in Springer LNCS.
11. I. Verbauwhede, H. Kuo, Architectural Optimization for a 1.82 Gbits/sec VLSI Implementation of the AES Rijndael Algorithm, Proceedings of Workshop on Cryptographic Hardware and Embedded Systems, France, 2001, to be published in Springer LNCS.

[^0]:    * The work described originates from the European Commission funded Project Secure Terminal IC (SETIC) established under contract IST-2000-25167 resp. Crypto Module with USB Interface (USB_CRYPT) established under contract IST-2000-25169 in the Information Society Technologies (IST) Program.

