SHARCS 2012: Accepted papers

SHARCS 2012
Special-Purpose Hardware for Attacking Cryptographic Systems

Introduction

How to participate:

Schedule

Travel

Registration

How to contribute:

Call for papers

Submission

Accepted papers

How to sponsor/exhibit:

Call for exhibitors

Platinum sponsor:

Sponsor:

Accepted papers

UPDATE: The workshop record is now online.

The following submitted papers have been accepted:

Alex Biryukov and Johann Großschädl
CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware
Abstract. The block cipher Rijndael has undergone more than ten years of extensive cryptanalysis since its submission as a candidate for the Advanced Encryption Standard (AES) in April 1998. To date, most of the publicly-known cryptanalytic results are based on reduced-round variants of the AES (respectively Rijndael) algorithm. Among the few exceptions that target the full AES are the Related-Key Cryptanalysis (RKC) introduced at ASIACRYPT 2009 and attacks exploiting Time-Memory-Key (TMK) trade-offs such as demonstrated at SAC 2005. However, all these attacks are generally considered infeasible in practice due to their high complexity (i.e. 2^99.5 AES operations for RKC, 2^80 for TMK). In this paper, we evaluate the cost of cryptanalytic attacks on the full AES when using special-purpose hardware in the form of multi-core AES processors that are designed in a similar way as modern Graphics Processing Units (GPUs) such as the NVIDIA GT200b. Using today's VLSI technology would allow for the implementation of a GPU-like processor reaching a throughput of up to 10^12 AES operations per second. An organization able to spend one trillion US$ for designing and building a supercomputer based on such processors could theoretically break the full AES in a time frame of as little as one year when using RKC, or in merely one month when performing a TMK attack. We also analyze different time-cost trade-offs and assess the implications of progress in VLSI technology under the assumption that Moore's law will continue to hold for the next ten years. These assessments raise some concerns about the long-term security of the AES.
Andrey Bogdanov, Elif Bilge Kavun, Christof Paar, Christian Rechberger, and Tolga Yalcin
Better than brute-force—optimized hardware architecture for efficient biclique attacks on AES-128
Abstract. Biclique cryptanalysis was recently used to claim the first attack on AES-128 without related keys. However, compared to a simple brute force attack, the method is more complicated, and its data requirements are much higher (2^88 chosen ciphertexts). Still, the theoretical speed gain over brute force search was small.
We show in this paper for the first time a practical implementation of a biclique attack on AES, using a parallel FPGA machine and ASICs, with a practical data complexity of only 16 chosen plaintexts, and still gaining almost a factor 2 over optimized brute force search.
Martijn Sprengers, Lejla Batina
Speeding up GPU-based password cracking
Abstract. Recent advances in the graphics processing unit (GPU) hardware challenge the way we look at secure password storage. GPUs have proven to be suitable for cryptographic operations and provide a significant speedup in performance compared to traditional central processing units (CPUs). This research presents a proof of concept for the impact of launching an exhaustive search attack on the MD5-crypt password hashing scheme using modern GPUs. We show that it is possible to achieve a performance of 880 000 password hashes per second, using various optimization techniques. For our implementation, executed on a standard GPU, we obtain a speed-up of a factor of 30 when compared with equally priced CPU hardware. With this performance increase, `complex' passwords with a length of 8 characters are now becoming feasible to crack even with inexpensive hardware.
Masaya Yasuda, Takeshi Shimoyama, Tetsuya Izu, Jun Kogure
On the strength comparison of ECC and RSA
Abstract. At present, the RSA cryptosystem is most widely used in public key cryptography. On the other hand, elliptic curve cryptography (ECC) has recently received much attention since smaller ECC key sizes provide the same security level as RSA. Although there are a lot of previous works that analyze the security of ECC and RSA, the comparison of strengths varies depending on analysis. The aim of this paper is once again to compare the security strengths, considering state-of-art of theory and experiments. In this paper, we compare the computing power required to solve the elliptic curve discrete logarithm problem (ECDLP) and the integer factorization problem (IFP), respectively, and estimate the sizes of the problems that provide the same level of security.
Lyndon Judge, Patrick Schaumont
A Flexible Hardware ECDLP Engine in Bluespec
Abstract. A parallel hardware implementation of the Pollard rho algorithm is a complex design requiring multiple layers of design hierarchy. In this contribution, we investigate an ECDLP design for a secp112r1 curve, and we discuss the design difficulties of a traditional hardware design method based on Verilog. The lack of flexible control structures, and the bottom-up style of hardware design leads to a rigid architecture, incapable of supporting quick architecture changes and design space exploration. We then present the same ECDLP implementation using Bluespec, a rule-based hardware description language. The language features of Bluespec support effective design space exploration. We demonstrate this by comparing several secp112r1 ECDLP variants, with non-trivial modification of parameters. We investigate the speed area tradeoff resulting from various design parameters and demonstrate performance from 53K to 2.87M iterations per second for Xilinx FPGA slice counts varying from 4,407 to 35,232 slices. We emphasize that these numbers were obtained by measuring the performance of a prototype, rather than estimating them from synthesis tool output. We conclude that the design of complex cryptanalytic hardware can greatly benefit from better hardware design methodologies, and we would like to advocate the importance of this aspect.
Ryan Henry, Ian Goldberg
Solving Discrete Logarithms in Smooth-Order Groups with CUDA
Abstract. This paper chronicles our experiences using CUDA to implement a parallelized variant of Pollard's rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. We first discuss some key design constraints imposed by modern GPU architectures and the CUDA framework, and then explain how we were able to implement efficient arbitrary-precision modular multiplication within these constraints. Our implementation can execute roughly 51.9 million 768-bit modular multiplications per second—or a whopping 840 million 192-bit modular multiplications per second—on a single Nvidia Tesla M2050 GPU card, which is a notable improvement over all previous results on comparable hardware. We leverage this fast modular multiplication in our implementation of the parallel rho algorithm, which can solve discrete logarithms modulo a 1536-bit RSA number with a 2^55-smooth totient in less than two minutes. We conclude the paper by discussing implications to discrete logarithm-based cryptosystems, and by pointing out how efficient implementations of parallel rho (or related algorithms) lead to trapdoor discrete logarithm groups; we also point out two potential cryptographic applications for the latter. Our code is written in C for CUDA and PTX; it is open source and freely available for download online.
Chen-Mou Cheng, Tung Chou, Ruben Niederhagen, Bo-Yin Yang
Solving Quadratic Equations with XL on Parallel Architectures
Abstract. Solving a system of multivariate quadratic equations (MQ) is a hard problem whose complexity estimates are relevant to many cryptographic scenarios. In some cases it is required in the best known attack; sometimes it is a generic attack (such as for the multivariate PKCs), and some of the time it determines a provable level of security (such as for the QUAD stream ciphers).
Under some reasonable-looking assumptions, the best way to solve generic MQ systems is the XL algorithm implemented with a sparse matrix solver such as Wiedemann. Knowing how fast one can implement this attack gives us a good idea of how future cryptosystems related to MQ can be broken, similar to how implementations of General Number Field Sieve that factors smaller RSA numbers gives us more insight into the security of actual RSA-based cryptosystems.
This paper describes such an implementation of XL with Block Wiedemann. We are able to solve in 2.5 days, on a US$6000 computer, a system with 30 variables and 60 equations over F_16 (a computation of about 2^57 F_16-multiplications). This is something that we do not expect that F_4/F_5 would accomplish due to its much higher space usage. The software can be easily adapted to other small fields including F_2. More importantly, it scales nicely for small clusters, NUMA machines, and a combination of both.
Itai Dinur, Tim Güneysu, Christof Paar, Adi Shamir, Ralf Zimmermann
Experimentally Verifying a Complex Algebraic Attack on the Grain-128 Cipher Using Dedicated Reconfigurable Hardware
Abstract. In this work, we describe the first single-key attack on the full version of Grain-128 that can recover arbitrary keys. Our attack is based on a new version of a cube tester, which is a factor of about 2^38 faster than exhaustive search. To practically verify our results, we implemented the attack on the reconfigurable hardware cluster RIVYERA and tested the main components of the attack for dozens of random keys. Our experiments successfully demonstrated the correctness and expected complexity of the attack by finding a very significant bias in our new cube tester for about 7.5% of the tested keys. This is the first time that the main components of a complex analytical attack against a digital full-size cipher were successfully implemented using special-purpose hardware, truly exploiting the reconfigurable nature of an FPGA-based cryptanalytical device.
Daniel J. Bernstein, Hsieh-Chung Chen, Chen-Mou Cheng, Tanja Lange, Ruben Niederhagen, Peter Schwabe, Bo-Yin Yang
Usable assembly language for GPUs: a success story
Abstract. The NVIDIA compilers nvcc and ptxas leave the programmer with only very limited control over register allocation, register spills, instruction selection, and instruction scheduling. In theory a programmer can gain control by writing an entire kernel in van der Laan's cudasm assembly language, but this requires tedious, error-prone tracking of register assignments.
This paper introduces a higher-level assembly language, qhasm-cudasm, that allows much faster programming while providing the same amount of control over the GPU. This language has been used successfully to build a 90000-machine-instruction kernel for a computation described in detail in the paper, the largest public cryptanalytic project in history. The best GTX 295 speed that has been obtained for this computation with nvcc and ptxas is 25 million iterations per second; the best GTX 295 speed that has been obtained with qhasm-cudasm is 63 million iterations per second.
Nicolas T. Courtois, Daniel Hulme, Theodosis Mourouzis
Solving Circuit Optimisation Problems in Cryptography and Cryptanalysis
Abstract. In this paper we look at the problem of compact representation and optimization of some small circuits such as S-boxes in cryptographic algorithms. Our ambition is to show that it is a lot more than an essential software task in industrial hardware implementations of standard cryptographic algorithms [20, 25, 13, 8]. We show that that it can be done in a new way and should lead to many exciting new applications in particular in cryptanalysis.
The starting point is the notion of Multiplicative Complexity recently used to find very good optimizations of the AES S-box [5, 8, 7]. We have developed a method and software to optimize the multiplicative complexity and also the linear components using SAT solvers. We produce a compact implementation of two block ciphers PRESENT and GOST known for their exceptionally low hardware cost. We cover all the numerous variants of GOST and have released an open source bitslice implementation of PRESENT which is now the best publicly available [1]. We explain why our methodology is suitable and should be used in implementations aiming at preventing side channel attacks on cryptographic chips such as DPA. We show that our method can give better results that with standard and gate-level logical optimization tools such as the famous Berkeley Logic Friday software. Moreover, most of our results are optimal and cannot be improved which is a rare achievement in complexity.
This kind of optimizations can be seen as an essential ingredient and key step underlying a number of recent attacks on symmetric ciphers such as in [17, 16], and any further optimization effort should result in either improved attack speeds or a proof of optimality. For example we have been able to obtain a single-key attack on the full 32-round GOST block cipher which is faster than brute force.

Version

This is version 2012.03.25 of the accepted.html web page.