UPDATE: The workshop record is now online.
The following submitted papers have been accepted:
Alex Biryukov and Johann Großschädl
CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware
Abstract. The block cipher Rijndael has undergone more than ten years of extensive
since its submission as a candidate for the Advanced Encryption Standard (AES) in April 1998.
date, most of the publicly-known cryptanalytic results are based on reduced-round variants of
AES (respectively Rijndael) algorithm. Among the few exceptions that target the full AES are
Related-Key Cryptanalysis (RKC) introduced at ASIACRYPT 2009 and attacks exploiting
Time-Memory-Key (TMK) trade-offs such as demonstrated at SAC 2005. However, all these attacks are
generally considered infeasible in practice due to their high complexity (i.e. 2^99.5 AES
for RKC, 2^80 for TMK). In this paper, we evaluate the cost of cryptanalytic attacks on the
when using special-purpose hardware in the form of multi-core AES processors that are designed
a similar way as modern Graphics Processing Units (GPUs) such as the NVIDIA GT200b. Using
today's VLSI technology would allow for the implementation of a GPU-like processor reaching a
throughput of up to 10^12 AES operations per second. An organization able to spend one trillion
for designing and building a supercomputer based on such processors could theoretically break
full AES in a time frame of as little as one year when using RKC, or in merely one month when
performing a TMK attack. We also analyze different time-cost trade-offs and assess the
of progress in VLSI technology under the assumption that Moore's law will continue to hold for
next ten years. These assessments raise some concerns about the long-term security of the AES.
Andrey Bogdanov, Elif Bilge Kavun, Christof Paar, Christian Rechberger, and Tolga Yalcin
Better than brute-force—optimized hardware architecture for efficient biclique attacks on AES-128
Biclique cryptanalysis was recently used to claim the first attack on
AES-128 without related keys. However, compared to a simple brute force
attack, the method is more complicated, and its data requirements are
much higher (2^88 chosen ciphertexts). Still, the theoretical speed gain
over brute force search was small.
We show in this paper for the first time a practical implementation of a
biclique attack on AES, using a parallel FPGA machine and ASICs, with a
practical data complexity of only 16 chosen plaintexts, and still
gaining almost a factor 2 over optimized brute force search.
Martijn Sprengers, Lejla Batina
Speeding up GPU-based password cracking
Abstract. Recent advances in the graphics processing unit (GPU) hardware
challenge the way we look at secure password storage. GPUs have
proven to be suitable for cryptographic operations and provide a significant
speedup in performance compared to traditional central processing
units (CPUs). This research presents a proof of concept for the impact of
launching an exhaustive search attack on the MD5-crypt password hashing
scheme using modern GPUs. We show that it is possible to achieve
a performance of 880 000 password hashes per second, using various
optimization techniques. For our implementation, executed on a standard
GPU, we obtain a speed-up of a factor of 30 when compared with equally
priced CPU hardware. With this performance increase, `complex' passwords
with a length of 8 characters are now becoming feasible to crack
even with inexpensive hardware.
Masaya Yasuda, Takeshi Shimoyama, Tetsuya Izu, Jun Kogure
On the strength comparison of ECC and RSA
Abstract. At present, the RSA cryptosystem is most widely used in public key cryptography. On
the other hand, elliptic curve cryptography (ECC) has recently received much attention since
ECC key sizes provide the same security level as RSA. Although there are a lot of previous
analyze the security of ECC and RSA, the comparison of strengths varies depending on analysis.
aim of this paper is once again to compare the security strengths, considering state-of-art of
and experiments. In this paper, we compare the computing power required to solve the elliptic
discrete logarithm problem (ECDLP) and the integer factorization problem (IFP), respectively,
estimate the sizes of the problems that provide the same level of security.
Lyndon Judge, Patrick Schaumont
A Flexible Hardware ECDLP Engine in Bluespec
Abstract. A parallel hardware implementation of the Pollard rho
algorithm is a complex design requiring multiple layers of design hierarchy.
In this contribution, we investigate an ECDLP design for a secp112r1
curve, and we discuss the design difficulties of a traditional hardware
design method based on Verilog. The lack of flexible control structures,
and the bottom-up style of hardware design leads to a rigid architecture,
incapable of supporting quick architecture changes and design space
exploration. We then present the same ECDLP implementation using
Bluespec, a rule-based hardware description language. The language features
of Bluespec support effective design space exploration. We demonstrate
this by comparing several secp112r1 ECDLP variants, with non-trivial
modification of parameters. We investigate the speed area tradeoff
resulting from various design parameters and demonstrate performance
from 53K to 2.87M iterations per second for Xilinx FPGA slice counts
varying from 4,407 to 35,232 slices. We emphasize that these numbers
were obtained by measuring the performance of a prototype, rather than
estimating them from synthesis tool output. We conclude that the design
of complex cryptanalytic hardware can greatly benefit from better
hardware design methodologies, and we would like to advocate the
importance of this aspect.
Ryan Henry, Ian Goldberg
Solving Discrete Logarithms in Smooth-Order Groups with CUDA
This paper chronicles our experiences using CUDA to implement a parallelized variant of
rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but
smooth order using commodity GPUs.
We first discuss some key design constraints imposed by modern GPU
architectures and the CUDA framework, and then explain how we were able to implement efficient
arbitrary-precision modular multiplication within these constraints. Our implementation can
roughly 51.9 million 768-bit modular multiplications per second—or a whopping 840 million
modular multiplications per second—on a single Nvidia Tesla M2050 GPU card, which is a
improvement over all previous results on comparable hardware. We leverage this fast modular
multiplication in our implementation of the parallel rho algorithm, which can solve discrete
a 1536-bit RSA number with a 2^55-smooth totient in less than two minutes. We conclude the
discussing implications to discrete logarithm-based cryptosystems, and by pointing out how
efficient implementations of parallel rho (or related algorithms) lead
to trapdoor discrete logarithm
groups; we also
point out two potential cryptographic applications for the latter. Our code is written in C
for CUDA and
PTX; it is open source and freely available for download online.
Chen-Mou Cheng, Tung Chou, Ruben Niederhagen, Bo-Yin Yang
Solving Quadratic Equations with XL on Parallel Architectures
Solving a system of multivariate quadratic equations (MQ) is a hard problem whose
estimates are relevant to many cryptographic scenarios. In some cases it is required in the
attack; sometimes it is a generic attack (such as for the multivariate PKCs), and some of the
determines a provable level of security (such as for the QUAD stream ciphers).
Under some reasonable-looking assumptions, the best way to solve generic MQ systems is the
XL algorithm implemented with a sparse matrix solver such as Wiedemann. Knowing how fast one
can implement this attack gives us a good idea of how future cryptosystems related to MQ can
broken, similar to how implementations of General Number Field Sieve that factors smaller RSA
numbers gives us more insight into the security of actual RSA-based cryptosystems.
This paper describes such an implementation of XL with Block Wiedemann. We are able to
solve in 2.5 days, on a US$6000 computer, a system with 30 variables and 60 equations over F_16
(a computation of about 2^57 F_16-multiplications).
This is something that we do not expect that
F_4/F_5 would accomplish due to its much higher space usage.
The software can be easily adapted to other small fields including F_2.
More importantly, it scales nicely for small clusters, NUMA machines,
and a combination of both.
Itai Dinur, Tim Güneysu, Christof Paar, Adi Shamir, Ralf Zimmermann
Experimentally Verifying a Complex Algebraic Attack on the
Grain-128 Cipher Using Dedicated Reconfigurable Hardware
Abstract. In this work, we describe the first single-key attack on the full version of Grain-128
that can recover arbitrary keys.
Our attack is based on a new version of a cube tester, which
is a factor of about 2^38 faster than exhaustive search. To practically verify our results, we
the attack on the reconfigurable hardware cluster RIVYERA and tested the main components of
the attack for dozens of random keys. Our experiments successfully demonstrated the
and expected complexity of the attack by finding a very significant bias in our new cube tester
about 7.5% of the tested keys. This is the first time that the main components
of a complex analytical attack against a
digital full-size cipher were successfully implemented using
special-purpose hardware, truly exploiting the reconfigurable nature of
an FPGA-based cryptanalytical device.
Daniel J. Bernstein, Hsieh-Chung Chen, Chen-Mou Cheng, Tanja Lange, Ruben Niederhagen, Peter Schwabe, Bo-Yin Yang
Usable assembly language for GPUs: a success story
The NVIDIA compilers nvcc and ptxas leave the
programmer with only very limited control over register allocation,
register spills, instruction selection, and instruction scheduling.
In theory a programmer can gain control by writing an entire
kernel in van der Laan's cudasm assembly language, but this
requires tedious, error-prone tracking of register assignments.
This paper introduces a higher-level assembly language,
qhasm-cudasm, that allows much faster programming while providing the same amount of control over the GPU. This language
has been used successfully to build a 90000-machine-instruction
kernel for a computation described in detail in the paper, the
largest public cryptanalytic project in history. The best GTX
295 speed that has been obtained for this computation with nvcc
and ptxas is 25 million iterations per second; the best GTX 295
speed that has been obtained with qhasm-cudasm is 63 million
iterations per second.
Nicolas T. Courtois, Daniel Hulme, Theodosis Mourouzis
Solving Circuit Optimisation Problems in Cryptography and Cryptanalysis
Abstract. In this paper we look at the problem of compact representation
and optimization of some small circuits such as S-boxes in cryptographic
algorithms. Our ambition is to show that it is a lot more than an
essential software task in industrial hardware implementations of
standard cryptographic algorithms [20, 25, 13, 8]. We show that that it can
be done in a new way and should lead to many exciting new applications
in particular in cryptanalysis.
The starting point is the notion of Multiplicative Complexity recently
used to find very good optimizations of the AES S-box [5, 8, 7]. We have
developed a method and software to optimize the multiplicative
complexity and also the linear components using SAT solvers. We produce
a compact implementation of two block ciphers PRESENT and GOST
known for their exceptionally low hardware cost. We cover all the
numerous variants of GOST and have released an open source bitslice
implementation of PRESENT which is now the best publicly available .
We explain why our methodology is suitable and should be used in
implementations aiming at preventing side channel attacks on cryptographic
chips such as DPA. We show that our method can give better results
that with standard and gate-level logical optimization tools such as the
famous Berkeley Logic Friday software. Moreover, most of our results
are optimal and cannot be improved which is a rare achievement in
This kind of optimizations can be seen as an essential ingredient and key
step underlying a number of recent attacks on symmetric ciphers such
as in [17, 16], and any further optimization effort should result in either
improved attack speeds or a proof of optimality. For example we have
been able to obtain a single-key attack on the full 32-round GOST block
cipher which is faster than brute force.
This is version 2012.03.25 of the accepted.html web page.