   
Platinum sponsor: 


Sponsor: 


Sponsor: 


Sponsor: 



Accepted papers
UPDATE: The workshop record is now online.
The following submitted papers have been accepted:

Alex Biryukov and Johann Großschädl
CAESAR: Cryptanalysis of the Full AES Using GPULike Hardware
Abstract. The block cipher Rijndael has undergone more than ten years of extensive
cryptanalysis
since its submission as a candidate for the Advanced Encryption Standard (AES) in April 1998.
To
date, most of the publiclyknown cryptanalytic results are based on reducedround variants of
the
AES (respectively Rijndael) algorithm. Among the few exceptions that target the full AES are
the
RelatedKey Cryptanalysis (RKC) introduced at ASIACRYPT 2009 and attacks exploiting
TimeMemoryKey (TMK) tradeoffs such as demonstrated at SAC 2005. However, all these attacks are
generally considered infeasible in practice due to their high complexity (i.e. 2^99.5 AES
operations
for RKC, 2^80 for TMK). In this paper, we evaluate the cost of cryptanalytic attacks on the
full AES
when using specialpurpose hardware in the form of multicore AES processors that are designed
in
a similar way as modern Graphics Processing Units (GPUs) such as the NVIDIA GT200b. Using
today's VLSI technology would allow for the implementation of a GPUlike processor reaching a
throughput of up to 10^12 AES operations per second. An organization able to spend one trillion
US$
for designing and building a supercomputer based on such processors could theoretically break
the
full AES in a time frame of as little as one year when using RKC, or in merely one month when
performing a TMK attack. We also analyze different timecost tradeoffs and assess the
implications
of progress in VLSI technology under the assumption that Moore's law will continue to hold for
the
next ten years. These assessments raise some concerns about the longterm security of the AES.

Andrey Bogdanov, Elif Bilge Kavun, Christof Paar, Christian Rechberger, and Tolga Yalcin
Better than bruteforce—optimized hardware architecture for efficient biclique attacks on AES128
Abstract.
Biclique cryptanalysis was recently used to claim the first attack on
AES128 without related keys. However, compared to a simple brute force
attack, the method is more complicated, and its data requirements are
much higher (2^88 chosen ciphertexts). Still, the theoretical speed gain
over brute force search was small.
We show in this paper for the first time a practical implementation of a
biclique attack on AES, using a parallel FPGA machine and ASICs, with a
practical data complexity of only 16 chosen plaintexts, and still
gaining almost a factor 2 over optimized brute force search.

Martijn Sprengers, Lejla Batina
Speeding up GPUbased password cracking
Abstract. Recent advances in the graphics processing unit (GPU) hardware
challenge the way we look at secure password storage. GPUs have
proven to be suitable for cryptographic operations and provide a significant
speedup in performance compared to traditional central processing
units (CPUs). This research presents a proof of concept for the impact of
launching an exhaustive search attack on the MD5crypt password hashing
scheme using modern GPUs. We show that it is possible to achieve
a performance of 880 000 password hashes per second, using various
optimization techniques. For our implementation, executed on a standard
GPU, we obtain a speedup of a factor of 30 when compared with equally
priced CPU hardware. With this performance increase, `complex' passwords
with a length of 8 characters are now becoming feasible to crack
even with inexpensive hardware.

Masaya Yasuda, Takeshi Shimoyama, Tetsuya Izu, Jun Kogure
On the strength comparison of ECC and RSA
Abstract. At present, the RSA cryptosystem is most widely used in public key cryptography. On
the other hand, elliptic curve cryptography (ECC) has recently received much attention since
smaller
ECC key sizes provide the same security level as RSA. Although there are a lot of previous
works that
analyze the security of ECC and RSA, the comparison of strengths varies depending on analysis.
The
aim of this paper is once again to compare the security strengths, considering stateofart of
theory
and experiments. In this paper, we compare the computing power required to solve the elliptic
curve
discrete logarithm problem (ECDLP) and the integer factorization problem (IFP), respectively,
and
estimate the sizes of the problems that provide the same level of security.

Lyndon Judge, Patrick Schaumont
A Flexible Hardware ECDLP Engine in Bluespec
Abstract. A parallel hardware implementation of the Pollard rho
algorithm is a complex design requiring multiple layers of design hierarchy.
In this contribution, we investigate an ECDLP design for a secp112r1
curve, and we discuss the design difficulties of a traditional hardware
design method based on Verilog. The lack of flexible control structures,
and the bottomup style of hardware design leads to a rigid architecture,
incapable of supporting quick architecture changes and design space
exploration. We then present the same ECDLP implementation using
Bluespec, a rulebased hardware description language. The language features
of Bluespec support effective design space exploration. We demonstrate
this by comparing several secp112r1 ECDLP variants, with nontrivial
modification of parameters. We investigate the speed area tradeoff
resulting from various design parameters and demonstrate performance
from 53K to 2.87M iterations per second for Xilinx FPGA slice counts
varying from 4,407 to 35,232 slices. We emphasize that these numbers
were obtained by measuring the performance of a prototype, rather than
estimating them from synthesis tool output. We conclude that the design
of complex cryptanalytic hardware can greatly benefit from better
hardware design methodologies, and we would like to advocate the
importance of this aspect.

Ryan Henry, Ian Goldberg
Solving Discrete Logarithms in SmoothOrder Groups with CUDA
Abstract.
This paper chronicles our experiences using CUDA to implement a parallelized variant of
Pollard's
rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but
smooth order using commodity GPUs.
We first discuss some key design constraints imposed by modern GPU
architectures and the CUDA framework, and then explain how we were able to implement efficient
arbitraryprecision modular multiplication within these constraints. Our implementation can
execute
roughly 51.9 million 768bit modular multiplications per second—or a whopping 840 million
192bit
modular multiplications per second—on a single Nvidia Tesla M2050 GPU card, which is a
notable
improvement over all previous results on comparable hardware. We leverage this fast modular
multiplication in our implementation of the parallel rho algorithm, which can solve discrete
logarithms modulo
a 1536bit RSA number with a 2^55smooth totient in less than two minutes. We conclude the
paper by
discussing implications to discrete logarithmbased cryptosystems, and by pointing out how
efficient implementations of parallel rho (or related algorithms) lead
to trapdoor discrete logarithm
groups; we also
point out two potential cryptographic applications for the latter. Our code is written in C
for CUDA and
PTX; it is open source and freely available for download online.

ChenMou Cheng, Tung Chou, Ruben Niederhagen, BoYin Yang
Solving Quadratic Equations with XL on Parallel Architectures
Abstract.
Solving a system of multivariate quadratic equations (MQ) is a hard problem whose
complexity
estimates are relevant to many cryptographic scenarios. In some cases it is required in the
best known
attack; sometimes it is a generic attack (such as for the multivariate PKCs), and some of the
time it
determines a provable level of security (such as for the QUAD stream ciphers).
Under some reasonablelooking assumptions, the best way to solve generic MQ systems is the
XL algorithm implemented with a sparse matrix solver such as Wiedemann. Knowing how fast one
can implement this attack gives us a good idea of how future cryptosystems related to MQ can
be
broken, similar to how implementations of General Number Field Sieve that factors smaller RSA
numbers gives us more insight into the security of actual RSAbased cryptosystems.
This paper describes such an implementation of XL with Block Wiedemann. We are able to
solve in 2.5 days, on a US$6000 computer, a system with 30 variables and 60 equations over F_16
(a computation of about 2^57 F_16multiplications).
This is something that we do not expect that
F_4/F_5 would accomplish due to its much higher space usage.
The software can be easily adapted to other small fields including F_2.
More importantly, it scales nicely for small clusters, NUMA machines,
and a combination of both.

Itai Dinur, Tim Güneysu, Christof Paar, Adi Shamir, Ralf Zimmermann
Experimentally Verifying a Complex Algebraic Attack on the
Grain128 Cipher Using Dedicated Reconfigurable Hardware
Abstract. In this work, we describe the first singlekey attack on the full version of Grain128
that can recover arbitrary keys.
Our attack is based on a new version of a cube tester, which
is a factor of about 2^38 faster than exhaustive search. To practically verify our results, we
implemented
the attack on the reconfigurable hardware cluster RIVYERA and tested the main components of
the attack for dozens of random keys. Our experiments successfully demonstrated the
correctness
and expected complexity of the attack by finding a very significant bias in our new cube tester
for
about 7.5% of the tested keys. This is the first time that the main components
of a complex analytical attack against a
digital fullsize cipher were successfully implemented using
specialpurpose hardware, truly exploiting the reconfigurable nature of
an FPGAbased cryptanalytical device.

Daniel J. Bernstein, HsiehChung Chen, ChenMou Cheng, Tanja Lange, Ruben Niederhagen, Peter Schwabe, BoYin Yang
Usable assembly language for GPUs: a success story
Abstract.
The NVIDIA compilers nvcc and ptxas leave the
programmer with only very limited control over register allocation,
register spills, instruction selection, and instruction scheduling.
In theory a programmer can gain control by writing an entire
kernel in van der Laan's cudasm assembly language, but this
requires tedious, errorprone tracking of register assignments.
This paper introduces a higherlevel assembly language,
qhasmcudasm, that allows much faster programming while providing the same amount of control over the GPU. This language
has been used successfully to build a 90000machineinstruction
kernel for a computation described in detail in the paper, the
largest public cryptanalytic project in history. The best GTX
295 speed that has been obtained for this computation with nvcc
and ptxas is 25 million iterations per second; the best GTX 295
speed that has been obtained with qhasmcudasm is 63 million
iterations per second.

Nicolas T. Courtois, Daniel Hulme, Theodosis Mourouzis
Solving Circuit Optimisation Problems in Cryptography and Cryptanalysis
Abstract. In this paper we look at the problem of compact representation
and optimization of some small circuits such as Sboxes in cryptographic
algorithms. Our ambition is to show that it is a lot more than an
essential software task in industrial hardware implementations of
standard cryptographic algorithms [20, 25, 13, 8]. We show that that it can
be done in a new way and should lead to many exciting new applications
in particular in cryptanalysis.
The starting point is the notion of Multiplicative Complexity recently
used to find very good optimizations of the AES Sbox [5, 8, 7]. We have
developed a method and software to optimize the multiplicative
complexity and also the linear components using SAT solvers. We produce
a compact implementation of two block ciphers PRESENT and GOST
known for their exceptionally low hardware cost. We cover all the
numerous variants of GOST and have released an open source bitslice
implementation of PRESENT which is now the best publicly available [1].
We explain why our methodology is suitable and should be used in
implementations aiming at preventing side channel attacks on cryptographic
chips such as DPA. We show that our method can give better results
that with standard and gatelevel logical optimization tools such as the
famous Berkeley Logic Friday software. Moreover, most of our results
are optimal and cannot be improved which is a rare achievement in
complexity.
This kind of optimizations can be seen as an essential ingredient and key
step underlying a number of recent attacks on symmetric ciphers such
as in [17, 16], and any further optimization effort should result in either
improved attack speeds or a proof of optimality. For example we have
been able to obtain a singlekey attack on the full 32round GOST block
cipher which is faster than brute force.
Version
This is version 2012.03.25 of the accepted.html web page.
