
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

Fast Montgomery Modular Multiplication and Squaring on Embedded Processors
Yang LI Jinlin WANG Xuewen ZENG Xiaozhou YE
Publication
IEICE TRANSACTIONS on Communications
Vol.E100B
No.5
pp.680690 Publication Date: 2017/05/01
Online ISSN: 17451345 Type of Manuscript: PAPER Category: Fundamental Theories for Communications Keyword: Montgomery modular multiplication, embedded processors, hybrid multiplication, lazy doubling, coarsely integrated,
Full Text: PDF(3.5MB) >>Buy this Article
Summary:
Montgomery modular multiplication is one of the most efficient algorithms for modular multiplication of large integers. On resourceconstraint embedded processors, memoryaccess operations play an important role as arithmetic operations in the modular multiplication. To improve the efficiency of Montgomery modular multiplication on embedded processors, this paper concentrates on reducing the memoryaccess operations through adding a few working registers. We first revisit previous popular Montgomery modular multiplication algorithms, and then present improved algorithms for Montgomery modular multiplication and squaring for arbitrary prime fields. The algorithms adopt the general ideas of hybrid multiplication algorithm proposed by Gura and lazy doubling algorithm proposed by Lee. By careful optimization and redesign, we propose novel implementations for Montgomery multiplication and squaring called coarsely integrated product and operand hybrid scanning algorithm (CIPOHS) and coarsely integrated lazy doubling algorithm (CILD). Then, we implement the algorithms on general MIPS64 processor and OCTEON CN6645 processor equipped with specific multiplyadd instructions. Experiments show that CIPOHS and CILD offer the best performance both on the general MIPS64 and OCTEON CN6645 processors. But the proposed algorithms have obvious advantages for the processors with specific multiplyadd instructions such as OCTEON CN6645. When the modulus is 2048 bits, the CIPOHS and CILD outperform the CIOS algorithm by a factor of 47% and 58%, respectively.

