IJSTR

International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
0.2
2019CiteScore
 
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020

CALL FOR PAPERS
AUTHORS
DOWNLOADS
CONTACT

IJSTR >> Volume 9 - Issue 6, June 2020 Edition



International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616



GPU-Accelerated Optimization Of Block Lanczos Solver For Sparse Linear System

[Full Text]

 

AUTHOR(S)

Prashant Verma, Kapil Sharma

 

KEYWORDS

block lanczos, cryptography, Graphics Processing Unit, GPU-Direct P2P, MIMD, parallel-processing, RDMA.

 

ABSTRACT

Solving large and sparse system of linear equations has been extensively used for several crypt-analytic techniques. Block Lanczos and Block Wiedemann algorithms are well known for solving large sparse systems. However, the time complexity of such popular method makes it reluctant and hence, the concept of parallelism is made compulsory for such methods. This paper introduced an optimization of Block Lanczos method over finite field using accelerated GPUs. Here we consider GF (2) finite field. The parallel solver for optimization of Block Lanczos is implemented using NVIDIA Compute Unified Device Architecture (CUDA) and Message Passing Interface (MPI) to take advantage of multi-level parallelism on multi-node and multi-GPU systems. CUDA-aware MPI has been extensively used to leverage GPU-Direct Remote Direct Memory Access (RDMA) and GPU-Direct Point to Point (P2P) for optimized inter and intra node communication. The proposed solver for optimization of Block Lanczos explored the bandwidth of memory bandwidth on a single Tesla, multi Tesla K40 and multi Tesla P100 GPU nodes. The parallel efficiency is also achieved on DGX system with 8 P100 and 1 V100 Tesla GPU’s respectively.

 

REFERENCES

[1] Qi Wang, Xiubin Fan, Hongyan Zang, Yu Wang, The Space Complexity Analysis in the General Number Field Sieve Integer Factorization, Theoretical Computer Science, Volume 630, 2016, Pages 76-94,ISSN0304-3975, https://doi.org/10.1016/j.tcs.2016.03.028.
[2] B. Sengupta, A. Das, Use of SIMD-based data parallelism to speed up sieving in integer-factoring algorithms, IACR Cryptology ePrint Archive (2015) 44.
[3] P. Giorgi, R. Lebreton, Online order basis algorithm and its impact on the block Wiedemann algorithm, in: Proc. 39th Int. Symp. Symbolic and Algebraic Computation (ISSAC’14), ACM, 2014, pp. 202–209.
[4] A. G. Huang, Parallel Block Wiedemann-based GNFS algorithm for integer factorization, Master thesis, St. Francis Xavier University, Canada (2010).
[5] T. Zhou, J. Jiang, Performance modeling of hyper-scale custom machine for the principal steps in block Wiedemann algorithm, The J. Supercomputing (2016) 1–23.
[6] “Top500 list - November 2017,” https://www.top500.org/list/2019/11/.
[7] “Summit: Oak ridge national laboratory’s next high-performance supercomputer,” https://www.olcf.ornl.gov/olcfresources/computesystems/summit.
[8] I. Flesch, A new parallel approach to the Block Lanczos algorithm for finding null spaces over GF (2), Master thesis, Utrecht University, the Netherlands (2006).
[9] E. Thom´e, A modified block Lanczos algorithm with fewer vectors, arXiv preprint arXiv:1604.02277.
[10] http://en.wikipedia.org/wiki/Lanczos algorithm (2009). Intel Corporation, Technical Report.
[11] Laurence T. Yang, Ying Huang, Jun Feng, Qiwen Pan, Chunsheng Zhu, An improved parallel block Lanczos algorithm over GF (2) for integer factorization, Information Sciences, Volume 379, 2017, Pages 257-273, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2016.09.052.
[12] T. L. Xu Block Lanczos-based parallel GNFS algorithm for integer factorization, Master thesis, St. Francis Xavier University, Canada (2007).
[13] L. T. Yang, L. Xu, S.S. Yeo, S. Hussain, An integrated parallel GNFS algorithm for integer factorization based on Linbox Montgomery block Lanczos method over GF (2), Computers & Mathematics with Applications 60 (2) (2010) 338–346.
[14] K. Koc and S. N. Arachchige, “A fast algorithm for gaussian elimination over gf (2) and its implementation on the gapp.” J. of Parallel and Distributed Computing., vol. 13, no. 1, pp. 118–122, 1991.
[15] D. Parkinson and M. Wunderlich, “A compact algorithm for gaussian elimination over gf (2) implemented on highly parallel computers,” Parallel Computing, vol. 1, no. 1, pp. 65–73, Aug. 1984.
[16] A. Bogdanov, M. C. Mertens, C. Paar, J. Pelzl, and A. Rupp, “A parallel hardware architecture for fast gaussian elimination over gf (2),” 14th IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 237–248, 2006.
[17] M. R. Albrecht, G. V. Bard, and C. Pernet, “Efficient dense gaussian elimination over the finite field with two elements,” CoRR, vol. abs/1111.6549, 2011.
[18] “M4ri library,” https://github.com/malb/m4ri.
[19] W. Bosma, J. Cannon, and C. Playoust, “The magma algebra system i: The user language,” Journal of Symbolic Computation, vol. 24, no. 3-4, pp. 235–265, Oct. 1997.
[20] Aydin Buluc¸, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (SPAA ’09). Association for Computing Machinery, New York, NY, USA, 233–244.
DOI: https://doi.org/10.1145/1583991.1584053
[21] N. L. Zamarashkin and D. A. Zheltkov, “Gpu based acceleration of parallel block lancoz solver,” Lobachevskii Journal of Mathematics, vol. 39, no. 4, pp. 596–602, May 2018.
[22] “Gpu acceleration of dense matrix and block operations for lanczos method for systems over large prime finite field,” in Supercomputing. RuSCDays, ser. Communications in Computer and Information Science, vol. 793. Springer, 2017, pp.14–26.
[23] I. Gupta, P. Verma, V. Deshpande, N. Vydyanathan and B. Sharma, “GPU-Accelerated Scalable Solver for Large Linear Systems over Finite Fields,” 2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC), Solan Himachal Pradesh, India, 2018, pp. 324-329, doi: 10.1109/PDGC.2018.8745743.
[24] B. Vastenhouw, R. H. Bisseling, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication, SIAM Review 47 (1) (2004) 67–95.
[25] An introduction to cuda-aware mpi”. https://devblogs.nvidiacom/parallelforall/introduce-cudaaware-mpi.
[26] “FAQ: Running cuda-aware open MPI” https://www.open-mpi.org/faq/?category=runcuda.
[27] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008.
[28] M. P. Forum, “Mpi: A message-passing interface standard,” Knoxville, TN, USA, Tech. Rep., 1994.
[29] “Nvidia GPUDirect,” https://developer.nvidia.com/gpudirect.
[30] C. Rea˜no and F. Silla, “Performance Evaluation of the NVIDIA Pascal GPU Architecture: Early Experiences,” 2016 IEEE 18th International Conference on High Performance Computing and Communications; Sydney, NSW, 2016, pp. 1234-1235, doi: 10.1109/HPCCSmartCity- DSS.2016.0173.