Speeding up the Multiplication Algorithm for Large Integers

-Multiplication is one of the basic operations that influence the performance of many computer applications such as cryptography. The main challenge of the multiplication operation is the cost of the operation as compared to other basic operations such as addition and subtraction, especially when the size of the numbers is large. In this work, we investigate the use of the window strategy for multiplying a sequence of large integers to design an efficient sequential algorithm in order to reduce the number of bit-multiplication operations involved in multiplying a sequence of large integers. In our implementation, several parameters are considered and measured for their effect on the proposed algorithm and the best-known sequential algorithm in the literature. These parameters are the size of the sequence of integers, the size of the integers, the size of the window


I. INTRODUCTION
Computer arithmetic plays an essential role in every layer of computing and it is an important consideration when developing computer solutions for many problems such as cryptography, image processing, and numerical computations. In computer arithmetic, we use different operations such as addition, subtraction, multiplication, and division to achieve the goal of computation. Among these, the operation that has particular significance for many applications is the multiplication operation [1]. The multiplication operation is important, mainly for three reasons. First, the time cost of performing the multiplication operation is greater than that of other operations such as addition and subtraction. For example, given two integer numbers of ݊ -bits each, the addition and multiplication of the two integer numbers require ܱሺ݊ሻ and ܱሺ݊ ଶ ሻ bit operations, respectively, using the Naïve method [1].
This means that there is a significant difference between the costs of the two operations. Second, many primitive and essential arithmetic operations, such as division, squaring, inverse multiplication, and modulo operations, are based on the multiplication operation. Therefore, the running time of the multiplication operation affects these operations. Third, several complex applications in computer science, such as cryptography and digital signal processing, are based on a huge number of multiplication operations [2][3][4][5][6]. For example, in RSA and El-Gamal public-key cryptosystem, the multiplication operation is necessary. So, a more efficient multiplication method would lead to the speeding up of the computation process in complex applications.
For the above reasons, different strategies have been suggested to reduce the total number of operations required for multiplication. Two main research directions have been followed to improve the efficiency of the multiplication operation on a data set that consists of ݉ integer numbers each of size ݊. The first direction has been to reduce the cost of multiplying two numbers, ‫ݔ‬ and ‫,ݕ‬ of size ݊ each and thereby decrease the total cost of the multiplication operations for a data set. The second direction has been to reduce the cost of multiplying the data set by proposing an efficient strategy to multiply the ݉ numbers. Regarding the first research direction, many methods have been proposed to reduce the time complexity of multiplying two integers in both sequential [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] and parallel computation [24][25][26][27][28][29]. In the case of sequential computation, several techniques have been proposed such as the Naïve multiplication algorithm [ [1,15], the Toom-Cook multiplication algorithm [20], and a fast Fourier transform-based algorithm [18]. The time complexity of the Naïve multiplication method is ܱሺ݊ ଶ ሻ , whereas Karatsuba's algorithm uses a divide and conquer strategy to multiply the two integers in ܱሺ݊ ሻ, where ݇ = log ଶ 3 ≈ 1.585. The Toom-Cook multiplication method is a generalization of Karatsuba's algorithm using r-way multiplication and has a cost of ܱ ቀ݊ ଵାை൫ଵ ඥ୪୭ ⁄ ൯ ቁ. In contrast, Schonhage and Strassen utilize the fast Fourier transform to reduce the time complexity of multiplication. They propose two algorithms, where the best one runs on ܱሺ݊ log ݊ log log ݊ሻ and uses an arithmetical modulo operation. Another two algorithms are proposed to reduce the running time of the Schonhage-Strassen algorithm to achieve ܱ൫݊ log ݊ 2 ைሺ୪୭ * ሻ ൯, where log * ݊ = ݉݅݊݅݉‫݉ݑ‬൛݅: ‫݈݃‬ ሺሻ ݊ ≤ 2ൟ and ‫݈݃‬ ሺሻ ݊ = ݊. The first algorithm is based on arithmetic over complex numbers [12], while the other is based on modular arithmetic [10].
On the other hand, many different attempts have been made to parallelize the multiplication problem using different parallel models [24][25][26][27][28][29]. Most of these attempts have been based on the shared memory model, where the processors in this model communicate through shared memory. Also, some research studies have focused on implementing some parallel algorithms on real machines, such as FPGAs, GPUs, and multicore [25][26][27]. In the case of the second research direction, a few algorithms have been proposed to reduce the time complexity of the multiplication of ݉ integers in both sequential and parallel computation. In the case of sequential computation, the best-known algorithm is the Naïve method, which scans the sequence of integer numbers and multiplies one number in each iteration. In the case of parallel computation, a few strategies [30] have been proposed that use a shared memory model and are implemented on specific real machines such as the multicore system [31].
In this paper, we are interested in contributing to the second research direction that focuses on using sequential computation because, despite the progress that has been made toward developing an effective strategy, there is still room for improvement. Here, we present an efficient improvement sequential algorithm to multiply a large number of integers, each of large size. A comparison of the proposed algorithm and the best-known sequential algorithm indicates that, from a practical perspective, the proposed algorithm performs better. Our improved algorithm speeds up the best-known sequential algorithm when ݉ and ݊ are large.

II. THE OPTIMAL ALGORITHM
In this section, first we describe the optimal sequential algorithm that is used to multiply ݉ numbers, ܺ = ሺ‫ݔ‬ ‫ݔ‬ ଵ … ‫ݔ‬ ିଵ ሻ. Then we give an example to illustrate the number of multiplication operations required to multiply ݉ numbers.

A. The Optimal Multiplication Algorithm
The optimal multiplication algorithm, denoted as OM, multiplies ݉ numbers by sequentially scanning the input array of ݉ − 1 iterations. In each iteration ݅, the algorithm multiplies the number ‫ݔ‬ with ‫,ܯ‬ where ݅ ≥ 1 and ‫ܯ‬ is initially equal to ‫ݔ‬ and finally ‫ܯ‬ = ∏ ‫ݔ‬ ିଵ ୀ . In the light of the above, the algorithm requires ܱሺ݉ ‫ݐ‬ ௨ ሻ time. The term ݉is is derived from ݉ − 1 iterations, where ܱሺ݉ሻ = ܱሺ݉ − 1ሻ, while the term ‫ݐ‬ ௨ is the total number of operations required to multiply two numbers. The value of ‫ݐ‬ ௨ depends on the size of the two numbers. If we have two numbers of size n each: ‫ݔ‬ = ൫‫ݔ‬ ሺିଵሻ , ‫ݔ‬ ሺିଶሻ , … , ‫ݔ‬ ଵ , ‫ݔ‬ ൯ and ‫ݔ‬ = ൫‫ݔ‬ ሺିଵሻ , ‫ݔ‬ ሺିଶሻ , … , ‫ݔ‬ ଵ , ‫ݔ‬ ൯, then multiplying ‫ݔ‬ and ‫ݔ‬ requires ܱሺ݊ ଶ ሻ multiplication operations using the Naïve method, while the best-known time complexity for ‫ݐ‬ ௨ is ܱ൫݊ log ݊ 2 ைሺ୪୭ * ሻ ൯ [10,12]. To differentiate between the multiplication operation for the two integers ‫ݔ‬ and ‫ݔ‬ , and the multiplication operation for the two digits/bits ‫ݔ‬ and ‫ݔ‬ , we name the second type of operation as the digit-multiplication or bit-multiplication operation. So, the multiplication of ‫ݔ‬ and ‫ݔ‬ requires ܱሺ݊ ଶ ሻ digit-multiplication or bit-multiplication operations. The main problem when multiplying ݉ numbers, each of size ݊ bits, is the size of the result of the multiplication, which increases with the increase in the value of ݉. For example, when ݉ = 4, initially ‫ܯ‬ = ‫ݔ‬ and in the first iteration, the algorithm computes ‫ܯ‬ = ‫ݔ‬ × ‫ݔ‬ ଵ , of size 2݊ using ݊ ଶ bit-multiplication operations in the worst case. The second iteration of the algorithm involves multiplying ‫ܯ‬ = ‫ݔ‬ × ‫ݔ‬ ଵ of size 2݊ with ‫ݔ‬ ଶ of size ݊ to produce ‫ܯ‬ = ‫ݔ‬ × ‫ݔ‬ ଵ × ‫ݔ‬ ଶ of size 3݊ using 2݊ ଶ bitmultiplication operations. In the last iteration, the algorithm computes ‫ܯ‬ = ‫ݔ‬ × ‫ݔ‬ ଵ × ‫ݔ‬ ଶ × ‫ݔ‬ ଷ of size 4݊ using 3݊ ଶ bitmultiplication operations. Hence, the overall number of bitmultiplication operations to compute ‫ܯ‬ = ‫ݔ‬ × ‫ݔ‬ ଵ × ‫ݔ‬ ଶ × ‫ݔ‬ ଷ is ݊ ଶ + 2݊ ଶ + 3݊ ଶ = 6݊ ଶ in the worst case. Furthermore, in general, the multiplication of ݉ integer numbers, of size ݊ bits each, requires ݊ ଶ + 2݊ ଶ + 3݊ ଶ + ⋯ + ሺ݉ − 1ሻ݊ ଶ bitmultiplication operations, which is equal to ݊ ݉ ሺ݉ − 1ሻ/2 in the worst case.

B. Illustrative Example
An illustrative example is given to explain the total number of digit-multiplication operations required to multiply m numbers using the sequential OM algorithm. The notation ‫ݐ‬ * represents the total number of digit-multiplication operations. For simplicity, let us assume that we have an array of ݉ = 20 integer numbers and that the value of each element in the array is 5, i.e.: ܺ = ሺ5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5ሻ The execution of the optimal sequential algorithm on ܺ is described below.  Figure 1(c), the algorithm multiplies 5 with the digits 5, 2, and 1, so the number of digitmultiplication operations is 3, and is denoted by * ሺ3ሻ. The second piece of information is the total number of digitmultiplication operations from the start to the current iteration and is denoted by ‫ݐ‬ * . For example, in Figure 1(c), the total number of digit-multiplication operations is equal to the number of digit-multiplication operations for the current iteration, ݅ ൌ 3, plus the total number of digit-multiplication operations for the previous iterations. Therefore, ‫ݐ‬ * ൌ ‫ݐ‬ * * ሺ3ሻ ൌ 3 3 ൌ 6 . The third and last piece of information in each subfigure consists of the iteration number, ݅, and the value of ‫ܯ‬ that is equal to ∏ ‫ݔ‬ ୀ . Figures 1-4 display the complete execution of the OM algorithm on ܺ for iterations 1-5, 6-10, 11-14, and 15-19, respectively.    III. THE MODIFIED ALGORITHM In this section, at first the idea and the steps of the suggested strategy for speeding up the execution time of the multiplication of ݉ numbers, each of size ݊, where ݉ is a large value are introduced. Then two examples are given to illustrate the proposed approach by applying it to the array discussed in Section II.

A. The Algorithm
The proposed algorithm is based on dividing the array into The proposed algorithm executes ݉ multiplication operations, while the OM algorithm executes ݉ − 1 multiplication operations.
End for 16.
End if End.
In order to reduce the number of multiplications in the proposed algorithm, BM0, from ݉ to ݉ − 1 , we apply the following modifications. First, we compute the multiplication of the first block using the final result ‫.ܯ‬ Hence, the total number of multiplications is ܾ െ 1, whereas the first block of the initially proposed algorithm required ܾ multiplications: ܾ െ 1 to compute ‫݉݁ݐ‬ and one to update ‫ܯ‬ . Second, the remaining elements of the array are divided into blocks, ‫ہ‬ሺ݉ െ ܾሻ ܾ ⁄ ‫,ۂ‬ and then the same steps are performed as in the initial version of the proposed algorithm. The modification of the proposed algorithm is denoted as BM. Now, we compute the number of multiplications, to compute ‫ܯ‬ for the BM algorithm. Thus, the first block requires ܾ െ 1 multiplication operations, see lines 1-3. Each block, in the other blocks, requires ܾ െ 1 multiplication operations, see lines 7-12, and the updating of the value of ‫,ܯ‬ for each block, requires only one multiplication. So, each iteration in the do-while loop requires ܾ multiplication operations. Therefore, the total number of multiplications for the do-while loop is ‫ܾ/݉ہ‬ െ ‫ۂ1‬ ൈ ܾ where the term ݉/ܾ-1 represents the number of windows in the do-while loop. Additionally, the total number of the remaining elements is ݉ െ ‫ۂܾ/݉ہ‬ ൈ ܾ. Hence, the total number of multiplications for the BM algorithm is ݉ െ 1. Hence, the running time and memory consumption of the BM algorithm are ܱሺ݉ሻ and ܱሺ1ሻ, respectively.

B. Illustrative Example
In this section, we illustrate how the proposed BM algorithm reduces the total number of digit-multiplication operations by applying it on the same array as in Section II.B and using two values of ܾ: 5 and 10 . We also show how different block size values affect the total number of digitmultiplication operations.  First, let us assume that the size of the block is ܾ = 5. Initially, the algorithm assigns 1 to ‫ܯ‬ and executes lines 2-3. In the first iteration, the algorithm computes the multiplication of the first five numbers as ‫ܯ‬ = 3125. The total number of digit-multiplication operations to compute ‫ܯ‬ is nine, ‫ݐ‬ * ൌ 9, similar to Figure 1(a)-(d). This step is shown in Figure 5(a), where the information in the Figure refers to the block number instead of the iteration number. The algorithm computes the multiplication of the second block in ‫݉݁ݐ‬ ൌ 3125 using nine digit-multiplication operations. Then, the algorithm updates the value of ‫ܯ‬ by multiplying it with ‫,݉݁ݐ‬ using 16 digit-multiplication operations. Therefore, the total number of digit-multiplication operations until the end of this step is ‫ݐ‬ * ൌ 9 16 9 ൌ 34 , see Figure 5 In the second example illustrated in Figure 6, when ܾ ൌ 10 , the BM algorithm computes the first block in ‫ܯ‬ ൌ 5 ଵ using 36 digit-multiplication operations, which is similar to Figures 1(a)-(e) and 2(a)-(d). The second block is calculated by performing two steps. The first step involves computing the auxiliary variable ‫݉݁ݐ‬ ൌ 5 ଵ using 36 digitmultiplication operations. The second step involves multiplying ‫ܯ‬ with ‫݉݁ݐ‬ using 49 digit-multiplication operations to produce ‫ܯ‬ ൌ 5 ଶ . So, the total number of digitmultiplication operations is ‫ݐ‬ * = 36 49 36 ൌ 121. Hence the BM algorithm requires fewer digit-multiplication operations when ܾ ൌ 10. In other words, the BM algorithm performs better when ܾ ൌ 10 than when ܾ ൌ 5 and ܾ ൌ 1 . Also, the BM algorithm performs better when b = 5 than when ܾ = 1.

IV. EXPERIMENTAL STUDY
In this section, we study the BM algorithm experimentally to find the answers to the following research questions: 1. Does the value of ܾ have an effect on the running time of the BM algorithm experimentally, and if so, is this effect significant?
2. Do certain parameters (the size of the sequence of integers, the size of the integers, the size of the window, and the distribution of the data) have an effect on the performance of the BM algorithm?
3. Does the BM algorithm have a faster running time as compared to the OM algorithm?
The above questions are addressed in the following subsections. Subsection A contains a brief description of the platform (hardware and software) and the methodology used to test and measure the running time of both the BM and OM algorithms. Subsection B contains the answers to the first and second questions. Subsection C answers the third question.

A. Methodology
The experimental studies were conducted on a machine consisting of a 2.4 GHz processor with 32 GB of memory and a Windows operating system. Both algorithms, OM and BM, are implemented using C++ language and the GMP (GNU Multiple Precision Arithmetic) package. The GMP package is used to achieve two objectives. First, when we multiply many integer numbers of size 16 bits or more, the result is a number that is greater in size than 64 bits, and numbers of such size cannot be manipulated by the integer range in C++ language. Second, some of the applications that use the OM algorithm, such as cryptography, require a data of sizes greater than 64 bits. The experimental studies are based on the following four parameters: • The first parameter is the size of the integer number, ݊, and is measured in bits. In the experiment, we used ݊ = 32, 64, 128, 256, and 512.
• The third parameter is the size of the block, ܾ, used in the computation. In the experiment, we use different values of block size: 25, 50, … ,500, except when ݉ 512, the last value of ܾ is 250.
• The fourth parameter is the data range, ܴ, that is used to generate the elements of the array. In the experiment, we use two types of data range. The first data range is ܴ ଵ = ሾ2 ିଵ , 2 െ 1ሿ, which means that all data in the array of size ݊ bits exactly, i.e., 2 ିଵ ‫ݔ‬ ൏ 2 . The second data range is ܴ ଶ ൌ ሾ0, 2 െ 1ሿ, which means that all data in the array of size less than or equal ݊ bits, i.e., 0 ‫ݔ‬ ൏ 2 .
The running time for each of the algorithms (BM and OM) is measured by taking the average time for 50 instances for fixed parameter values. The running time of the algorithms is measured in seconds.

B. Behavior of the BM Algorithm
In this subsection, we study the effect of changing the value of ܾ on the running time of the BM algorithm using different values of ݉, ݊ and ܴ. To verify this goal, we first generate a data set by determining the values of ݉, ݊ and ܴ. Then we generate a data for one instance, ‫ܦ‬ ൌ ሺ݉, ݊, ܴሻ. For example, we set ݉ ൌ 4݇, ݊ ൌ 512, and ܴ ൌ ܴ ଵ . After that, we execute the BM algorithm on different values of ܾ as described in Section IV.A and the running time for BM algorithm is measured for each value of ܾ. Next, we repeat the previous steps using different values of data ‫ܦ‬ ൌ ሺ݉, ݊, ܴሻ. Following that, the same procedure is applied on BM algorithm using ܴ ଶ and different values of ݉, and ݊.  0.021, 0.018, 0.017, 0.017, 0.017,  0.018, 0.020, 0.020, 0.022, and 0.024, respectively. Usually, this case occurs when ݊ and ݉ are small or when ܾ is close in value to ݉. This means that not all values of ܾ lead to an improvement in the running time of the BM algorithm, especially when ݉ and ݊ are small.
Third, it is clear that between two successive values of ܾ, the improvement is small. However, when the difference between the two values of ܾ is large, with a certain limitation, the improvement is significant. For example, when ݉ = 16݇, ݊ = 128, and ܾ = 75 and 100, the running time of the BM algorithm is 3.99 and 3.68 , respectively, while the running time for the BM algorithm when ܾ = 75 and 250 is 3.99 and 2.35, respectively.
Fourth, the running time of the BM algorithm on ܴ ଶ is a little faster than on ܴ ଵ . The reason for this is that all the numbers in ܴ ଵ have n bits exactly, while the numbers in ܴ ଶ have a maximum length ݊.

C. Comparison of OM and BM Algorithms
In this subsection, the running times of the OM algorithm and BM algorithm are compared based on two values. The first value is the minimum running time obtained from the experimental results for the BM algorithm using different values of ܾ, this value is denoted as BM min . The second value is the average running time calculated from all running times of the BM algorithm using different values of ܾ , denoted as BM Avg . Therefore, two running time values for the BM algorithm are considered: BM min and BM Avg . Table I lists the running times of both algorithms, OM and BM, when the data range for the elements of the array taken from ܴ ଵ . Several observations arise from the analysis of the data results in Table  I.   Table II. Fourth, referring to Table II, it is clear that the percentage of improvement achieved by the BM algorithm in the case of BM min is better than in the case of BM Avg , for fixed ݊ and ݉, because the running time for BM algorithm in case of BM min is less than the running time for BM algorithm in case of BM Avg .
Note that, with increasing ݉ and ݊, the running times for BM in case of ܴ ଵ is near to the case of ܴ ଶ , so we neglect the comparison between BM and OM algorithms in case of ܴ ଶ .
V. CONCLUSION In this paper, we studied the multiplication operation because it has a higher time cost as compared to other basic arithmetic operations and therefore has a significant influence on the performance of many applications such as cryptography and digital signal processing. To reduce the cost of this operation, we developed an efficient algorithm for sequential computation tasks that require the multiplication of a sequence of big integers. The algorithm is based on using the window strategy to reduce the cost of multiplying the sequence of big integers. The algorithm has the same time complexity as the best-known sequential algorithm but performs fewer digitmultiplication operations. In our experimental studies to test the performance of the proposed BM algorithm, we considered four parameters of different values: sequence size (݉ ൌ 1/4݇, 1/2݇, 1݇, 2݇, 4݇, 8݇, 16݇, and 32݇), integer size (݊ = 32, 64, 128, 256, and 512), block size (ܾ = 25, 50, …, 500), and data range (fixed and varied sizes). The results showed the effectiveness of the proposed algorithm as compared to the best-known sequential algorithm, where the percentage of improvement achieved by the proposed algorithm was 90% when ݉ and ݊ were large.
There are different future directions related to this study. Firstly, the study of the behavior of the BM algorithm when ݊ ≥ 1݇. Secondly, the study of the effect of increased value of ܾ on the BM algorithm. Third, the question of the existence of a relation between the values of ܾ and ݉. Fourth, how to use the GPU to speedup the running time for the BM algorithm. Fifth, how to speedup modular multi-exponentiation using the BM algorithm and modular exponentiation.