An Algorithm to Optimize Frequent Pattern Mining in Parallel and Distributed Environment

Anshu Singla; Parul Gandhi

doi:10.48084/etasr.9830

Authors

Anshu Singla Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies (MRIIRS), Faridabad, Haryana, India | KCCITM, Greater Noida, India
Parul Gandhi Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies (MRIIRS), Faridabad, Haryana, India

Volume: 15 | Issue: 3 | Pages: 22252-22256 | June 2025 | https://doi.org/10.48084/etasr.9830

Received: 4 December 2024 | Revised: 25 December 2024, 14 January 2025, 1 February 2025, 3 February 2025 | Accepted: 5 February 2025 | Online: 24 March 2025

Corresponding author: Anshu Singla

Abstract

Frequent Pattern Mining (FPM) is an important data mining task that involves identifying recurrent patterns or correlations in datasets. The main purpose of FPM algorithms is to find sets of items that frequently appear in transactional or relational databases. This study presents a Parallel and Distributed Recursive Elimination (PDReLim) algorithm, a novel FPM technique designed for parallel computing to improve efficiency compared to existing parallel FPM algorithms. PDReLim recursively deletes infrequent items on each node while using the capabilities of parallel and distributed systems or clusters. Its performance was evaluated on well-known datasets, namely Chess, Mushroom, and Connect, available in the UCI repository, with a focus on the lowest support threshold, which causes computational bottlenecks for many FPM algorithms. PDReLim, implemented in PySpark, outperforms standard MapReduce for iterative algorithms. Spark's execution is optimized for large databases by utilizing its proficient capabilities, such as the RDD data structure, in-memory processing, and shared variables. The results show that PDReLim was significantly faster than PApriori, PFP-Growth, and PFP-Max.

Keywords:

PySpark, Frequent Pattern Mining (FPM), parallel FPM, Spark, association rule mining, apriori, eclat

References

P. Gupta and V. Sawant, "A Parallel Apriori Algorithm and FP- Growth Based on SPARK," ITM Web of Conferences, vol. 40, 2021, Art. no. 03046. DOI: https://doi.org/10.1051/itmconf/20214003046

M. J. Zaki, "Parallel and distributed association mining: a survey," IEEE Concurrency, vol. 7, no. 4, Oct. 1999, Art. no. 14–25. DOI: https://doi.org/10.1109/4434.806975

S. Biswas, N. Biswas, and K. C. Mondal, "Parallel and Distributed Association Mining: A Recent Survey," Information Management and Computer Science, vol. 2, no. 1, pp. 15–24, Sep. 2019. DOI: https://doi.org/10.26480/imcs.01.2019.15.24

R. Khajuria, A. Sharma, S. Sharma, A. Sharma, J. Narayan Baliya, and P. Singh, "Performance analysis of frequent pattern mining algorithm on different real-life dataset," Indonesian Journal of Electrical Engineering and Computer Science, vol. 29, no. 3, Mar. 2023, Art. no. 1355. DOI: https://doi.org/10.11591/ijeecs.v29.i3.pp1355-1363

M. R. Al-Bana, M. S. Farhan, and N. A. Othman, "An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data," Data, vol. 7, no. 1, Jan. 2022, Art. no. 11. DOI: https://doi.org/10.3390/data7010011

C. Fernandez-Basso, M. D. Ruiz, and M. J. Martin-Bautista, "New Spark solutions for distributed frequent itemset and association rule mining algorithms," Cluster Computing, vol. 27, no. 2, pp. 1217–1234, Apr. 2024. DOI: https://doi.org/10.1007/s10586-023-04014-w

L. Liu, J. Wen, Z. Zheng, and H. Su, "An improved approach for mining association rules in parallel using Spark Streaming," International Journal of Circuit Theory and Applications, vol. 49, no. 4, pp. 1028–1039, Apr. 2021. DOI: https://doi.org/10.1002/cta.2935

J. J. Flores et al., "Parallel mining of frequent patterns for school records analytics at the Universidad Michoacana," in 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), Ixtapa, Nov. 2017, pp. 1–6. DOI: https://doi.org/10.1109/ROPEC.2017.8261636

F. Gao, C. Bhowmick, and J. Liu, "Performance Analysis Using Apriori Algorithm Along with Spark and Python," in Proceedings of the 2018 International Conference on Computing and Big Data, Charleston, SC, USA, Sep. 2018, pp. 28–31. DOI: https://doi.org/10.1145/3277104.3277108

A. Satty, M. M. Y. Salih, A. A. Hassaballa, E. A. E. Gumma, A. Abdallah, and G. S. Mohamed Khamis, "Comparative Analysis of Machine Learning Algorithms for Investigating Myocardial Infarction Complications," Engineering, Technology & Applied Science Research, vol. 14, no. 1, pp. 12775–12779, Feb. 2024. DOI: https://doi.org/10.48084/etasr.6691

S. S. Alzahrani, "Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME and Orange Tools," Engineering, Technology & Applied Science Research, vol. 12, no. 5, pp. 9364–9371, Oct. 2022. DOI: https://doi.org/10.48084/etasr.5184

B. Bouaita, A. Beghriche, A. Kout, and A. Moussaoui, "A New Approach for Optimizing the Extraction of Association Rules," Engineering, Technology & Applied Science Research, vol. 13, no. 2, pp. 10496–10500, Apr. 2023. DOI: https://doi.org/10.48084/etasr.5722

D. J. I. Raj, V. S. Radhakrishnan, M. R. Reddy, N. S. Selvan, B. Elangovan, and M. Ganesan, "The Projection-Based Data Transformation Approach for Privacy Preservation in Data Mining," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15969–15974, Aug. 2024. DOI: https://doi.org/10.48084/etasr.7969

M. Sinthuja, S. Pravinthraja, B. K. Dhanalakshmi, H. L. Gururaj, V. Ravi, and G. Jyothish Lal, "An efficient and resilience linear prefix approach for mining maximal frequent itemset using clustering," Journal of Safety Science and Resilience, vol. 6, no. 1, pp. 93–104, Mar. 2025. DOI: https://doi.org/10.1016/j.jnlssr.2024.08.001

"Mushroom." UCI Machine Learning Repository, 1981.

J. Tromp, "Connect-4." UCI Machine Learning Repository, 1995.

R. Quinlan, "Chess (King-Rook vs. King-Knight)." UCI Machine Learning Repository, 1983.