Advantages of Giraph over Hadoop in Graph Processing

C. L. Vidal-Silva, E. Madariaga, T. Pham, J. M. Rubio, L. A. Urzua, L. Carter, F. Johnson

Abstract


This article presents a comparison of the computing performance of the MapReduce tool Hadoop and Giraph on large-scale graphs. The main ideas of MapReduce and bulk synchronous parallel (BSP) are reviewed as big data computing approaches to highlight their applicability in large-scale graph processing. This paper reviews the execution performance of Hadoop and Giraph on the PageRank algorithm to classify web pages according to their relevance, and on a few other algorithms to find the minimum spanning tree in a graph with the primary goal of finding the most efficient computing approach to work on large-scale graphs. Experimental results show that the use of Giraph for processing large-size graphs reduces the execution time by 25% in comparison with the results obtained using the Hadoop for the same experiments. Giraph represents the optimal option thanks to its in-memory computing approach that avoids secondary memory direct interaction.


Keywords


Giraph; Hadoop; Graph; Big Data; Big Graph

Full Text:

PDF

References


I. Yaqoob, I. A. T. Hashem, A. Gani, S. Mokhtar, E. Ahmed, N. B. Anuar, A. V. Vasilakos, “Big data”, International Journal of Information Management, Vol. 36, No. 6, pp. 1231–1247, 2016

A. K. Wahi, V. Ahuja, “The internet of things-new value streams for customers”, International Journal of Information Technology and Management, Vol. 16, No. 4, pp. 360–375, 2017

P. Zikopoulos, C. Eaton, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Osborne Media, 2011

T. White, Hadoop: The Definitive Guide, O’Reilly Media, Inc., 2015

J. Dean, S. Ghemawat, “Mapreduce: Simplified data processing on large clusters”, 6th Conference on Symposium on Operating Systems Design & Implementation, San Francisco, December 6-8, 2004

J. Dean, S. Ghemawat, “Mapreduce: Simplified data processing on large clusters”, Communications of the ACM, Vol. 51, No. 1, pp. 107–113, 2008

M. Aurelio, B. Fagnani, G. Lotz, Dynamic Graph Computations using Parallel Distributed Computing Solutions, Science without Borders, 3-Months Project Report, Queen Mary, University of London, 2013

R. Shaposhnik, C. Martella, D. Logothetis, Practical Graph Analytics with Apache Giraph, Apress, 2015

S. Sakr, F. M. Orakzai, I. Abdelaziz, Z. Khayyat, Large-Scale Graph Processing Using Apache Giraph, Springer, 2017

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, “Pregel: A system for large-scale graph processing”, 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, USA, June 6-11, 2010

S. Valenzuela, C. Vidal, “Evaluacion de hadoop y giraph para el procesamiento de grafos”, Jornadas Chilenas de la Computacin, XXVII Encuentro Chileno de Computacin. Santiago, Chile, 2015 (in Spanish)

S. Karanth, Mastering Hadoop. Packt Publishing, 2015

S. Ghemawat, H. Gobioff, S. T. Leung, “The Google file system”, Nineteenth ACM Symposium on Operating Systems Principles, Bolton Landing, USA, October 19-22, 2003

S. Brin, L. Page, “The anatomy of a large-scale hypertextual web search engine”, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp. 107–117, 1998.

L. G. Valiant, “A bridging model for parallel computation”, Communications of the ACM, Vol. 33, No. 8, pp. 103–111, 1990

M. Han, K. Daudjee, K. Ammar, M. T. Ozsu, X. Wang, T. Jin, “An experimental comparison of pregel-like graph processing systems”, Proceedings of the VLDB Endowment, Vol. 7, No. 12, pp. 1047–1058, 2014

S. Salihoglu, J. Shin, V. Khanna, B. Q. Truong, J. Widom, “Graft: A debugging tool for apache giraph”, 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, May 31-June 4, 2015

M. Han, K. Daudjee, “Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems”, Proceedings of the VLDB Endowment, Vol. 8, No. 9, pp. 950–961, 2015

J. Lin, C. Dyer, Data-Intensive Text Processing with MapReduce, Morgan and Claypool, 2010

M. Held, R. M. Karp, “The traveling-salesman problem and minimum spanning trees: Part ii”, Mathematical Programming, Vol. 1, No. 1, pp. 6–25, 1971

R. C. Prim, “Shortest connection networks and some generalizations”, Bell System Technical Journal, Vol. 36, No. 6, pp. 1389–1401, 1957

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica, “Apache spark: A unified engine for big data processing”, Communications of the ACM, Vol. 59, No. 11, pp. 56–65, 2016

E. Friedman, K. Tzoumas, Introduction to Apache Flink: Stream Processing for Real Time and Beyond, O’Reilly Media, 2016

S. Papp, The Definitive Guide to Apache Flink: Next Generation Data Processing, Apress, 2016




eISSN: 1792-8036     pISSN: 2241-4487