On The Current State of Scholarly Retrieval Systems

S. Khalid, S. Khusro, I. Ullah, G. Dawson-Amoah


The enormous growth in the size of scholarly literature makes its retrieval challenging. To address this challenge, researchers and practitioners developed several solutions. These include indexing solutions e.g. ResearchGate, Directory of Open Access Journals (DOAJ), Digital Bibliography & Library Project (DBLP) etc., research paper repositories e.g. arXiv.org, Zenodo, etc., digital libraries, scholarly retrieval systems, e.g., Google Scholar, Microsoft Academic Search, Semantic Scholar etc., digital libraries, and publisher websites. Among these, the scholarly retrieval systems, the main focus of this article, employ efficient information retrieval techniques and other search tactics. However, they are still limited in meeting the user information needs to the fullest. This brief review paper is an attempt to identify the main reasons behind this failure by reporting the current state of scholarly retrieval systems. The findings of this study suggest that the existing scholarly retrieval systems should differentiate scholarly users from ordinary users and identify their needs. Citation network analysis should be made an essential part of the retrieval system to improve the search precision and accuracy. The paper also identifies several research challenges and opportunities that may lead to better scholarly retrieval systems.


information retrieval; scholarly search; scholarly users; citation networks

Full Text:



Baidu Academic, available at: http://xueshu.baidu.com

M. Khabsa, C. L. Giles, “The number of scholarly documents on the public web”, PloS One, Vol. 9, No. 5, p. e93949, 2014

E. Orduna-Malea, J. M. Ayllon, A. Martin-Martin, E. D. Lopez-Cozar, “About the size of Google Scholar: playing the numbers”, available at: https://arxiv.org/abs/1407.6239, 2014

Microsoft Academic, available at: https://academic.microsoft.com

J. Wu, C. Liang, H. Yang, C. L. Giles, “CiteSeerX data: semanticizing scholarly papers”, International Workshop on Semantic Big Data, San Francisco, USA, June 26 - July 1, 2016

M. Liu, “Progress in documentation the complexities of citation practice: a review of citation studies”, Journal of Documentation, Vol. 49, pp. 370-408, 1993

D. Goldberg, D. Nichols, B. M. Oki, D. Terry, “Using collaborative filtering to weave an information tapestry”, Communications of the ACM, Vol. 35, No. 12, pp. 61-70, 1992

S. Bradshaw, “Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes”, in: International Conference on Theory and Practice of Digital Libraries, , pp. 499-510, Springer, 2003

A. Ritchie, S. Teufel, S. Robertson, “Using Terms from Citations for IR: Some First Results”, in: Advances in Information Retrieval, ECIR 2008, pp. 211-221, Springer, 2008

A. Ritchie, Citation Context Analysis for Information Retrieval, University of Cambridge, 2009

J. Beel, B. Gipp, S. Langer, C. Breitinger, “Research-paper recommender systems: a literature survey”, International Journal on Digital Libraries, Vol. 17, No. 4, pp. 305-338, 2016

K. Sugiyama, M. Y. Kan, “A comprehensive evaluation of scholarly paper recommendation using potential citation papers”, International Journal on Digital Libraries, Vol. 16, No. 2, pp. 91-109, 2015

C. He, D. Parra, K. Verbert, “Interactive recommender systems: A survey of the state of the art and future research challenges and opportunities”, Expert Systems with Applications, Vol. 56, pp. 9-27, 2016

B. Sun, P. Mitra, C. Lee Giles, K. T. Mueller, “Identifying, indexing, and ranking chemical formulae and chemical names in digital documents”, ACM Transactions on Information Systems (TOIS), Vol. 29, No. 2, p. 12, 2011

S. Tuarob, S. Bhatia, P. Mitra, C. L. Giles, “AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data”, IEEE Transactions on Big Data, Vol. 2, No. 1, pp. 3-17, 2016

Y. Liu, K. Bai, P. Mitra, C. L. Giles, “TableSeer:automatic table metadata extraction and searching in digital libraries”, 7th ACM/IEEE-CS Joint Conference on Digital libraries, Vancouver, British Columbia, Canada, June 17-22, 2007

M. Khabsa, P. Treeratpituk, C. L. Giles, “AckSeer:a repository and search engine for automatically extracted acknowledgments from digital libraries”, ACM/IEEE-CS Joint Conference on Digital Libraries, Washington, USA, June 10-14, 2012

M. Singh, B. Barua, P. Palod, M. Garg, S. Satapathy, S. Bushi, K. Ayush, K. S. Rohith, T. Gamidi, P. Goyal, A. Mukherjee, “OCR++: A Robust Framework For Information Extraction from Scholarly Articles”, 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, December 11-17, 2016

H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, E. A. Fox, “Automatic document metadata extraction using support vector machines”, Joint Conference on Digital Libraries, Houston, USA, May 27-31, 2003

M. Lipinski, K. Yao, C. Breitinger, J. Beel, B. Gipp, “Evaluation of header metadata extraction approaches and tools for scientific PDF documents”, 13th ACM/IEEE-CS Joint Conference on Digital libraries, Indianapolis, USA, July 22-26, 2013

Apache Tika, available at: https://tika.apache.org

P. Lopez, “GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications”, in: Research and Advanced Technology for Digital Libraries, pp. 473-474, Springer, 2009

C. A. Clark, S. K. Divvala, “Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers”, in: AAAI Workshop: Scholarly Big Data, AAAI Publications, 2015

S. Tuarob, S. Bhatia, P. Mitra, C. L. Giles, “Automatic detection of pseudocodes in scholarly documents using machine learning”, 12th International Conference on Document Analysis and Recognition, Washington, USA, August 25-28, 2013

I. G. Councill, C. L. Giles, M. Y. Kan, “ParsCit: an Open-source CRF Reference String Parsing Package”, LREC, Vol. 8, pp. 661-667, 2008

S. R. Choudhury, S. Wang, C. L. Giles, “Scalable algorithms for scholarly figure mining and semantics”, International Workshop on Semantic Big Data, San Francisco , USA, June 26-July 1, 2016

G. Veena, J. Mathew, J. Joseph, “A Survey on Search Systems for Extracting And Searching in Scholarly Big Data”, International Journal of Innovative Research in Science, Engineering and Technology, Vol. 5, Special No. 14, pp. 98-103, 2016

X. Li, M. D. Rijke, “Do Topic Shift and Query Reformulation Patterns Correlate in Academic Search?”, in: Advances in Information Retrieval, Springer, 2017

S. M. Beitzel, E. C. Jensen, A. Chowdhury, O. Frieder, D. Grossman, “Temporal analysis of a very large topically categorized Web query log”, Journal of the American Society for Information Science & Technology, Vol. 58, No. 2, pp. 166–178, 2007

A. Di Iorio, R. Giannella, F. Poggi, S. Peroni, F. Vitali, “Exploring Scholarly Papers Through Citations”, 2015 ACM Symposium on Document Engineering, Lausanne, Switzerland, September 8-11, 2015

M. H. MacRoberts, B. R. MacRoberts, “Problems of citation analysis: A study of uncited and seldom‐cited influences”, Journal of the American Society for Information Science and Technology, Vol. 61, No. 1, pp. 1-12, 2010

X. Y. Liu, B. C. Chien, “Applying Citation Network Analysis on Recommendation of Research Paper Collection”, 4th Multidisciplinary International Social Networks Conference, Bangkok, Thailand, July 17-19, 2017

S. M. Mcnee, I. Albert, D. Cosley, P. Gopalkrishnan, S. K. Lam, A. M. Rashid, J. A. Konstan, J. Riedl, “On the recommending of citations for research papers”, ACM Conference on Computer supported cooperative work, New Orleans, USA, November, 16-20, 2002

A. Silvescu, A. Silvescu, P. Mitra, C. L. Giles, “Can't see the forest for the trees?: a citation recommendation system”, ACM/IEEE-CS Joint Conference on Digital Libraries, Indianapolis, USA, July 22-26, 2013

K. Sugiyama, M. Y. Kan, “Exploiting potential citation papers in scholarly paper recommendation”, 13th ACM/IEEE-CS Joint Conference on Digital libraries, Indianapolis, USA, July 22-26, 2013

Q. He, J. Pei, D. Kifer, P. Mitra, L. Giles, “Context-aware citation recommendation”, International Conference on World Wide Web, Raleigh, USA, April, 2010

B. Golshan, T. Lappas, E. Terzi, “SOFIA SEARCH: a tool for automating related-work search”, ACM SIGMOD International Conference on Management of Data, Scottsdale, USA, May 20-24, 2012

K. Toutanova, C. D. Manning, “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger”, 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Vol. 13, pp. 63-70, Hong Kong, October 7-8, 2000

T. Chakraborty, R. Narayanam, “All fingers are not equal: Intensity of references in scientific articles”, 2016 Conference on Empirical Methods in Natural Language Processing, Austin, USA, November 1-5, 2016

S. Kumar, “Structure and dynamics of signed citation networks”, 25th International Conference Companion on World Wide Web, Montreal, Canada, April 11-15, 2016

M. M. Kessler, “Bibliographic coupling between scientific papers”, American Documentation, Vol. 14, No. 1, pp. 10-25, 1963

P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto, M. A. Concalves, “Combining link-based and content-based methods for web document classification”, 12th International Conference on Information and Knowledge Management, New Orleans, USA, November 3-8, 2003

T. Couto, M. Cristo, M. A. Goncalves, P. Calado, N. Ziviani, E. Moura, B. Ribeiro-Neto, “A comparative study of citations and links in document classification”, 6th ACM/IEEE-CS Joint Conference on Digital Libraries, Chapel Hill, USA, June 11-15, 2006

B. Gipp, Citation-based Plagiarism Detection: Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis, Springer, 2014

B. Gipp, N. Meuschke, “Citation pattern matching algorithms for citation-based plagiarism detection:greedy citation tiling, citation chunking and longest common citation sequence”, International Symposium on Parallel Architectures, Algorithms, and Networks, Mountain View, USA, September, 19-22, 2011

S. Kumar, P. K. Reddy, V. P. Reddy, A. Singh, “Similarity analysis of legal judgments”, ACM Bangalore Conference, Bangalore, Karnataka, India, March 25-26, 2011

S. Liu, C. Chen, K. Ding, B. Wang, K. Xu, Y. Lin, “Literature retrieval based on citation context”, Scientometrics, Vol. 101, Vol. 2, pp. 1293-1307, 2014

S. Teufel, “Argumentative Zoning for Improved Citation Indexing”, in Computing Attitude and Affect in Text: Theory and Applications Vol. 20, pp. 159-169, Springer, 2006

S. Mohammad, B. Dorr, M. Egan, A. Hassan, P. Muthukrishan, V. Qazvinian, D. Radev, D. Zajic, “Using Citations to Generate Surveys of Scientific Paradigms”, Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Boulder, USA, May 31 - June 5, 2009

R. L. Liu, “Retrieval of Scholarly Articles with Similar Core Contents”, International Journal of Knowledge Content Development & Technology, Vol. 7, No. 3, pp. 5-27, 2017

Apache Lucene, available at: http://lucene.apache.org

J. S. Whissell, C. L. A. Clarke, “Effective measures for inter-document similarity”, 22nd ACM International Conference on Information & Knowledge Management, San Francisco, USA, October 27 - November 1, 2013

K. W. Boyack, D. Newman, R. J. Duhon, R. Klavans, M. Patek, J. R. Biberstine, B. Schijvenaars, A. Skupin, N. Ma, K. Borner, “Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches”, Plos One, Vol. 6, No. 3, p. e18029, 2011

P. Glenisson, F. Janssens, B. D. Moor, “Combining full text and bibliometric information in mapping scientific disciplines”, Information Processing & Management An International Journal, Vol. 41, No. 6, pp. 1548-1572, 2005

T. K. Landauer, D. Laham, M. Derr, “Colloquium Paper: Mapping Knowledge Domains: From paragraph to graph: Latent semantic analysis for information visualization”, Proceedings of the National Academy of Sciences USA, Vol. 101, Suppl. 1, pp. 5214-5219, 2004

S. E. Robertson, S. Walker, M. Beaulieu, P. Willett, “Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track”, Nist Special Publication SP 500, pp. 253-264, 1999

R. L. Liu, “Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles”, Plos One, Vol. 10, No. 10, p. e0142026, 2015

K. W. Boyack, R. Klavans, “Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately?”, Journal of the American Society for Information Science & Technology, Vol. 61, No. 12, pp. 2389-2404, 2010

F. Janssens, W. Glanzel, B. D. Moor, “A hybrid mapping of information science”, Scientometrics, Vol. 75, No. 3, pp. 607-631, 2008

B. Gipp, J. Beel, “Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis”, 12th International Conference on Scientometrics & Informetrics, Rio de Janeiro, Brazil, July 14-17, 2009

K. W. Boyack, H. Small, R. Klavans, “Improving the accuracy of co-citation clustering using full text”, Journal of the American Society for Information Science & Technology, Vol. 64, No. 9, pp. 1759-1767, 2013

X. Liu, J. Zhang, C. Guo, “Full-text citation analysis: A new method to enhance scholarly networks”, Journal of the American Society for Information Science & Technology, Vol. 64, No. 9, pp. 1852-1863, 2013

H. Small, “Interpreting maps of science using citation context sentiments: a preliminary investigation”, Scientometrics, Vol. 87, No. 2, pp. 373-388, 2011

B. Aljaber, N. Stokes, J. Bailey, J. Pei, “Document clustering of scientific texts using citation contexts”, Information Retrieval, Vol. 13, No. 2, pp. 101-131, 2010

P. I. Nakov, A. S. Schwartz, M. A. Hearst, “Citances: Citation sentences for semantic analysis of bioscience text”, SIGIR 04 Workshop on Search & Discovery in Bioinformatics, Sheffield, UK, July 25-29, 2004

M. A. J. Singh, D. S. Ravikumar, Newspaper Citation in Scholarly Publications: A Study on Financial Times Newspaper during 2001- 2010 as Reflected in Web of Science, Library Philosophy & Practice, University of Nebraska, 2018

K. Sugiyama, M. Y. Kan, “Exploiting potential citation papers in scholarly paper recommendation”, 13th ACM/IEEE-CS Joint Conference on Digital Libraries, Indianapolis, USA, July 22-26, 2013

C. Caragea, A. Silvescu, P. Mitra, C. L. Giles, “Can't see the forest for the trees?: a citation recommendation system”, 13th ACM/IEEE-CS Joint Conference on Digital Libraries, Indianapolis, USA, July 22-26, 2013

C. Wang, D. M. Blei, “Collaborative topic modeling for recommending scientific articles”, 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, August 21-24, 2011

O. Kucuktunc, E. Saule, K. Kaya, U. V. Catalyurek, “TheAdvisor: a web service for academic recommendation”, 13th ACM/IEEE-CS Joint Conference on Digital Libraries, Indianapolis, USA, July 22-26, 2013

M. D. Ekstrand, P. Kannan, J. A. Stemper, J. T. Butler, J. A. Konstan, J. T. Riedl, “Automatically building research reading lists”, 4th ACM Conference on Recommender Systems, Barcelona, Spain, September 25-30, 2010

M. Hagen, A. Beyer, T. Gollub, K. Komlossy, B. Stein, “Supporting Scholarly Search with Keyqueries”, 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016

T. Gollub, M. Hagen, M. Michel, B. Stein, “From keywords to keyqueries: content descriptors for the web”, 36th International ACM SIGIR Conference on Research and Development in Information retrieval, Dublin, Ireland, July 28-August 1, 2013

M. Hagen, B. Stein, “Candidate document retrieval for web-scale text reuse detection”, International Symposium on String Processing and Information Retrieval, Pisa, Italy, October 17-21, 2011

R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing, 1999

Z. Tan, C. Liu, Y. Mao, Y. Guo, J. Shen, X. Wang, “AceMap: A Novel Approach towards Displaying Relationship among Academic Literatures”, 25th International Conference Companion on World Wide Web, Montreal, Canada, April 11-15, 2016

J. Beel, B. Gipp, E. Wilde, “Academic Search Engine Optimization (ASEO) Optimizing Scholarly Literature for Google Scholar & Co”, Journal of Scholarly Publishing, Vol. 41, No. 2, pp. 176-190, 2009

M. T. Luong, T. D. Nguyen, M. Y. Kan, “Logical structure recovery in scholarly articles with rich document features”, in: Multimedia Storage and Retrieval Innovations for Digital Library Systems, pp. 270-292, IGI Global, 2012

K. Siler, “Citation choice and innovation in science studies”, Scientometrics, Vol. 95, No. 1, pp. 385-415, 2013

C. L. Borgman, “Data, Data Citation, and Bibliometrics”, Taiwan Data Curation and Citation Workshop, Taipei, Taiwan, December 5, 2016

P. Chen, H. Xie, S. Maslov, S. Redner, “Finding scientific gems with Google’s PageRank algorithm”, Journal of Informetrics, Vol. 1, No. 1, pp. 8-15, 2007

N. Ma, J. Guan, Y. Zhao, “Bringing PageRank to the citation analysis”, Information Processing & Management, Vol. 44, No. 2, pp. 800-810, 2008

Y. Ding, B. Cronin, “Popular and/or prestigious? Measures of scholarly esteem”, Information Processing & Management, Vol. 47, No. 1, pp. 80-96, 2011

F. Radicchi, S. Fortunato, B. Markines, A. Vespignani, “Diffusion of scientific credits and the ranking of scientists”, Physical Review E, Vol. 80, No. 5, p. 056103, 2009

E. C. Rosenthal, H. J. Weiss, “A data envelopment analysis approach for ranking journals”, Omega, Vol. 70, pp. 135-147, 2016

E. Yan, C. R. Sugimoto, “Institutional interactions: Exploring social, cognitive, and geographic relationships between institutions as demonstrated through citation networks”, Journal of the American Society for Information Science and Technology, Vol. 62, No. 8, pp. 1498-1514, 2011

J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, C. L. Giles, “Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search”, 8th International Conference on Knowledge Capture, Palisades, USA, October 7-10, 2015

CiteNetExplorer, available at: http://www.citnetexplorer.nl

N. J. Van Eck, L. Waltman, “Systematic Retrieval of Scientific Literature based on Citation Relations: Introducing the CitNetExplorer Tool”, European Conference on Information Retrieval, Amsterdam, Netherlands, April 13-16, 2014

N. J. van Eck and L. Waltman, “CitNetExplorer: A new software tool for analyzing and visualizing citation networks”, Journal of Informetrics, Vol. 8, No. 4, pp. 802-823, 2014

M. Eto, “Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches”, Joint Workshop on Bibliometric-Enhanced Information Retrieval and NLP for Digital Libraries, New Jersey, USA, June 19-23, 2016

D. Dubin, “The most influential paper Gerard Salton never wrote”, Library Trends, Vol. 52, No. 4, pp. 748-764, 2004

M. V. Simkin, V. P. Roychowdhury, “Read before you cite!”, Complex Systems, Vol. 14, pp. 269-274, 2003

M. J. Moravcsik, P. Murugesan, “Some Results on the Function and Quality of Citations: Social Studies of Science”, Social Studies of Science Vol. 3, No. 4, p. 538, 1988

E. Yan, Y. Ding, “Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other”, Journal of the American Society for Information Science and Technology, Vol. 63, No. 7, pp. 1313-1326, 2012

Z. Jiang, X. Liu, “Recovering missing citations in a scholarly network: a 2-step citation analysis to estimate publication importance”, 13th ACM/IEEE-CS Joint Conference on Digital libraries, Indianapolis, USA, July 22-26, 2013

C. Chen, M. Song, “The Uncertainty of Science: Navigating Through the Unknown”, in: Representing Scientific Knowledge, pp. 1-35, Springer, 2017

H. Shakibian, N. M. Charkari, “Optimization problems in complex networks: Challenges and directions”, 24th Iranian Conference on Electrical Engineering (ICEE), Shiraz, Iran, May 10-12, 2016

eISSN: 1792-8036     pISSN: 2241-4487