On The Current State of Scholarly Retrieval Systems

The enormous growth in the size of scholarly literature makes its retrieval challenging. To address this challenge, researchers and practitioners developed several solutions. These include indexing solutions e.g. ResearchGate, Directory of Open Access Journals (DOAJ), Digital Bibliography & Library Project (DBLP) etc., research paper repositories e.g. arXiv.org, Zenodo, etc., digital libraries, scholarly retrieval systems, e.g., Google Scholar, Microsoft Academic Search, Semantic Scholar etc., digital libraries, and publisher websites. Among these, the scholarly retrieval systems, the main focus of this article, employ efficient information retrieval techniques and other search tactics. However, they are still limited in meeting the user information needs to the fullest. This brief review paper is an attempt to identify the main reasons behind this failure by reporting the current state of scholarly retrieval systems. The findings of this study suggest that the existing scholarly retrieval systems should differentiate scholarly users from ordinary users and identify their needs. Citation network analysis should be made an essential part of the retrieval system to improve the search precision and accuracy. The paper also identifies several research challenges and opportunities that may lead to better scholarly retrieval systems. Keywords-information retrieval; scholarly search; scholarly users; citation networks


INTRODUCTION
A scholarly retrieval system is a sophisticated software that performs crawling, indexing, searching, and ranking to make scholarly data (research publications and related information including authors, publishers, citations, etc.), available to searchers.Several scholarly retrieval systems including Google Scholar, Microsoft Academic Search, CiteSeerX and Chinese Baidu Academic [1] are frequently used by modern-day online searchers.The retrieved scholarly documents include journal articles, conference proceedings, books, dissertations, technical reports, and patents.While some of these documents are freely accessible to all members of the public, access to others is limited only to subscribers.The academic web is growing, but there seems to be no definite agreement on its size.One estimation of the number of scholarly documents is 120 million, of which 25% is freely accessible [2].Google Scholar has indexed nearly 160 million scholarly documents [3].Microsoft Academic Search has indexed nearly 209.79 million [4].The number of scholarly documents increases at an annual rate of over 1 million [5].Such a huge collection of research publications is therefore challenging to process and find relevant papers effortlessly.Researchers are working on finding out a way for supporting scholarly search and making it more accessible.Their efforts resulted in several indexing solutions, publication repositories, digital libraries, research paper recommender systems, and scholarly retrieval systems.This paper aims to report on the current state of the scholarly retrieval system by identifying the commonalities and differences between web and scholarly users, surveying the search techniques of the available scholarly retrieval systems, and understanding the potential role citation networks analysis in retrieval relevant research publications.

II. THE CURRENT STATE SCHOLARLY RETRIEVAL SYSTEMS
Scholarly retrieval solutions take the user search query as input and check its relevance with publications using different ranking features [6][7][8][9][10].As a complementary tool to the academic search, a research paper recommender system employs different filtering algorithms to find and recommend relevant papers based on users' implicit and explicit feedback as well as the content of these documents.In some cases, search and recommendation are employed in a searchrecommendation hybrid manner, where keywords are first used to find an initial list of search results and then recommendations are applied to refine the search [11].Without loss of generality, both architectures are highly related, and most of the techniques used for scholarly retrieval systems apply to scholarly recommender systems.Recommender systems are covered in several recent papers [11][12][13].

A. The Structure of Scholarly Documents
Unlike general web, the unit of information to be retrieved by a scholarly search system is a research article that is retrieved based either on its content or some specific parts.A scholarly publication can be a journal article, conference paper, technical report, pre-print, thesis/dissertation, or a book.This paper considers research articles only, excluding theses/dissertations, technical reports, and books.A research paper has a well-defined structure and well-organized content to which writers are customarily constrained.Usually, the author follows the Author Guidelines or Instructions for Authors specifying the length, format, in-text citations, references, artwork, tables, etc., before the submission or after the manuscript is accepted for publication.The manuscript text is mainly unstructured [5], but it is sometimes considered semistructured or even structured.Generally, research articles consist of a header, main-content, a bibliography, algorithms, tables, figures, mathematical equations and so on [14].The header contains a title, authors, their emails and affiliations, abstract, and publication year, venue (journal, conference, etc.), volume and issue number, number of pages, etc. Figures and tables covey results and other structured information in a very symbolic and practical way.Algorithms are the step-by-step approach and effective way to present how a computational problem works.The mathematical computation is usually written in the form of equations.Bibliography (also known as references, incites, or notes) is a collection of cited publications listed at the end of the research article.They play a vital role in assessing the quality of the manuscript, helping the reader to learn more by accessing these links, and facilitates in creating citation networks.The extraction and usage of all essential components can enhance the ranking of scholarly retrieval systems [15][16][17].By utilizing different tools such as OCR++ [18][19][20], Apache Tika [21], GROBID for header extraction [22], PDFFigures for table and figure extraction [23] and algorithm extraction [24], ParsCit for citations extraction [25], etc., documents can be parsed into different sections like title, abstract, body text, authors, venue, and references for optimizing retrieval.The metadata including title, author (name, email, affiliation), heading and section mapping, footnote, figures and table headings, URL, citation, and references can be extracted and processed in a usable format like XML or JSON [18].The extraction and storage of figures can also play a vital role in the retrieval of relevant papers [26].However, for an efficient scholarly retrieval system, it is essential to consider the structure and associated metadata of the scholarly documents in search, ranking, and recommendation [27].

B. Users of Scholarly Retrieval Systems
The scholarly users are different from the typical web searchers [28].They have different search patterns: in general web search, the search activity is on a peak in the weekend and goes down during weekdays, but in scholarly search, the activity is on the peak during weekdays and (mostly) drops in weekends [29].Academic searchers use scholarly retrieval platforms.The web searchers widely use the general web search engines.Table I summarizes the types of scholarly users and their requirements, based on [30].• Find authors of relevant papers.

1) Readers/Authors
Readers and authors usually search for and read scholarly documents to find novelties in literature either for learning or for developing new approaches to solve a problem.Because of the massive size of the academic web, it is unrealistic for a researcher to read every article related to the research subject [31].In order not to overwhelm researchers with information overload, it is essential to provide a result set that includes the most relevant documents related to a given query.

2) Reviewers
Reviewers assess the quality of a submitted paper to ensure it meets the laid down standards.A critical aspect of a reviewer's job is to evaluate the citations used.For any given topic, there usually exist a set of core scholarly documents that need to be referenced in any new work because they establish the theoretical foundations of the topic.It is essential to ensure that all relevant information is made available to a reviewer during the evaluation process, which may aid in reviewing the submitted manuscript more efficiently.

3) Editors
The core mandate of editors is to evaluate the scope of a submitted scholarly document to ensure that it fits the platform (journal, conference proceedings, etc.) it is intended for.They also evaluate the number and quality of self-references in situations where they are used.

4) Evaluators
Evaluators belong to a category of users who usually carry out research aimed at determining the contributions of the author to the body of knowledge of a specific field of study.

5) Event Organizers
Event organizers are generally interested in getting to know the potential participants by using the citation network.The participants could include researchers working on insightful solutions in the given domain, students, authors of previously published relevant scholarly documents and any others who might have an interest in the event.

C. Approaches to Scholarly Search and Retrieval
The following sections discuss the approaches with which the scholarly search systems use to mitigate the problem of information overload for academic searchers.

1) Citation Graph-Based Approaches
The network of references forms the citation graph, in which the citing paper cites the cited ones.Several approaches practice citation graph [32][33][34][35][36][37] e.g.Sofia Search [37] produces a citation graph by starting from the initial set of papers and following the links of citing and cited papers until the desired number of candidate papers is found.It mimics a human in identifying candidate publications from the citation graph.From the citation graph of the seed papers, the approach generates a list of relevant papers.However, in the growth rate of research papers, the use of Sofia Search is limited.At first, it needs seed papers and a lower bound.Then, all the in-links and out-links are not equal and relevant [38][39][40].Another representative technique of citation graph is bibliographic coupling that considers only out-links of a paper [41].The similarity between ܲ ଵ and ܲ ଶ is computed as where ܱ భ and ܱ మ are the sets, having out-links of ܲ ଵ and ܲ ଶ , respectively.The similarity between the two papers is equal to 0 when both sets ܱ భ and ܱ మ are empty.Bibliographic coupling has been practiced and worked well for classification of scholarly documents [43], plagiarism detection [44,45], and similar legal judgments [46].However, it is limited in retrieving relevant papers due to two reasons: a) bibliographic coupling misses some important papers not present in the outcites, and b) it is unable to consider the in-cites of the papers.The citation context is used in [47] for retrieving relevant literature.However, extracting citation context is challenging due to the unavailability of full text and almost unable to reveal the main subject of the paper resourcefully [48,49].Several popular academic search engines including Google Scholar, PubMed and CiteSeer use the links between academic articles, provided by citation networks for documents ranking.

2) Content-Based Approaches
Content-based methods process textual content of the papers, which can be title, abstract, introduction, keywords and body of the articles.These methods weigh the article's influence by the frequency and position of the terms in the article [50].Many techniques are based on the term weights to estimate the relevance of articles.The most widely used approach is the vector space model (VSM), which represents each article as a vector of term weights and the relevance is a measure in terms of some similarity measures such as cosine similarity between the query and document vectors.Many retrieval systems and applications practice VSM (e.g.[51]) even though, the cosine similarity does not perform well in many situations [52,53].Latent semantic analysis (LSA) improves the vector representation of a scholarly article by singular value decomposition (SVD) method [54,55].However, for the efficient retrieval of scholarly articles, LSA does not perform well comparatively [53].Many scholarly retrieval systems prefer using BM25 which is among the best ranking techniques for scholarly retrieval [56].However, to better meet the requirements of different scholarly users, researchers have also adopted hybrid approaches by combining content-and citation-based approaches, discussed below.

3) Hybrid Approaches
The hybrid approaches [43,[57][58][59] combine the best of the citation graph-based and content-based techniques to compute the relevance of documents to the search query.The proximity of citations is supportive in locating related articles [60,61].Two articles may be similar to each other if many articles in nearby locations cite them.However, all the papers are not publically available to locate the nearby locations and cannot guarantee the exact subject [48,49].The context passage around the citation indicates the main content of the cited paper [62], however, the cited paper can be focused on a different subject of context [62,63].Context passages are used for several other purposes in literature, like inter-article similarity estimation [64], disambiguation of named entities [65], topicbased retrieval [9,47], identification of biomedical articles [50], and newspaper citations in scholarly search [66].However, extracting context passage is challenging due to the unavailability of full-text and therefore inability to conclude the subject of the paper efficiently [48,49].Intuitively, many popular scholarly search engines like Semantic Scholar use hybrid approaches for ranking documents.Much research has been done on the effectiveness of academic search engines.Some authors use graph-based approaches for the effectiveness of the academic search [67][68][69][70][71] while bearing in mind that a citation graph is usually sparse and noisy [68].The solution in [72] supports scholarly search using key-queries [73] and query covers [74] to enhance the effectiveness of the academic search.However, their approach takes a research article as input for key-phrase selection and weighting methods, which result in suboptimal ranking.Due to the massive expansion in research paper repositories, the scholarly search is a very hot and challenging domain for both researchers and developers.Although several approaches have been proposed in the literature to address the requirements of scholars, we are still away from an ideal academic search engine that meets the heterogonous needs of different categories of scholarly users with minimum effort.Further research is required to address the requirements of academic searchers.

D. Ranking Algorithms for Scholarly Search
There is no universal ranking algorithm that scholarly retrieval systems use to rank documents in response to user queries.In most cases, scholarly retrieval models are quadruples {D, Q, F, R (q, d)} [75].D is the representation component that is usually searched in the collection set.Q is the logical view of the user need.F is a framework and reasoning component for modeling document representation, query, and their relationship.R (q, d) is a reasoning component to rank the document as per the query terms.Due to the advancements in IR, numerous technologies and techniques are used for enhancing scholarly retrieval systems.For instance, Semantic Scholar uses semantic technologies for accomplishing the task of locating relevant documents.AceMap [76] academic search system analyzes big scholarly document datasets using the "map" approach.Google Scholar and DBLP use text-based methods to navigate.These scholarly retrieval systems use different ranking algorithms that place matching results in their order of relevance.Some systems let the user choose the ranking factor (publication date, number of citations, author or journal name and reputation, and relevance of the document based on some predefined designed criteria).The factor selected by the user is given more weight in determining the relevance of documents.Some other systems like Google Scholar do not allow users to intervene in the weighting factor of ranking.
In most scholarly retrieval systems, the relevance of a document is measured by considering different document elements.For instance, how repeatedly the search term is found in the document and in which field (i.e., title, abstract, body, etc.).Commonly, if the search term occurs more often in a document or a more important field of the document, it is considered more relevant.For example, the term in the title is weighted more heavily than its occurrence in the abstract and so on.The weight of each term in the document is assigned to the total ranking weight based on term position.Some of the document fields that may be weighted differently by scholarly search systems are shown in Table II.Due to the unavailability of data, the ranking algorithms and their attributes of all available scholarly retrieval systems was not considered.We slightly considered the ranking mechanism of Google Scholar, the most widely used scholarly search engine.It takes into account multiple factors such as relevance, citation count, author name, name of publisher etc. [77] when generating results to a user query.In assessing the relevance of a given document and query, Google Scholar gives higher weight to the title.The citation attribute also plays a vital role in the ranking, and therefore, the documents having relatively many citations are likely to be placed near the top of the result list.Author and journal or conference name can also affect the ranking of documents, i.e. a query having an author or journal/conference name is likely to be positioned in the top of the search results list.For example, most of the top results of a search for "information retrieval" are likely to be articles about various IR topics from the Information Retrieval Journal.Google Scholar also considers publication date and some other attributes in ranking [77].Intuitively, without loss of generality as the research paper repository is growing rapidly, it is essential to consider all the desired components of the scholarly domain in ranking algorithms that shatter the metaphor of scholarly retrieval systems.These citation networks are discussed in detail below.

III. SCHOLARLY CITATION NETWORKS
In the scholarly domain, a citation network is a significant and critical part of scholarly retrieval models.A citation is a link from one scholarly document to another.When a document uses a text excerpt, an idea, a concept, a figure, an algorithm, etc. from another scholarly document, it usually refers to that document [78][79][80].Citations are necessary because they help create links between publications and authors, give credit to authors, promote reusability and productivity, and provide a roadmap to discovery.Professional associations encourage scholars to replicate findings, results, improve research standards and give desired credit to scholars by citing their work when deemed relevant and related.In most cases the impact of an author, institution, journal or even a country concerning a particular field of study is measured in terms of citations count.For instance, a document with a lot of incites is considered more influential.The academic policymakers use citation networks via Google PageRank, and it's variants to quantify scholarly texts [81,82].Likewise, to credit authors, academic networks use the status of their citing authors to distinguish high-status authors in co-authorship networks [83,84].The citation networks also provide a technique to differentiate prestigious journals [85].A journal is said to be prestigious if it has been cited by other prestigious journals and has numerous highly cited works.Institutions and countries are assessed using the same criteria [86].Intuitively, citations are used in the retrieval models of many famous scholarly search engines such as Google Scholar in different ways like citation count, bibliographic coupling [41], and cocitation and citation context [47] in ranking results.
Citation networks form a complex graph.Consider a paper network where nodes are the scholarly documents and the edges are the citations between the papers, i.e.G (P,C), where P is the set of nodes (papers), and C is the set of edges (citations, i.e., in-cites and out-cites).It is a substantial complex graph that can have several sub-graphs including but not limited to paper graphs, author graphs, collaboration graphs and semantic graphs.For a collaboration graph, an edge (X, Y) exists if person X worked with person Y.In the case of semantic graphs, an edge (X,Y) exists if word X is associated with word Y.The insightful utilization of all the subgraphs in the scholarly network and their associated metadata can play a significant role in the efficiency of scholarly retrieval systems.For example, authors in [87] extracted the metadata from scholarly documents with the aim to create a knowledge-base of each scholarly article for efficient document retrieval.Citation networks can play a vital role in the systematic retrieval of scholarly literature.Much research has been carried out about how useful information from scholarly citation networks can be extracted and utilized for better information retrieval.CitNetExplorer [88] analyzes and visualizes citation networks to address citation-based scientific literature retrieval [89,90].The tool is helpful in finding full relevant papers about a specific topic for preparing a review article.Author in [91] extends the co-citation network by incorporating satellite documents.Co-citation is a relationship among two scholarly papers concurrently cited by a third scholarly document.When the co-citation linkage detects scholarly documents, it is conceivable to obtain more suitable search terms from the related document.Such terms may not have been included in the original seed document.
Despite their numerous benefits, the existing domain of citation networks considers all citations for a given document to be equally significant.This can lead to situations where inaccurate information is deemed to be relevant because several authors have cited it.For example, a paper titled "A vector space model for information retrieval" alleged to have been published in 1975 is considered the most commonly cited paper published by Gerard Salton even though it does not exist in reality [92].The paper "Read before you cite!" suggests that authors read just 20% of the work they cite [93].Other authors also concluded that 25% of the references are redundant, 40% are for aspiring only to minimum standards [94] and 62.7% address just definition, tools, etc. not attributed for a specific function [38][39][40].All these show that to improve the quality of citation-based applications, citations should not be regarded as equally significant.To do this, several researchers have carried out research activities aimed at mitigating the challenges associated with citation networks [31], including: • Scattering: There is no single authoritative place to keep a record of an entire academic citation network.Due to the distributed nature of the academic web, different search platforms have different citation metrics and analytics.This can pose a significant challenge when using citation networks to credit a paper, author, journal, institution, country, etc. [95].
• Uncertainty: The relationship between citations is not always available in repositories.For instance, in the ACM digital library, 18.5% of publications have no citing details while 55.6% lack any cited information [96].Therefore different papers receive the same ranking score.Handling the erroneous and missing citations metadata (incites and out-cites) of scholarly documents is a massive challenge for academic retrieval systems [97].
• Restriction: As discussed earlier, the whole academic web is not freely available [2].Since the citations used in a document are part of that document, the unavailability of such a document can be a profound challenge to accurate academic information retrieval.
• Integrating scholarly metrics and analytics: A citation network is a handy assessment tool for distinguishing different scholarly mark units [90,95].It is beneficial when determining the impact of papers, authors, co-authors, conferences, journals, institutions, projects, countries, etc., in a particular field of study.However, due to the challenges associated with citations, it may not always produce optimal results.
• Accessibility: Due to the growing rate of academic literature, it is challenging to locate relevant papers.More and more documents are published on a daily basis [97].
For instance, PloseOne alone published 30,000 documents with an average 85 documents per day in 2014 [97].These publications inadvertently result in the addition of billions of nodes to the already existing citation networks.Web of Science, for example, accumulates about 1 billion citations per year.This makes the accessibility of citation network more challenging [97].
• Complex Graph: Citation network is a complex graph having some non-trivial features like instantaneous network evolution, complex nested topology, multiple nodes/edges and large-scale growing rate.These features make some of the algorithms needed to get optimal results inapplicable [98].

IV. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS
The academic search is a fascinating research area.Several academic search engines exist today with Google Scholar being the dominant one.However, locating relevant documents is still challenging due to the high growth of the research papers repository.In this regard, much research is going on in the field of scholarly document ranking, retrieval, recommendation, and the proper exploitation of citation networks.We are still away from an efficient scholarly retrieval system.The reasons are many, including: • Disambiguating authors: One can try Google Scholar while adding publications to his/her profile, where several articles are displayed being identified as possibly written by the user.For accurate disambiguation, email addresses, affiliations, city, country and field of expertise could be exploited together with the efficient classifier and machine learning algorithms.
• The limited use of semantic web technologies, especially ontologies and linked open data, makes popular scholarly retrieval systems limited, where reasoning and machineunderstand-ability could bring fruitful results.
• Semantic web together with natural language processing could be employed in identifying and categorizing incitations to differentiate between relevant papers and ones that were used for self-citations or improving/increasing the number of references in the bibliography section.
• User studies are required to understand the user interactions with scholarly retrieval systems to understand their information needs better so that more user-friendly solutions are produced.Given the high growth rate of the academic web, it is necessary to develop tools that realize and emphasize users' needs.
• The use of citation networks in scholarly retrieval and assessing the impact of scholarly works has achieved many fruitful results.However, efforts are required to exploit them to their fullest in building the scholarly reputation of the authors, research publications, and journals so that users could be able to judge the quality of a publication better.In this regard, the challenges mentioned above need the attention of researchers and practitioners.
• The performance of an academic search engine can be improved by in-depth insight into citation networks (i.e.paper network, author network, collaboration network and text network) and infer most influential citations.The relationship between citations, authors and publications can also be computed for each document to efficiently rank documents.
• The ranking algorithms of scholarly search engines are different, many factors of ranking documents are by nature ambiguous and confusing to formalize.Most of them are proprietary making it difficult to understand how they work.Therefore, detailed empirical studies are required, in order to understand their ranking techniques and devise solutions that are free, open-source, and which could be reproduced whenever required.
This review paper is an attempt to bring the attention of researchers and practitioners towards the endless possibilities in which a more efficient scholarly retrieval system could be developed.It emphasizes on mitigating the information overload that currently researchers, especially newcomers, face while trying to access the most intended and relevant papers.For a more efficient solution, it is essential first to understand user information needs, develop approaches in the light of these needs, and exploit citation networks and modern IR, machine learning, and semantic web technologies so that search engines could be able to better understand the content and provide access to the desired content timely and resourcefully.