Using Association Rules to Enrich Arabic Ontology

In this article, we propose the use of a minimal generic base of associative rules between term association rules, to automatically enrich an existing domain ontology. Initially, nonredundant association rules between terms are extracted from an Arabic corpus. Then, the matching of the candidate terms is done through the matching between the concepts of the initial ontology and the premises of the association rules, with three distance measures that we define. Keywords-ontology; automatic enrichment; association rules


INTRODUCTION
Ontology is a tool for representing knowledge and reasoning that serves the organization of a set of concepts in a specific field, as well as the relations between these concepts [1][2][3].Ontologies are regularly subject to updates and changes.Performing these updates manually is an expensive and timeconsuming task as it mobilizes experts in the field to identify and classify new vocabulary items in the ontology.To accelerate this process of evolution and adaptation and to take away any form of subjectivity, recent research has focused on the implementation of semi-automatic and automatic ontology enrichment techniques.The majority of approaches, often based on statistical or linguistic tools, focus on adding new concepts and/or relationships between them.The ontology enrichment process can be divided into two stages: the search for new concepts and relations and the placement of these concepts and relationships within the ontology [3].The general process is depicted in Figure 1.Several works have focused on this process of enrichment of ontologies, addressing one or more of its stages: • Extraction of representative terms in a specialized field.
• Identification of lexical relations between terms.

• Placement of new terms in an existing ontology
In these works, the term ontology takes several meanings like thesaurus, taxonomy or more generally controlled vocabulary.The work dealing with the extraction of candidate terms in the ontology enrichment process is based on statistical and syntactic methods.Statistical methods select terms according to their distribution in the corpus [1][2][3], as well as other measures such as mutual information, "the probability of the appearance of the word A knowing that the word B has appeared", or else measures calculating the probability of occurrences of a set of terms [4][5][6].These different propositions make it possible to identify new ontology elements, but do not allow their placing in the ontology, without human intervention.Syntactic methods aim at determining the grammatical function of a word or a group of words within a sentence.They are based on the hypothesis that grammatical dependencies reflect semantic dependencies.These techniques lead to the proposition of new concepts, linked by relations that are not yet semantically identified.Regarding the identification of concepts and relationships and their placement in the ontology, the extraction of ARs is one of the major techniques proposed by the data mining community.Many other works propose the use of frequent correlations which can exist between the terms of a corpus.These approaches consist most often of extracting ARs between candidate terms, previously identified by statistical or syntactic tools [7].At the end of the process, authors get a set of ARs, describing the existence of a relationship between two concepts [8][9][10].General process of ontology enrichment.
In this paper, we propose a methodology for building a conceptual network formed by the combination of two types of knowledge, namely, knowledge present in the initial ontological structure specific to a domain and represented by semantic links, and knowledge derived from the minimal generic base of associative rules (ARs) between terms, essentially representing correlations that are appreciated by statistical measures.

A. Existing Approaches to Discover Candidate Concepts
We distinguish two types of methods for the discovery of candidate concepts: • Statistical methods: they select the terms according to their distribution in the corpus [1][2][3], as well as other more complex measures such as mutual information, tf-idf, etc., or the use of statistical distributions of terms [4].These different propositions make possible to identify new ontology elements, but do not make possible placing them in the ontology, without tedious human intervention [12].
• Syntactic methods: they aim to determine the grammatical function of a word or group of words within a sentence.They are based on the hypothesis that grammatical dependencies reflect semantic dependencies [13].They define in a sentence, the verb (V) as being the relation which links the subject (S) to the complement (C).They thus have the disadvantage of identifying only the relationships labeled by the verbs.Some approaches also use syntactic patterns [12].The extracted terms illustrate the new candidate concepts for enrichment, but also the existence of relations between them.However, these relationships are not labeled semantically.Moreover, no measure evaluating semantically new added relations is calculated.

B. Existing Approaches to the Concept Placement in Ontology
After the discovery of the candidate terms, it is essential to detect the relations between these new terms and those which link them to the initial ontology.In [2], authors propose a statistical approach based on the frequent co-occurrence of candidate terms with terms of the initial ontology.The major drawback of this work lies in the fact that they do not allow the precise addition of new concepts and relations in the ontological structure [14].Other approaches in the literature suggest using search techniques data [10,11,13].The work in [4,15], is based on a classification method in order to bring together the candidate terms contained in the texts of the concepts present in the ontology.The principle is similar to that explained in the approaches of [1,15], which group together terms by a clustering method according to their number of occurrences within the corpus.However, these methods do not detect the relations between the candidate terms, i.e., these new terms can therefore be added only by human intervention.In addition, several studies propose the use of frequent correlations that exist between the terms of a corpus.These approaches consist of extracting rules association [7] between candidate terms [8][9][10].At the end of the search process, a set of ARs between terms is generated.Each rule expresses the existence of a relationship between two concepts of the domain.This process of enrichment remains semiautomatic because on the one hand the number of derived ARs is very important and on the other hand, a human intervention is necessary to semantically define the relations discovered and to name them.

III. ASSOCIATION RULES
Association rule mining is a famous knowledge discovery technique for finding associations between items from a transaction database.Its definition varies according to the three main currents initiated by the following: author in [16] defines rules of statistical implication to help educationalists find relationships between acquiring basic notions in class, authors in [17] are more interested in orderly representation of concepts with informative implications, authors in [18] favored optimized extraction of ARs in large databases.Subsequently, these forms have known extensions in several directions.The binary properties are no longer required, we can now make ARs with digital properties [19,20].To avoid the vast increase of rule extraction time, more efficient algorithms have been proposed [21].The semantics of the rules have been refined through many quality indices [22], which helps the user to choose the most appropriate rules for his needs.Navigation and queries by using an appropriate language have been developed [23] to facilitate the exploration of this set of rules.ARs present conditional relationships between the attributes of a database.They represent an implication of the form A→B where A and B are an itemsets.The set of items A is called antecedent and B consequent of the rule which provides information about the existing relations between A and B. It expresses how objects or items are related to each other, and how they can be grouped together.The first step of extraction in the association rules mining is finding out the frequent itemset which is called candidate (te).This transaction can be measured by two statistic measurements called support and confidence.The support (Sup(A→B)) is defined as the relative frequency of transactions in the data set D that contains the itemsets A and B.
The confidence (Conf(A→B)) of a rule measures the reliability of the inference given by rules.
Then, the important association rules are filtered from the candidate itemsets.A rule r is available only if Sup(A→B)>minsup and Conf(A→B)> minconf where minsup represents the threshold of support and minconf represents the threshold of confidence.These two values are specified by the user.

A. Process for Association Rules Extraction
The process of extracting association rules consists of several phases ranging from data selection and preparation to result interpretation (Figure 2).Several works have focused on this process of enrichment of ontologies, addressing one or more of its stages: • Data selection and preparation (cleaning): In this phase, the database data used for the extraction of the association rules are selected and the transformation of these data into an extraction context occurs.This phase is necessary to be able to apply rule extraction algorithms to different kinds of data from different sources, to concentrate the search on the useful data and to minimize extraction time [24].To have significant rules the extraction of morphological analysis of each word must follow the order described in [25] and shown in Table I • Generation of association rules: is carried out from the frequent itemsets generated previously.In general, the generation of association rules is done directly, without access to the extraction context, and the cost of this phase in execution time is therefore low compared to the cost of extracting frequent itemsets.
• Visualization and interpretation: This phase consists in the visualization of the association rules extracted from the context and their interpretation.Thus the domain expert can judge their relevance and usefulness.

IV. PROPOSED APPROACH
This stage consists in bringing closer to our initial ontology, that will be noted as O, the terms which appear in the premises of the candidates rules of the base of the sequential rules.These terms are identified as candidate concepts for enrichment.• Definition 2: A termset is a non-empty set of terms denoted by (t 1 , t 2 ... t k ).An associative rule R is valued by two statistical metrics, namely support and trust [18].The support of the associative rule R: T i →T j , denoted by Supp (R), expresses the frequency with which the two termsets T i and T j co-occur together in corpus C. The confidence of R, denoted by Conf(R), expresses the conditional probability for a document to contain termset T j , knowing that it contains the termset T i .An associative rule is valid if its confidence is greater than or equal to the minimum confidence threshold noted by minconf.

A. Extraction of the Ontology and Creation of the Generic Base
We use the GEN-MGB algorithm [26] for the extraction of the generic base of RA no redundant MGB.This base is characterized by its significant compactness, i.e., it contains a minimal core of ARs, from which all the redundant and valid rules can be deduced by means of a complete and valid axiomatic system [26,27].By considering the context of text extraction K, we adapt the definition of the MGB base given in [26] to the problem of Ontology enrichment.We remind that non-redundant ARs have one only term of the domain in the premise [28].
We use then a semi-automatic tool such as Protege 2000 [29] for the ontology O construction from CO corpus.It is validated downstream by a domain expert.The evaluation of the semantic link between O concepts are computed from the proposed similarity measure in [30] that takes into account both the depth of concepts in the hierarchy of concepts and the structure of the latter.Thus, the similarity between two concepts C1 and C2 of the ontology O is calculated as [30]: (2 × depth(c)) SimWu(C ,C )= (depth(c )+ depth(c ) ) (4) where depth (c i ) corresponds to the depth level of the concept c i and c represents the most specific concept that generalizes c 1 and c 2 in O.

B. Adopted Approach for the Ontology Enrichment
The enrichment process we propose is iterative and includes the following steps:

1) Calculation of the candidate concepts for the enrichment
We compute for each concept c i of ontology O the set of the candidate concepts to be connected to c i .This set includes the terms figuring in the conclusions of the valid associative rules whose premise is c i as well as those of the redundant rules [31].
According to the example shown in Figure 4, the candidate concepts for enrichment related to the concept c 1 are {c 10 , c 12 , c 5 , c 15 }.Fig. 4.
Example of calculating candidate concepts.

2) Placement of the new concepts
This step consists in placing the candidate concepts while preserving the coherence of the concepts and pre-established relations in the initial ontology.This makes possible not to add relational redundancies in the case of a concept being candidate to be related to several concepts of ontology O [32]. Figure 5 shows the addition of the new concepts c 10 and c 11 and the displacement of c 15 because Conf(c 1 ⇒c 15 )>Conf(c 7 ⇒c 15 ).

3) Calculating the neighborhood of ci and distance measurements
We define the notion of neighborhood of a concept of the ontology O as: • Definition 3: The neighborhood of a concept represents the set of corners connected to it in the ontology, by one or more valid association rules [32].The relations between c i and its neighbors, are evaluated on the basis of a statistical metric that we call measure of distance between c i and its neighborhood, and denoted by Dist O MGB .It is computed according to the measure of confidence of associative intervening during the ontology enrichment and the measures of similarities calculated between the concepts of the initial ontological structure [33].The measure of distance that we define is calculated according to three possible cases [34]: • Case 1: If the two concepts c i and c j come from the base C then Dist O C (c i , c j )=Conf(R: c i ⇒c j ).
•  V. CONCLUSION Various ontology enrichment techniques have been proposed in the literature.Their limitations come from the fact that they do not allow the entire enrichment process without the intervention of the domain expert.In this article, we presented an automatic ontology enrichment process with a generic base of associative rules.The originality of our approach is that it exploits the maximum of concepts for enrichment without resorting to a priori knowledge.Its advantage is that it allows the learning of the distance represented by any relation of the enriched ontology.

• Definition 1 :
An ontology is a quadruplet O=(C D , ≤ C , R, ≤ R ) where C D is the set of concepts of the domain, ≤ C is the partial order defined on C D , R is the set of relations defined on C D ×C D and ≤ R is the partial order relation defined on R. We consider that a formal extraction context is a triplet K=(D, T, R) where D represents a finite set of documents from the corpus C, T is a finite set of terms and R a binary relation.Each pair (d,t)∈R means that the document d∈D contains the term t∈T.

Fig. 3 .
Fig. 3.Process of enrichment of Ontology by using Association Rules.

Case 2 :
If the two concepts c i and c j belong initially to ontology O then Dist O C (c i , c j )=SimWu(c i , c j ).The similarity between the two concepts c 1 and c 2 of ontology O is calculated as in (3).• Case 3: If c i is a concept added to the ontology O and it is related to the concept c k of the initial ontology O in a way that Dist O MGB (c k , c i )=Conf(R:c k ⇒c i )=β then any concept c x of the ontology O in relation to c k such that SimWu(c k , c x )=α, is also in relation with c i .In this case, the distance measure is mixed, i.e., Dist O MGB (c i , c x )=α×β.The three cases are illustrated in Figure 3. Thanks to this enrichment technique we are able to add new concepts and relationships.

TABLE I .
REPRESENTATIVE SCHEMA OF AN ARABIC WORD STRUCTURE