A modular framework for ontology learning from text in Portuguese

. Research on ontology learning has been carried out in many knowledge areas, especially in Artificial Intelligence. Semi-automatic or automatic ontology learning can contribute to the field of knowledge representation. Many semi-automatic approaches to ontology learning from texts have been proposed. Most of these proposals use natural language processing techniques. This paper describes a computational framework construction for semi-automated ontology learning from texts in Portuguese. Axioms are not treated in this paper. The work described here originated from the Philipp Cimiano’s proposal along with text standardization mechanisms, natural language processing, identification of taxonomic relations and techniques for structuring ontologies. In this work, a case study on public security domain was also done, showing the benefits of the developed computational framework. The result of this case study is an ontology for this area.


INTRODUCTION
One of the main challenges of Computer Science is to turn computers into machines that can learn by themselves. To do so, it will be required computers to have capacities that allow them to simulate, in some way, the human learning process. Some researchers have conducted studies on the field of computational learning with ontologies. Ontologies can store knowledge in their structure. They can be visualized as graphs of concepts and relations, where concepts are the nodes and relations are edges between nodes (Wong et al., 2012). One of the biggest problems faced by ontology engineers is pre-cisely the ontology learning problem (Basegio & de Lima, 2006).
In Computer Science, ontologies serve as metadata schema, providing a concept-controlled vocabulary (Maedche & Staab, 2001). They can be composed of concepts, relations, concept and assertion instances and should be comprehensible to agents and other computational entities (Gruber, 1995). The effective man/machine communication through ontologies requires use of a specific language (Wong et al., 2012).
Ontology structuring is an important step in knowledge-based system development, as it allows knowledge formalization and sharing between humans and computer systems (Lopes et al., 2012). Ontology use is important to reduce the ambiguity problem 1 existing in texts and serve as a concept dictionary within a given domain (Maedche & Staab, 2001).
The first stage of ontology learning processes is the choice of the corpus 2 to be used as a data source. The corpus quality and richness are fundamental for the performance of the extraction of information process (Lopes et al., 2009).
Generally, in the ontology learning process, three proposals are accepted: totally automated, not automated (manual) and semi-automated, the first one can be considered utopian, the second one is inefficient and the third one seems to be the best option (Cao et al., 2012). To semi-automate the ontology learning process, it is necessary to define a source of information where knowledge will be extracted from Ghisi et al. (2012). According to Cimiano (2006), the term ontology learning was originally used by Maedche & Staab (2001) to describe the process of acquiring knowledge from data. Here, different subjects act in a complementary way, working with different types of data (unstructured, semi-structured or totally structured) (Maedche & Staab, 2001). However, there are gaps in providing integrated solutions to support ontology learning from texts. Philipp Cimiano proposed to organize the ontology learning process in many steps (Cimiano, 2006). This article presents a computational framework 3 to help software engineers to introduce ontology learning capabilities in their applications. We do not propose new appoaches to learn ontologies. Our purpose is to integrate third-party solutions in just one tool, following Cimiano's process. So, our framework works like a production line, where in each step a suitable solution can be used. If a better solution for a step is developed, it can be easily added to the framework to replace a previous solution in use.
To show the framework usefulness, we conducted an experiment on the public security context. In this work, we only considered ontology learning from text. So, for simplicity, when it is written "ontology learning", we mean "ontology learning from text".
The rest of this paper is organized as follows: Section 2 presents some related works; Section 3 presents the developed framework; Section 4 describes the framework evaluation and the results obtained. Finally, Section 5 presents conclusions and suggestions for future work. Baségio & de Lima (2006) proposed a semiautomatized approach to ontology learning from texts. He works specially on the phases of concept and taxonomic relation extraction for the Brazilian Portuguese. In his work, each step (relevant terms, compound terms and hearst-pattern-based relations) was validated manually by specialists. Junior (2008) proposed a Java plug-in to help in semi-automatized ontology learning from text in Brazilian Portuguese, integrated to Protégé. Motta (2009) proposed a semi-automated process to ontology learning from text in Portuguese language. Zahra (2009) proposed a web application to semiautomatic extraction of ontologic structures from text in Portuguese language. Gonçalves et al. (2011), proposed an application to ontology learning from text, focused on identification of concepts and relations. They presented an approach based on graphs to identify concepts and a concept analysis to get relations. Lopes et. al. (2012), proposed an application to extract relevant terms from a corpus in Portuguese language. They developed a Java tool that incorporates mechanisms for identifying relevant terms, 3 We will use simply "framework" in the rest of this paper. morphosyntactic labeling, agreement, term frequency, conceptual markings and hierarchy of concepts.

THE PROPOSED FRAMEWORK
A framework is a set of cooperating classes that makes up a software project reusable Gamma et al. (1995). Thus, one can use object orientation in its structure targeting software development artifacts reuse. Being an incomplete system, it can be adapted to implement complete applications in a given domain, reducing the effort in deploying applications (Fayad et al., 1999).
In this section, we detail the framework desigstages and architecture.The framework takes its input (textual documents), performs the tasks required for ontology learning and produces a textual document, encoded in OWL, as output. To view the ontological structure generated, the Protégé tool can be used. The framework developed in this work was called Sabença 4 . It has a modular architecture, as shown in Figure 1. Third-party components have also been integrated to it. Each module is described in the following.  the hearst patterns are prepared for use in English. An adaptation was re-quired for Portuguese. In this work, it was chosen the adaptation proposed by Baségio & de Lima (2006). After applying the hearst method, hypernymy and hyponymy relations between terms were discovered. A single file containing the found relations was generated. For this, we have used the Apache JENA 2.12 component (Jena, 2014). Complex terms are constructed from n-gram generation. We do not found any scientific study that used 5 or higher level n-grams. So, we considered that 4-level n-grams are sufficient to find complex terms that are valuable in ontology learning process. Then, a single file, containing the found 4-grams, was generated. After n-gram generation, complex terms are found. After that, a single file, containing the found compound terms is generated. To find taxonomic relations, relevant terms, compound terms and the hearst method relations on a single taxonomic structure were used. The found relations are stored in a single file. 7. Exporter Module: the exporter module builds ontological structures in OWL language. In this module, the taxonomic relations generated in the previous module are used. An ontological structure is generated containing simple terms, compound terms and hearst pattern based relations. Finally, two files are generated to save the ontology produced by the framework: "ontology.owl" e "ontology.rdf".

Replacing Modules in the Framework: the
Sabença framework uses depen dency injection to maintain low coupling and high cohesion between classes and components and allow flexible modification in modules. So, modules can be replaced by another ones or new modules can be added easily. Dependency injection is a design pattern that allows to run a class within another class. Using dependency injection facilitates replacing framework existing resources by others resources. The framework is dependent on the property file "sabenca.properties". The configuration file should be properly modified in order to use other components.
We understand that some phases of ontology learning process require human intervention. Table 1 shows all the stages implemented, based on the Philipp Cimiano's proposal (Cimiano, 2006).

EVALUATION AND RESULTS
According to Wong et al. (2012), validation can be grouped into some categories: contextual approach, coverage approach, comparison-based approach, evaluative approach, structural approach and functional approach. An evaluative approach, where domain experts evaluate the ontology layers [terms, concepts, relations and axioms (optionally)], can also be used. In this work, we have evaluated simple terms, compound terms and taxonomic relations. We have followed the Janez Brank's recommendation to evaluate the different layers of ontology separately, rather than trying to directly evaluate the ontology as a whole (Brank et al., 2005). With this, we can determine the term relevance for the domain and if they are conceptually correct (Wong et al., 2012).
To validate the developed framework, a case study was conducted on public secu-rity area. A manual validation was done by experts in the area. Each expert chose terms he considered as correct. Those terms marked as correct by all experts were selected. After that, the values generated in the framework were manually replaced by those selected by experts. Then, these terms were used in the ontological structure generation.
The evaluation process took place by sending an email with a worksheet attached for each expert. The evaluation worksheet contained three tabs (hearst pattern, relevant terms and compounds terms) with two columns (terms and correct). Each expert should mark with an "X" in the column "correct" for each corresponding value only if he understands that the term or pattern belongs to the public security domain. In the body of the email, there were instructions on how to complete the evaluation worksheet.
The validated data served as input to the ontology structuring process. For this, we analyzed the information marked by each expert and made the selection of the terms that were common in all evaluated worksheet. Later, we manually replaced the values generated in the framework by the results selected in the evaluation process. After that, the ontological structure was generated.

Textual Document Acquisitions
To conduct experiments on our case study, textual documents (.PDF, .DOC and .DOCX) were selected from the library of the military police academy of the Goiás/Brazil state and from Revista Brasileira de Estudos em Segurança Pública -REBESP9 (the Brazilian Journal of Studies on Public Security), composing a 152-document corpus, totaling 3958 pages. In this collection, there were scientific articles and monographs on public 9 Available on the address http://revista.ssp.go.gov.br/index.php/rebesp, last seen in May 2017.
security. Librarians of this police academy manually selected the documents.
In the case study, we observed the selected documents contained terms from other areas, not related to public security, such as education, marketing, management and others.

Data analysis
The framework only converts unprotected documents. During importation, 8 protected files were found. Thus only 144 documents were converted.

Term identification
After conversion, the morphosyntactic labeling task is started. Morphosyntactic labeling is the ontology learning process main stage and its results influence in subsequent modules. 1283064 terms were found in this phase. To eliminate non-relevant domain terms, a list of stopwords was used. The stopword list was automatically generated after labeling terms. At the end, 5934 total non-relevant terms for the domain were found. To define relevant terms, all existing terms in stopword list were removed. Then, terms were grouped by nouns. After that, terms were weighed using the TF-IDF method. The weighing resulted in 20076 extracted terms.
The TF-IDF measure returns a very large amount of non-relevant terms to be presented to the ontology engineer. There is no a standard pruning rule for selecting them Basegio & de LIMA (2006). Therefore, the list of terms was analysed and classified in TF-IDF index descending order. Terms with index TF-IDF = 0.0 were identified. These terms were characterized as not relevant to the domain. To select relevant terms, TF-IDF 10 0. 1 11 was used as the minimum frequency. This resulted in a list reduction. Thus, it was obtained 314 relevant terms for the domain.

Compound term identification
The Markov method was used to identify compound terms. In this paper, only the compound terms contained in the relevant term list extracted in the previous section were selected. 1516 compound relevant term were extracted, out of the 18678 compound terms generated. Table 2 shows the compound relevant term amount, classified by rules. The column "Rules" shows the rules used to define compound terms, e.g., "polícia de primeiro mundo" ("first-world police") will be selected by the rule "sub + prp + sub + adj", where "sub"= noun, "prp" = preposition and "adj"= adjective. The column "Qty" shows the amount of each term found for the given rule.

Hearst Patterns
Hearst patterns were used to find hypernymy and hyponymy relations between terms. In this work, the pattern rules were applied. This resulted in 104 taxonomic relations. The amounts found by each hearst pattern can be seen in Table 3. The column "Pattern" shows the hearst patterns adapted by Baségio & de Lima (2006). The column "Qty" shows the amount found from the hearst pattern application in the corpus documents.

Domain ontology structuring
To identify taxonomic relations from the phrase core of the compound terms, the relations between each compound term and the relevant term are used (Baségio & de Lima, 2006). Therefore, the relations between compound terms and relevant terms are first searched. The compound term found receives a <TYPE> annotation, indicating its taxonomic relation.
Next, the hearst pattern results are added to the existing taxonomic relations. Finally, the taxonomic relations are exported to the OWL Lite and to the RDF Triple-n pattern. The exporting result is stored into a OWL file, which can be viewed using the Protégé tool. The resulting ontology is composed of 2 levels and have 194 entities. We have limited the number of levels of the ontology just to get it generated faster in the computer we performed the experiments. This is not a limitation of the framework. Table 3. Taxonomic relation amount found by the hearst patterns. Pattern Qty SUB as (SUB,)*(ou-e) SUB 33 SUB such as (SUB,)*(ou-e) SUB 3 such SUB as (SUB,)*(ou-e) SUB 0 SUB , SUB * , or another(s) SUB 6 SUB , SUB * , and another(s) SUB 59 SUB, including SUB, * (ou-e) SUB 0 SUB , especially SUB, * (ou-e) SUB 0 SUB , mainly SUB, * (ou-e) SUB 1 SUB , particularly SUB, * (ou-e) SUB 0 SUB , in particular SUB, * (ou-e) SUB 2 SUB , in particular SUB, * (ou-e) SUB 0 SUB , in a especial way SUB, * (ou-e) SUB 0 SUB , especially SUB, * (ou-e) SUB 0

CONCLUSION
An annotated corpus was generated from textual documents manually selected by librar-ians. The corpus used in this work has more than 1 million words extracted from textual documents. In this work, various frameworks available for text conversion, labeling, stem-ming, proper name identification and writing ontological language, were tested. Most of these frameworks were designed to operate just in English. Only tools that produce satisfactory results for the Portuguese language were chosen. The used tools impacted in the result quality. It was found that identification of taxonomic relations from compound terms with n-grams of 4levels produced better results in comparison with the hearst patterns. The five stages of ontology learning implementation allowed us to structure an ontology in Portuguese. However, we understand that ontologies are representations of shared reali-ties and, as the knowledge acquired from the reality changes, the structure of ontologies should be changed as well. Therefore, there are no complete ontologies.
The Sabença framework allows the inclusion of third-party tools and customization of developed classes. In this version 0.1, we use third-party tools that perform the necessary steps to implement learning ontologies from texts in Portuguese. We were careful to choose the tools that were more suited to this objective. But, we did not use third-party tools in all the stages of ontology learning process. Some stages had to be implemented.
The main contribution of this work is to facilitate the development of applications that require the use of ontologies. The framework was developed mainly from the Philipp Cimiano's work.
The case study we conducted produced results indicating that the developed framework can facilitate the process of building domain ontologies from texts. The Sabença framework can be improved by constructing new modules to support extracting text from audio and video, new modules to handle other idioms and to non-taxonomic relation recognition. It could be interesting too to build a new module to merge ontologies.