J A R I T Ö Y L I AdSchema – a Schema for Semistructured Data AC TA WA S A E N S I A No. 157 Computer Science 5 U N I V E R S I TA S WA S A E N S I S 2 0 0 6 Reviewers Professor Eljas Soisalon-Soininen Helsinki University of Technology Department of Computer Science and Engineering P.O. Box 5400, FI-02015 HUT Professor Martti Penttonen University of Kuopio Department of Computer Science P.O.Box 1627, FI-70211 Kuopio ACTA WASAENSIA ACKNOWLEDGEMENTS The development of this thesis was a long process during which I have met many extraordinary people. Although in the cover of this thesis is only one name, many are the names of people who, one way or other nurtured it, and to whom I am pleased to devote my most sincere and deep gratefulness. At first, I would like to thank my supervisor Professor Matti Linna, for his great support and wise guidance throughout the process, steering but without imposing, suggesting but keeping freedom intact. I would also like to thank Professor Merja Wanne for the insightful discussions we had and her indispensable help at the beginning of this work. I also like to thank Professor Jarmo Alander for his support and thoughtful feedback of this work. I would also like to thank Professor Martti Penttonen and Professor Eljas Soisalon- Soininen for reviewing the thesis and their valuable comments and remarks concerning the work. With gratefulness, I would like to acknowledge here, that the research work behind this thesis was financed by Research Institute for Technology and Vaasan yliopistoseura. I would like to thank all my colleagues who have supported or contributed to this work by encouragement, advice, suggestions, discussions, etc. especially Anja, Pekka, Hannu K., Kimmo, Markus, Timo, Johanna and many others. Finally, my deepest gratitude to my loving wife Tuula who has loved me and encouraged me all these years and to our children Jani, Hannakatrin, Ville, Elina-Sofia and Johannes for being there. ACTA WASAENSIA CONTENTS ACKNOWLEDGEMENTS 3 ABSTRACT 6 1. INTRODUCTION 7 1.1. Background of the Research 7 1.2. Research Objectives and Contributions 9 1.3. Structure of Thesis 11 2. DISCOVERING THE DATA STRUCTURE 12 3. FOUNDATIONAL THEORIES 17 3.1. Representative Objects 18 3.2. DataGuide 24 4. ADAJACENCY RELATION SYSTEM 30 5. SCHEMA PROPOSAL FOR SEMISTRUCTURED DATA 35 5.1. Definition of Semistructured Data 35 5.2. Determining of the Degree of a Data Structure 42 5.3. General Query 45 5.4. Algorithms 46 5.4.1. Modeling of Relational Data 46 5.4.2. Modeling of an Object Exchange Model 53 5.5. Definition of an Adjacency Schema 61 5.6. Examples 67 5.7 Creating an AdSchema from Semistructured Data 70 6. CONCLUSIONS AND FUTURE WORK 76 6.1. Conclusions 76 6.2. Summary of Contributions 77 6.3. Future Research 78 REFERENCES 80 6 ACTA WASAENSIA ABSTRACT Töyli, Jari (2005). AdSchema – a Schema for Semistructured Data. Acta Wasaensia No. 157, 85 p. Data that do not have a fixed schema cannot be searched efficiently because the forming of a query is difficult without a schema or some information about the structure. The quantity of such data has increased since the middle of the 90’s, especially on the Internet. There is much research conducted in this area and consequently there are several proposals of how to model such data, how to search such data and also how to present the structure of such data. Semistructured data, as it is called, is still a subject of intensive investigations. In our investigations we have found out that schemas of such data still need enhancements. It has turned out that the schemas can be bigger than the original source database or, if the size is limited, some accuracy is lost. In this work we first propose a schema to be used to present the structure of such data and secondly an algorithm with which we can construct such a schema. Our earlier investigations showed that we can use the theory of Adjacency Relation Systems to model semistructured data and also that the same theory can be used, with some minor modifications, as a foundation for a schema for semistructured data. In this work we apply the theory of Adjacency Relation Systems and propose a new way to present the schema of semistructured data. This approach differs in some parts substantially from those presented by other researchers. Our proposal is not based on merging of similar parts of the schema but on the interrelationships between adjacent types. By this technique we can prevent the schema from growing too large and avoid loosing of accuracy of the schema. Our proposal has also the advantage that the produced schema has the same kind of characteristics as a strong DataGuide. Our work gives also an accurate definition of semistructured data and introduces an algorithm for calculating the degree of semistructured data. Moreover, we present a general query by which we can search data from such data structure. Jari Töyli, Department of Computer Science, University of Vaasa, P.O. Box 700, FI- 65101 Vaasa, Finland, email jari.toyli@uwasa.fi. Keywords: Semistructured Data, DataGuide, Schema Construction, Algorithms, Interrelationship, Degree of Semistructured Data. ACTA WASAENSIA 7 1. INTRODUCTION 1.1. Background of the Research In structured database systems the schema has two purposes. First it describes the structure or type of the data and secondly it describes some constraints on the information system. This type of data can easily be represented and accessed by conventional database systems, like relational and object oriented database systems. However, today there exist more and more data that cannot be constrained by a schema because the data has variations in its structure. The variations typically consist of missing data, duplicated fields, or minor changes in representation. This kind of data arises among others on the Internet, and it is called semistructured data. Semistructured data (Abiteboul 1997; Buneman 1997; Abiteboul, Buneman & Suciu 2000) has been investigated since the middle of the 90’s. During that time the research has mainly focused on developing data models and query languages for semistructured data. There is one proposal that has gained more attention than any other. This model, based on objects, is called Object Exchange Model (OEM) (McHugh, Abiteboul, Goldman, Quass & Widom 1997). In another model, proposed by Buneman, Davidson, Hillebrand & Suciu (1996) and Buneman, Fernandez & Suciu (2000), the node labels are absent and the data is carried entirely on the edge labels. This model has not been named, although it is based on a sound theory. In addition to these two data models there are also other proposals that have a close affinity to these models. The first one, ACeDB (Thierry-Mieg & Durbin, 1992) has been developed as a database for genetic data and the second, is the XML (World Wide Web Consortium, 2004). There is a great deal of research on 8 ACTA WASAENSIA XML in the context of semistructured data, e.g. Goldman, McHugh & Widom (1999) and Broekstra, Fluit, & van Harmelen (2000). Because of the absence of a fixed schema, conventional SQL cannot be used to access semistructured data. However, there are new query languages developed for semistructured data. Some of them are meant to be used with XML data (Broekstra et al. 2000), and others with object data, like Lorel. Lorel has been developed in the context with the Lore system, and it is based on the object query language (OQL). Still another language is called UnQL, and it is a value based query language (Buneman et al. 2000). The main purpose of this work is to present a new proposal for a schema of semistructured data and an algorithm for its construction. Besides, this work contains a definition of semistructured data, and an algorithm for defining the degree of semistructured data. We present also a general query for semistructured data, which includes all the special queries presented in Wanne (1998). Schema proposals for semistructured data are one of the research areas which have been widely investigated during the last decade. DataGuide (Goldman, Widom 1997; Goldman 2000) is the most promising proposal of a schema for semistructured data, but as it has some drawback, the research on that subject is still active. There are also other proposals e.g. Wang & Liu (1997) and Nestorov, Abiteboul & Motwani (1998) of how to discover the inherent structure of semistructured data, but these proposals are all in a stage of development, and so far there is no prominent solution to the problem. The drawbacks in the proposals of a schema for semistructured data represented so far are significant to the extent that there is a need of a new proposal. The following problems are the most distinctive ones: ACTA WASAENSIA 9 • There can be more than one schema for a semistructured database. • Some accuracy of the schema can be lost when one generates an approximate schema (i.e. merge similar parts) of the source database. • The size of the schema may be bigger than the size of the original database. As long as these problems are unsolved the queries on semistructured data are inefficient, which in turn means higher costs. Researches have developed new query languages that use novel techniques and there are also more efficient methods to discover the embedded structure of the data from the underlying structure. However, the results are still unsatisfactory in some areas. In our work we focus on the latter technique and propose a method in which we can complete the underlying data structure with new relations that are derived from the known relations. The completion of the data structure applies the idea of relations between adjacent elements, and adjacency defining sets (Wanne 1998, Wanne & Linna 1999). 1.2. Research Objectives and Contributions According to the research conducted on schema discovery and extraction we conclude that schema development is a vital area that is worth to be explored more. Our research hypotheses are the following: A schema for semistructured data can be constructed and completed according to the theories of the adjacency relation system. The size of the present schemas for semistructured data can be reduced and its accuracy can be increased with the help of the Adjacency Schema. 10 ACTA WASAENSIA In order to test our hypothesis we state some research questions which will guide our work. There are four research questions that are the following: 1. Is it possible to apply the theory of the adjacency relation system to construct a schema on semistructured data? 2. Can we develop an algorithm with which the schema can be built up? 3. Does the Adjacency Schema fulfil the same requirements imposed on the present schemas for semistructured data? 4. Is the Adjacency Schema minimal and more accurate than the existing schemas? The results of Töyli (2002) and Töyli, Wanne & Linna (2002a, 2002b) showed that semistructured data can be presented with the adjacency model. In this work we will go further ahead in our investigations and propose a schema for semistructured data. Our further research is based on the theory of the adjacency relation system (Wanne 1998), which we have somewhat modified, as well as the theory of representative objects (Nestorov, Ullman, Wiener & Chawathe 1997a). According to the hypothesis and research questions stated above, the main contributions of this work can be stated as follows: 1. A new schema proposal for semistructured data. 2. An accurate definition of semistructured data. 3. An algorithm for calculating the degree of an incomplete data structure. 4. A general query for a given relation combination. 5. An algorithm for modelling semistructured data. 6. An algorithm for modelling relational data. 7. An algorithm for the construction of a schema for semistructured data. ACTA WASAENSIA 11 1.3. Structure of Thesis The rest of this thesis is organized as follows. Section 2 is a literature review, in which we present the most recent research findings in the area of this study. The literature review is restricted to cover the areas of schema extraction and discovery, and the most significant papers written on schemas of semistructured data until today. In Section 3 we present the theories of representative objects and DataGuide, which set down the ground for a schema of semistructured data. We also give a short presentation of other research conducted in this area concerning structure discovering. In Section 4 we present the theory of Adjacency Relation Systems (ARS), which has been the basis when modelling semistructured data (Töyli 2002), and also in this case when we present our proposal of a schema for semistructured data. This theory is fundamental to our work, although we have made some minor changes to some of the definitions. In Section 5 we present our schema proposal for semistructured data. First we give the definition of semistructured data, an algorithm for determining the degree of a data structure, and define a general query. After that we present two algorithms for the modelling of relational and semistructured data and one for the construction of a schema for semistructured data. We also introduce the concept of an Adjacency Schema (AdSchema) and give some examples. Finally, in Section 6 we give some conclusions of the results, and outline some proposals of future work. 12 ACTA WASAENSIA 2. DISCOVERING THE DATA STRUCTURE Schema discovery, especially in context with semistructured data, has been under intensive investigations during the last few years. In order to get some understanding of the research conducted in this area we present some work that deal with this specific subject. There is much research that does not propose any new schema, but which is very important for the development of the schema construction. Actually, there is a proposition called DataGuide (Goldman et al. 1997) which can be used as a schema for semistructured data. It is a directed graph, where every edge of the graph is represented with a formula. However, DataGuide has been developed since then and a more accurate version of it has been introduced. The most significant problems in context with a schema for semistructured data are the size and the accuracy of the schema. The more accurate schemas we want the larger the schema becomes and vice versa. We have to balance between these two aspects. In addition to DataGuide there are other propositions in which the researchers try to discover (Wang et al., 1997; Wang & Liu, 1998; Cong, Yi, Liu & Wang, 2002; Wang & Liu, 2000b) or extract (Wang, Yu & Wong, 2000; Hacid, Soualmia & Toumani, 2000; Nestorov et al., 1998) schema information from the data, or more generally, they try to discover a typical or frequent structure on the data. The data will then be classified or clustered into sets of objects. The results of these investigations are promising, but more research is needed because the accuracy and conciseness of the schema is still incomplete. All the above-mentioned papers consider object models like Object Exchange Model (OEM). The same idea and model have been considered in Nestorov et al. (1997a) in which the authors present the concept of representative objects that ACTA WASAENSIA 13 facilitates the querying and browsing of semistructured data. In Nestorov et al. (1997b) the authors consider the problem of identifying some underlying structure in large collections of semistructured data. Buneman et al. (1997) present a proposal of a schema for semistructured data in which both data and schema are represented as edge-labeled graphs. It differs from the OEM model in that the labels are associated with edges, not nodes. The notion of conformance between a graph database and a graph schema are introduced. Also a “deterministic” subclass of schemas is investigated. In Wang et al. (1997) the authors consider schema discovery. The basic idea is to generate a structured layer of typical object structures of semistructured data (e.g. of an OEM database). The main concept in this process is a so-called tree- expression, which is a partial (labelled) tree representation of an object, i.e. a generalization of an object. When generalizing an object some information may be lost. The generalization problem is considered in Wang et al., (2000b). The concept of the tree expression is further developed in Wang et al. (1998a) and Wang et al. (2000a), where the authors represent a new concept called a k- tree-expression. A k-tree expression is a tree expression of k leaf nodes, and it is constructed by “cluing” a sequence of k paths that are not prefixes of each other. In other words, a k-tree expression is the “prefix tree” of the k paths that preserves their order in the sequence. The authors also present an algorithm for searching all k-tree expressions. The simple paths of k-tree expressions are then traversed in depth-first order (Goodrich & Tamassia 1998). In Wang et al. (1998a) and Wang et al. (2000a) the authors noticed that the use of wildcards on paths can cause over generation of them, and also that the algorithm cannot discover typical substructures which exist in the lower part or in the middle part of semistructured objects when the upper part of objects has 14 ACTA WASAENSIA a different number of labels. The authors in Cong et al. (2002) propose a new and more effective method to compute the frequencies of the substructures (with a wild card or not). This approach contains a new feature in which the semistructured objects are represented with paths and corresponding tidlists, where the tidlist is a set of transaction objects or trees that contain a path as a tuple in the database. Representation of objects with labelled paths saves a lot of space. The authors also propose an effective method by which the over generation of paths with wild cards can be removed. Finally, they propose an adapted mining algorithm for the substructure discovery problem. The algorithm proposed in Cong et al. (2002) is more effective than the one developed and improved in Wang et al., (1997, 1998a, 2000a). However, as long as wildcards are used we must balance between efficiency and accuracy. The number of wildcards affects restrictively to accuracy and increasingly to the search space, so we would need a method that does not use wildcards at all, or at least the number of them is minor. The same problem of schema extraction has been considered in Wang et al. (2000b) and in Nestorov et al. (1998) but this time the authors have used a slightly different method. In Wang et al. (2000b) the method has been to use description logics and in Nestorov et al. (1998) the method is based on the semantics of monadic datalog. However, both of these methods use the same approach of merging of similar object classes. These classes proved to be too large so the authors in both papers propose approximation of the generated groups. Here we give only a short presentation of the subject, but a more detailed description of the methods can be found in Wang et al. (2000b) and in Nestorov et al. (1998). As a motivation for our work we can mention that the authors faced the same kind of problem as we have encountered before, i.e. inaccuracy when grouping or merging objects. ACTA WASAENSIA 15 In Wang et al. (2000b) the authors propose a construction of an approximated graph schema by clustering objects with similar incoming and outgoing edge patterns. This approach differs from the DataGuide in that the DataGuide combines objects with similar type definitions. The main idea is to predict the sets of incoming and outgoing edges of a vertex of a data graph that also appears as a vertex of the schema graph. Another principle is to minimize the size of the schema graph by minimizing the number of appearances of each label. The approximate graph schema is small in size, cheap to construct and maintain, and there is no predetermined user defined threshold. However, there may be duplicates and false paths in the schema graph, which means that it is not fully accurate. In Nestorov et al. (1997a) the authors investigated the typing problem of semistructured data. This problem will be considered more detailed in Section 3 and is bypassed here. The idea of a concise representation of the inherent schema of semistructured data has been refined in Nestorov et al. (1997b). The authors present an algorithm for constructing a type hierarchy for semistructured data source. The classification is highly dependent on the choice of the so-called threshold value. If the threshold is too low, the result set is too large and on the other hand if the threshold is too high, the accuracy becomes very low. Once again, there is a trade-off between the accuracy and the size. It is also problematic to choose the value of the threshold. In Nestorov et al. (1998) the authors present an algorithm for approximate typing of semistructured data, i.e. an object does not have to fit its type definition precisely. The size of perfect typing was a problem in Nestorov et al. 16 ACTA WASAENSIA (1997b), because the resulting data sets may be imperfect with respect to the typing, i.e. the data sets may contain edges that should not be present or lack some information (missing edges). Nestorov et al. (1998) presents a technique for computing an approximate typing of an appropriate size, which allows the data set to be imperfect with respect to the typing and they focus on the quality of the typing result rather than the time performance. In Nestorov et al. (1998) the authors have used the semantics of monadic datalog programs. Another approach, presented in Hacid et al. (2000), use description logics for discovering the (partial) implicit structure of semistructured data. They propose an approach for approximate typing of semistructured data. The extracted schema is expressed in the form of terminological axioms in a small description logic. A more detailed description of the approach is given in Hacid et al. (2000). Constraints on semistructured data are not easy to express. In Buneman et al. (1997) there is a proposal of how to present schema for such data. In that proposal the database is presented as an edge-labeled graph, with a corresponding schema in which the edges are labeled with unary formulas. A database conforms to a schema if there is a correspondence between the edges in the source database and the schema. A graph database may conform to several graph schemas. So there exists a schema to which all databases conform. It is also possible to store multiple schemas for the same data, in order to help optimize queries. Since there is a natural ordering on graphs schemas, it is possible to take the least upper bound of a set of schemas and combine all their constraints into a single schema. This approach is one of the first proposals for a schema for semistructured data, and has had a very strong implication in the research of schemas for semistructured data. ACTA WASAENSIA 17 3. FOUNDATIONAL THEORIES In this section we present the theories behind representative objects (Nestorov et al. 1997a) and DataGuides (Goldman et al. 1997, Goldman 2000). We consider here the theory of representative objects because it lays the foundation for the DataGuide. The DataGuide, in turn, is considered because it is the de facto standard of schemas for semistructured data and as far as we know, the only implementation of a schema for semistructured data. In Figure 3.1 we have a part of an OEM database, representing a premiership document. It serves as a basis for multiple examples through this section. Figure 3.1. The premiership object (Nestorov et al. 1997a).                                             !   "    #$     "         %   "  &  !  '            !  '   %  &           "  "     '   %      &  "    ""      $      ()   ***+ $, (-$ . /, (  , ( , (0 *  ,(  **-, (&, (  , (1 $, ( , (&, (, (- 2  , (', (0  3  , (#4*5  *****  , (2 , (6 , (3   , 18 ACTA WASAENSIA 3.1. Representative Objects The theoretical foundation of dynamically generated structural summaries has been given in (Nestorov et al. 1997a). The authors present the representative object concept, which is also called as a DataGuide. The motivation behind a representative object is not only to create a concise description of the structure of the data, but also to facilitate querying and browsing of semistructured data. Representative objects are used with an object model called Object Exchange Model (OEM), initially introduced in Papakonstantinou et al. (1995). In an OEM model every node represents an object identification, and the edges with their labels represent object references. One important concept used with the OEM model is a simple path expression that is a dot separated sequence of labels of the graph (Nestorov et al. 1997a). Definition 3.1. Let il be a label (of object references) for ni ,,1= , 0≥n . Then nlllpe 21 ⋅= is a simple path expression of length n . Another concept is a data path that is a sequence of alternating objects and labels, separated by commas. A data path starts and ends with an object. For every two consecutive objects the first object has an object reference to the second object and there is a label between the objects (Nestorov et al. 1997a). Definition 3.2. Let io be an object for ni ,,0 = , il be a label for ni ,,1= , and )()(, 1−∈ iii ovalueoidentifierl for ni ,,1= , 0≥n . Then nn ,o,l,,l,o,lop 2110= is a data path, of length n . Here are some additional terms in order to help to understand the concepts of simple path expressions and data paths: ACTA WASAENSIA 19 • A data path nn ollop ,,,, 10 = originates from 0o or is rooted at the object 0o . • An object io is within an object o if there exists a data path originating from o and ending with io . • A data path p is within an object o if p originates from an object within o . • A data path nn ollop ,,,, 10 = is an instance of the simple path expression nlllpe ... 21 = . Example 3.1. To illustrate the above terms consider the premiership object of Figure 3.1. • The simple path expression Player.Number has two instance data paths within the premiership object, namely obj(&1),Player,obj(&5), Number.obj(&13) and obj(&1),Player,obj(&14),Number,obj(&18). • Consider the following two data paths obj(&1),Player,obj(&14). FormerClub,obj(&24) and obj(&24),Player.obj(&28),FormerClub,obj(&1). Thus obj(&1) is within obj(&24) and obj(&24) is within obj(&1). i.e. there is a cycle within the premiership object. With a continuation function represented in Nestorov et al. (1997a) we can find out the labels of the object references on a given path, and we can also find out an answer if we can continue our navigation from the current node, i.e. the node is not an atomic node. The continuation function is substantial for representative objects. Definition 3.3. Let o be an object in OEM and nlllpe ... 21 = a simple path expression, 0≥n . Then we define the concept on(o,pe)continuati as follows. 20 ACTA WASAENSIA • continuation(o,pe) ⊇ { ∃|l a data path 111 ,,,,,,, += nnn olololop  that is an instance of pe.l}. • continuation(o,pe) ⊇ { ∃⊥| a data path nn ololop ,,,,, 11 = that is an instance of pe and no is an atomic object} If the simple path expression is of length 0, then the continuation of o is the set of the labels of all outgoing edges from o . However, if the simple path expression is pe , then the continuation of o is the set of all the labels of the outgoing edges of the last node of the path of length 1+n starting at o plus ⊥ if any of the last nodes is atomic. Example 3.2. Consider the premiership object from Figure 3.1. The following examples illustrate Definition 3.3. • continuation(premiership,ε ) = {Club} • continuation(premiership, Club) = {Name, Player, Stadium, Captain} • continuation(premiership, Club.Player.Name) = {First, Last, Nickname, ⊥ } In Definition 3.3 we consider only data paths originating from an object that is the first argument of the continuation function. There are also situations when it is convenient to relax this restriction by e.g. limiting the length of the simple path expression or by allowing the data path to be within a given object. Definition 3.4. Let o be an object, 1≥k , and let pe be a simple path expression of length n , kn ≤≤0 . Then we define (o,pe)oncontinuati k as follows ACTA WASAENSIA 21 • If kn = then o continuationk(o,pe) ⊇ { ∃|l a data path p within o , not necessarily rooted at o , that is an instance of pe.l} o continuationk(o,pe) ⊇ { ∃⊥| a data path nn ololop ,,,,, 110 = within o , that is an instance of pe and no is an atomic object}. • Otherwise ( kif n < ) continuationk(o,pe) = continuation(o,pe). Example 3.3. Consider the premiership object in Figure 3.1. The following examples illustrate Definition 3.4. • Continuation1(premiership, Name) = {Official, Nickname, First, Last, ⊥ } • Continuation2(premiership, Club) = {Name, Player, Stadium, Captain} • Continuation2(premiership, Player.Name) = {First, Last, Nickname, ⊥ } There are two types of representative objects. First, we have a full representative object (FRO), which is restricted to a particular object, and secondly, if the length of the simple path expression is given, then we have a degree-k representative object (k-RO). The k-ROs are less complex than FROs and can be used to approximate FROs. Formally, the definition is as follows. Definition 3.5. Let o be an object. Then the function continuationo(o,pe) = continuation(o,pe), where pe is a simple path expression, is a full representative object (FRO) for o . When the continuation function of Definition 3.5 is replaced by the koncontinuati function we get the following definition for a degree-k representative object: 22 ACTA WASAENSIA Definition 3.6. Let o be an object and 1>k . Then (o,pe)oncontinuati ko = (o,pe)oncontinuati k , where pe is a simple path expression, is a degree-k representative object (k-RO) for o . k-ROs take usually less space than FROs and may be constructed faster. However, k-ROs only approximately support the schema discovery. If the length of the simple path expression pe is less than k, where k is the degree of the k-RO, the exact value of (pe)oncontinuati o can be computed, and the k-RO provides the same support as a FRO. However, if the length of pe is at least k then the approximation of (pe)oncontinuati o provided by the k-RO must be used. A description of the method of computing an approximation of (pe)oncontinuati o from k-RO is given in Nestorov et al. (1997a). When a FRO is implemented in an OEM graph, the on(o,pe)continuati is computed. The FRO consist of an object and an algorithm for computing the function ooncontinuati . The object is explored in a breadth first manner in order to find the instance data paths of pe . If every such data path only the last object in the data path is considered. Finally, we get the continuation of pe , which is the set of all the different labels of object references of those objects and possibly ⊥ if any of those objects is atomic. The algorithm is presented in Nestorov et al. (1997a). Formally, the implementation of FROs in OEM graph is as follows: Definition 3.7. Let 1o and 2o be objects in OEM. Then 1o is a full representative object in OEM for 2o if for any simple path expression pe we have (pe)oncontinuati(pe)oncontinuati oo 21 = . ACTA WASAENSIA 23 Usually there are many FROs for a given o , including the object itself. One of those FROs is minimal, i.e. there is one FRO for which the computation of the continuation function takes minimal time in proportion to the size of the object set. Definition 3.8. Let oR be a FRO (in OEM) for o . Then oR is a minimal FRO if any simple path expression nlllpe ... 21 = , 0≥n , has at most one instance data path originating from oR and ending with a complex object, and at most one instance data path originating from oR and ending with an atomic object. In brief, the construction of a minimal FRO can be done by first creating a nondeterministic finite automaton (NFA) of a given object o , then determinizing and minimizing that NFA, and finally constructing a minimal FRO from that deterministic finite automaton (DFA). The construction of a NFA from an object in OEM is the following: the objects of the OEM graph correspond to the states and the object references and their labels correspond to the transitions and their respective letters. There is also a function that maps all the objects within o to the corresponding unique automaton states. All the objects of the OEM are either complex or atomic. There are also two sets of which the first one contains all the objects within o , and the other one contains all the different labels of the object references within o . The determinization and minimization of the NFA is a well studied problem, so the readers are suggested to turn to Hopcroft & Ullman (1979) for a more detailed description. After the determinization and minimization of the NFA, the next phase is to construct the minimal FRO from the DFA. In the transformation process from a DFA to an OEM object there is one major concern which is the consideration of 24 ACTA WASAENSIA the states that have two different associations of objects, i.e. complex and atomic objects. The construction of a minimal FRO from the given DFA is associated with two functions, one called atomic_obj for mapping of a state to its corresponding atomic object, and another called complex_obj for mapping of a state to its corresponding complex object. The rest of the construction work can be found in Nestorow et al. (1997a). A proof of the correctness can also be found at the same source. 3.2. DataGuide A first example of the functionality of the theory of full representative objects has been presented in a system called Lore (McHugh et al. 1997). In that system an overall description of the incomplete structure of an OEM database is presented with the help of a so-called DataGuide. A DataGuide has the same meaning as metadata in conventional database management systems, like relational and object-oriented systems. DataGuides are stored as objects and they can be accessed by applications and query processors. Because there is no structure in the data, the DataGuide conforms to data rather than forcing data to conform to the DataGuide. The DataGuide is specially developed to be used with the Lore system. However, DataGuides can be used with any graph-based data models. Graph-structured databases and other related theoretical research is presented in Buneman et al. (1997). The basic theory of a DataGuide is presented in Goldman et al. (1997) and Goldman (2000). There are also two other concepts, which are extensions of the primary DataGuide. These are strong DataGuide (Goldman 2000) and approximate DataGuide Goldman et al. (1999). A strong DataGuide supports ACTA WASAENSIA 25 annotations, i.e. a statement about the set of objects in the database reachable by a particular path. An approximate DataGuide, in turn, relaxes some aspects of the definition of a DataGuide. This inaccuracy is possible because all DataGuide paths do not need to exist in the original database. We have already given some definitions that are crucial for a schema for an OEM database, like simple path expression and data path. In Figure 3.2 we have a sample of the OEM database of Figure 3.1. We can observe, for example, path expressions Club.Name and Club.Player.Name both starting at object &0.                          "     '         "  "  ()   ***+ $, (-$ . /, (, (- 2  , (',         !       Figure 3.2. Sample of the premiership object of Figure 3.1. In Figure 3.2 Club.&1.Name.&2 and Club.&20.Player.&22.Name.&23 are examples of data paths of the object &0. We can also see an instance of the data path Club.Name, i.e. Club.&1.Name.&2. For a DataGuide we can define the so-called target set: 26 ACTA WASAENSIA Definition 3.9. In an OEM object s , a target set of a label path l is a set t of oids such that { }l.o .l..o.l.oo|lt n of instancepath data a is2211 = . That is, a target set t is the set of all objects that can be reached by traversing a given label path l of s . We write )(lTt s= . We say that l reaches any element of t , and likewise each element of t is reachable via l (Goldman 2000). For example, the target set of Club.Name in Figure 3.2 is {&2, &21, &25}. Two different label paths may share the same target set. In Figure 3.2 the label paths Club.Player.FormerClub and Club have the same target set i.e. {&1}. There are two properties to which a DataGuide must conform. The first one is conciseness and the second is accuracy. The third property of a DataGuide is that it should be convenient, i.e. a DataGuide must itself be an OEM object. A DataGuide is concise if every unique label path of a source database is described exactly once, regardless of the number of them in the source database. Accuracy means that there is no label path in the DataGuide that does not exist in the original source database. A database that will be summarized is called as the source database, and the given source database is assumed to be identified by its root object. So, the formal definition of a DataGuide is the following: Definition 3.10. A DataGuide for an OEM source object s is an OEM object d such that every label path of s has exactly one data path instance in d , and every label path of d is a label path of s (Goldman 2000). Figure 3.3 shows a minimal FRO, i.e. a DataGuide of the source database of Figure 3.1. Using this DataGuide we can check whether a given data path of length n exists in the original source database by examining only n objects from the root object in the DataGuide. In our example we need only to examine the outgoing edges of objects &35, &36 to find out that the label path Club.Player ACTA WASAENSIA 27 exists in the database. Similarly, we can follow any single instance of a label path l in a DataGuide and reach some object o , so that the labels of the outgoing edges of that particular object o represents all possible labels that could ever follow l in the source database. So the three different labels of the edges emanating from the object 40& , represent all possible labels that ever follow Captain in the source database. In Figure 3.3 there are no atomic values, because a DataGuide is a structural summary of the source database (Nestorov et al. 1997a).   #$                                 ppremiershiR     &35 &36 &39&38&37 &40 &41 &42 &43 &44 &45 &46 &47 &48 &49 &50 &51 &52 &53 &54 Figure 3.3. The minimal FRO (DataGuide) for the premiership object. (Nestorov et al. 1997a). Goldman (2000) points out two main drawbacks of DataGuides. The first one is related to the annotations, i.e. a property of a set of objects that comprise the target set. The problem with annotations is that, when multiple label paths reach the same object, we cannot say to which label path the annotation is 28 ACTA WASAENSIA applied. This problem has been solved by strong DataGuide which is also introduced in Goldman (2000). The second problem is related to the computation of a DataGuide. The computation of a DataGuide can be very expensive if there exists cycles in the structure of the source database. In Goldman et al. (1999) the authors propose a solution to this problem in a form of an Approximate DataGuide (ADG). An approximate DataGuide is like any other DataGuides, but it allows some degree of inaccuracy. The idea behind an approximate DataGuide is to identify “similar” parts of the DataGuide and merge them. There are two main approaches to the approximation: Object Matching and Role Matching. According to the original definition of a DataGuide all the data paths of the source database must be found in the corresponding schema. The approximate DataGuide relaxes this rule, i.e. there may be some paths in the schema that do not exist in the source database. When an approximate DataGuide is created with the object matching method, (Goldman et al. 1999) the target set of the objects is considered. Every path of the DataGuide points to some target set, but in object matching we allow DataGuide paths to point to the same object when their target sets are “almost” similar. Two target sets X and Y are defined to be similar when ( )YXYX ,max/∩ is above a threshold θ . Role matching is the second method for the creation of an approximate DataGuide. In this method DataGuide objects are merged based on the label paths i.e. the roles. If the function ),( 21 ppM returns True for label paths 1p and ACTA WASAENSIA 29 2p , then the paths 1p and 2p point to the same ADG object. This merging process, in turn, can be done with two different ways. In suffix matching the function ),( 21 ppM returns True if and only if the last labels of 1p and 2p are equal. The other method, called Path-Cycle Matching, addresses the problem caused by cyclic databases. Cycles in a database cause the paths to grow too large before reaching an identical target set. The solution for this kind of problem is to add a heuristic into the algorithm. If a specific label can be seen more than once on the path starting from the root, we can assume that this is a “semantic” cycle and the paths can be merged. The path-cycle matching function ),( 21 ppM returns True if and only if 1p is a prefix of 2p (or 2p is a prefix of 1p ) and the last labels of 1p and 2p are the same (Goldman et al. 1999). The choice between a DataGuide and an approximate DataGuide is the same as if we have to choose between accuracy and efficiency. If we want to have an accurate but maybe inefficient schema we use a DataGuide, but if some inaccuracy is allowed then we choose an approximate DataGuide. It is also possible to combine some of the techniques. 30 ACTA WASAENSIA 4. ADAJACENCY RELATION SYSTEM In traditional database management systems, the data is formally separated from the specification of its schema, or structure. The main purpose of the schema is to serve the query processor, but it also helps the user to understand the structure of the database. As we already know, a semistructured database does not have a schema, or if there is a schema, it is incomplete or irregular. Since a schema can prominently facilitate the query performance and help to avoid exhaustive searches, it is very important that all the information about the structure of the semistructured data can be discovered. Schema discovery from semistructured data has been under intensive investigation since the middle of the 90’s. In Section 2 we presented some examples and results of such investigations. We also noticed that there were some shortcomings, like inaccuracy after merging parts of the schema and large schemas caused by cyclic structures. An early work of applying adjacencies has presented in Ni & Bloor (1994). Their work considers boundary data structures, which normally are built on relations between faces, edges and vertices. The novel idea of adjacencies between these entities is furthermore developed in Wanne (1998) and Wanne et al. (1999). In the following chapters we will briefly present those concepts of the Adjacency Relation System (ARS) that are necessary to understand our proposal of a schema for semistructured data. The same concepts are somewhat modified later in this work in order to make them suit better to our purposes. Our work on a schema for semistructured data is based on adjacencies between edges. In our proposal we consider an adjacency as an interrelationship between two consecutive edges, and it is defined with the help of the node ACTA WASAENSIA 31 between them, whereas in Wanne et al. (1999) the adjacency is defined between elements of different types. Formally an Adjacency Relation System is (Wanne et al. 1999) as follows: Definition 4.1. The adjacency relation system (ARS) is a pair ( RA, ) where },,,{ 21 nAAA =A , 1≥n , is a set containing pairwise disjoint finite nonempty sets and }},,2,1{,|{ njiRij ∈=R is a set of relations, where each ijR is a relation on ji AA × . If ijm Ryxyxyx ∈),(,),,(),,( 21  are all the pairs of relation ijR having x as the first component, then each element ),,2,1( mkyk = is said to be adjacent with the element x . Furthermore, denote by )(xAd j the set },,,{ 21 myyy  (Wanne et al. 1999). We assume that the elements of each set iA , ni ,,2,1 = , represent entities of a certain type iT . The adjacency between elements can also be defined with the help of the so-called adjacency defining sets ( τ ) as follows. Associate with each index pair },,2,1{, nji ∈ a set },{},,2,1{ jinK −⊆  of indices and a set of entity types }|{~ KkTT kij ∈= . Elements iAx ∈ , jAy ∈ , where },,2,1{, nji ∈ and yx ≠ , are said to be adjacent with respect to a set of entity types φ≠∈= }|{~ KkTT kij if for each Kk ∈ there is an element kAz ∈ such that )(zAdx i∈ and )(zAdy j∈ . The set ijT ~ is called the adjacency defining set between the elements of iA and jA . For an adjacency relation system associated with adjacency defining sets (ARST), we use the notion ),,( τRA , where τ refers to the set of adjacency defining sets. 32 ACTA WASAENSIA Definition 4.2. An ARST ),,( τRA is said to be unique if for each pair },,2,1{, nji ∈ of integers such that the corresponding adjacency defining set ijT ~ is nonempty and for all elements iAx ∈ , jAy ∈ , x and y are adjacent if and only if they are adjacent with respect to ijT ~ . Consider (Figure 4.1) a set of relations },,,,,,,,{ 332313232221131211 RRRRRRRRR=R , where )},(),,{( 122111 xxxxR = , )},(),,(),,(),,{( 2222122112 xyyxxyyxR = , )},(),,(),,(),,{( 2222211213 xzzxxzzxR = , )},(),,{( 122133 zzzzR = , φ===== 3231232221 RRRRR . and the adjacency defining sets }{ 211 ~ TT = , }{ 133 ~ TT = , φ======= 32233113211222 TTTTTTT . We can see that elements 1x and 2x , 121 , Axx ∈ , are adjacent with each other as well as 1z and 2z , 321 , Azz ∈ . The adjacency is defined via the relation 11R for 1x and 2x , and via the relation 33R for 1z and 2z , and on the other hand, also with respect to the adjacency defining sets ~ 11T and ~ 33T . If there are no other non- empty adjacency defining sets, the ARST considered is unique. ACTA WASAENSIA 33 )( 11 TA )( 22 TA )( 33 TA 1x 2x 1y 2y 3y 1z 2z Figure 4.1. Directed graph illustrating the relations presented above. We can use the notation ji TT → for a relation type to indicate that the relations ijR are defined on ji AA × . Furthermore, we can consider sets of relation types }),(|{ SjiTT ji ∈→ , where },,2,1{},,2,1{ nnS  ×⊆ . For an ARST ),,( τRA , denote by SR | the restriction of R on S , i.e. }),(|{| SjiRS ij ∈∈= RR (Linna et al. 2003). Definition 4.3. A relation sr TT → is determined uniquely by the relation combination }{ S|(i,j)TT ji ∈→ if for any unique ARST ),,( τRA there is no other unique ARST ( ,, 'RA ) such that |S|S 'RR = but 'rsrs RR ≠ (Wanne et al. 1999). One of the problems with DataGuides is their size. In this work we consider also that problem, because even if the relation combination is valid, it should be as small as possible. A relation combination is valid and minimal when the number of relations in the combination is as small as possible. Definition 4.4. Given the entity types nTT ,,1  , a set },,2,1{},,2,1{ nnS  ×⊆ , and adjacency defining sets τ . A relation combination }),(|{ SjiTT ji ∈→ is 34 ACTA WASAENSIA said to be valid if for any unique ARST ),,( τRA there is no other unique ARST ( ,, 'RA ) such that '|S|S RR = . Otherwise the relation combination is said to be non-valid. Let us consider an ARST associated with planar graphs. We have the entities V (vertices), E (edges) and F (faces). Denote e.g. VT =1 , ET =2 and FT =3 . Relation combination of two relations, where the entity E occurs twice, i.e. the combination },{ FEEV →→ and its symmetric variants },{ EFEV →→ },{ FEVE →→ },{ EFVE →→ are minimal valid combinations (Wanne et al. 1999). Next we will give a theorems of Wanne et al. (1999) that will be referenced later on in this work. Theorem 4.1. Let ),, τRA( be a unique ARST with n types nTT ,,1  , such that 0 ~ ≠iiT for ni ,,2,1 = . A relation combination },,,,,,{ 2211 rr kjkikjkikjki TTTTTTTTTTTT →→→→→→  ji ≠ , rkkji ,,, 1 ≠ , determines uniquely the relations ji TT → and ij TT → if and only if φ≠ijT ~ and },,{ 1 ~ rkkij TTT ⊆ (Wanne et al. 1999). ACTA WASAENSIA 35 5. SCHEMA PROPOSAL FOR SEMISTRUCTURED DATA In structured database systems the main purpose of the schema is to describe the structure of the data. This in turn is a prerequisite that the query languages can access data efficiently. In this Section we will give our definition of semistructured data, show how to determine the degree of given data structure, and finally present our proposal of a schema for semistructured data. 5.1. Definition of Semistructured Data According to Wanne et al. (1999) some relations of a given data structure are derivable from a set of stored relations. Each relation determined uniquely by a given relation combination is said to be derivable from the given relation combination. If a relation combination determines uniquely all other relations then it is said to be valid. This means that we have all relations of a data structure and the degree of the structure is one (1). Example 5.1. Consider an ARST ),,( τRA , where the relations are as follows (Figure 5.1): )},(),,{( 122111 xxxxR = , φ=22R , )},(),,{( 122133 zzzzR = , )},(),,(),,(),,{( 3222211112 yxyxyxyxR = , )},(),,(),,(),,{( 2322121121 xyxyxyxyR = , )},(),,{( 221213 zxzxR = , )},(),,{( 222131 xzxzR = , )},(),,(),,{( 13122123 zyzyzyR = , 36 ACTA WASAENSIA )},(),,(),,{( 12312132 yzyzyzR = , and the adjacency defining sets are }{},{ 133 ~ 211 ~ TTTT == , φ======= 32 ~ 23 ~ 31 ~ 13 ~ 21 ~ 12 ~ 22 ~ TTTTTTT . x1 x2 )1(1 TA )2(2 TA )3(3 TA y1 y2 y3 z1 z2 Figure 5.1. Undirected graph illustrating the relations of Example 5.1. Let )}3,2(),2,1{(=S , then },{}),(|{| 2312 RRSjiRS ij =∈∈= RR . Let ( τ,, 'RA ) be another ARST where 'R contains the same relations as R except the relation on 22 AA × which is defined by )},(),,{( 1221'22 yyyyR = . As in Example 5.1 it can be deduced that the ARST ( τ,, 'RA ) is unique. Since SS || 'RR = , the relation combination },{ 3221 TTTT →→ is not valid. ACTA WASAENSIA 37 Now, e.g. the relation 31 TT → is not uniquely determined. Namely, if ( τ,, 'RA ) is another ARST, where 'R contains the same relations as R except the relations on 31 AA × and on 13 AA × , which are defined by )},(),,{( 2111'13 zxzxR = and )},(),,{( 1211'31 xzxzR = . Now we will define the notion of the degree of the data structure (Linna et al. 2003). Let C be a subset of all possible relations of the form 'TT → , where T and 'T are types. Denote by C the number of relations in C . Let 'C be a relation combination such that 'CC ∪ is valid and valid}is |min{' ii CCCC ∪= . Denote ')min( CC = . In other words, )min(C is the minimum number of relations, which is needed to complete C to a valid relation combination. Denote by )(CD the set of all relations, which can be derived uniquely from C . Obviously, )(CDC ⊆ . The degree of a relation combination C is defined as the ratio lations of all rethe number D(C) Deg(C) = . According to the definition, it holds that 1)(0 ≤≤ CDeg for each relation combination C . Definition 5.1. Let C be the relation combination describing the given data structure. The data structure is said to be 38 ACTA WASAENSIA ( i ) structured, if 1)( =CDeg ; ( ii ) unstructured, if 0)( =CDeg ; ( iii ) semistructured, if 1)(0 << CDeg . Note that obviously, 1)( =CDeg if and only if C is valid. In the following example (Wanne 1998) it is shown (Figure 5.2) that if we know the relations VF → and EV → , i.e. },{ EVVFC →→= , then we can always derive the relations FV → , VE → and VV → . In Wanne (1998) it is also shown that relations EFFEEEFF →→→→ ,,, can not be derived. This means that ,,,,{)( VEFVEVVFCD →→→→= }VV → . Thus the degree of C is 56.09/5)( ==CDeg . V E F Figure 5.2. Three entities ),,( FVE and nine adjacent relations. If e.g. },{ FEEVC →→= , it can be shown that )(CD contains all the nine relations and thus 19/9)( ==CDeg , and the relation combination C is valid. If on the other hand },,{ EEVFEVC →→→= , it can be shown that },,,,,{)( VVFVVEEEVFEVDC →→→→→→= . ACTA WASAENSIA 39 Thus the degree of C in this case is 67.09/6)( ==CDeg . In Figure 5.3 (Goldman 2000) we have a typical situation in which an OEM source database is depicted with two DataGuides. As we have found out in the previous sections it is fully possible that there are one, two or more DataGuides of one and the same source database. When we merge the lower parts of the DataGuides of Figure 5.3a and Figure 5.3b, i.e. the paths C.D, we loose some accuracy. B BA C C C D D D (a) 1 2 3 4 5 6 7 8 9 10 A C C DD B (b) 11 12 13 14 15 16 17 A B C D (c ) 18 19 20 21 Figure 5.3. A source and two DataGuides (Goldman 2000). If we wish to keep additional information in a DataGuide, like annotations e.g. statistical odds that an object reachable via l has any outgoing edges with a specific label (Goldman 2000), then we cannot say to which label path the annotation is applied. Referring to Figure 5.3c, we see that label paths A.C and B.C both reach the same object. Thus, if we store an annotation on object 20, we cannot know if the annotation applies to label path A.C, label path B.C or both. The DataGuide of Figure 5.3c is minimal. In the DataGuide in Figure 5.3b, however, we have two distinct objects for the two label paths, so we can 40 ACTA WASAENSIA correctly separate the annotation, but the DataGuide in Figure 5.3b is not any more minimal. This is a situation in which the use of AdSchema, defined in detail later on, is a clear advantage. When the source database of Figure 5.3a is depicted with the AdSchema we will have a schema that is minimal, and the problem concerning annotations can be avoided. Besides, the AdSchema conforms initially to the same requirements as a strong DataGuide. By using AdSchema we can increase efficiency because the schema need not to be rebuilt in order to make it minimal. An AdSchema of the database of Figure 5.3 is depicted in Figure 5.4. The interrelationship between the types B and D is left out because the type C is an interrelationship defining type. A B C D Figure 5.4. An AdSchema of the source database of Figure 5.3. The graphs of Figure 5.3 can also be seen from another point of view. We can consider the number of the relations, i.e. the degree of the structure, so the three graphs of Figure 5.3 can be interpreted as follows. In Figure 5.3a we have 12 relations altogether. The relations are BA → , CA → , CB → , DC → , DA → and DB → , and their symmetric relations. Of those relations BA → is an empty relation (as well as its symmetric version). The relations DA → and DB → have an interrelationship defining type, which in this case is the type C. If the self-relations, i.e. AA → , BB → , CC → and DD → had been included then we should have had 16 relations altogether, but because the self-relations are insignificant to the result they are left out. ACTA WASAENSIA 41 In Figure 5.3b we have the same 12 relations. The number of relations is the same because each relation is depicted only once, and in Figure 5.3a there are two paths that have the same edge labels. Also here the self-relations are omitted. In both of these cases (Figures 5.3a and 5.3b) the degree of the data structure is 0.112/12)( ==CDeg , which means that the structure of the data is quite high. In Figure 5.3c the number of relations is smaller and so is the degree of C . The known relations in Figure 5.3c are BA → , CA → , CB → , DC → , DA → and DB → . Here again, we consider also the symmetric variants of each of the relations. So the total number of relations here is 12. However, in this case there are a few unknown relations. The relations DA → and DB → are not known as well as the relations CA → , CB → . The degree of the data structure is 33,012/4)( ==CDeg . 1 2 3 4 5 6 7 8 9 10 1311 12 A A A A B B B BC C C C A B C (a) (b) Figure 5.5. An example of the sizes of the source database (a) and the corresponding AdSchema (b). In Figure 5.5a there is a sample OEM database. We can see that there are four edges emanating from the root node labelled with a letter A. From each of the nodes 2, 3, 4 and 5 there are two outgoing edges labelled with letters B and C. The relations in that graph are BA → , CA → , CB → , AB → , AC → , BC → , AA → , BB → and CC → , so we have 9 relations altogether. Of these 42 ACTA WASAENSIA relations we omit the self-relations. The degree of the data structure is 0.16/6)( ==CDeg . In Figure 5.5b we have the graph of Figure 5.5a presented with an AdSchema. The schema is simple and there are only three types with two interrelationships between them. The total number of relations is the same as in Figure 5.5a, but only two of them are needed because the rest of them can be derived. The schema is minimal and there is no confusion about interpretation of it. As a conclusion of these examples we can say that, in most cases, if a semistructured database is presented with an AdSchema the size of the new schema is smaller than the original database, and it is also minimal. The AdSchema also fulfils the requirements of a strong DataGuide. As we can derive some relations from the known ones, we do not need to store all relations we know, and still we can keep the same level of accuracy as if all the relations were stored. 5.2. Determining of the Degree of a Data Structure First we give an algorithm for the determining of the degree of a given data structure C , given as a set of relations. Let entity types nTT ,,1  , a set },,2,1{},,2,1{ nnS  ×⊆ and adjacency defining sets τ be given. Consider the relation combination }),(|{ SjiTTC ji ∈→= . By the symmetry, we may assume that for each pair Sji ∈),( also Sij ∈),( . The Degree Testing Algorithm is the following: Step 0. Let the pairs Sji ∈),( be marked pairs and all the other pairs unmarked pairs. ACTA WASAENSIA 43 Step 1. Search for each unmarked pair Sji ∉),( , such that there exists an integer jik ,≠ for which Ski ∈),( and Skj ∈),( . If such a pair and an integer are found, goto Step 2. Otherwise goto Step 3. Step 2. Find all integers rkk ,1 , 1≥r , such that Skjki ss ∈),(),,( for rs ≤≤1 . If for at least one s , rs ≤≤1 , we have ijk TT s ~ ∈ and },,,{ 21 ~ rkkkij TTTT ⊆ , then include relations ji TT → and ij TT → in the relation combination, i.e. add the pairs ),( ji and ),( ij to the set S and goto Step 1. If there is no such integer s , then goto Step 1. Step 3. Let 1C be the set of all new relations derived by Step 2. The degree of the given data structure C is 2 1 n CC Deg(C) += and the set of all obtained relations is 1CC ∪ . Theorem 5.1. The determining of the degree of semistructured data is a decidable question. Proof: We will show that the above algorithm decides the question. The purpose of Steps 1 and 2 is to decide if the present relation combination determines uniquely at least one new relation. If in Step 1 no pair Sji ∈),( , ji ≠ , such that there exists an integer jik ,≠ for which Ski ∈),( and Skj ∈),( is found, then, by Theorem 4.1, no further relation can be derived from the 44 ACTA WASAENSIA obtained relations. Thus we know the set 1C of all new relations derivable from the set C and can go to Step 3 and calculate the degree of C . If in Step 1 at least one pair Sji ∉),( , ji ≠ , satisfying the given property was found and the assumptions of Step 2 are satisfied then the new relations ji TT → and ij TT → are derivable by Theorem 4.1 and they are added to the set of previous relations S . The procedure, i.e. Step 1, is then repeated for finding all possible new relations. If, conversely, the assumptions of Theorem 4.1 are not satisfied in Step 2, make the pair ( ji, ) marked and go to Step 1. This proves that the algorithm is working correctly. Illustrate the Degree Testing Algorithm by the following example. Let VT =1 , ET =2 and FT =3 be the types of vertices, edges and faces in the case of planar graphs. In the usual interpretation of planar graphs }{11 ETT VV ~~ == },{22 FVTT EE ~~ == (winged-edge interpretation) }{33 ETT FF ~~ == φ==== 2112 ~ EV ~ VE ~~ TTTT φ==== 3223 ~ FE ~ EF ~~ TTTT }{3113 ETTTT ~ FV ~ VF ~~ ==== Let now the relation combination be },{ VFEVC →→= . By the symmetry, the relations VE → and FV → belong also to )(CD . Thus we may assume at this stage that )}1,3(),3,1(),1,2(),2,1{(=S . ACTA WASAENSIA 45 Consider in Step 1 the pair S∈)1,1( , i.e. the relation VV → . Since S∈)2,1( , i.e. CEV ∈→ , and S∈)1,2( , i.e. CVE ∈→ go to Step 2 with 2=k . There are two integers sk such that ),1( sk , Sk s ∈),1( , namely 21 =k corresponding to E and 31 =k corresponding to F . Since },{ VETE VV ⊂∈ , i.e. },{ 211 11 ~ kkk TTTT ⊂∈ , we can include the relation VV → in the relation combination and go back to Step 1. For the unmarked pair )2,2( there is an integer 1=k such that Sk ∈),2( . There is only one integer sk , namely 1=sk , such that that Sk s ∈),2( . Now we have },{ ~ FVTV EE =∈ , i.e. 22 ~ TT sk ∈ but }.{},{22 ~ skTFVT ⊆= Similar deduction holds also for the pair )3,3( . Thus the relations EE → and FF → do not belong to )(CD . Consider finally the unmarked pair )3,2( , i.e. the pair ),( FE . In this case for 1=k we have S∈)1,3(),1,2( , i.e. SVFVE ∈→→ , , and we can go to Step 2. Since φ=== EFij TTT ~ 23 ~~ , the pair )3,2( , i.e. the relation FE → (and EF → ), does not belong to )(CD . Since there are no other candidates to )(CD left, we can deduce that },,{1 VVFVVEC →→→= and so 56,09 5 3 32)( 22 1 == + = + = n CC CDeg . The Degree Testing Algorithm enables the definition of a general query based on the given data structure, i.e. on the given relation combination C . We now give a definition for the general query. 5.3. General Query Given an ARST ),,( τRA , a relation combination C and an element x of type X , find the set xY of all elements of type Y such that 46 ACTA WASAENSIA (i) the relation )(CDYX ∈→ and (ii) ABRyx ∈),( for all xYy ∈ , where A∈BA, and elements of A are of type X and elements of B are of type Y . Note that in (i) the set )(CD of relations can be determined by the preceding algorithm. It can be shown that this general definition of a query includes e.g. all the special queries introduced in Ni et al. (1994), namely the queries such as direct query, inverse query, self queries, direct transitive query, inverse transitive query, and indirect transitive queries. 5.4. Algorithms In the following sections we will present three algorithms. The first two algorithms are based on the algorithms of Töyli (2002), whereas the third algorithm is developed in this work. The aim of the first two algorithms is to show that it is possible to model data of different structures with the adjacency model. The third algorithm creates a new schema of a source database, which is a semistructured database. 5.4.1. Modeling of Relational Data In Buneman et al. (1996) the authors show that relational databases are easily encoded as trees. In this subsection we show that relational databases can also be encoded with the adjacency model. We will go through the process in which we convert a relational model into the adjacency model. In order to succeed with this process we expect that the data in the relational model is normalized, at least to the 3rd normal form. The process during which we proceed from the ACTA WASAENSIA 47 relational model into the adjacency model is quite straightforward. It consists of nine steps, which are given here: Step 1. Select an attribute (column) from a table. Step 2. Create a corresponding type. Step 3. Repeat steps 1 and 2 until all the distinct tables and attributes are handled. Step 4. Remove those types which have been created of the secondary keys, if any. Step 5. Take a primary key and define it as an adjacency defining type. Step 6. Repeat step 5 until all the distinct primary keys are defined as adjacency defining types. Step 7. Create an adjacency between an element of a type and an element of its corresponding adjacency defining type (i.e. between an attribute and the primary key of a table). Step 8. Repeat step 7 until all the adjacencies are created between the elements and their adjacency defining sets. Step 9. Create all other wanted adjacencies between the types. In Step 1, if there are multiple tables in our relational database, we select first one table from which we start the conversion process. Actually, it does not matter which table we select, but it could be a good advice to choose the one which have the most attributes. Next we will choose one attribute of that table and proceed to the next step. In Step 2 we create a new type of the attribute we have just selected in step one. When we name a new type, it is advisable to use the column name of that attribute as a type name. Often the column name is representational of the content of that particular attribute, and on the other side we can avoid conflicting names between the types. The type and the length of the attributes 48 ACTA WASAENSIA are insignificant, because the Adjacency Model treats all data equally, i.e. the type names and the adjacencies between them are the only significant matters. In Step 3 we repeat the two previous steps. However, as soon as all the attributes of one table are treated, we take another table under consideration. As long as we repeat steps one and two we do not need to be concerned if there are more than one type with the same name, because these extra types are removed in the next step. Now the primary phase of the conversion process is completed. After all the tables and all the attributes are considered we enter the next step. We have as many types as we had columns in the original database. However, some of them describe one and the same attribute. In Step 4 we remove all those duplicate attributes that have been created of the secondary keys of the tables, because they all have a corresponding type created of the primary key. Finally, there should not be duplicate types supposing the source database is at least in 3rd normal form. In the theory of the Adjacency Relation Systems (Wanne 1998) there is one concept called an adjacency defining set. In this fifth Step we consider these sets. Every distinct attribute of a primary key, sometimes the primary key consists of two or more attributes, can be defined as an adjacency defining type. So we will go through all types and check if it is a part of a primary key or not. If it is, then that particular type can be used to define adjacency between two or more types of the same table. If the same type is used as a secondary key in another table of the source database, then it can define adjacencies also between types created of attributes of two or more tables. In Step 6 we repeat the procedure of step five and select another type which is a part of a primary key of any table of the source database. This new type will ACTA WASAENSIA 49 also be defined as an adjacency defining type. This step will be repeated until all adjacency defining types are found and declared. In the previous steps we have concentrated on the types of the adjacency model. Next we will consider the elements of the types. In the relational model adjacencies are defined with relationships between the tables of the database. In the adjacency model the adjacency is defined with the help of the adjacency defining types. Moreover, the adjacency is created on the element level, not on the type level. In Step 7 we draw a line, i.e. a relation between an element of a type and an element of its corresponding adjacency defining type. We must remember here that we are not connecting elements based on their values, but according to the adjacency they have as attributes in the same table. The attributes, which have been declared as secondary keys in the source database, connect the types taken from different tables. In Step 8 we repeat the procedure of Step 7 until all the adjacencies between the elements of the different types are created. When we are finished with our work all elements of every distinct type are in relation with some other element. Some elements are adjacent with two or more elements, others has only one adjacent element. In the first case we consider elements of adjacency defining types and in the other case elements of the ordinary types, i.e. adjacencies between the attributes within a table. In Step 9 we considered all those adjacencies that are for some reason not considered yet. Before we go on with another algorithm for semistructured data we will give an example of how the algorithm presented above works. The example database 50 ACTA WASAENSIA contains three tables with ten attributes, so our adjacency model will contain only eight types, because the secondary keys are removed. In Figure 5.6 the three tables are Order, Bill and Customer. Each of the tables have records (rows) with a number of attributes (columns) and relationships which are composed with the primary keys (Ord_No, Bill_No and Cust_No) and the secondary keys (Bill_No and Cust_No). None of the attributes have null values. Order Ord_No Bill_No Sum Date 10 100 330,00 31.10.2000 11 101 20,00 1.11.2000 20 110 1 000,00 15.11.2000 Bill Bill_No Cust_No Delivery 100 10 yes 101 10 yes 110 15 no Customer Cust_No Adress Phone 10 Street 1 12 345 670 11 Road 66 33 377 880 15 Avenue 99 55 555 555 Relations Figure 5.6. Customer Order System – an example of a relational database. As mentioned above, we start by picking up attributes from the tables of the relational model and create types of them one at a time. In our example we have the attributes Ord_No, Bill_No, Sum and Date in the Order table and Bill_No, Cust_No etc. in the tables Bill and Customer. After the first three steps we have ten types, some of which are duplicates. ACTA WASAENSIA 51 In step four we eliminate those duplicate types which are declared as secondary keys, in this case the types Bill_No and Cust_No. So, at the end, we have only eight types left, Ord_No, Bill_No, Sum, Date, Cust_No, Delivery, Address and Phone. The next steps, steps five and six, check the primary keys and define them as adjacency defining types. There are three of them, i.e. Ord_No, Bill_No and Cust_No. After that it is quite simple to create the adjacencies between the elements of the adjacency defining types and the elements of those types which are adjacent with them, like the elements of Ord_No/Bill_No, Ord_No/Sum and Ord_No/Date. The same thing will be done with the other adjacency defining sets and their corresponding types. Now we have managed all the steps except the last one. In the last step, step nine, we create the rest of the adjacencies, if there are any. In most of the cases we have succeeded to create all the adjacencies, because the relations between the tables are mostly implemented with the help of the primary and secondary keys. When we eliminated the secondary keys we had only the primary keys left, and with them we, in the adjacency model, could create the adjacencies necessary for our purposes. If there are any non-key relations in the relational model, we will add them into our representation. In Figure 5.7 we have presented the relational database of Figure 5.6 with the adjacency model. There are only eight types left: Ord_No, Bill_No, Sum, Delivery etc., which represent the attributes of the tables. Each type contains elements, which in turn, correspond the values of those attributes, like Order_No: 10, 11, 20; Delivery: yes, no etc.. The relationships between the elements of different types are represented with directed edges; for example the upper part of Figure 5.7 represents the relationships ( NoOrdDateR _, , NoOrdNoBillR _,_ and NoOrdSumR _, ) of the attributes of the table Order of Figure 5.6. 52 ACTA WASAENSIA Bill_No Sum Ord_No Phone Date Cust_No Delivery Adress 10 11 20 100 101 110 330,00 20,00 1 000,0031.10.2000 15.11.2000 10 15 Yes Street 1 Road 66 Avenue 99 12345670 33377880 55555555 11 Figure 5.7. The relational database of Figure 5.15 represented with the Adjacency Model. The types Bill_No, Sum and Date are adjacent with each other with respect to the adjacency defining type, which in this case, is the Ord_No. So we can denote the adjacency defining sets with NoBillSumT _, ~ = }{ _ NoOrdT , NoBillDateT _, ~ = }{ _ ~ NoOrdT and DateSumT , ~ = }{ _ NoOrdT . Additionally, we can see that the elements of the type Bill_No are adjacent with the elements of the type Ord_No (e.g. )10(100 _ NoBillAd∈ ) and so do the elements of the types Sum and Date (e.g. )10(00.330 SumAd∈ , )10(2001.10.31 DateAd∈ ). These adjacencies are defined by the adjacency defining sets, not with respect to the values of the corresponding types. ACTA WASAENSIA 53 The adjacencies, which we have shown above, were implemented between the types taken from one and the same table. It is equally possible to define adjacencies between types taken from two different tables. In our example the type Bill_No is an attribute of two tables, Order and Bill, so we can define the adjacency between the tables in the adjacency model, e.g. )100(10 _ NoCustAd∈ and )10(100 _ NoBillAd∈ . The rest of Figure 5.7 can be represented equally applying the definitions above. The algorithm we have presented above is quite straightforward, because the underlying data model is based on a fixed schema. However, the amount of data that do not have a fixed schema increases rapidly. Our next algorithm is developed to manage such data. The algorithm reads a tree or a graph-like structure and creates an adjacency model of it. With this algorithm we want to show that semistructured data can be modelled with the adjacency model. 5.4.2. Modeling of an Object Exchange Model In this Section we will present our algorithm that converts an Object Change Model (McHugh 1997) to an adjacency model. We have chosen this model because it is most widely used, and the DataGuide schema is also based on this same model. The conversion process from the OEM model into the adjacency model differs in some aspects from the process of the relational model. The main principle is that every edge of the source database corresponds to one type of the adjacency model, and every node of the source database corresponds to a relation between two elements of two different types. The conversion between the models proceeds as follows: Step 1. If there is an incoming edge into the root node of the graph, create a new type of it. Step 2. Select an edge from the graph. 54 ACTA WASAENSIA Step 3. Create a new type of the edge. Step 4. Repeat steps 2 and 3 until all distinct edges are handled. Step 5. Select a type, which has two or more distinct types that are adjacent with it, and define it as an adjacency defining type. Step 6. Repeat step 5 until all such types are handled. Step 7. Create an adjacency between an element of an atomic type (a type which does not have other types that are adjacent with it) and an element of its corresponding adjacency defining type. Step 8. Repeat step 7 until all the elements of the atomic types are handled. Step 9. Draw the adjacencies between the elements of the adjacency defining types according to the original graph. Very often the root node has an incoming edge. This edge is considered first in Step 1, where we create a new type of it. Most often that particular edge has a name already and we can use it, but if it is not named then we can use what ever name we want. However, we must think out the name well so that there would not be any conflicts between the names of the other edge of the semistructured database. If there is no incoming edge, we can directly move on to the next step. In Step 2 we select one of the edges of the semistructured database. If we make the selection for the first time we take an edge emanating from the root node, otherwise we can select any of the remaining edges we have not visited yet. Our algorithm uses the depth-first search, so we know exactly those edges that are already picked up. In Step 3 we create a new type of the label of the last selected edge. In the conversion process from the relational model to the adjacency model duplicate ACTA WASAENSIA 55 types were allowed and removed later, but because our algorithm keeps track of the edges handled, we will not face the same problem here. Step 4 repeats the three previous steps and uses the DFS to manage all the edges of the source database. Because we use a commonly proven search method in our algorithm even the cycles will not cause any problem. In Steps 5 and 6 we define the adjacency defining types. As a rule of thumb we can say that all those edges of the semistructured database that end to a node that have outgoing edges can be defined as such. These two steps will be repeated until all such types are defined. Here we end the first phase of our algorithm. All types are created and all adjacency defining types are defined. In the next three steps we finish the conversion process by drawing the relations, i.e. the adjacencies between the elements of the types. When we consider the Object Exchange Model, the elements of the types consist of the object identifiers of the nodes to which the incoming edges are connected. Basically, if we have three edges with the same name, then we have only one type with that name and three object identifiers as elements within that type. In Steps 7 and 8 we create the relations, i.e. an adjacencies, between elements of the different types. In this and the next step we consider only adjacencies between atomic types and adjacency defining types. The elements, between which the relation is created, can be determined according to the inner nodes of the source database. The incoming edge to the inner node is defined as an adjacency defining type, and the object identifier of that node is the starting element. The outgoing edge, which is an atomic type in the adjacency model, points to a node, and the object identifier of that node is the ending element of 56 ACTA WASAENSIA the relation. When we have concluded these two steps, we have created all the adjacencies between the adjacency defining types and the atomic types. In Step 9 we create the adjacencies between the adjacency defining types. These types are inner nodes in the source database. The relation in the adjacency model is drawn from a type which is created of the label of the incoming edge of the inner node, and the relation ends to a type which is created from the label of the outgoing edge of the same inner node. The relation is, of course, an adjacency between the elements of the named types. Finally, when all the types are considered the semistructured model is converted to an adjacency model. The number of types in our model is less (in general) than the number of edge labels in the source database, because the edges with the same label are considered as one type. Next we will take some examples of the conversion process. We will consider here two different examples and give a brief explanation of the process. We will start with an OEM database (Figure 5.8) presented by McHugh, Widom, Abiteboul, Luo & Rajaraman (1998). In Figure 5.9 we have the same database represented with the adjacency model. In Figure 5.8 we can see that there is a named input edge, which is called DB, so the first type is created of it. After that we go through the whole graph with the depth first search and select all distinct edges, and create corresponding types of their labels, like Movie, Actor, Title etc. Next we define the adjacency defining types. In this first example the adjacency defining types include e.g. the types Movie, Actor and Price. Next we create the adjacencies from the elements of the adjacency defining types to the elements of their corresponding adjacent types like the relations between Name - Actor, Currency - Price, etc. Finally we create the adjacencies between the adjacency defining sets (Movie – Actor, Movie – Title ACTA WASAENSIA 57 etc.). All special cases, if any, are handled separately after we have finished the standard procedure. In Figure 5.8 the edge with the label References between the two Movie types is a special case which will be considered later in this example. Figure 5.8. An OEM database (McHugh et al. 1998). In Figure 5.9 there are thirteen distinct types with adjacency defining types and ordinary types. If we take the type Actor, we can see that some of the elements of the types Name and Character are adjacent with each other with respect to the type Actor ( }{ ~ ActorcterName,Chara TT = ). The elements of the types Year, Title, Writer and Price are adjacent with each other with respect to the type Movie, so we can write Movie ( }{, ~ MovieYearActor TT = , }{, ~ MovieTitleActor TT = etc.). For example Harrison Ford is the name (&17) of the Actor (&6) who plays in Movie (&2). We can also see that the Character (&18) Deccard is played by an Actor (&6) whose Name is Harrison Ford (&17), because the adjacency of these two types is defined by the type Actor. The adjacency between the elements of the types Name and Actor, e.g. )6(&17& NameAd∈ is defined directly by the relation without any adjacency defining type. 58 ACTA WASAENSIA &3 &4&2 &16&6 &13 Movie &14 &5 &9 Actor Title &10 Writer &25&17 &21 Name &26&18 &22 Character &15 &8 &11 Price &7 &12 Year &20 Currency &19 Amount &23 Original &24 AKA &1 DB Figure 5.9. An OEM database of Figure 5.8 modelled by the Adjacency Model. In Figure 5.9 the types Title and Price have elements with two different expressions. First, they both have elements that do not have adjacencies with other elements and secondly, they have elements with adjacencies with other elements. For example, the element &5 of the type Title do not have adjacent elements, whereas the element &14 has two adjacent elements (i.e. )14(&24& AKAAd∈ and )14(&23& OriginalAd∈ ). This kind of structures make semistructured data difficult to model with the common data models, but as we can see we can present structures like this with the adjacency model. ACTA WASAENSIA 59 The edge labelled References in Figure 5.8 is not depicted as a distinct type in the adjacency model, because there is a more natural way to represent it. It can be represented with only one type by defining a relation between the two elements &2 and &3 of the type Movie. With this method we can reduce the number of distinct types. Our next example deals with the semistructured data graph of Figure 5.10. It represents an OEM database, which contains information about the Stanford Database Group (McHugh et. al 1997). In principle it has the same structure as the OEM database of Figure 5.8 but there is one significant difference. The graph contains a cycle between the two nodes &4 and &5 with edges labelled Member and Project. Figure 5.10. An OEM database with a cycle between the nodes &4 and &5 (McHugh et al. 1997). In the adjacency model of Figure 5.11 the adjacency is defined with relations between the elements ojectMemberR Pr, and MemberojectR ,Pr . In this case we can use also an adjacency defining type, i.e. the DBGroup. Both of the types Member and 60 ACTA WASAENSIA Project have an element (&4 and &5) that is adjacent with an element (&1) of the adjacency defining type DBGroup, i.e. }{Pr ~ DBGroupojectMember, TT = . &5 &6&3 &4&2 Member Project &12 Name &13 Age &14 Office &19 Building &20 Room &1 DBGroup &15 Title &16 Figure 5.11. The cycles of Figure 5.10 presented with the Adjacency Model. Since )1(&6,&5& Pr ojectAd∈ in Figure 5.11, the elements &5 and &6 are adjacent with each other with respect to ojectojectT Pr,Pr ~ . This kind of adjacency is typical in the context of semistructured data when we have a distinct root node (here DBGroup). The same thing can be found between the types Project and Member, where )4(&6,&5& Pr ojectAd∈ . ACTA WASAENSIA 61 5.5. Definition of an Adjacency Schema In this Section we present our proposal of a schema for semistructured data. Although our work follows the theory of Wanne (1998) and Wanne et al. (1999), there is one exception in that our proposal concerns the adjacency between the types and not between the individual elements of the types, which is the case in Wanne (1998). This means new formulations of some of the definitions introduced in Wanne (1998) and Wanne et al. (1999). These will be presented in this section. As a consequence of the fact that we consider a schema for a database we do not need to show the atomic objects, i.e. the values of the source database. This means that the schema becomes more simplified. The values of the atomic objects can be stored in a common data store, like a relational or object database, and hence can be accessed by e.g. with SQL. In Figure 5.12a there is a part of a semistructured database. The database graph consists of nodes and labeled edges. The nodes do not have any object identifications. In Figure 5.12b the same graph is depicted with an Adjacency Schema, which consists of types and interrelationships between them. Intuitively, in Figure 5.12a, the meaning of a relationship between two consecutive objects is defined with the help of a labelled edge, whereas in an Adjacency Schema the relationship between two interrelated types is defined with respect to a node of a database graph. Definition 5.1. An Adjacency Schema (AdSchema) is a pair ( RT A , ), where { }nA TTT ,,, 21 =T , 1≥n , is a set of types, and }},,2,1{|{ niRi ∈=R , where each ))(,( iii TIrTR = , and )( iTIr denotes a subset of AT . 62 ACTA WASAENSIA 3T nT 1T 2T  1T 2T 3T nT a b Figure 5.12. A part of a data graph depicted with an Adjacency Schema. Definition 5.2. If R∈= }),,,{,( 21 miiiii TTTTR  , then each type ),,2,1( mkTk = is said to be interrelated to the type iT . Furthermore, denote by )( iTIr the set },,,{ 21 miii TTT  . Since it is possible that jii = for some mj ,,1 = , the relation iR means that a type iT is interrelated to itself. This is a special case in a data graph where the start and end node of an edge are the same. We use the notion ji TT → to show the interrelationship between types, like iT and jT . If we have semistructured data that contains elements of type iT and jT and if the elements of these two types are adjacent with each other, then we can denote that by ji TT → . This means that )( ij TIrT ∈ and the interrelationship of the elements can be represented also as a pair of elements (Wanne 1998). All known approaches, including approximate DataGuides (Goldman 2000), consider directed graphs as schemas for semistructured databases. In our approach we consider symmetric relations between the types and so there is no predefined order between the types of the schema. This makes it possible that we can traverse the data structure in both directions. ACTA WASAENSIA 63 Definition 5.3. An Adjacency Schema ( RT A , ) is said to be symmetric if for each pair of types it holds that )( ji TIrT ∈ if and only if )( ij TIrT ∈ . The number of the nodes in a semistructured database graph can vary from only a few nodes to a large number of them. The number of nodes is not a problem, but it would be more convenient if some of the relations could be left out of the schema, i.e. they could be derived from the known ones. We can use the interrelationship between some particular types when the relation is defined with respect to a third type. Definition 5.4. In the case of a symmetric Adjacency Schema, the types Aji TT T∈, are said to be interrelated with respect to a type kT , AkT T∈ , },{},,2,1{ jink −∈  , if )( ik TIrT ∈ and )( jk TIrT ∈ . The type kT is called an interrelationship defining type. In Figure 5.13 the types 1T and 3T are interrelated to each other with respect to the type 2T , in notation }{ 213 ~ TT = . If ji = the type is adjacent with itself. In Figure 5.13 we can see a schema with j types and the interrelationships between them. As we can see there are many interrelationships drawn between the types, but many of them are unnecessary because they can be derived from the already known interrelationships, i.e. there is a minimal set of relations that need to be stored. In Goldman (2000) there is a concept called a data path, which is a sequence of one or more dot-separated labels nlll ,..,. 21  starting from an object o . In our work the corresponding concept is realized with transitively interrelated types. 64 ACTA WASAENSIA Transitively interrelated types are types that are adjacent with each other with respect to a set of consecutive intermediate types. . . . 1+iT 2+iT 3+iTiT jT Figure 5.13. A valid structure containing interrelationships between a set of types. Definition 5.5. Two types iT and jT , ji ≠ are transitively interrelated if there are one or more consecutive intermediate types between them, such that iT and kT are interrelated with respect to a type 1−kT for jiik ,,3,2 ++= . Consider the case of Figure 5.14. It has the same set of types as is shown in Figure 5.13 but the number of interrelationships is lesser. The interrelationship types are as in the previous definition. iT jT . . . 2+iT 3+iT1+iT Figure 5.14. A minimal set of interrelationships of the types of Figure 5.13. The fact that some of the interrelationships can be derived tends to decrease the size of AdSchema. So, we need to find the optimal combination of the ACTA WASAENSIA 65 interrelationships to form the base set from which the remaining relations can be derived. If we can find a relation combination that is minimal and valid, then the rest of the interrelationships can be derived from the known interrelationships and the database schema can be reduced without loosing its accuracy. In Wanne (1998) the validity of a relation combination is defined between the elements of adjacent types but here we just notice that if a relation combination is valid, then it uniquely defines the interrelationships of all types. In Figure 5.15 we have a schema with six relations, },,,,, 654321 RRRRR{R=R , i.e. }),{,( 4211 TTTR = , }),,{,( 43122 TTTTR = , }){,( 233 TTR = , }),,,{,( 652144 TTTTTR = , }){,( 455 TTR = and }){,( 466 TTR = . The combination of three relations 21, RR and 4R form a valid combination. This means that the rest of the relations can be derived with the help of these relations. 1T 2T 3T 4T 5T 6T Figure 5.15. An example of a valid schema. 66 ACTA WASAENSIA When we think of the relation combinations that should be stored we sometimes have to select between two or more alternatives. In Figure 5.16 we have such a situation. There are three different relation combinations from which we can choose. First we have the relations { }421 ,, RRRR = , where }){,( 121 TTR = , }){,( 322 TTR = and }){,( 424 TTR = , the second relation combination is },, '4'3'2' RR{RR = where the relations are }){,( 31'2 TTR = , }){,( 23'3 TTR = and }){,( 42'4 TTR = , and the third },, ''4''3''1'' RR{RR = , where the relations are }){,( 21''1 TTR = , }){,( 31''3 TTR = and }){,( 42''4 TTR = . The choice between these relation combinations must be done with respect to the efficiency aspect. The most efficient is the relation combination R i.e. { }421 ,, RRRR = . With this solution the interrelationship between the types 1T and 3T can be left out of the schema, because that particular interrelationship can be defined with respect to the type 2T , which is an interrelationship defining type. 1T 2T 3T4T Figure 5.16. Selection problem of the minimal valid relation combination. This relation combination is the most efficient because if we start from 2T we can retrieve all the types 1T , 3T and 4T with one direct access. ACTA WASAENSIA 67 5.6. Examples In this Section we give some examples of how different parts of a semistructured database can be depicted with an AdSchema. In Figure 3.1 we have a semistructured database called premiership (Nestorov et al. 1997) some parts of which we presented with the help of an AdSchema. First, in Figure 5.17a we have the first part of the premiership object starting from the root node &0. It presents a data path &0.Club.&1.Name.&2.Offical.&3. This data path can be constructed with an AdSchema depicted in Figure 5.17b consisting of three types Club, Name and Official and the relations between them.     &0 &1 &2 &3   a     b   Figure 5.17. A portion of a graph of Figure 2.1 depicted with an AdSchema Next we will extend Figure 5.17a with a new type called Player and its successor Nationality (Figure 5.18a). The second data path is &0.Club.&1.Player.&14. Nationality.&19. Player is the second edge emanating from the edge Club, which means that we have to insert the type Player into the schema as well as its successor Nationality (Figure 5.18b). 68 ACTA WASAENSIA      &0 &1 &2 &14 &3 &19    a      b    Figure 5.18. Adjacency of types Name and Player defined with respect to a third type Club. The types Club, Name and Player are interrelationship defining types, which means that we do not need to draw an interrelationship between the types Name and Player. Besides, we need not to have the interrelationships between the types Club and Nationality or between the types Club and Official either because they are also interrelated with each other with respect to the types Player and Name, respectively. In Figure 5.19a the edge Name has two different expressions. In the first case it is a complex object and in the second case it is an atomic object. Both of these expressions are allowed in an OEM graph. This deviation does not pose any problem in AdSchema, because we have only one type for each occurrence of the Name object. The corresponding AdSchema is depicted in Figure 5.19b, in which we can see that the types Club and Name are again interrelationship defining types as in Figure 5.18b, although the situation is somewhat different. This is one characteristic of the AdSchema that makes it possible to increase the degree of the data structure, i.e. some relations between the types can be derived from the existing, stored interrelationships. ACTA WASAENSIA 69           &0 &1 &25&14 &15 &16 &17 &24   a      b    Figure 5.19. Two different occurrences of the edge Name depicted with an AdSchema. In Figure 5.19b the interrelationships }){,(3 PlayerNameR = and }){,(4 NamePlayerR = can be derived with the help of the interrelationships }){lub,(1 NameCR = and }){lub,(2 PlayerCR = . We can also use the symmetric variants of the named interrelationships 1R and 2R . Also the interrelationship between the types First and Last can be derived based on the same definition. Next we will consider cycles. In Figure 5.20b we can see how a cycle of a premiership object (Nestorov et al. 1997) is implemeted with the AdSchema. We can see that the types Club and FormerClub are interrelated to each other with respect to the type Player. There is also a interrelationship between the types Player and FormerClub, but we need not to show it in the AdSchema. The reason is the interrelationship defining set Player. It can be used to derive the missing interrelationship between the types Club and FormerClub. 70 ACTA WASAENSIA         &0 &1 &14   &24 &28       a         b Figure 5.20. Representing cycles with the AdSchema. 5.7 Creating an AdSchema from Semistructured Data In this Section we present an algorithm for schema creation for semistructured data. The original source database is read in a depth-first fashion. Every label path is considered and all distinct edges are read and new types are created from them. Equally all the interrelationships between the types are created of the nodes of the source database. We use a hash table where all the object and edges are stored. The hash table is used in order to prevent multiple types with same name to be inserted into our AdSchema. We also create a starting type (usually the incoming edge to the root node) because in this way we can avoid disconnected types. Next we go through the steps of the algorithm (Figure 5.21). The first six lines represent the initialization phase of our algorithm. On line 1 we have the name of the algorithm. Line 2 contains the input file to our algorithm. It is a semistructured database and will have the object identifier of the root node o of the source database as a starting point. On line 3 the output file is introduced. It is the AdSchema. On line 5 a hash table is declared . The purpose of the hash table is to map the objects/edges of the source database to ACTA WASAENSIA 71 the types of the AdSchema. On line 6 a global type called ads is created. This variable contains the first type that will be inserted into the new AdSchema. Usually it is an edge that comes into the root node. Figure 5.21. An algorithm to create an Adjacency Schema. Line 8 starts the making of the AdSchema. The method MakeAdSchema contains the object identifier of the root node as a parameter. On line 9, the first new type, i.e. the label of the first edge is assigned into the ads variable. Line 10 makes the first insertion into the hash table with the method Insert() of the object targetHash. The insertion contains the object identifier of the root node and the label of the first edge. After that, on line 11, we call the recursive part of that algorithm called RecursiveMake(). 72 ACTA WASAENSIA The effective work is done on lines 14 through 26. On these lines all the nodes and edges of the source database are stored in the hash table, and above all, the types and the interrelationships between them are appended to the current schema. Line 14 is the header of the RecursiveMake method. It contains as parameters the object identifier and the name of the edge. When we call this method for the first time, the object identifier is the identifier of the root node and the name of the edge is the label of the incoming edge to the root node. On line 15 we save all the object identifier and edge label pairs of the children of each object in t1. On lines 16 through 26 we have a for-each loop that searches through the hash table and checks if the given object identifiers are already stored there. In a semistructured database there can be several edges with the same edge label but with different object identifiers. On line 16 we consider unique edge labels but take into consideration all the object identifiers. Line 16 starts the for-each loop. On line 17 the object identifiers of the edge label are picked up and stored to the variable t2, and on line 18 we search the hash table for the object identifiers of the variable t2. The used method of the object targetHash is called Lookup(). The result of the search is stored to the variable d2. Lines 19 through 26 check if the given objects are found in the hash table or not. If an object identifier is found we add on line 20 an edge from the second last inserted type to the last inserted type. On the other hand, if we cannot find the search object from the hash table (line 21), we add a new type to the schema (line 22). This new type will then be added to the hash table on line 23. Also the set of object identifiers assigned to that edge are added to the hash table. After that, on line 24, a new interrelationship is added between the two last added types. Finally, a new recursive call is made on line 25, and all the rest of the types and the interrelationships between them are added to the schema. ACTA WASAENSIA 73 Next we will consider an example of how our algorithm works. In Figure 5.22 there is a part of an OEM database of restaurants and bars (Goldman 2000). There are two Restaurants with some detailed information and one Bar without any additional information. The Owner of one restaurant is the same as the Manager of the other restaurant. The leaf nodes have some data. Nam e BarRestaurant Entree Restaurant Entree Entree Nam ePhone Owner Manager 1 2 3 4 5 8 9 10 116 7 Chill’s Burger 555-1234 Sm ith Darbar Lam b Curry Beef Curry Rose & Crown Figure 5.22. A sample OEM database of restaurants and bars (Goldman 2000). In Figure 5.23 we have the corresponding AdSchema of Figure 5.22. When we create an AdSchema of Figure 5.22 we can see that there are as many types in it as there are distinct edges in DataGuide. In addition, our AdSchema has an extra type (Entry) which does not exist in the DataGuide or in the original source database. In our algorithm the source database with an object identifier o contains the input data. The output of the algorithm is the schema of Figure 5.23. First, in the method MakeAdSchema(), we assign the type Entry as a new type to the variable ads. We insert the type Entry and the object identifier 1 to the hash table and start the RecursiveMake() method. 74 ACTA WASAENSIA -  7 )  2         Figure 5.23. An Adjacency Schema of the sample OEM database of Figure 5.22. When we enter to the method RecursiveMake() we assign the pairs {(Restaurant,2),(Restaurant,3)} to the variable p. After that, for each unique label in p, we assign the object identifiers to the variable t2. t2 will contain the set {2,3}. Next we will use the method Lookup() of the object targetHash to lookup if the objects are already in the hash table. The result will be assigned to the variable d2. If the objects are not found, then d2 will be assigned to the next new type, which in this case will be Restaurant. The object identifiers and the new type will then be added to the hash table. This will again be done with the Insert() method. A new interrelationship is added between the Entry type and the Restaurant type, and the RecursiveMake() method will be called again with new parameters. In this second call of the method t1 contains the object identifiers {2,3} and d1 contains the type Restaurant. The variable p will now contain the following pairs: {(Name,5),(Entree,6),(Phone,7),(Owner,8),(Manager, 8),(Name,9),(Entree,10),(Entree,11)}. Now, for each unique label, we go through the same procedure. We start with label Name and assign the object identifiers (5,9) to variable t2, try to find them in the hash table and if not found, we create a new type, add the type and the object identifiers to the hash table and finally ACTA WASAENSIA 75 create a new relation between the second last and the last types. This procedure will go on until all unique edges and the interrelationships between them are added to the AdSchema. This example is a very simple example and its only purpose is to show that the algorithm works. The only speciality is that the two edges Owner and Manager present one and the same person. In the source database this means that the corresponding edges end to the same node. In the AdSchema this can be presented without any specific structural problem. 76 ACTA WASAENSIA 6. CONCLUSIONS AND FUTURE WORK 6.1. Conclusions Our research hypotheses stated that it is possible to construct a schema for semistructured data based on the theory of the Adjacency Relation Systems, and that the size of it can be decreased without loosing much of its accuracy. As we have finished our work and given answers to the research questions we can state the following. The first research question about the adequacy of the theory of the Adjacency Relation Systems to be used as a foundation for an Adjacency Schema was answered positively. This means that the theory is solid and can be used with some smaller changes also with a schema construction for semistructured data. It also turned out, as we stated in the second research question, that it is possible to develop an algorithm that generates the schema from the original source database. The third question about the fulfillment of the requirements imposed for a schema for semistructured data is partly answered, but this question is still open. We need empirical results to answer this question in full, so in future this is one of the areas we will concentrate. The fourth research question about the minimal and accurate schema is also partly open. The schema creation algorithm constructs a schema that contains only distinct types which means that the schema is minimal. In most cases it also shares the idea of a strong DataGuide in that there is only one “label path” in the adjacency schema to each of the types. ACTA WASAENSIA 77 6.2. Summary of Contributions The main contributions of this work are the schema proposal for semistructured data, an accurate definition of semistructured data and an algorithm for calculating its degree as well as a general query for a given relation combination. We also contribute with algorithms for the modelling of relation and semistructured data as well as with an algorithm for construction of an adjacency schema for semistructured data. The accurate definition of the semistructured data helps us to determine the degree of the semistructured data. It states clearly that the degree is between zero and one, and that it can be completed up to one, which means that the structure is valid. The degree calculating algorithm connects our work to the theory of the Adjacency Relation Systems. It calculates the degree from a set of given relations and their adjacency defining sets. The calculated degree can furthermore be increased if we use the known relations to derive new ones. The definition of semistructured data and the calculation algorithm show that the theory of Adjacency Relation Systems could be applied in our work. The general query presented in this work is a first attempt to develop a query that could use the structure of the adjacency schema to find the information wanted by a user. However, the general query has not been developed to the point in which the whole benefit of it could be used. The main contribution of this paper is the schema proposal for semistructured data. This proposal follows the same principles imposed for the existing schemas, like DataGuide. The results we have obtained in our investigations are very encouraging. We have shown that the adjacency schema can be used as a schema for semistructured data, however the test examples could be more 78 ACTA WASAENSIA numerous. This means that more extensive conclusions should not be drawn before more support for our investigations can be shown. The schema construction algorithm presented in this work is important because it shows that the theory behind the adjacency schema can be utilized in practice. This statement gives us a good base for our future work in this area. The other algorithms presented here support the same thing although they consider the modelling of semistructured data. If we compare our results with the results obtained by other researchers there are two points that are worth mentioning. First we can say that as far as we know there has not been an exact definition of semistructured data which is general enough. In our work we give the limits between which semistructured data should lie and a motivation for such a convention. Secondly is the schema for semistructured data. As we have described in the earlier sections of this work there is one schema for semistructured data that has reached the position of a de facto standard. Compared with that schema it is hard to come up with new proposals. However, we have presented our proposal because we believe that there is potential in it. There are aspects in the current schemas that need improvements and when we have developed the theory of the adjacency schema we have emphasized these facts. 6.3. Future Research As it has tuned out in the earlier sections there are some subjects that need more and deeper investigation. The presented schema proposal is our first attempt to give a structure for semistructured data but we lack some experimental results of the effectiveness of our proposal. It will be one of the first things to be considered in future. ACTA WASAENSIA 79 Another thing that has not been considered in our work is the modification of the schema. The schema must be consistent even when the source database changes. These changes in the structure of the schema must be handled with the construction algorithm. In this stage our algorithm does not handle changes in the structure of the source database so the second subject to be covered in future is the algorithm. Furthermore, the algorithm should also be able to handle different versions of semistructured data which is not the case right now. The algorithm must be made more flexible so that it can handle cycles and other exceptions in the structure of the source database. Finally, the subject of queries has not been considered very much in this work. In the adjacency relation systems there are queries whose appropriateness should be tested in this context. As the queries are relatively simple there should not be much work to adapt them to be used with the adjacency schema. However, modification of these queries will be on the list of our future work. 80 ACTA WASAENSIA REFERENCES Abiteboul, S. (1997). Querying Semi-Structured Data. In Proceedings of International Conference on Database Theory [online] [cited: 2005-06-01], 1–18. Available: ftp://ftp.inria.fr/INRIA/Projects/gemo/gemo/ GemoReport-103.ps.gz. Abiteboul, S., P. Buneman & D. Suciu (2000). Data on the Web. From Relations to Semistructured Data and XML. San Francisco, USA: Morgan Kaufman Publishers. 258 p. ISBN 1-55860-622-X. Broekstra, J., C. Fluit, & F. van Harmelen (2000). The State of the Art on Representation and Query Languages for Semistructured Data. On-To- Knowledge (EU-IST-199910132) [online] [cited: 2005-06-01]. Available: http://www.ontoknowledge.org/downl/del8.pdf. Buneman, P. (1997). Semistructured Data. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems [online] [cited: 2005-06-01], 117–121. Available: http://db. cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz. Buneman, P., S. Davidson, G. Hillebrand & D. Suciu (1996). A Query Language and Optimization Techniques for Unstructured Data. In Proceedings of ACM-SIGMOD International Conference on Management of Data [online] [cited: 2005-06-01], 505–516. Available: http://db.cis.upenn. edu/DL/96/UnQL/UnQL.ps.gz. Buneman, P., S. Davidson, M. Fernandez & D. Suciu (1997). Adding Structure to Unstructured Data. In Proceedings of the International Conference on Database Theory [online] [cited: 2005-06-01], 336–350. Available: http: //www.cis.upenn.edu/~db/sue/ICDT96.pdf. Buneman, P., M. Fernandez & D. Suciu (2000). UnQL: A Query Language and Algebra for Semistructured Data Based on Structural Recursion. In Very Large Data Bases, VLDB Journal [online] [cited: 2005-06-01]. Available: http://db.cis.upenn.edu/DL/vldbjournal00.ps.gz. Cong, G., L. Yi, B. Liu & K. Wang (2002). Discovering Frequent Substructures from Hierarchical Semi-structured Data. In Proceedings of the Second SIAM International Conference on Data Mining, SDM. Hyatt Regency, Crystal City, Arlington, VA, USA [online] [cited: 2005-06-01]. Available: http://www.cs.sfu.ca/~wangk/pub/siam02.pdf. ACTA WASAENSIA 81 Goldman, R. (2000). Integrated Query and Search of Databases, XML, and the Web. Thesis. Stanford University, Department of Computer Science [online] [cited: 2005-06-01]. Available: http://infolab.stanford.edu/ ~widom/theses/goldman.ps. Goldman, R. & J. Widom (1997). DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proceedings of the Twenty-Third International Conference on Very Large Data Bases [online] [cited: 2005-06-01], 436–445. Available: http://www-db.stanford.edu /pub/papers/dataguide_vldb97.ps. Goldman, R., J. McHugh & J. Widom (1999). From Semistructured Data to XML: Migrating the Lore Data Model and Query Language. In Proceedings of the Second International Workshop on the Web and Databases (WebDB'99) [online] [cited: 2006-06-01], 25–30. Available: http://www-db.stanford.edu/pub/papers/xml.ps. Goldman, R. & J. Widom (1999). Approximate DataGuides. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats. Jerusalem, Israel [online] [cited: 2005-06-02]. Available: http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc= 1999-56&format=pdf&compression=&name=1999-56.pdf. Goodrich, M. & R. Tamassia (1998). Data Structures and Algorithms in Java. New York etc.: John Wiley & Sons, Inc. 738 p. ISBN 0-471-19308-9. Hacid, M-S., F. Soualmia & F. Toumani (2000). Schema Extraction for Semistructured Data. In Proceedings of the 2000 International Workshop on Description Logics (DL'2000). Aachen, Germany [online] [cited: 2005-06-01], 133–142. Available: http://www710.univ-lyon1.fr/ ~dbkrr/publications.htm. Hopcroft, J. E. & J. D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Menlo Park, California etc.: Addison-Wesley. 418 p. ISBN 0-201-02988-X. Linna, M., M. Wanne & J. Töyli (2003). Completion of the Incomplete Data Structure. In Proceedings of the Second IASTED International Conference on Information and Knowledge Sharing, Scottsdale, AZ, USA [online] [cited: 2005-06-02]. Available: http://www.uwasa.fi/~jt/IKS_2003. pdf. 82 ACTA WASAENSIA McHugh, J., S. Abiteboul, R. Goldman, D. Quass & J. Widom (1997). Lore: A Database Management System for Semistructured Data. SIGMOD Record 26:3 [online] [cited: 2005-06-01], 54–66. Available: http:// www-db.stanford.edu/pub/papers/lore97.ps. McHugh, J., J. Widom, S. Abiteboul, Q. Luo & A. Rajaraman (1998). Indexing Semistructured Data. Technical Report, the Stanford Database Group [online] [cited: 2005-06-03]. Available: http://www-db.stanford.edu /lore/pubs/semiindexing98.pdf. Nestorov, S., J. Ullman, J. Wiener & S. Chawathe (1997a). Representative Objects: Concise Representations of Semistructured, Hierarchical Data. In Proceedings of the 13th International Conference on Data Engineering (ICDE'97), Birmingham, England [online] [cited: 2005-06- 01]. Available: http://www-db.stanford.edu/pub/papers/represen tative-object.ps. Nestorov, S., S. Abiteboul & R. Motwani (1997b). Inferring Structure in Semistructured Data. ACM SIGMOD Record 26:4 [online] [cited: 2005- 06-01], 39–43. Available: ftp://ftp.inria.fr/INRIA/Projects/gemo/ gemo/GemoReport-122.ps.gz. Nestorov, S., S. Abiteboul & R. Motwani (1998). Extracting Schema from Semistructured Data. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data 27:2 [online] [cited: 2005-06-01]. Available: ftp://ftp.inria.fr/INRIA/Projects/gemo/ gemo/GemoReport-152.ps.gz. Ni, X & S. Bloor (1994). Performance Evaluation of Boundary Data Structures. IEEE Computer Graphics and Applications 14:6 [online] [cited: 2005-06- 02], 66–77. Available: http://ieeexplore.ieee.org/iel1/38/7784/0032 9098.pdf?tp=&arnumber=329098&isnumber=7784. Papakonstantinou, Y., H. Garcia-Molina & J. Widom (1995). Object Exchange Across Heterogeneous Information Sources. In Proceedings of the International Conference on Data Engineering (ICDE), Taipei, Taiwan [online] [cited: 2005-06-02]. Available: http://dbpubs.stanford.edu/ pub/showDoc.Fulltext?lang=en&doc=1995-6&format=pdf&compre ssion=&name=1995-6.pdf Thierry-Mieg, J. & R. Durbin (1992). Syntactic Definitions for the ACeDB Data Base Manager. Technical Report, MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge, UK. ACTA WASAENSIA 83 Töyli, J. (2002). Modeling Semistructured Data by the Adjacency Model. Licentiate thesis, University of Vaasa [online] [cited: 2005-06-02]. Available: http://www.uwasa.fi/~jt/lisuri.pdf. Töyli, J., M. Linna & M. Wanne (2002a). Modeling Relational Data by the Adjacency Model. In Proceedings of the Fourth International Conference on Enterprise Information Systems. Universidad de Castilla-La Mancha, Ciudad Real - Spain [online] [cited: 2005-06-02], 296–301. Available: http://www.uwasa.fi/~jt/ICEIS_2002.pdf. Töyli, J., M. Linna & M. Wanne (2002b). Modeling Semistructured Data by the Adjacency Model. In Proceedings of the Fifth Joint Conference on Knowledge-Based Software Engineering. University of Maribor, Maribor - Slovenia [online] [cited: 2005-06-02], 282–290. Available: http://www.uwasa.fi/~jt/JCKBSE_2002.pdf. Wang, K. & H.Q. Liu (1997). Schema Discovery for Semistructured Data. In Proceeding of the International Conference on Knowledge Discovery and Data Mining. Newport Beach, AAAI [online] [cited: 2005-06-01], 271– 274. Available: http://www.cs.sfu.ca/~wangk/pub/kdd1.ps. Wang, K. & H.Q. Liu (1998a). Discovering Typical Structures of Documents: a Road Map Approach. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [online] [cited: 2005-06-01], 146–154. Available: http:// www.cs.sfu.ca/~wangk/pub/sigir.pdf. Wang, K. & H.Q. Liu (2000a). Discovering Structural Association of Semistructured Data. IEEE Transactions on Knowledge and Data Engineering, 12:3 [online] [cited: 2005-06-01], 353–371. Available: http://www.cs.sfu.ca/~wangk/pub/tkde98.pdf. Wang, Q.Y., J. X. Yu & K. F. Wong (2000b). Approximate Graph Schema Extraction for Semi-Structured Data. In Proceedings of Seventh International Conference on Extending Database Technology (EDBT). 302–316. Germany: Springer, Konstanz. Wanne, M. (1998). Adjacency Relation Systems. Acta Wasaensia 60. Computer Science 1. University of Vaasa, Vaasa. Wanne, M. & M. Linna (1999). A General Model for Adjacency. Fundamenta Informaticae 38, 39–50. 84 ACTA WASAENSIA World Wide Web Consortium (2004). Extensible markup Language (XML) 1.1. [online] [cited:2006-01-24] Available: http://www.w3.org/TR/ 2004/REC-xml11-20040204/.