OLE ORM Logicbased English A Language for Redesigning

OLE: ORM Logic-based English A Language for Redesigning and Migrating Databases Herman Balsters University of Groningen The Netherlands September, 2012

Situation • Designers (technical and non-technical) of information systems want to write down constraints and derivation rules to describe their business context • They want a language that is easy to read and write by both designers and domain experts: hence, the constraints and derivation rules should be able to be validated by nontechnical domain experts • The language has to be expressive enough to capture typical business rules and constraints • The language should be precise enough to be completely unambiguous and eventually translate to a technical platform (e. g. , SQL)

In our case: a fact-based language for re-engineering, restructuring, and migrating databases 1. A user (of our method) is a person who wants to deliver a re-engineered and re-structured database, serving as a basis for, e. g. , a database migration 2. Axiom: There is no sense in migrating data of which the semantics are not known 3. The semantics of each database table from a given database is reconstructed by offering a corresponding natural language sentence capturing the intended meaning of the table heading. 4. The first offering of this sentence can be in informal semantics/syntax 5. This sentence is rewritten to a structured format, coined as OLE: ORM Logic-based English. This format is mandatory and fixed for the user. 6. OLE is a sugared form of Sorted Logic, and is easy to read and write by non-technical domain experts (hence enabling validation of the OLE statements). 7. Why take logic as a basis? : ORM diagrams are directly related to firstorder predicate logic (cf. Ph. D- thesis Halpin, and other recent work by Halpin)

OLE continued. . 8. Objects are categorized as entities and values, and entities are offered reference modes. 9. Elementary facts in OLE may be added with the following relevant constraints: • uniqueness constraints • mandatory constraints • external uniqueness constraints • subtyping 10. Entities can also be identified by using compound reference schemes (So far, OLE looks like CSDP, but then for textual modeling) 11. OLE allows for ad hoc specifications of constraints (what you can express as a constraint in logic, you can express in OLE) 12. OLE allows for ad hoc specifications of derivation rules (what you can express as a derivation rule in logic, you can express in OLE) 13. OLE-specs eventually translate to SQL

Comparable approaches • Common Logic Controlled English (CLCE, John Sowa): Adopted by the OMG, ISO standard • The Attempto project (ACE, Kuhn): Texts in ACE are used to represent knowledge and rules, e. g. within the context of the Web • Constellation Query Langauge (CQL, Clifford Heath): Texts in CQL can be seen as an alternative for ORM diagrams to model the Uo. D • Formal ORM Language 2 (FORML 2, Terry Halpin and Jan Pieter Wijbenga): highly expressive natural-language based specification language for derivation rules in ORM Some differences with OLE • CLCE and ACE are not primarily directed at data modelling, but is a general requirements specification language, based on sugared variants of first-order predicate logic • CLCE attempts to offer UML class diagrams a more formal and complete basis (cf. FUML) • CQL is not based on logic • FORML 2 is not based on logic, but more on how humans actually communicate in natural language

OLE and its correspondence to Sorted Logic with variables (SORT): SORT: = Unary. Predicate(Term) | Binary. Predicate(Term, Term) | Ternary. Predicate(Term, Term) | Term=Term | ~(SORT) | SORT{&|v|→|↔}SORT | {∀|∃| ∃0. . *|∃0. . 1|∃1|∃0} Var: Type (SORT)| ∀Var 1, Var 2: Type(Var 1 Binary. Predicate Var 2 ↔ SORT)| ∀Var 1: Entity. Type, Var 2: Type (Var 1. Role. Name = Var 2 ↔ SORT | (SORT). Term: = Var | Object| Entity. Rolename | Entity. Refmode | Base. Relation. Attribute Rolename: = Name Ref. Mode: = Name Object: = Entity | Value Type: = Value. Type | Entity. Type | Base. Relation Entity. Type: = Name Value. Type: = Name

OLE: grammar spec (not complete, and in non-variable format) OLE: = Term Unary. Predicate | Term Binary. Predicate Term | Term Ternary. Predicate. First. Part Term Ternary. Predicate. Second. Part Term | Term isa Term | Term=Term | not OLE | OLE {and|or} OLE | if OLE then OLE | for each Type {and Type}*, OLE | there is {some | possibly some | at most one | exactly one |no} Type {and Type}*, such that OLE | for each Type 1 and Type 2, Type 1 Binary. Predicate Type 2 iff OLE | for each Entity. Type and Type, the Rolename of Entity. Type is Type iff OLE |(OLE). Term: = {that} Type | the Rolename of Entity. Type | Entity. Type. Refmode | Base. Relation. Attribute | Constant Rolename: = Name Type: = Value. Type | Entity. Type | Base. Relation Entity. Type: = Name Ref. Mode: = Name Value. Type: = Name

OLE examples 1. for each Project, there is exactly one Description, such that Project has Description 2. for each Student and Project, there is at most one Student. Project, such that Student. Project involves Student and Student. Project involves Project 3. for each Student. Project, there is exactly one Mentor, such that Student. Project has Mentor

Migrating Databases Consider the following table STUDENT(nr, mentor, project. Description) (informally) defined by: "Student is identified by number and has mentor for the project described by project. Description" Notice: definition involves soft semantics, a composite fact, and non-normalized data How do we re-engineer and re-structure this table into an acceptable format? Acceptable format = • Normalized (no redundancy) • Correct and complete semantics (including all relevant constraints) Our example is not acceptable because –as a fact type- it violates the (n-1)-rule (consider the non-key dependency project → project. Description)

OLE Re. Designer method Step 1: List the elementary facts for the target model • Student has Mentor for Project • Project is described by Project. Description Step 2: For each elementary fact, list all entities and values, along with reference modes and basic types • Student is Entity (and is referred to by nr (of type integer)) • Mentor is Entity (and is referred to by name (of type integer)) • Project is Entity (and is referred to by code (of type varchar)) • Description is Entity (and is referred to by name (of type varchar)) Step 3: For each elementary fact, list all uniqueness constraints • for each Student and Project, there is exactly one Mentor, such that Student has Mentor for Project • for each Project, there is exactly one Description, such that Project has Description

Putting this information into the NORMA tool, will yield the following proper ORM model (re-engineering the schema of the original Student base table): Model M 1

Canonical ORM format (equivalent to the original proper ORM model) yields: Model M 2 Canonical ORM (Halpin, NORMA), can be completely derived from the original source table populations, and is a possible model on which to base an actual migration.

The importance of Co. Reference-ORM • It is the model format that is as close as possible to the table format we eventually want in the implementation of the target model. • The canonical model can be called the binary break-down of the target model. • However, it is still completely conceptual: semantics are given by fully validated elementary fact statements. (This in contrast to the corresponding relational view: table headings often do not even offer a clue about the table’s underlying fact types!) • The canonical model offers fully re-engineered, re-structured and validated semantics of the original source database. • The canonical model is therefore used as the model in which we define the derivation rules defining the target database • These derivation rules (in OLE) offer what we call the binary build-up of the restructured database (i. e. , the target database) Hence, migration is performed by binary break-down followed by binary build-up

We have the following rules to populate the fact types of M 2 (assuming a population of the source model M 1) : (1) for each Student and Project and Mentor, if Student has Mentor for Project then there is some Student. Project , such that Student. Project involves Student and Student. Project involves Project (2) for each Student. Project and Student and Project, if Student. Project involves Student and Student. Project involves Project then there is some Mentor, such that Student has Mentor for Project • Rule(1) tells us that the entity type Student. Project is at least populated by taking instances from the fact type Student has Mentor for Project from Model M 1 • Rule(2) tells us that the entity type Student. Project is at most populated by instances taken from the fact type Student has Mentor for Project from Model M 1

We can now prove the following properties (proof is based on the previous rules (1, 2) and the uc’s taken from model M 1): (i) for each Student and Project, there is at most one Student. Project, such that Student. Project involves Student and Student. Project involves Project (ii) for each Student. Project, there is exactly one Student, such that Student. Project involves Student (iii) for each Student. Project, there is exactly one Project, such that Student. Project involves Project Properties (i), (ii), and (iii) now allow us to introduce the following (derived) compound reference scheme for Student. Project: Student. Project is Entity (and is referred to by Student and Project, where Student. Project involves Student and Student. Project involves Project)

• This results in the following intermediate target database as suggested by NORMA (Rmap) • This target database still has to be transformed into its final “view” format: all attributes occurring in the 2 tables are to be declared as derived

How do migrate our original base tables to a Re-engineered/Restructured database? Here, the original source table STUDENT(nr, mentor, project. Description) serves as the input base relation from which we derive the population of the canonical ORM model and, subsequentky, the target database We proceed by using Derivation Rules in OLE …

(1) for each Nr, Name, and Code, there is some Student. Project, such that the student. Nr of Student. Project is that Nr and the mentor. Name of Student. Project is that Name and the project. Code of Student. Project is that Code iff there is some STUDENT, such that STUDENT. nr = Nr and STUDENT. project = Code and STUDENT. mentor = Nam (2) for each Nr, Name, String and Code, there is some Project, such that the project. Code of Project is that Code and the Description of Project is that Name iff there is some STUDENT, such that STUDENT. nr = Nr and STUDENT. project = Code and STUDENT. mentor = Name and STUDENT. project. Description = that String

• We note that definition (1) respects the (derived) compound reference scheme for Student. Project, proved in the previous section. • We note that in (1, 2) we have employed a general scheme for defining views. • Say that: - we wish to define a view V, with attributes x 1, . . , xn (of domain type X 1, . . , Xn) - this view is to be defined in terms of a set of base relations B 1, . . , Bm - Ψ denotes the predicate describing the definition of V then the general formula (in Sorted Logic) stating that V is defined in terms of Ψ is given by ∀x 1: X 1, . . , ∀xn: Xn [∃v: V v(x 1, . . , xn) ⇔ ∃b 1: B 1, . . , ∃bm: Bm Ψ(b 1, . . , bm, x 1, . . , xn)]

The OLE rules (1, 2) result in the following target table definitions (generated by the ORM Re. Designer tool) Define the target table Student. Project with column names Student. Nr, project. Code, mentor. Name as: rows consisting of STUDENT. nr, STUDENT. project, STUDENT. mentor for each STUDENT Define the target table Project with column names project. Code, description as: rows consisting of STUDENT. project, STUDENT. project. Description for each STUDENT

This results in the following SQL (generated by the ORM Re. Designer tool): CREATE VIEW `Project` (`project. Code`, `description. Name`) AS SELECT `STUDENT`. `Project`, `STUDENT`. `Project. Description` FROM `STUDENT`; CREATE VIEW `Student. Project` (`student. Nr`, `project. Code`, `mentor. Name`) AS SELECT `STUDENT`. `nr`, `STUDENT`. `Project`, `STUDENT`. `Mentor` FROM `STUDENT`;

Some Notes on OLE and the Re. Designer • OLE has been offered a full BNF language spec, which was implemented in the ORMRe. Designer tool (tech. report Ru. G) • OLE contains at least the expressive power of Sorted Logic (tech. report Ru. G) • OLE can handle many of ORM's fact type constructions and constraints • We are currently working on a suitable syntax for set comparison constraints (and aggregate functions) --------------------------------------------------- • In this paper, we have taken only a simple base relation as a case study; we have, however, applied OLE and the ORMRe. Designer to larger case studies (e. g. , migrating patient medical records from two different hospitals to a common EPD (Electronic Patient Dossier) database). • Both OLE and the ORMRe. Designer proved in practice (Master students in a course given on Data Migration, and professional EPD-designers) to be reasonably simple to use, both by the non-technical domain expert (involved in the validation of the OLE-specifications) and professional designer (using the tool to migrate databases).

Conclusions • We have tried to show fact-based modeling, in particular ORM and its representation in (sugared) Sorted Logic, can help in reengineering (relational) databases. • We reconstruct the semantics of the source database by offering a set of natural-language sentences capturing the conceptual structure and constraints of the source. • These sentences are rewritten in a structured natural language format, coined as OLE: ORM Logic-based English. • OLE is used to define the mappings from the original source to a reengineered target database. • We have discussed the ORM Re. Designer: a semi-automatic tool, based on OLE and NORMA, available as a research prototype, used for reengineering and migrating relational databases.

Future work • The ORMRe. Designer tool is a research protoype, and, hence, needs improvement to perform on an actual professional scale • OLE needs to be extended with language features to also support more advanced data modeling (e. g. set comparison constraints) and advanced support for businesss rules (e. g. aggregate functions) • Non green-field situations: We are currently working on a combination of BPMN and OLE, to offer a procedure for deriving target data models from target process models (method and tool)

Acknowledgement • Terry Halpin for in-depth duscussions on using Logic in ORM • Excellent feedback/comments from one of the reviewers which greatly improved an important part of the paper