Data Classification

Introduction

The data classification module offers a series of functionalities to classify the physical structures of a target system with respect to terms defined in the business glossary. Through this module it is possible, for example, to identify data of interest within a database such as email, tax codes or PII. Data can be identified through classification rules that may be specific to the domain of interest.

The data classification process is divided into four different phases:

  1. Crawling: the loading of metadata from the target system relating to the physical structures. In particular, the metadata relating to the names of the tables, fields and their type. Data Sampling: In this phase the extraction of samples from the target system is performed. These samples will be used in the rules evaluation phase to assign each field a specific term of the business glossary. Construction of Dictionaries: through a query it is possible to define dictionaries which collect all the data extracted from the defined query (for example all the names of the employees from the “EMPLOYEE” table)
  2. Rule instantiation: loading of the classification rules of the system’s physical structures. Rules can be of two categories: rules that classify on the basis of values of records and rules that classify on the basis of the metadata of physical structures.
  3. Rule evaluation: this step consists in the evaluation of classification rules on the physical metadata or on the extracted samples. The result of the evaluation is the percentage probability that the match is correct.
  4. Assignments generation: All classification rule evaluations generate assignments between the physical structure and the business glossary term defined by the rule. Assignments are possible matches between the physical layer and the logical layer. The assignments have a score deriving from the evaluations, which defines the probability of correctness of the match. The assignments must be validated after their creation, the action can be performed in two ways or through an intervention by the user, or with previously defined score thresholds, which automate the process.

Definitions

Term Defintion
Dictionary A dictionary represents the list of all permissible values ​​of a business glossary term.
Rule A rule is a pattern for associating physical metadata with a business glossary item.
Evaluation The result of a rule evaluation on a physical asset (e.g. a table or a column).
Assignment Summary of one or more evaluations that defines the relationship between an object of the data catalog and an element in the business glossary.