Classification Rules And Dictionaries

In Blindata it is possible to define rules for the association of a physical field with an element of the business glossary as well as dictionaries for the collection of all permissible values ​​of a business glossary term.

To access the section dedicated to dictionaries and rules, click on the “Classification” button on the left menu of Blindata, as highlighted in the following figure.

Classification menu

The following paragraphs illustrate how to define rules and dictionaries in Blindata

How to define a classification rule

Just landed in the “Classification” section, the list of all the rules defined within the platform will be shown.

Classification menu

To add a new rule, click on the button highlighted in the figure and fill in the relative creation form. The information requested is listed below:

  • Name (mandatory): the name of the rule
  • Data Category Name: the name of the Data Category to which the rule should refer
  • Logical Field Name: the name of the Logical Field to which the rule should refer
  • Description: An optional description about the rule
  • Rule Type: the type of rule that specifies the type of implementation (further details in the next paragraph)
  • Rule Content (required): the actual body of the rule to be applied depends on its type (rule type)
  • Weight (mandatory): indicates, as a percentage, the probability of correctness of the rule or its reliability
  • Threshold: available depending on the type of rule, it represents the percentage of samples that must satisfy the rule for it to be considered in the evaluation phase
  • Production: indicates whether the rule is actually active and therefore usable during the classification process
  • Associated Scope: represent tags defined on a rule; one or more scopes can be used to tag rules and then to select them for execution of the classification process

Data Category Name and Logical Field Name are mutually required.

An example of an easily reproducible rule is that for finding the tax code of a natural person. We can see the example in the following image:

Classification rule example

In all types of rules based on sample extraction (DICTIONARY_DATA, REGEX_DATA,VALUE_DATA) a threshold must be defined beyond which the evaluation of the rule is considered positive. These types of rules in fact compare the extracted samples with the content of the rule: if the percentage of the sample, which satisfies the rule condition, exceeds the threshold value then the rule is considered positive.

In the example shown, the threshold has a value of 0.9, this means that 90% of the extracted samples must comply with the rule in order to generate an evaluation, with a weight of 0.8, i.e. with an 80% probability that that evaluation is correct.

Types of classification rules

As previously defined, there are 6 different rule types:

DICTIONARY_DATA: this type of rule checks the percentage of samples contained within a given dictionary. When choosing this type of rule, you must specify the name of the dictionary from which to take the terms to compare.

Classification rule DICTIONARY_DATA

REGEX_DATA: this type of rule uses a regular expression to match a certain type of data. In the example above, we can see a rule of this type, i.e. the rule to search for valid tax codes.

Classification rule REGEX_DATA

VALUE_DATA: this type of rule is very similar to “DICTIONARY_DATA”, the main difference is that instead of a dictionary we have a predefined list of values ​​in the content of the rule that the user must define. Also in this case it is necessary to define a threshold and check the percentage of correspondence of the extracted samples with the defined values.

Classification rule VALUE_DATA

TABLE_METADATA: this rule uses metadata of the physical structure. It checks if the rule content is contained in the metadata value, for example EMPL to target physical structures containing employees’ data.

Classification rule TABLE_METADATA

FIELD_METADATA: same as TABLE_METADATA but on physical columns. The rule is activated if the column name matches the value contained in the rule.

Classification rule FIELD_METADATA

TYPE_METADATA: the rule is activated if the target column is of the type defined in the rule (for example varchar, timestamp, integer etc..).

Classification rule TYPE_METADATA

REGEX_FIELD_METADATA: this rule is triggered if the target column name matches the defined regular expression.

Classification rule REGEX_FIELD_METADATA

REGEX_FIELD_DESCRIPTION_METADATA: this rule is triggered if the target column decription matches the defined regular expression. You can use this rule to analyze comments on fields, or other properties that can be personalized with the crawling options of the classification job.

REGEX_TABLE_METADATA: this rule is triggered if the target table name matches the defined regular expression.

Classification rule REGEX_TABLE_METADATA

REGEX_TABLE_DESCRIPTION_METADATA: this rule is triggered if the target table decription matches the defined regular expression. You can use this rule to analyze comments on tables, or other properties that can be personalized with the crawling options of the classification job.

How to define a dictionary

A dictionary defines the set of all allowed values ​​of a business glossary term. As for the rules, to define a dictionary we have to click the item at the top right as shown in the figure.

Classification dictionary menu location

The list of existing dictionaries is shown here. To define a new dictionary, click on the button at the bottom right as shown in the previous figure. The fields to fill in are:

  • Name (mandatory): the name of the dictionary
  • Entity Name (required): the table from which to extract the data
  • Query (required): the query used to extract the data
  • Associated Scopes: the tags associated with the dictionary can correspond to domains or reference systems

A simple example of a dictionary is the one that collects all customer emails like one of those shown in the figure. In this case, the dictionary is created starting from the extraction query shown in the figure on the “Customer” table.

Classification dictionary example