Analyze SQL statements

What is a Query Bucket?

A Bucket is defined to group a set of Statements. To each Bucket is defined a list of systems through which it is generated the Catalog, used by Parser. A Bucket is configurable to automatically generate DataFlow from Statements which are inserted inside of it. In addition, it contains pre-processing settings which allow you to substitute portions of text of SQL statements using regular expressions. This is useful in the case in which there would be Statements based on SQL dialects not completely supported by Parser.

Bucket statistics are calculated on the basis of the inside of Statements status. The percentage of a Bucket completion is calculated considering the Statements overall on those that still need user intervention (indicated with “pending”, including ANALYZED and PARTIALLY_ANALYZED status). Statistics are viewed both on Bucket list and detail page.

“Bucket Statistics image”

How to create a Query Bucket

It is possible to add a new Bucket going in the subsection “query parser” of the section “data flow”.

“Bucket Creation Editor”

The editor defines:

  • Name the unique name for this bucket.
  • Description a description of the bucket and its intended usage.
  • Systems on which the statements contained in the bucket works. It is needed to load the correct set of metadata from the catalog.
  • Strategy for the automatic creation of DataFlows, which can be:
    • No one.
    • At Physical Entity level.
    • At Physical Fields level.
  • SQL Statements pre-processing settings, adding a list of regular expression with its relative replacement text.
  • The team to which the Bucket is assigned.

Statement Analysis

A SQL statement is analyzed in order to reconstruct the Lineage which it creates. From Lineage it is possible to extract DataFlow. Thanks to overall saved DataFlow you can trace the data lineage of the entire system. A Statement is analyzed by Parser when this is inserted inside a Bucket or when the text of the SQL statement is modified.

A Statement can have one of the following status:

Statement State Description
PENDING Statement must still be analyzed by Parser.
ANALYZED The parser was able to correctly analyze all the Statements and to recover from Catalog all tables used inside a SQL statement.
PARTIALLY_ANALYZED The parser was able to correctly analyze all Statements but it failed to recover from Catalog all tables used inside SQL statements.
ERROR The parser was unable to properly analyze the Statement due to a syntax error in the SQL statement or an error in the analysis.
IGNORE The Statement will not be reviewed by the system. The Statement can be put in this state as a result of an user’s action or a submission of a Statement that however led to the same result as the one produced by another statement.
CONFIRMED One or more DataFlows were generated using the Statement. Once the Statement is in this state, it will no longer be reanalyzed.

If the Statement is in the ANALYZED or PARTIALLY_ANALYZED status, it is indicated in the field “type” even the type of SQL statement that the Parser was able to recognise.

“Analized view example”

Using the refresh button on the Statement detail page, it is possible to re-analyze it regardless of the state in which it is located (except for a Statement of SELECT or SELECT_INTO which has been linked to a target table).

Target destination of a select statement

An entity in a system may have been generated from data in another system. If extraction is done using a SQL statement , it is possible to represent this link connecting to a SQL statement of SELECT with a target table outside the system. Every time that a SELECT Statement is correctly analyzed by Parser ( ANALYZED o PARTIALLY_ANALYZED status), it is shown a warning with a button that refers to a modal for the link with a target table.

“Select with missing target example”

From modal it is possible to choose the target table among the Physical Entities.

“Select target selection”

You can then connect the output of the SQL statement with Physical Fields of target table through:

  • Column number
  • Index number of the column.
  • Choosing by hand the Physical Field.

“Select target options”

Once the table is connected the lineage is updated and it is possible to generate the related DataFlows.

“Select target lineage”

Lineage between entity not involved in SQL statements

An entity in a system may have been imported from another system without using SQL statements (e.g. dump csv, file transfer, job of etl). In this case, a link between two Physical Entities is represented through or a single DataFlow, or a set of DataFlows between their Physical Fields. To generate the DataFlow simply go to the detail page of one of the two Physical Entities and access the section dedicated to DataFlows.

“Manual Connection Panel”

From here then to establish the direction of the connection one of the two icons located above the buttons must be selected “INCOMING” and “OUTCOMING”. These open a modal, by which you can choose the connection level (a single DataFlow between the two Physical Entities or more DataFlow between their Physical Fields).

“Manual Connection Editor”

If it is chosen the link at Physical Fields level, DataFlows can be generated using two different strategies:

  • Through Ordinal Position of Physical Fields.
  • Through Column Name of Physical Fields.

Finally, there is always the possibility to manually choose individual links between fields.

Dataflow creation strategies

Once the Parser has managed to extract the Lineage from a Statement, the data lineage information can be saved within the data catalog through the DataFlows.

Starting from a Statement it is possible to generate DataFlows, if the following conditions are satisfied:

  • Statement’s status is ANALYZED or PARTIALLY_ANALYZED.
  • The starting Physical Entity and target Physical Entity are inside the catalog.
  • The Lineage obtained from the analysis of the Statement contains at least one path between the source Physical Entity and the destination one.

Creation can be either manual or automated:

Strategy Description
Manual creation of Dataflows In order to create a DataFlow manually you need to go to the “Data Flows Preview” section on the Statement’s detail page. Here, a list of all the DataFlow previews that can be generated starting from the analyzed Statement will be shown. Previews are calculated either at the level of Physical Fields or at the level of Physical Entities.
“Manual creation option”
This behavior can be selected via the “Physical Entity Level” switch. To save the previews, just click on the appropriate add button and confirm. Previews that generated DataFlow are indicated with a blue check.
Automatic creation It is possible to create DataFlows at the same time as inserting a new Statement inside a Bucket. This feature can be enabled in the Bucket settings by editing the “Automatic DataFlow Creation’’ items.
“Automatic creation option”
As with manual generation, it is possible to choose whether to generate DataFlows at the level of Physical Entities or at the level of Physical Fields.

Refreshing Statements analysis

The refresh operation allows you to reset the operations performed on the statement and rerun the analysis. This includes deleting dataflows and removing any associated target tables (for select statements). The operations performed during the refresh are as follows:

  1. rollback and deletion of created dataflows
  2. deletion of full lineage data
  3. parsing the statement and generating a new full lineage
  4. automatic data flow creation if enabled on the bucket

“statement refresh button”

On the detail page of a Bucket it is possible to apply a massive refresh to all the statements it contains. The procedure can take several minutes to complete: it is necessary to keep the application open.

“statement massive refresh button”

Lineage Visualization

The Lineage can be viewed on the Statement detail page, in the “Full Lineage” section. It is available only in the case of a positive analysis (ANALYZED or PARTIALLY_ANALYZED status).

“full lineage chart” The columns found in the Catalog are grouped by the name of the table they belong to and are shown in blue, while those not found are shown in orange and grouped by the probable names of the tables they belong to.

The visualization of the complete data lineage is located in the “Data Flows” sub-section of the Physical Entities detail page. “Dataflow chart buttons”

You can choose whether to view the data lineage via flow chart “Dataflow chart image” or via graph. “Dataflow graph image”