Database Profiler

Examines database data to determine its data type and statistical composition, and outputs a detailed JSON description that can be used for further analysis.

For more information on data types, see Data types.

The output analysis from the node is provided in three formats:

  • The analysis output pin provides data profile information which you can use to analyze the profile metrics of the database, and consists of a row for each field that the node has profiled, with the properties of each field contained in separate columns.
  • The detail output pin provides a JSON description containing the field properties that you can use for further analysis, and consists of a row for each field that the node has profiled.
  • The errors output pin provides details of any analysis errors that may take place. See the FailureBehavior property for more details.

Configuring the Database Profiler node

To configure the Database Profiler node and connect to an external database:

  1. Configure the required properties to connect to your database. If you are connecting to a database that uses a driver which is pre-installed with Data360 Analyze, follow the Simple database connection steps. Or, if you are connecting to a database that uses a driver which is not shipped with Data360 Analyze, see Advanced database connection.

    Tip: For a list of the drivers which are included with your installation, see Acquiring data from a database.

  2. If you are working with a large database, it is recommended that you filter your queries as appropriate, using the options under Properties > Configure > Filtering in the right-hand panel.

Logical Type detection

You can add your own Logical types - also known as semantic types - to those detected by default by the Database Profiler node. To do this, provide a JSON specification in a file that you upload to the server, and use the path to that file as the value of the LogicalTypeDefinitions property on the node.

For example, to detect possible UK National Insurance number values in a data field, you could use the following JSON:

[
	{
		"qualifier": "UK.NI",
		"regExpReturned": "\\s*[a-zA-Z]{2}(\\s*\\d\\s*){6}[a-zA-Z]\\s*",
		"threshold": 98,
		"baseType": "STRING"
	}
]

When you supply input data to the Database Profiler node, types can be identified by using the JSON when the following criteria are met:

  • If a string field contains values that are matched by the regular expression defined in the regExpReturned property in the JSON,
  • AND, if the percentage of values for a field that are identified as the specified type is equal to or greater than the value specified by threshold. The value is given as a percentage, so in this example, a minimum of 98% of the values must be detected as a match.
  • When a match is found, the output analysis identifies the Type Qualifier for the field as the value given by the qualifier property - in this case UK.NI - and the value of the Validation column for the field is the regular expression defined in the regExp property in the JSON.
  • Properties

    DbUser

    Optionally specify the database username to connect to the database.

    DbPassword

    Optionally specify the password used to connect to the database.

    DbType

    Select a database type to connect to.

    If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

    DbName

    Specify the name of the database to connect to.

    Note: If you're connecting to MS Access, enter the file path to your MS Access database. If you're connecting to an Oracle database, enter the system identifier (SID).

    If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

    DbHost

    Specify the database host to connect to e.g. localhost. Leave this property blank if connecting to MS Access.

    If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

    DbUrl

    Optionally specify the driver's database URL.

    If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required.

    Otherwise, specify the URL in the following format: jdbc:subprotocol:subname. For example, jdbc:sqlserver://localhost:1433;databaseName=CustomerData. Provided by the driver's vendor.

    See the help topic Acquiring data from a database for more information on configuring this property.

    DbDriver

    Optionally specify the class name of your JDBC driver.

    If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required.

    Otherwise, specify the class name of your JDBC driver, as provided by the driver's vendor.

    See the help topic Acquiring data from a database for more information on configuring this property.

    DbOptions

    Optionally specify database connection properties e.g. key=val&key=val2 or key=value entry per line.

    DbDriverClasspath

    Optionally specify the database driver classpath.

    If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required.

    Otherwise, specify the classpath with each classpath entry on a new line.

    Each classpath entry must be the location on the file system of either a single jar file, or a directory containing jars or class files.

    DbDriverNativeLibraryDirectories

    List of directories to search in for native libraries e.g. DLLs and shared libraries (.so).

    The directories will be used by the JVM search for any native libraries needed by the node or any of it dependencies. More specifically, the paths will be searched for any call to System.load(...) and System.loadLibrary(...).

    Additionally, the paths will be used when searching for the pre-load native libraries specified by the DbDriverNativeLibraries property.

    The directories must be absolute paths, or resolve to absolute paths via substitution. The set of allowable directories is also limited by the server side property ls.brain.node.jdbc.dbDriverNativeLibraryWhitelistPaths.

    Any directory which does not satisfy these conditions will be ignored. Set the LogLevel property to DEBUG to trace library resolution.

    DbDriverNativeLibraries

    A list of names, paths, or patterns of native libraries to pre-load into the JVM prior to loading the driver.

    The specified entries will be searched for in the directories listed in the DbDriverNativeLibraryDirectories property.

    Note that this will only load the libraries in the JVM memory space, and will not register them for JNI binding. It is assumed that the driver itself will make the appropriate calls to System.load() and System.loadLibrary().

    The primary use of this property is to load transitive dependencies for native libraries that would have normally be required to be specified via an environment variable e.g. PATH or LD_LIBRARY_PATH. The values can be in the following forms:

    • A library name which will be matched using the platforms specific naming library naming scheme, e.g. for Linux, "foo" would match "libfoo.so", for windows, it would match "foo.dll".
    • An absolute path.
    • A glob pattern of the form: glob:
    • A regex pattern of the form: regex:

    For both glob and regex patterns, the pattern will be matched against the entire absolute path of the library files.

    ProfileMode

    Choose from:

    • List Only - List the Tables and Columns discovered.
    • Count Only - List the Tables and Columns discovered along with the number of rows per table.
    • Profile - Execute a profile on all Tables & Columns discovered.

    The default value is List Only.

    CatalogFilter

    Optionally specify a list (comma separated) of names to constrain the Catalogs retrieved.

    Note: For the CatalogFilter property, you must type the name of a Catalog and an exact match is performed. For the other filter properties, you can type either a name, or you can type a string followed by the % character.

    Note: You can either use CatalogFilter or ExcludedCatalogs, but not both.

    SchemaFilter

    Optionally specify a list (comma separated) of filters to constrain the Schemas retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.

    Note: You can either use SchemaFilter or ExcludedSchemas, but not both.

    ExcludedCatalogs

    Optionally specify a list (comma separated) of catalogs that will be ignored.

    Note: You can either use CatalogFilter or ExcludedCatalogs, but not both.

    ExcludedSchemas

    Optionally specify a list (comma separated) of schemas that will be ignored.

    Note: You can either use SchemaFilter or ExcludedSchemas, but not both.

    TableViewFilter

    Optionally specify a list (comma separated) of filters to constrain the Tables/Views retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.

    ExcludeSystemCatalogs

    Optionally specify whether to exclude the system predefined Catalogs or not.

    The default value is True.

    ExcludeSystemSchemas

    Optionally specify whether to exclude the system predefined Schemas or not.

    The default value is True.

    DefaultToUnicode

    Optionally specify whether string columns should default to Unicode type.

    If set to False, string columns default to string type.

    The default value is True.

    UnicodeBinding

    Optionally specify how Unicode strings should be passed to the JDBC driver. Choose from:

    • CHARACTER - Bind the string using the mechanism for CHAR and VARCHAR types.
    • NATIONAL CHARACTER - Bind the string using the mechanism for NCHAR and NVARCHAR types.

    SqlDateType

    Optionally select a datetype in which to map a SQL DATE. Choose from:

    • date
    • datetime

    Sample Size

    Optionally specify the size of the sample set of the input data to analyze.

    The default is to process 100k rows.

    Maximum Cardinality

    Set the maximum cardinality that will be tracked. This value represents the number of discrete values for a field that will be tracked in order to analyze a field.

    The default value is 12000.

    JSON Detail Level

    Control the level of detail related to valid, invalid, and shape sets provided in the JSON output.

    • 0 - just counts.
    • 1 - first 100 rows.
    • 2 - all rows.

    The default value is 1.

    Locale

    Optionally specify the Locale to be used for analyzing the input stream.

    Locale must be specified in the format <language code>[_<country code>[_<variant code>]]

    For example, to set the locale to English, the Locale property should be set to en.

    For US English, the Locale property should be set to en_US.

    For Austrian German, the Locale property should be set to de_AT.

    The default value is the default Locale under which the server is running.

    LogicalTypeDefinitions

    Optionally provide the path to a JSON specification for additional Logical Types to be detected. For details, see Logical Type detection.

    Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

    FailureBehavior

    Optionally specify what to do when an error is detected. Choose from:

    • Error - Report error and stop further processing.
    • Log - Log a warning message (the default).
    • Ignore - Ignore the error.

    Inputs and outputs

    Inputs: 1 optional.

    Outputs: analysis, details, error.