Join

Matches records from two data sources to create an enriched data set with values from both inputs whenever the specified field is found to exist in both sources.

 

The Join node works in a similar way to the Merge and Lookup nodes. A key difference between the Join and Merge nodes is that the Join node outputs the results of a specific join type, whereas the Merge node outputs the results of a left, right and inner join to three separate output pins. The Lookup node is best used when you are enriching one, larger data set, with a second, smaller data set.

 

To configure the Join node:

  1. Choose a join type.

     

    The default join type is an inner join i.e. only the matches are output (data that is found in both the left and right data sets).

    If you want to change the join type, select an option from the list in the JoinMode property. For more details on the different join types, see Join types.

     

  2. Select the related fields on which you want to base the join.

    Note: The fields on which you base the join must be of a similar type. You can compare the number types int, long and double against each other, and fields of string data type can be compared to Unicode fields.

     

    Use the Match Keys Grid to select the names of the fields on which you want to base the join. You can add multiple rows if you want to base the join on more than one set of fields.

    From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list.

    The output records are sorted by the matching criteria. You have the option to change the sort order to Sort Descending (high to low).

    For more details on these options, see Match keys property.

     

    Alternatively, for more advanced techniques, select the Advanced tab, where you can use Python script to configure the node.

     

    You can choose not to enter any join keys, meaning that the node will output a full set of all possible matches and all records will match each other. This is not the default behavior of the node, therefore in this case you must also set the NoJoinKeyBehavior property to Log or Ignore. Note that with large data sets, this can take a very long time and produce very large amounts of output data.

    Tip: The Join node sorts input data by default. If the inputs are already sorted, you can improve the performance of the node by setting the optional SortLeftInput and SortRightInput properties to False.

     

  3. Select which fields you want to include in the output.

     

    By default, all fields from both inputs are included in the output. You can modify this behavior by excluding fields in the Field List grid.

    If there are any field name conflicts, where the same case-insensitive field name appears in both inputs, the node will fail. To prevent this type of error, use the Field List grid to either exclude or rename one of the fields.

    Excluding a field

    A simple way to resolve a field name conflict is to untick one of the conflicting field names in the Field List grid. It is common practice to exclude the right match key field, regardless of field name, to avoid duplication of values in the output.

    Renaming a field

    If you want to output both of the conflicting fields, you can rename one of them by typing a new name in the New Name column of the Field List grid.

    Alternatively, you can type a new pattern at the data Fields or lookup Fields level to rename all fields from that input, for example *_right.

    Note: If you have applied an Exclude pattern, you cannot rename the fields by pattern. If a set of fields has an existing rename pattern and is set to Exclude, the rename pattern is cleared.

    For more details on adding new patterns, and including, excluding and renaming fields, see Field list.

Example

You want to combine the following two data sources and output all records from the left data set and any corresponding matches from the right data set.

 

Data set one:

Order ID

int

Country

string

Customer ID

int

10030 UK 1
10031 USA 2
10032 Ireland 3

 

Data set two:

Order ID

int

Customer Name

string

Product ID

int

10030 Frank Peters 712
10031 Sarah Lyons 164
10032 James Curtis 7812

 

The "Order ID" field is common to both data sets, so this is the field on which we will base the join.

 

  1. Run the two input nodes and connect them to a Join node.
  2. In the JoinMode property, select Left.
  3. In the Match Keys Grid, select "Order ID" for both the left and right fields.
  4. In the Field List grid, untick the "Order ID" field from the right Fields list.
  5. Run the Join node.

The merged output is as follows:

Order ID

int

Country

string

Customer ID

int

Customer Name

string

Product ID

int

10030 UK 1 Frank Peters 712
10031 USA 2 Sarah Lyons 164
10032 Ireland 3 James Curtis 7812

Properties

JoinMode

Optionally specify the type of join which is to be performed by the node. Choose from:

  • Inner - Only handle records which match in both inputs.
  • Left - Handle records appearing in the left input and matched records.
  • Left Orphans - Only handle records which appear in the left input but not the right input.
  • Right - Handle records appearing in the right input and matched records.
  • Right Orphans - Only handle records which appear in the right input but not the left input.
  • Full Outer - Handle all records in the left and right input.
  • All Orphans - Handle all non-matched records in the left and right input.

The default value is Inner.

JoinType

Caution: This property has been deprecated and will not be supported in a future release. It is recommended that you use the JoinMode property which provides similar functionality.

Optionally specify the type of join which is to be performed by the node. Choose from:

  • Inner - Only handle records which match in both inputs.
  • Left - Only handle records which appear in the left input but not the right input.
  • Left Inner - Handle records appearing in the left input and matched records.
  • Right - Only handle records which appear in the right input but not the left input.
  • Right Inner - Handle records appearing in the right input and matched records.
  • Full Outer - Handle all non-matched records in the left and right input

The default value is Inner.

Match Keys

The simplest way to merge your data is by using the Match Keys Grid to select the fields on which you want to base the join. Select or type a field from the left data set, then select or type a field from the right data set.

From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).

You can add multiple rows if you want to base the join on more than one set of fields.

Alternatively, for more advanced techniques, select the Advanced tab.

ConfigureFields

Specify which fields to include in the node outputs using the Field List grid, or use the Advanced tab to specify the Python script to configure the node outputs and output fields.

This script is executed once in order to configure the output metadata prior to any records being processed.

The details of any mappings from input field to output field are stored in the mapper for the output which can be used in the ProcessRecords property.

If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g in1, out1).

The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs).

As this script is executed prior to reading any records, the inputs and outputs references refer specifically to the input and output metadata.

ProcessRecords

Specify the Python script to process the results of the merge operation and write output records.

This script is executed for every join result that is supported by the join types specified in the JoinType property

This can occur when a match is identified in the merge, or a record is seen which only appears on either the left or right input.

When a match is detected, joinResult.match will evaluate to True and the matching input records from the left and right input will be provided to the script.

When a record is only seen in the left input, joinResult.left will evaluate to True and the input record from the left input will be provided to the script.

When a record is only seen in the right input, joinResult.right will evaluate to True and the input record from the right input will be provided to the script.

Note that the script is only called for the types of joins supported by those specified in the JoinType property.

The default script attempts to map all of the input fields onto the corresponding output fields based on the mapping setup via the ConfigureFields property.

If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g in1, out1).

The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs).

The input and output references in this property refer to the input and output records as opposed to the metadata in the ConfigureFields property.

SortLeftInput

Optionally specify whether the first input will be sorted on the Left Key.

The default value is True whenever a Left Key is specified.

SortRightInput

Optionally specify whether the second input will be sorted on the Right Key.

The default value is True whenever a Right Key is specified.

VerifyInputsSorted

Optionally specify whether or not the node should validate that the inputs are correctly sorted during the merge operation.

The default value is False provided that the Left Key and Right Key are not specified or SortLeftInput and SortRightInput evaluate to True otherwise, the default value is True.

BufferSize

Optionally specify the maximum number of records from each input which can be held in memory.

The node only needs to hold records from both inputs in memory when there are large numbers of records on both inputs which each match the same key.

Generally, this property should not need to be set unless there are a lot of records from both inputs matching a given key and the records are too wide for 5,000 records from both inputs to be stored in memory, or if the records have a highly variable width in bytes - e.g. on most records a field is null, but occasionally is very large.

If no value is specified, the BufferSize will be dynamically calculated based on the width of the incoming records, where initially 5,000 records from each input may be loaded into memory.

NoJoinKeyBehavior

Optionally specify what to do if no join key is specified. Choose from:

  • Error - The node fails.
  • Log - The node will continue processing. An explanatory message is written to the log.
  • Ignore - The node will continue processing. Nothing is written to the log.

The default value is Error.

Example data flows

A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.

In the Directory under the /Data360 Samples/Node Examples/ folder, you will find "Joining and Blending Data", which shows examples of how to use this node.

Note: Upgrades will overwrite all data flows in the workspace. If you want to make changes to one of the sample data flows, we recommend you create a copy and save it elsewhere, using Save as...

Inputs and outputs

Inputs: left, right.

Outputs: out1.