tMap allows you to perform many operations – you can join data sources, filter records, transform data and manage output paths. This component, unlike tJoin, which allows you to join only to two objects, allows you to join data from many sources.
In this training, I will show:
- what types of joins you can do in the tMap component,
- how to join data sources using tMap,
- how to create output for records that do not meet the conditions for joining.
Types of data joins in tMap
The tMap component allows two types of joins: Inner Join andI. Let’s quickly remind them what they are about.
If we join two tables using INNER JOIN, as a result we get only the records appearing in both tables. This is great illustrated by the figure below:
LEFT OUTER JOIN, also known as LEFT JOIN, as the result gives all records appearing in table A (first table, main) and only those records from table B that meet the join condition. Let’s look at the following figure:
Default join settings
Talend DI during joins of objects performed in the tMap component allows you to configure two important parameters: Lookup Model and Match Model.
The Lookup Model is set to Load once by default. This means that before performing the first join of the main table (for example A) with another table (B), Talend reads all records from table B and stores them on disk or in memory during the whole time of joining with table A (depending on settings Store temp data). Table B is only read once. Reload at each row means that table B will be reloaded for each record in the main table. These types of joins are used if table B is updated in real time and we want to join table A with the latest versions of table B records, or if table B contains a lot of records that should not be stored in memory because of the possibility of overflow. The third type of Lookup Model is Reload at each row (cache), which, as its name implies, prevents us from storing read data on disk. All read records are cached and if there are any changes in table B, only they will be topped up.
The Match Model is set by default as Unique match, which means that if there are duplicates for the join condition in Table B, we will get a unique result as the output and it will be the last record encountered in the table for this join condition. So you can call his action Last match. First match, if it finds duplicates in table B, it will load only the first record. However, All matches gives all possible joins from Table B for table A, so we can get duplicates during this type of join. There is one more option – All rows, which consists in making the Cartesian product of the sets. All rows from table B are joined with all records of main table A.
Joining data sources – example
In the example jobie we will join customer data (CUSTOMER_SRC) from the previous training with the data on organizations that you can download below:
Create a new process in Talend Data Integration and drag and drop to the designer canvas the CUSTOMER_SRC table created in the previous training. Then, after creating an ORGANIZATION object, also drag it to the designer. By dragging objects from the repository, we have already set all database parameters and queries.
Then, add the tMap component to the canvas and connect it to the CUSTOMER_SRC object, which will be our main table. Also link the ORGANIZATION tables as the source for the tMap component. The process should now look like this:
As the main table, Talend always marks the one that was first connected to tMap. The second and each subsequent table will be marked as lookup.
Before we set the join condition, let’s add two output objects – tLogRow. One of them will display data matched according to our condition, while the other will display data not matched. I called my paths main and reject, respectively.
Let’s set the tMap component, double click on it to enter the settings. From the first table, select the ORG_ID row and drag it to the table you are connecting to. For me, the main table is CUSTOMER_SRC (row1), while the table with which I join is ORGANIZATION (row2). If you want, you can change the path name (row1, row2) in the main process view by double clicking on it).
Go to tMap Setting and select Join Model, then change it to Inner Join.
In the next step, drag all the attributes from row1 and ORG_NAME from row2 to the output main. In the settings for output reject, change Catch lookup inner join reject to true. Records that will not match during INNER JOIN will be loaded with the reject output. Click OK to exit the tMap settings.
tLogRow displays all records in a new line by default, and the columns are separated by the “|” character. Having two tLogRow components as the target, it can be a bit embarrassing to see to which output the records belong to. The result will be something like this:
To quickly see which records have not joined with the ORGANIZATION table, we can change the result display settings in tLogRow. To do this, select the Component View -> Mode -> Table tab
Now the result is definitely more readable.
If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!