The SELF JOIN
The JOIN operator can be used to combine any pair of tables, including combining the table to itself. This is the "self join". Self joins can use any JOIN operator.
For instance, check this classical example of returning an employee's boss (based on
Table1). In this example, we consider that the value in
field2 is in fact the boss' code number, therefore related to
key1.
SELECT t1.key1, t1.field1 as Name,
t1.field2, mirror.field1 as Boss
FROM Table1 t1
LEFT JOIN Table1 mirror ON t1.field2 = mirror.key1;
And this is the output to this query.
key1 | Name | field2 | Boss |
3 | Erik | 8 | Harry |
4 | John | 3 | Erik |
6 | Mark | 3 | Erik |
7 | Peter | 8 | Harry |
8 | Harry | 0 | null |
In this example, the last record shows that Harry has no boss, or in other words, he is the #1 in the company's hierarchy.
Excluding the Intersection of the Sets
Checking the previous Venn diagrams I just showed above, one may come to a simple question: what if I need to get all records from
Table1 except for those that match with records in
Table2. Well, this is pretty useful in day-to-day business, but obviously we don't need a special JOIN operator to do it.
Observe the result sets above and you will see you only need to add a WHERE clause to your SQL statement, looking for records that have a NULL value for
Table2's key. So, the result set we are looking is the red area shown in the Venn diagram below Figure.
Non-matching records from Table1.
We can write a LEFT JOIN for this query, for instance:
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.key2 = t2.key2
WHERE t2.key2 IS NULL;
And, finally, the result set will be:
key1 | Name | T1Key | T2Key | City |
6 | Mark | 7 | null | null |
7 | Peter | 8 | null | null |
When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed.
One Word about Execution Plans
These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed.
This is as true for SQL Server as any other RDBMS.
Having a basic understanding of how SQL works is important for any DBA or developer. This will help you to get things done. In fast and reliable way. If you are curious about this, just take a look in the execution plan for the query above, shown in below figure.
The execution plan to a query using LEFT JOIN
Joins and Indexes
Take a look again at the Execution Plan of that query. Notice it used the clustered indexes of both tables. Using indexes is the best way to make your query run faster. But you have to pay attention to some details.
When we write our queries, we expect the SQL Server Query Optimizer to use the table indexes to boost your query performance. We can also help the Query Optimizer choose the indexed fields to be part of your query.
For instance, when using the JOIN operator, the ideal approach is to base the join condition over indexed fields. Checking again the Execution Plan, we notice that the clustered index on
Table2 was used. This index was automatically built on
key2 when this table was created, as
key2 is the primary key to that table.
On the other hand,
Table1 had no index on field
key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying
key2 using the only available index. This was the table clustered index, based on
key1, the primary key on
Table1. You see the query optimizer is really a smart tool. But you would help it a lot creating a new index (a non-clustered one) on
key2.
Remembering a bit about referential integrity, you see
key2 should be a foreign key on
Table1, because it is related to another field in other table (which is
Table2.key2).
Personally I believe foreign keys should exist in all real-world database models. And it is a good idea to create non-clustered indexes on all foreign keys. You will always run lots of queries, and also use the JOIN operator, based on your primary and foreign keys.
(IMPORTANT: SQL Server will automatically create a clustered index on primary keys. But, by default, it does nothing with foreign keys. So make sure you have the proper settings on your database).
Non-equal Comparisons
When we write SQL statements using the JOIN operator, we usually compare if one field in one table is equal to another field in the other table. But this is not the mandatory syntax. We could use any logical operator, like different than (<>), greater than (>), less than (<) and so on.
Although this fancy stuff might give you the impression that SQL gives so much power, I feel this is more like a cosmetic feature. Consider this example. See Table 1 above , where we have 5 records. Now let's consider the following SQL statement.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.key2 <= t2.key2
WHERE t1.key1 = 3 ;
Notice this uses an inner join and we are specifically picking one single record from
Table1, the one where
key1 is equal to 3. The only problem is that there are 6 records and
Table2 that satisfy the join condition. Take a look in the output to this query.
key1 | Name | T1Key | T2Key | City |
3 | Erik | 1 | 1 | New York |
3 | Erik | 1 | 2 | Sao Paulo |
3 | Erik | 1 | 4 | Paris |
3 | Erik | 1 | 5 | London |
3 | Erik | 1 | 6 | Rome |
3 | Erik | 1 | 9 | Madrid |
The problem with non-equal joins is that they usually duplicate records. And this is not something you will need in a regular basis. Anyway, now you know you can do it.
Multiple JOINs
SQL JOINs are always about putting together a pair of tables and finding related objects that obey a given rule (usually, but not limited to, equal values). We can join multiple tables. For instance, to combine 3 tables, you will need 2 joins. And a new join will be necessary for each new table. If you use a join in each step, to combine N tables, you will use N-1 joins.
One important thing is that SQL allows you to use different types of joins in the same statement.
But DBAs and developers have to be careful on joining too many tables. Several times, I have seen situations where queries demanded 10, 20 tables or even more. For performance reasons, it is not a good idea to do a single query to put all data together. The query optimizer will do a better job if you break your query it into several smaller, simpler queries.
Now consider we have a third table, called
Table3, shown bellow.
Table3
key3 | field1 |
1 | Engineer |
2 | Surgeon |
3 | DBA |
4 | Lawyer |
5 | Teacher |
6 | Actor |
Now let's write a statement to bring the employee's name, the city where he lives and what is his profession. This will demand us to join all 3 tables. Just remember that joins are written in pairs. So first we will join
Table1 to
Table2. And then we will join
Table1 and
Table3. The resulting script is shown below.
SELECT t1.key1, t1.field1 as Employee,
t2.key2, t2.field1 as City,
t3.key3, t3.field1 as Profession
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.key2 = t2.key2
INNER JOIN Table3 t3 ON t1.key3 = t3.key3;
As we are running only INNER JOINs, we will have only records that match the combination of the 3 tables. See the output below.
key1 | Name | key2 | City | key3 | Profession |
3 | Erik | 1 | New York | 6 | Actor |
4 | John | 4 | Paris | 4 | Lawyer |
6 | Harry | 9 | Madrid | 2 | Surgeon |
Beyond the SELECT statements
The use of JOIN operators is not restricted to SELECT statements. In T-SQL, you can use joins in INSERT, DELETE and UPDATE statements as well. But keep in mind that in most RDBMS's we have nowadays, joins are not supported on DELETE and UPDATE statements. So I would advise you to restrict the use of joins to SELECT and INSERT statements only, even in SQL Server code. This is important if you want to keep your scripts more easily portable to different platforms.
Source:
SQL Server Central