Wednesday, May 2, 2012

TRANSACT-SQL: JOIN...ON condition or WHERE condition - which is better?

A colleague suggested I write about this topic as she often wonders about it, but doesn't really understand the implications and has never had a chance to test it.

Basically, the question is whether or not it's better to add a condition in a query in the WHERE clause at the end of the query or to make it part of an ON clause in the JOIN when writing queries in Transact-SQL.

The very short answer is: It depends.

The short answer is:
  • INNER JOIN: It really doesn't matter and is a case of personal preference more than anything else.
  • OUTER JOIN (LEFT, RIGHT, or FULL): There's a logical difference that in some cases will cause result sets to differ, so it depends on what you're trying to achieve with your query.
  • CROSS JOIN: Has no ON clause and therefore isn't relevant for this topic.
Now comes the long answer.

Let's look at INNER JOIN first.

By means of example, say you have two tables called Manufacturer and Product respectively. You want to run a query that returns a list of all products with the manufacturer's name next to it. This is fairly easy to achieve like so:
SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID

Now, suppose that you want to add a condition to this. For instance, you only want a list of products for Manufacturers from a specific country, which you specify in a string literal. This could be done in one of two ways.
Option 1, adding a WHERE clause:
SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID WHERE Manufacturer.Country = 'Country1'

or Option 2, adding a condition to the JOIN clause:

SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID AND Manufacturer.Country = 'Country1'

The difference is subtle and the result set will be exactly the same. So which is better?

The answer is neither, or both. At least in terms of performance there is no difference between these queries. If you look at the query execution plan for these two queries, you will find that they are exactly the same. Basically SQL server will try to optimize the query regardless of where you put the condition.

To prove the point, I created some random data in the Product and Manufacturer tables, using 500 manufacturers, 150 countries and 1000000 products and then ran the query for one country using both methods. I did this twenty times for each query and the results were very closely matched. This wasn't a very well controlled test environment (I'm simply running it on my laptop with lots of other background services busy all the time), so saw a fair bit of variance and initially it looked like the option 2 might be faster, but over time the two averages converged. Most significantly the fastest times for each option were identical down to the last millisecond (i.e. on the odd occasion that nothing else was consuming processing power on my laptop, they were really identical).

Personally, my preference is generally to put these kinds of conditions in the ON clause of the join. This makes more logical sense to me as it feels like I'm limiting the result set earlier in the logical process and therefore I find this easier to read, understand and maintain. However, in practice it actually makes no difference, so it really is personal preference. There might be a good argument for putting all your conditions in one place (the WHERE clause) so you can easily see all the limitations that the code is putting on the result set even when dealing with multiple joins.

Now lets look at OUTER JOINS. I'm not going to go into all three types of outer join as for purposes of this article they will all behave similarly, so let's stick to the one that is most frequently used, LEFT JOIN.

With a left join, the two options are logically quite different. Looking at the above two queries, a left join will deliver vastly different results in the two respective options.

Option 1 will produce the exact same result set as the inner join does. This is partly because the Product table is first in the query and the foreign key constraint on Product.ManufacturerID and the unique key on Manufacturer.ID ensures that there will be no NULL countries nor multiple rows for a single product (i.e. every Product row will successfully be matched to a unique row in Manufacturer) and partly because the WHERE clause narrows the selection down to a specific country only.

However, option 2 will produce a completely different result. In this instance, the result set will be a complete list of products. Rows where the country condition is met will appear in the result set with a manufacturer name, but rows that do not meet this condition will still appear in the list, only with a NULL value instead of an actual manufacturer name.

In conclusion, for LEFT JOIN (and the same goes for RIGHT and FULL OUTER JOIN), the decision rests on what you're trying to achieve. Whether you want rows not meeting the condition to appear in the result set with NULL values for the secondary table or whether you prefer if these values are not there.

Note: Testing for this article was done on SQL Server 2008 R2 developer edition.

1 comment:

  1. If anyone is interested, the scripts to generate the tables and data that I used in testing this are available here

    ReplyDelete