The main point of Relational
Databases is to reduce the amount of duplicated data in tables by “normalising”
the data – that is, reduce duplication and redundant data. For
instance, if we were creating a table of employees and their departments
in a business – rather than list the same department over and over
again, we could split out the departments into their own table and reference
that table in the employees table as below:
Employees Table
Employee_ID | First_Name | Last_Name | Department_ID |
1 | Joe | Bloggs | 1 |
2 | Fred | Jones | 2 |
3 | Bill | Smith | 3 |
4 | Ann | Jones | 2 |
5 | Sally | Dee | 1 |
6 | Jim | Lee | 2 |
Departments Table
Department_ID | Department_Name |
1 | Sales |
2 | Tech Support |
3 | Finance |
The problem then comes
when you want to list all employees and their department. In order
to do this we will need to JOIN the 2 tables in the SQL statement to
get the relevant data back. This is the statement we will use:
First_Name, Last_Name, Department_Name
FROM
Employees INNER JOIN Departments ON
Employee.Department_ID = Department.Department_ID
Which will give us the
following
First_Name | Last_Name | Department_Name |
Joe | Bloggs | Sales |
Sally | Dee | Sales |
Fred | Jones | Tech Support |
Ann | Jones | Tech Support |
Jim | Lee | Tech Support |
Bill | Smith | Finance |
Things to note from the
statement:
- We have specified
an INNER JOIN. An inner join returns only those rows from the
listed tables that match the criteria specified in the ON portion of
the JOIN clause. In this case we have said where Department_ID
in the Employees Table matches the Department_ID in the Departments
table.
- Although here
we have specified that it is an INNER JOIN, this is the default type
of join so you can just specify JOIN and get the same results.
- The JOIN statement
is in the FROM clause. This is part of the ANSI SQL-92 standard
for how to format Transact SQL statements. It is also quicker
and easier for SQL to evaluate joins in the FROM clause before it gets
to the WHERE clause.
- It is not required
but it is considered good practice to write all keywords in upper case.
Although Query Analyzer will colour-code keywords, this practice will
help to more easily identify and separate them.
There were also no duplicate
column names in our query so we just listed the columns we wanted and
listed the full table names in the ON portion of the JOIN clause.
If there were duplicate column names in the query we would have had
to have specified to SQL Server which table we wanted the column data
to be displayed from just like in the ON portion otherwise SQL will
complain that it doesn’t know which table you want the data from.
If we had to do this,
it would make our SQL statement look like:
Employees.First_Name,
Employees.Last_Name,
Departments.Department_Name
FROM
Employees INNER JOIN Departments ON
Employee.Department_ID = Department.Department_ID
Which looks unwieldy enough
when there are just a few columns – can you imagine a statement containing
30 columns and 4 JOINS written out like this?
Fortunately, SQL allows
us to use Table Aliases to replace the table names in most of the statement,
so we can rewrite our statement as below:
e.First_Name,
e.Last_Name,
d.Department_Name
FROM
Employees e JOIN Departments d ON
e.Department_ID = d.Department_ID
As you can see, we have
now replaced the table names with the letters ‘e’ and ‘d’ respectively.
We told SQL to recognise these as the table names in the FROM clause
when we said:
Employees e JOIN Departments
d
It did not have to be
a single letter it is just easier (and quicker to type!). Indeed
you will find that, once you are writing complex queries involving a
lot of tables, you will have to start using more characters. When
using aliases, try to use something that will identify the table –
in our case it was easy as we only had 2 tables so using the first letter
of each was enough to identify them - If our 2 tables were called Employees
and Entitlements we would have had to use more than one letter!
One important note to remember is that, once you have used a table alias
you must then use it everywhere else in the statement in place of the
table name.
Now lets get a bit more
complex. Instead of listing employees and what department they
are in, we want to list all the departments and which employees
work in them. Given the tables above and the knowledge we have
gained so far we could write a statement such as:
d.Department_Name,
e.First_Name,
e.Last_Name
FROM
Departments d JOIN Employees e ON
d.Department_ID = e.Department_ID
ORDER BY
Department_Name
* Note I have thrown in
an ORDER BY clause to order the listing by department – nothing to
do with joins it just makes the output neater
Which would give us:
Department_Name | First_Name | Last_Name |
Finance | Bill | Smith |
Tech Support | Jim | Lee |
Tech Support | Fred | Jones |
Tech Support | Ann | Jones |
Sales | Sally | Dee |
Sales | Joe | Bloggs |
Which is great, and just
what we wanted. But now, suppose our business expands and adds
some more departments so that the departments table looks like this:
Department_ID | Department_Name |
1 | Sales |
2 | Tech Support |
3 | Finance |
4 | Marketing |
5 | Legal |
If we run our statement
again, we will get exactly the same output! Where are our missing
departments? Because an inner join will only give us those rows
that match from both tables we don’t see the Legal department
or the Marketing department.
In order to see all the
departments regardless of whether an employee is assigned to that department
we will have to use an OUTER JOIN. An outer join returns all rows
from the outer table and those rows that match from the other (inner)
table.
There are 2 types of outer
join, a LEFT OUTER JOIN and a RIGHT OUTER JOIN. They do the same
thing except one takes the left most listed table as the outer table
and the other takes the right most listed table as the outer table.
It just allows you to be more flexible in how you can write your joins
and can become invaluable when you have to join 3 or more tables.
All we have to do to our
statement is change JOIN to read LEFT OUTER JOIN as below:
d.Department_Name,
e.First_Name,
e.Last_Name
FROM
Departments d LEFT OUTER JOIN Employees e
ON d.Department_ID = e.Department_ID
ORDER BY
Department_Name
And we will get the output
we wanted as below:
Department_Name | First_Name | Last_Name |
Legal | NULL | NULL |
Finance | Bill | Smith |
Tech Support | Jim | Lee |
Tech Support | Fred | Jones |
Tech Support | Ann | Jones |
Sales | Sally | Dee |
Sales | Joe | Bloggs |
Marketing | NULL | NULL |
Some things to note about
the statement and the output:
- The keyword
OUTER is optional so you can just use LEFT JOIN (or RIGHT JOIN).
- You can see
that we now have all the departments listed and where there was no matching
Department_ID in the Employees table we have the entry NULL against
First_Name and Last_Name from that table.
Now lets expand our employees
table. It is clear from looking at it that we have people working
in the same departments so lets promote some of them. We will
record this fact by adding a Manager_ID column. This will be populated
with an Employee_ID whenever a person has a manager, to indicate who
that manager is. This will leave our table looking like this:
Employee_ID | First_Name | Last_Name | Department_ID | Manager_ID |
1 | Joe | Bloggs | 1 | NULL |
2 | Fred | Jones | 2 | NULL |
3 | Bill | Smith | 3 | NULL |
4 | Ann | Jones | 2 | 2 |
5 | Sally | Dee | 1 | 1 |
6 | Jim | Lee | 2 | 2 |
This shows that Joe, Fred
and Bill have no managers (Indicated by the NULL in the column for this
value). Ann and Jim are managed by Fred and Sally is managed by
Joe.
We now need to list all
employees and the name of their manager. We know from above that,
as not everyone has a manager, we will have to use an OUTER JOIN but
we do not have another table to join to. We will have to use a
SELF JOIN – that is, join the table to itself - and this is where
table aliases really come into play. The statement we will use
is this:
e.First_Name,
e.Last_Name,
e2.First_Name AS Manager_Name
FROM
Employees e LEFT JOIN Employees e2 ON
e.manager_id = e2.employee_id
First_Name | Last_Name | Manager_Name |
Joe | Bloggs | NULL |
Fred | Jones | NULL |
Bill | Smith | NULL |
Ann | Jones | Fred |
Sally | Dee | Joe |
Jim | Lee | Fred |
Things to note about the
statement:
- Basically we
have used table aliases to create a “copy” of the employees table
and called it e2. Now we can use our copy of the employees table
just as if it was another table.
- You can also
see in the statement that I have aliased the column name as well so
that the column title makes more sense – you can do this in any SQL
statement if you feel that the actual column name doesn’t make much
sense. For instance, if we had named our columns Emp_FName and
Emp_LName it doesn’t look too good on a report so we could alias them
to First Name and Last Name. You can have spaces in these column
names as log as you enclose the name in quotes like below:
e.First_Name AS ‘First Name’,
e.Last_Name AS ‘Last Name’,
e2.First_Name AS ‘Manager Name’
FROM
Employees e LEFT JOIN Employees e2 ON
e.manager_id = e2.employee_id
First Name | Last Name | Manager Name |
Joe | Bloggs | NULL |
* As a test, how would
you rewrite our managers query above so that it listed managers and
their employees? My solution at the end of the article.
Up until now we have only
been dealing with 2 tables. What happens when you need to join
3 (or more) tables?
Our employees are hardworking
so lets allow them to work some overtime. To do this we will add
2 tables – an Overtime table and an Overtime_Allowed table.
Why 2 tables? Well,
we will be restricting who can take overtime based on department and
to comply with normalisation rules, rather then put all that information
into one table with lots of duplication we will create 1 table for the
overtime definition and then what is called a “link” table to contain
information about both departments and overtime. The new tables
are below:
Overtime Table
OTime_ID | OTime_Name | Pay_Scale |
1 | Weekday | 0 |
2 | Weekend | 1.5 |
3 | Bank Holiday | 2 |
Overtime_Allowed
Table
Dept_ID | OTime_ID | OTime_Allowed |
1 | 1 | Y |
1 | 2 | N |
1 | 3 | N |
2 | 1 | Y |
2 | 2 | Y |
2 | 3 | Y |
3 | 1 | Y |
3 | 2 | Y |
3 | 3 | N |
4 | 1 | Y |
4 | 2 | Y |
4 | 3 | N |
5 | 1 | Y |
5 | 2 | Y |
5 | 3 | Y |
Hopefully you can now
see the point of the 2 tables.
** NOTE – in reality
the OTime_Allowed column would be data type BIT (a 1 or a 0 to indicate
TRUE or FALSE or YES and NO in our case) I have made it a CHAR
field to keep the statement simple so we don’t get bogged down in understanding
the syntax of converting a BIT field into a ‘Yes’ or ‘No’ when the results
are displayed.
So now suppose we want
to list all departments and their overtime allowances. In order
to do this we will have to join the Departments, Overtime and Overtime_Allowed
tables.
When joining 3 or more
tables it is important to understand how SQL processes the joins so
that you can order them correctly to get the results you need.
I like to think of multi-table joins as like the game “six degrees
of Kevin Bacon” - you take each table in turn and relate it to the
next until finally you have linked Tom Cruise to Kevin Bacon in as few
hops as possible. But maybe that’s just me!
SQL processes joins in
much the same way - It takes the first two tables listed and creates
a result set for them. It then takes the next table and compares
it to the temporary table it has created containing the previous result-set
and, based on the conditions in the ON clause, creates a result set
and so on until it creates the final table it displays.
An example is needed:
d.Department_Name
AS Department,
o.OTime_Name AS ‘Overtime Allowed’
FROM
Departments d
JOIN Overtime_Allowed oa ON d.Department_ID = oa.Dept_ID
JOIN Overtime o ON oa.OTime_ID = o.OTime_ID
WHERE
oa.OTime_Allowed = ‘Y’
ORDER BY
d.Department_Name
Department | Overtime Allowed |
Finance | Weekend |
Finance | Weekday |
Legal | Weekend |
Legal | Weekday |
Legal | Bank Holiday |
Marketing | Weekend |
Marketing | Weekday |
Sales | Weekday |
Tech Support | Bank Holiday |
Tech Support | Weekend |
Tech Support | Weekday |
Here you can see that
we used the link table to get our results by joining Departments to
Overtime via Overtime_Allowed.
We have now covered the
2 main types of join – INNER JOIN and OUTER JOIN and
their various uses such as SELF JOIN. We have also covered
joining more than 2 tables. And still there is more to talk about!
There are 2 other types
of joins that we haven’t yet covered – CROSS JOIN and FULL OUTER JOIN.
These are less commonly used joins (ie – I’ve never had course to
use them!) but can come in handy.
Cross join:
A cross join on 2 tables
will return every possible combination of rows and columns in those
2 tables. This is called the Cartesian product of the 2 tables.
This means that if we were to join our Employees and Departments tables
in a cross join we would get 30 rows returned as Employees has 6 rows
and Departments has 5 (6*5=30)
When you specify a cross
join you can either use the CROSS JOIN keywords or separate the table
names with a comma:
First_Name,
Last_Name,
Department_Name
FROM
Employees CROSS JOIN Departments
First_Name,
Last_Name,
Department_Name
FROM
Employees,
Departments
Both would give the results
shown below:
First Name | Last Name | Manager Name |
Joe | Bloggs | Sales |
Fred | Jones | Sales |
Bill | Smith | Sales |
Ann | Jones | Sales |
Sally | Dee | Sales |
Jim | Lee | Sales |
Joe | Bloggs | Tech Support |
Fred | Jones | Tech Support |
Bill | Smith | Tech Support |
Ann | Jones | Tech Support |
Sally | Dee | Tech Support |
Jim | Lee | Tech Support |
Joe | Bloggs | Finance |
Fred | Jones | Finance |
Bill | Smith | Finance |
Ann | Jones | Finance |
Sally | Dee | Finance |
Jim | Lee | Finance |
Joe | Bloggs | Marketing |
Fred | Jones | Marketing |
Bill | Smith | Marketing |
Ann | Jones | Marketing |
Sally | Dee | Marketing |
Jim | Lee | Marketing |
Joe | Bloggs | Legal |
Fred | Jones | Legal |
Bill | Smith | Legal |
Ann | Jones | Legal |
Sally | Dee | Legal |
Jim | Lee | Legal |
Full outer join:
A full outer join on 2
tables would return all rows from both tables with Null entries in the
left columns where things don’t match with the right and vice versa.
For example, suppose a
temp worker joins the company – temp workers have no department so
the Department_ID field in Employees would be NULL. We already
have 2 departments that have no workers that we added earlier so a FULL
OUTER JOIN between Employees and Departments would then give us:
First_Name,
Last_Name,
Department_Name
FROM
Employees FULL OUTER JOIN Departments ON
Employee.Department_ID = Department.Department_ID
First_Name | Last_Name | Department_Name |
John | Deed | NULL |
Joe | Bloggs | Sales |
Sally | Dee | Sales |
Fred | Jones | Tech Support |
Ann | Jones | Tech Support |
Jim | Lee | Tech Support |
Bill | Smith | Finance |
NULL | NULL | Marketing |
NULL | NULL | Legal |
So there you go.
INNER JOIN, OUTER JOIN, SELF JOIN, CROSS JOIN and FULL OUTER JOIN.
Hopefully you now understand all the various ways SQL allows you to
join tables and you can make use of this to normalise your database
and still display the data you need.
Before we go though, some
final points on using joins:
- The columns
which are used in join statement have to be either the same data type
or compatible data types. Compatible data types means that SQL
Server can perform an implicit conversion on the data type when it comes
to compare the columns.
- The performance
of queries using joins can, in general, be increased if all columns
in a join are indexed.
The only thing that remains
is to see if you rewrote the managers and employees query in the same
way I did – my solution below:
e.First_Name AS Manager_First,
e.Last_Name AS Manager_Last,
e2.First_Name AS Employee
FROM
Employees e JOIN Employees e2 ON
e.Employee_ID = e2.Manager_ID
ORDER BY
e.First_Name
[…] Database “normalising” The main point of Relational Databases is to reduce the amount of duplicated data in tables by “normalising” the data – that is, reduce duplication and redundant data. […]
1 comment:
Very helpful information. Would need basic SQL knowledge to follow.
Post a Comment