Check Out Dave's New game for iPhone and iPod Touch: Smiled Out!

SQL

Difference Between Union and Union All

Posted in SQL on July 29th, 2010 by Dave Andrews – Be the first to comment

Sometimes it can be confusing what the difference between ‘union’ in a SQL statement and ‘union all’ means. The basic definition is that ‘union’ will combine a given set (query results) with the values of a second set that don’t already exist. ‘Union All’ will simply append the second set to the end of the first set, which includes all the records of the first set and all the records of the second set.

Let’s see an example.

1
2
3
4
5
SELECT 1 MyValue
 
UNION
 
SELECT 1 MyValue

Here we are combining two sets of data which each contain a single record, 1. If you execute this command, you will get back a single row which contains the value 1 in the MyValue column. This is because the new value we union in already existed in the first data set, so it did not include it.

MyValue
1

Let’s modify our query to use Union All.

1
2
3
4
5
SELECT 1 MyValue
 
UNION ALL
 
SELECT 1 MyValue

We now get back a set with two records, the 1 from the first set AND the 1 from the second set.

MyValue
1
1

Inserting Results of a Query into a Table

Posted in SQL on July 25th, 2010 by Dave Andrews – Be the first to comment

Sometimes its necessary to populate an existing table with the results of a query. This can be done very easily with an INSERT statement combined with a SELECT.

A typical insert query looks like this.

1
2
3
4
5
6
7
8
9
10
INSERT INTO Person 
     (Name, 
      Address, 
      Phone
     )
VALUES 
     ('Test Guy', 
      '123 Test St', 
      '555-5555'
     )

This works fine for just one row. But what if we had another table which contained the information we needed, and had 100000 rows that we wanted to copy over? Let’s just combine our INSERT with a SELECT!

1
2
3
4
5
6
7
8
9
10
11
INSERT INTO Person 
     (Name, 
      Address, 
      Phone
     )
SELECT
     Name,
     Address,
     Phone
FROM
     TableThatHasValues

It’s that easy! The SELECT query will execute and the results of that select query will be piped into the INSERT command, inserting every single row.

Selecting data into a new table

Posted in SQL on July 23rd, 2010 by Dave Andrews – Be the first to comment

If you have a series of data that you want to insert into a new table, or into a temporary table, it’s very simple. This can be used to create a simple backup of records, or to take a very long and complex query and save its results for future use. If you save into a table, that table won’t have any relationships that are in the existing table, just something to keep in mind.

Let’s say we have a query like this.

1
2
3
4
5
6
SELECT
     *
FROM
     Animals a
     INNER JOIN Animal_Details ad
          ON a.AnimalID = ad.AnimalID

This table will join together two tables and pull in the details of whatever animals are in the animals table. I just made this up quickly. Let’s have SQL save that to a new table.

1
2
3
4
5
6
7
8
SELECT
     *
INTO 
     AnimalDetailsCombined
FROM
     Animals a
     INNER JOIN Animal_Details ad
          ON a.AnimalID = ad.AnimalID

This query will create a brand new table called AnimalDetailsCombined and save the results of that query into the table. If you wanted to save to a temporary table, which would be deleted once the connection is closed, just add a hashtag (#) in front of the table name.

1
2
3
4
5
6
7
8
SELECT
     *
INTO 
     #AnimalDetailsCombined
FROM
     Animals a
     INNER JOIN Animal_Details ad
          ON a.AnimalID = ad.AnimalID

You now have a temporary table called #AnimalDetailsCombined with is the result of the query.

Using SQL To Accrue Values (Overtime in Time Entries)

Posted in SQL on March 15th, 2010 by Dave Andrews – 11 Comments

I’ve made it an exercise recently to write SQL code that will determine overtime hours and regular hours given a list of time entries. I wanted to do this through a single (well, nested) SQL query and to not use cursors to iterate through the entries.

The problem is that we have a table of time entries, the date the entries were made, and the employee who made them and we need to accrue the hours that employee worked in a given week. This is simple to do with iterators in code or cursors in T-SQL, but what fun would that be?

I’m going to simplify the problem by only looking at a single week of data. In a larger solution which looks over a larger range of dates such as a month you will need a subquery which assigns each week a number and group by that week number. In my full blown implementation of this, I used row_number() over (order by startdate) given a list of weeks with a startdate and enddate. But that is for a later post.

Let’s look at our week’s worth of data, which has 2 employees.

Table TimeEntries

TimeEntryID EmployeeID DateWorked Hours
1 1 2010-01-04 8
2 1 2010-01-05 10
3 1 2010-01-06 7.5
4 1 2010-01-07 10
5 1 2010-01-08 10
6 1 2010-01-09 4
7 2 2010-01-04 10
8 2 2010-01-05 10
9 2 2010-01-06 10
10 2 2010-01-07 10

As you can infer from the above, employee 1 worked overtime (more than 40 hours) starting on the 8th, and employee 2 worked no overtime, since the employee worked exactly 40 hours in that week.

Problem 1: Multiple Records in a Day The first problem we must overcome is calculating the total number of hours worked in a given day. This is because we can have an employee who puts in 4 hours in the morning and 4 hours in the afternoon. If we don’t have this in a single record, it will be difficult down the line when our complicated joins and subqueries come in.

So we build this query which will sum up the hours in a given day. We will use this in a couple of places.

1
2
3
4
5
6
7
8
9
SELECT
	te.EmployeeId,
	te.DateWorked,
	SUM(te.Hours) TotalHoursInDay
FROM
	TimeEntries te
GROUP BY
	te.EmployeeId,
	te.DateWorked

Very simple, we’re just summing up the hours per day. We’re going to use this query in several places.

Problem 2: Finding out which days are in overtime. The next thing we need to overcome is to come up with a means of determining which days should be in overtime. This is where “accrual” of sorts comes into play. We join each day’s record to every record before it. We then group that record and sum up the hours previous to the given day.

Let’s look at the query.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
SELECT
	te.EmployeeId,
	te.TotalHoursInDay,
	te.DateWorked,
	sum(previous.TotalHoursInDay) HoursUpToThisPoint,
	case when te.TotalHoursInDay + SUM(previous.TotalHoursInDay) > 40 then 'Y'
		 else 'N'
	end InOvertime
FROM
	(
		SELECT
			te.EmployeeId,
			te.DateWorked,
			SUM(te.Hours) TotalHoursInDay
		FROM
			TimeEntries te
		GROUP BY
			te.EmployeeId,
			te.DateWorked
	) te
	LEFT JOIN
	(
		SELECT
			te.EmployeeId,
			te.DateWorked,
			SUM(te.Hours) TotalHoursInDay
		FROM
			TimeEntries te
		GROUP BY
			te.EmployeeId,
			te.DateWorked
	) previous
		ON	te.EmployeeId = previous.EmployeeId AND
			te.DateWorked > previous.DateWorked
GROUP BY
	te.EmployeeId,
	te.TotalHoursInDay,
	te.DateWorked

The first subquery, te, creates a list of each day for each employee and the total hours for that day. This is to get the total current hours for the employee for the given day. We join that to the second subquery, previous. The Previous query generates a total number for all days PRIOR to the given day. (The logic comes from our join condition, te.DateWorked > previous.DateWorked)

The InOvertime flag will be used later on to determine overtime hours. It lets us know when the employee has crossed the threshold. That could probably be taken out, but it simplifies the logic later on.

So we get these results:

EmployeeID TotalHoursInDay DateWorked HoursUpToThisPoint InOvertime
1 8 2010-01-04 NULL N
1 10 2010-01-05 8 N
1 7.5 2010-01-06 18 N
1 10 2010-01-07 25.5 N
1 10 2010-01-08 35.5 Y
1 4 2010-01-08 45.5 Y
2 10 2010-01-04 NULL N
2 10 2010-01-05 10 N
2 10 2010-01-06 20 N
2 10 2010-01-07 30 N

As you can see, this subquery determines for us the total number of hours up until the given day, and when combined with the hours worked in the given day we know if the employee is in overtime or not.

Let’s put one more wrapper around this which will total up the hours and tell us how many hours were regular hours and how many were overtime.

Here is our final query.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
SELECT
	EmployeeId,
	sum(case when InOvertime = 'N' then TotalHoursInDay
	         when InOvertime = 'Y' AND HoursUpToThisPoint < 40 then  40 - HoursUpToThisPoint
		 else 0
	end) RegularHours,
	sum(case when InOvertime = 'Y' then
		 case when HoursUpToThisPoint > 40 then TotalHoursInDay
			  else HoursUpToThisPoint + TotalHoursInDay - 40
		 end
		 else 0
	end) OTHours
FROM
(
	SELECT
		te.EmployeeId,
		te.TotalHoursInDay,
		te.DateWorked,
		sum(previous.TotalHoursInDay) HoursUpToThisPoint,
		case when te.TotalHoursInDay + SUM(previous.TotalHoursInDay) > 40 then 'Y'
			 else 'N'
		end InOvertime
	FROM
		(
			SELECT
				te.EmployeeId,
				te.DateWorked,
				SUM(te.Hours) TotalHoursInDay
			FROM
				TimeEntries te
			GROUP BY
				te.EmployeeId,
				te.DateWorked
		) te
		LEFT JOIN
		(
			SELECT
				te.EmployeeId,
				te.DateWorked,
				SUM(te.Hours) TotalHoursInDay
			FROM
				TimeEntries te
			GROUP BY
				te.EmployeeId,
				te.DateWorked
		) previous
			ON	te.EmployeeId = previous.EmployeeId AND
				te.DateWorked > previous.DateWorked
	GROUP BY
		te.EmployeeId,
		te.TotalHoursInDay,
		te.DateWorked
) HoursQuery
GROUP BY
	HoursQuery.EmployeeId
EmployeeId RegularHours OTHours
1 40 9.5
2 40 0

What we have done here is add some logic that acts on whether or not the a given time entry is in overtime. We group by the employee so we can aggregate hours.

If the record is in overtime, it might mean the employee is already over 40 hours or it might mean that the current record puts the employee over 40 hours. So we have to add a check that will take the difference as regular hours, since the hours that take the employee up to 40 are still regular hours.

The second check is to calculate overtime. If the time entry puts the employee over 40, then take the amount over 40 as overtime. If the employee is already over 40, then the entire entry is overtime.

I hope this post helps you have the mindset of doing as much as possible with your queries, rather than writing code to calculate values which can easily be queried.

Using COALESCE() in Linq-To-SQL

Posted in CSharp, SQL on November 23rd, 2009 by Dave Andrews – 2 Comments

I had a situation today where I wanted to return a user’s name using Linq. Sure, I could have easily written some code for this, but for the sake of keeping my code simple and readable I wanted Linq to handle the building of the name.

Users Table

UserID FirstName LastName
1 David Andrews
2 Bob NULL

The Linq seemed easy:

1
2
3
4
5
6
7
8
9
10
public static string GetNameOfUser(int userID)
        {
            DatabaseDataContext dc = new DatabaseDataContext();
 
            var user = from u in dc.Users
                       where u.UserID == userID
                       select u.FirstName + " " + u.LastName;
 
            return (user.Count() == 0) ? "Unknown User" : user.First();
        }

However my results were not correct. Bob was left out because he did not have a last name value (it’s NULL in the database.) You will get results like this if you run that for each user (one at a time, but I put them in a table here.)

Results

Expected Value Received
David Andrews David Andrews
Bob <Empty String>

In SQL it is very common to use the COALESCE() function to handle problems with null values. But how can I do that in Linq? I didn’t want to write extra code to check for nulls and concatenate the values myself.

The answer is simple: ??. Yes, that’s right. Two question marks is the “coalesce” operator in Linq. It’s a good idea to wrap what you are testing in coalesce inside parenthesis as well.

What follows is my new function.

1
2
3
4
5
6
7
8
9
10
public static string GetNameOfUser(int userID)
        {
            DatabaseDataContext dc = new DatabaseDataContext();
 
            var user = from u in dc.Users
                       where u.UserID == userID
                       select (u.FirstName ?? "") + " " + (u.LastName ?? "");
 
            return (user.Count() == 0) ? "Unknown User" : user.First();
        }

As you can see, I’ve added ?? “” after both FirstName and LastName. This translates to:

coalesce(FirstName, ”)
coalesce(LastName, ”)

Which means that if FirstName is null you will get ”, which adds correctly in SQL to another string. Same goes for LastName.

Here are my new results:

Results

Expected Value Received
David Andrews David Andrews
Bob Bob

Join on Multiple Columns in Linq To SQL

Posted in CSharp, SQL on November 20th, 2009 by Dave Andrews – Be the first to comment

Today I had to create a Linq-To-SQL query which joined two tables. These tables had compound primary keys (table simplified for example):

Table Referrers

ServerID (PK) ReferrerID (PK) Value
1 1 My value 1
1 2 My value 2
1 3 My value 3

Table ReferrerInfo

ServerID (PK) ReferrerID (PK) Value
1 1 More info… 1
1 2 More info… 2
1 3 More Info… 3

So I wanted to join these together using a Linq-To-SQL query. The first thought that occured to me was to use a join:

1
2
3
var referrers = from r in Referrers
               join ri in ReferrerInfo on r.ServerConnectionID equals ri.ServerConnectionID ..... ??????
               select r;

So, apparently in C# you cannot do multiple columns in your join. I have to join on both ServerConnectionID and ReferrerID.

This page: http://www.onedotnetway.com/linq-to-sql-join-on-multiple-conditions/ explains one method to do so, but unfortunately that Linq query only works in VB.

So, I ended up writing a Linq-To-SQL that just uses the old style of joining, by using the where clause!

1
2
3
4
5
var referrers = from r in Referrers
                from ri in Referrer_Info
                where r.ServerConnectionID == ri.ServerConnectionID &&
                   r.ReferrerID == ri.ReferrerID
                select r;

This successfully executes my multi-column join.

Using Row_Number() to Enumerate and Partition Records in SQL Server

Posted in SQL on October 30th, 2009 by Dave Andrews – Be the first to comment

I had a situation recently where I had a table full of people records, where the people were divided into families. The business logic that needed to be followed was that I had to assign a “Twin Code” to each record. This meant that for each family in the database, if two or more members were born on the same day they should be treated as twins. The twins should be assigned a number enumerating them in order of birth. If the member was not a twin, they should just receive the twin code of 1.

Here’s an example table:

PersonID FamilyID FirstName LastName DateOfBirth
1 1 Joe Johnson 2000-10-23 13:00:00
2 1 Jim Johnson 2001-12-15 05:45:00
3 2 Karly Matthews 2000-05-20 04:00:00
4 2 Kacy Matthews 2000-05-20 04:02:00
5 2 Tom Matthews 2001-09-15 11:52:00

There are lots of ways to achieve the desired result, but the simplest is to just use a simple SELECT statement combined with the ROW_NUMBER() function with a couple parameters as to how to number the rows!

ROW_NUMBER() provides you with the number of a row in a given recordset, where you provide details on how to number the records. For example, if I just had to number the records above based solely upon the date of birth (ignoring families) then I would use this query:

1
2
3
4
5
6
7
8
9
10
11
SELECT
     [PersonID]
    ,[FamilyID]
    ,[FirstName]
    ,[LastName]
    ,[DateOfBirth]
    ,ROW_NUMBER() over (ORDER BY DateOfBirth) AS Number
FROM
	People
ORDER BY 
	PersonID

This just tells the ROW_NUMBER() function to order its numbering ascending by DateOfBirth. Notice that I apply an order myself later on in the query, which is different than the row_number() order. I would get these results:

PersonID FamilyID FirstName LastName DateOfBirth Number
1 1 Joe Johnson 2000-10-23 13:00:00 3
2 1 Jim Johnson 2001-12-15 05:45:00 5
3 2 Karly Matthews 2000-05-20 04:00:00 1
4 2 Kacy Matthews 2000-05-20 04:02:00 2
5 2 Tom Matthews 2001-09-15 11:52:00 4

The number field that is assigned to each record is in the order of DateOfBirth.

Ordering my numbering by DateOfBirth is just half of the picture. I also need to “group” the records by the FamilyID. This is where a clause in T-SQL that you might not be very familiar with comes into play: “PARTITION BY”. The PARTITION BY clause allows us to group the results within the call to ROW_NUMBER() without grouping them ourselves via a GROUP BY. It just tells the ROW_NUMBER what groupings to use when it does its counting.

Here is our final SQL statement, which achieves the business logic we wanted to implement.

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT 
       [PersonID]
      ,[FamilyID]
      ,[FirstName]
      ,[LastName]
      ,[DateOfBirth]
      ,ROW_NUMBER() over(PARTITION BY FamilyID, 
                         CONVERT(NVARCHAR(25), DateOfBirth, 111) 
                         ORDER BY DateOfBirth ASC) TwinCode
 
  FROM [People]
ORDER BY
	PersonID

In the ROW_NUMBER function above, I am doing several things. I’m grouping on FamilyID, and also grouping on a converted DateOfBirth. I convert the DateOfBirth to an nvarchar using the 111 conversion code, because that gets results like ’2009/10/11′ and ’2009/10/12′ which can easily be grouped by to achieve distinct dates.

Grouping on the Family, DateOfBirth, and then sorting by DateOfBirth ascending achieves the desired result for the ROW_NUMBER. Here are the results of the query:

PersonID FamilyID FirstName LastName DateOfBirth TwinCode
1 1 Joe Johnson 2000-10-23 13:00:00 1
2 1 Jim Johnson 2001-12-15 05:45:00 1
3 2 Karly Matthews 2000-05-20 04:00:00 1
4 2 Kacy Matthews 2000-05-20 04:02:00 2
5 2 Tom Matthews 2001-09-15 11:52:00 1

As you can see, the two people who qualify as twins above (Karly and Kacy) are enumerated correctly, with Karly receiving a 1 and Kacy receiving a 2. All the records that are not twins properly receive a 1.

SQL Server – Using Computed or Calculated Columns in Tables

Posted in SQL on October 21st, 2009 by Dave Andrews – Be the first to comment

You are probably used to writing views to access data in your tables when there is some sort of computation that must be made in a field. But did you know that you can have your tables make computations themselves, without running through a view?

This can be done with Computed (or Calculated) Columns. These columns are table-level expressions that can operate on the other fields in a given record.

Let’s create a table which uses computed columns. I am going to create a table called Programmers which allows me to store a programmer’s first name, last name, middle initial, and date of birth. The table will include two computed columns: one which combines the elements of the name into a FullName field, and a second column which tells me the programmer’s age. This will all be achieved directly in the table, without the use of a view.

First, let’s create the table. Here is the query to create all of the fields except the computed ones.

1
2
3
4
5
6
7
CREATE TABLE Programmers
(
	ProgrammerID INT IDENTITY(1,1) NOT NULL,
	FirstName NVARCHAR(30),
	LastName NVARCHAR(30),
	MiddleInit NCHAR(1),
	DateOfBirth DATETIME,

Now, let’s create our first computed column. The syntax is simple. Just begin with the name of the column, and then in parenthesis define the expression which will calculate the value of the column.

Let’s begin with the FullName calculation. Just add the LastName, a comma, FirstName, and MiddleInit, and then trim white space off the right to handle a missing initial.

1
2
3
        FullName AS (rtrim(coalesce(LastName, '') + ', ' +
            coalesce(FirstName, '') + ' ' + 
            coalesce(MiddleInit, ''))),

There we have it, our FullName calculation. Each field is encapsulated in coalesce to handle NULL values properly.

The last calculation will be the age. The age is simply the difference in years of the current date from the birth date.

1
2
	Age AS (datediff(year, DateOfBirth, getdate()))
)

We also add a following parenthesis to close out our “CREATE TABLE” statement.

Now lets insert some test data. I added a few records with some NULLs for good testing measure. All standard stuff here. Notice we are not inserting the FullName or Age values.

1
2
3
4
5
6
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth)
     VALUES ('David', 'Andrews', 'C', '1984-09-20')
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth) 
     VALUES ('Billy', 'Jenkins', NULL, '1990-01-20')
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth) 
     VALUES ('Robert', 'Anderson', 'K', NULL)

Now lets test out our fields, using nothing more than a SELECT.

1
SELECT * FROM Programmers

We will get the following results:

ProgrammerID FirstName LastName MiddleInit DateOfBirth FullName Age
1 David Andrews C 1984-09-20 00:00:00.000 Andrews, David C 25
2 Billy Jenkins NULL 1990-01-20 00:00:00.000 Jenkins, Billy 19
3 Robert Anderson K NULL Anderson, Robert K NULL

I emphasized the calculated fields above. Our query did not calculate them, they were considered to be just a part of the table since they are calculated fields.

One thing to keep in mind about calculated fields is that they are difficult to modify. You have to DROP the field and then ADD it back with the same name. This can change the order of fields in your query if you use SELECT *. It can also affect any triggers you may have which rely on the fields being in a certain order.

Also keep in mind any overhead that calculated fields may produce. It’s a good idea to use them for absolutely basic, atomic information, such as what I presented above. Complex calculations could become taxing to your queries.

Understanding SQL Joins – The Left Join

Posted in SQL on October 21st, 2009 by Dave Andrews – Be the first to comment

SQL joins are a crucial part of anything more than the simplest of queries. Many programmers who do not fully comprehend SQL joins end up writing bloated software which will pull information from one table, store it, then run another query to get the information they want to join on. They then use code to process the join. It is much more efficient to have the SQL server handle the joining and processing of these records for you than to create custom code which ties tables together.

Let’s use these tables as an example.

AnimalTypes table

AnimalTypeID Name
1 Dog
2 Cat
3 Turtle
4 Ferret

Animals table

AnimalID AnimalTypeID Name Age
1 1 Dusty 5
2 3 Jonesey 2
3 2 Bonnie 1
4 3 Fiddler 3
5 1 Marci 1
6 5 Tails 2

Here we have a table called AnimalTypes which contains 4 types of animals, and a table called Animals which lists out animals and what type they are.

Let’s say we want to write a program which will display all the animals in our animals table, as well as the Name of the Type of the animal, not the ID of the type. If an animal has an unknown type, we want to display “UNKNOWN.” Here is the bad process that we want to avoid.

SELECT * FROM AnimalTypes
SELECT * FROM Animals

The bad program will then loop through all the animals and print out the Name field in the AnimalTypes results which corresponds to the given type ID. If one does not exist in the AnimalTypes results, the code could then print “UNKNOWN.”

It also might occur to you that you can write a query like this one, which would also be incorrect.

SELECT
      *
FROM
      Animals, AnimalTypes
WHERE
      Animals.AnimalTypeID = AnimalTypes.AnimalTypeID

This is the basis of an inner join. It can be written as an inner join query, but I wrote it in the manner above as an example. This query will return results, however you will only get animals who have a corresponding correct AnimalType. Our program wants to display all animals, regardless of whether or not they have a valid type selected. Using the tables from above, you would get these results:

AnimalID AnimalTypeID Name Age AnimalTypeID Name
1 1 Dusty 5 1 Dog
2 3 Jonesey 2 3 Turtle
3 2 Bonnie 1 2 Cat
4 3 Fiddler 3 3 Turtle
5 1 Marci 1 1 Dog

If you notice, the animal Tails is missing from the list. This is because Tails’ AnimalTypeID of 5 did not have a corresponding AnimalTypeID in the AnimalTypes table. In a query such as this, both values have to exist in both tables. This is not the case in our data, even though we want all animals to display.

What we need to use in this case is a left join or a left outer join. The left join will take each record from the left-hand side of the join, and tie it to records on the right-hand side. If the tie does not exist in the right-hand table, then the left-hand fields will still exist in the results, with NULL values for the right-hand.

Here is the query using a left join.

1
2
3
4
5
6
SELECT 
     * 
FROM 
     Animals
     LEFT JOIN AnimalTypes
          ON Animals.AnimalTypeID = AnimalTypes.AnimalTypeID

What you get here is every record from Animals, joined to the corresponding AnimalType record. In the case of Tails, we won’t get any AnimalType information.

AnimalID AnimalTypeID Name Age AnimalTypeID Name
1 1 Dusty 5 1 Dog
2 3 Jonesey 2 3 Turtle
3 2 Bonnie 1 2 Cat
4 3 Fiddler 3 3 Turtle
5 1 Marci 1 1 Dog
6 5 Tails 2 NULL NULL

A very common use of this join is in financial transactions, where you might be missing the category of the transaction but still want the amount included.

Conditional Sums in SQL Queries

Posted in SQL on September 29th, 2009 by Dave Andrews – 1 Comment

One thing that you may encounter frequently is a situation where you want to count the records in a table which are of a certain type. You want to do this as efficiently as possible, instead of running a new count query for each type. Let’s use the below table for reference.

Name Type Age
Buffy Dog 10
Dusty Dog 13
Clyde Dog 2
Bonnie Cat 1
Milky Ferret 1

We have above a table of animals. How can we determine the count of each type of animal? The simplest (and most inefficient) is to run a count() query for each type of animal. This would mean we have 3 queries to run, as there are 3 types of animals.

1
2
3
4
5
6
7
8
9
declare @numdogs int
declare @numcats int
declare @numferrets int
 
SELECT @numdogs = COUNT(*) FROM animals WHERE type = 'Dog'
SELECT @numcats = COUNT(*) FROM animals WHERE type = 'Cat'
SELECT @numferrets = COUNT(*) FROM animals WHERE type = 'Ferret'
 
SELECT @numdogs AS NumDogs, @numcats AS NumCats, @numferrets AS NumFerrets

We will get results from that query like this.

NumDogs NumCats NumFerrets
3 1 1

As you can imagine, this is terribly inefficient. A good answer in this case is to use a “conditional” sum in the SQL query. So rather than running COUNT(*) we will be running SUM(…). We will combine the SUM() call with a CASE statement, so that it counts correctly. This is the trick I hope to teach you. You can place CASE … END statements inside of aggregate functions.

Throw away all of our temporary variables, the 4 select statements involved, and let’s replace them with this single select.

1
2
3
4
5
6
SELECT 
	SUM(CASE WHEN type='Dog' then 1 else 0 end) AS NumDogs,
	SUM(CASE WHEN type='Cat' then 1 else 0 end) AS NumCats,
	SUM(CASE WHEN type='Ferret' then 1 else 0 end) AS NumFerrets
FROM 
	animals

We get these results:

NumDogs NumCats NumFerrets
3 1 1

Which is exactly what we got previously, but using a much more efficient method. The trick was to insert a CASE … END statement inside the SUM. The case will return a 1 when the type is the type we are looking for, and a 0 otherwise. The sum will then execute to add up all of those 0s and 1s. A 0 will not add anything, so we will effectively end up counting the values which match the type we are looking for.