Check Out Dave's New game for iPhone and iPod Touch: Smiled Out!

Archive for October, 2009

Using Row_Number() to Enumerate and Partition Records in SQL Server

Posted in SQL on October 30th, 2009 by Dave Andrews – Be the first to comment

I had a situation recently where I had a table full of people records, where the people were divided into families. The business logic that needed to be followed was that I had to assign a “Twin Code” to each record. This meant that for each family in the database, if two or more members were born on the same day they should be treated as twins. The twins should be assigned a number enumerating them in order of birth. If the member was not a twin, they should just receive the twin code of 1.

Here’s an example table:

PersonID FamilyID FirstName LastName DateOfBirth
1 1 Joe Johnson 2000-10-23 13:00:00
2 1 Jim Johnson 2001-12-15 05:45:00
3 2 Karly Matthews 2000-05-20 04:00:00
4 2 Kacy Matthews 2000-05-20 04:02:00
5 2 Tom Matthews 2001-09-15 11:52:00

There are lots of ways to achieve the desired result, but the simplest is to just use a simple SELECT statement combined with the ROW_NUMBER() function with a couple parameters as to how to number the rows!

ROW_NUMBER() provides you with the number of a row in a given recordset, where you provide details on how to number the records. For example, if I just had to number the records above based solely upon the date of birth (ignoring families) then I would use this query:

1
2
3
4
5
6
7
8
9
10
11
SELECT
     [PersonID]
    ,[FamilyID]
    ,[FirstName]
    ,[LastName]
    ,[DateOfBirth]
    ,ROW_NUMBER() over (ORDER BY DateOfBirth) AS Number
FROM
	People
ORDER BY 
	PersonID

This just tells the ROW_NUMBER() function to order its numbering ascending by DateOfBirth. Notice that I apply an order myself later on in the query, which is different than the row_number() order. I would get these results:

PersonID FamilyID FirstName LastName DateOfBirth Number
1 1 Joe Johnson 2000-10-23 13:00:00 3
2 1 Jim Johnson 2001-12-15 05:45:00 5
3 2 Karly Matthews 2000-05-20 04:00:00 1
4 2 Kacy Matthews 2000-05-20 04:02:00 2
5 2 Tom Matthews 2001-09-15 11:52:00 4

The number field that is assigned to each record is in the order of DateOfBirth.

Ordering my numbering by DateOfBirth is just half of the picture. I also need to “group” the records by the FamilyID. This is where a clause in T-SQL that you might not be very familiar with comes into play: “PARTITION BY”. The PARTITION BY clause allows us to group the results within the call to ROW_NUMBER() without grouping them ourselves via a GROUP BY. It just tells the ROW_NUMBER what groupings to use when it does its counting.

Here is our final SQL statement, which achieves the business logic we wanted to implement.

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT 
       [PersonID]
      ,[FamilyID]
      ,[FirstName]
      ,[LastName]
      ,[DateOfBirth]
      ,ROW_NUMBER() over(PARTITION BY FamilyID, 
                         CONVERT(NVARCHAR(25), DateOfBirth, 111) 
                         ORDER BY DateOfBirth ASC) TwinCode
 
  FROM [People]
ORDER BY
	PersonID

In the ROW_NUMBER function above, I am doing several things. I’m grouping on FamilyID, and also grouping on a converted DateOfBirth. I convert the DateOfBirth to an nvarchar using the 111 conversion code, because that gets results like ’2009/10/11′ and ’2009/10/12′ which can easily be grouped by to achieve distinct dates.

Grouping on the Family, DateOfBirth, and then sorting by DateOfBirth ascending achieves the desired result for the ROW_NUMBER. Here are the results of the query:

PersonID FamilyID FirstName LastName DateOfBirth TwinCode
1 1 Joe Johnson 2000-10-23 13:00:00 1
2 1 Jim Johnson 2001-12-15 05:45:00 1
3 2 Karly Matthews 2000-05-20 04:00:00 1
4 2 Kacy Matthews 2000-05-20 04:02:00 2
5 2 Tom Matthews 2001-09-15 11:52:00 1

As you can see, the two people who qualify as twins above (Karly and Kacy) are enumerated correctly, with Karly receiving a 1 and Kacy receiving a 2. All the records that are not twins properly receive a 1.

SQL Server – Using Computed or Calculated Columns in Tables

Posted in SQL on October 21st, 2009 by Dave Andrews – Be the first to comment

You are probably used to writing views to access data in your tables when there is some sort of computation that must be made in a field. But did you know that you can have your tables make computations themselves, without running through a view?

This can be done with Computed (or Calculated) Columns. These columns are table-level expressions that can operate on the other fields in a given record.

Let’s create a table which uses computed columns. I am going to create a table called Programmers which allows me to store a programmer’s first name, last name, middle initial, and date of birth. The table will include two computed columns: one which combines the elements of the name into a FullName field, and a second column which tells me the programmer’s age. This will all be achieved directly in the table, without the use of a view.

First, let’s create the table. Here is the query to create all of the fields except the computed ones.

1
2
3
4
5
6
7
CREATE TABLE Programmers
(
	ProgrammerID INT IDENTITY(1,1) NOT NULL,
	FirstName NVARCHAR(30),
	LastName NVARCHAR(30),
	MiddleInit NCHAR(1),
	DateOfBirth DATETIME,

Now, let’s create our first computed column. The syntax is simple. Just begin with the name of the column, and then in parenthesis define the expression which will calculate the value of the column.

Let’s begin with the FullName calculation. Just add the LastName, a comma, FirstName, and MiddleInit, and then trim white space off the right to handle a missing initial.

1
2
3
        FullName AS (rtrim(coalesce(LastName, '') + ', ' +
            coalesce(FirstName, '') + ' ' + 
            coalesce(MiddleInit, ''))),

There we have it, our FullName calculation. Each field is encapsulated in coalesce to handle NULL values properly.

The last calculation will be the age. The age is simply the difference in years of the current date from the birth date.

1
2
	Age AS (datediff(year, DateOfBirth, getdate()))
)

We also add a following parenthesis to close out our “CREATE TABLE” statement.

Now lets insert some test data. I added a few records with some NULLs for good testing measure. All standard stuff here. Notice we are not inserting the FullName or Age values.

1
2
3
4
5
6
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth)
     VALUES ('David', 'Andrews', 'C', '1984-09-20')
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth) 
     VALUES ('Billy', 'Jenkins', NULL, '1990-01-20')
INSERT INTO Programmers(FirstName, LastName, MiddleInit, DateOfBirth) 
     VALUES ('Robert', 'Anderson', 'K', NULL)

Now lets test out our fields, using nothing more than a SELECT.

1
SELECT * FROM Programmers

We will get the following results:

ProgrammerID FirstName LastName MiddleInit DateOfBirth FullName Age
1 David Andrews C 1984-09-20 00:00:00.000 Andrews, David C 25
2 Billy Jenkins NULL 1990-01-20 00:00:00.000 Jenkins, Billy 19
3 Robert Anderson K NULL Anderson, Robert K NULL

I emphasized the calculated fields above. Our query did not calculate them, they were considered to be just a part of the table since they are calculated fields.

One thing to keep in mind about calculated fields is that they are difficult to modify. You have to DROP the field and then ADD it back with the same name. This can change the order of fields in your query if you use SELECT *. It can also affect any triggers you may have which rely on the fields being in a certain order.

Also keep in mind any overhead that calculated fields may produce. It’s a good idea to use them for absolutely basic, atomic information, such as what I presented above. Complex calculations could become taxing to your queries.

Understanding SQL Joins – The Left Join

Posted in SQL on October 21st, 2009 by Dave Andrews – Be the first to comment

SQL joins are a crucial part of anything more than the simplest of queries. Many programmers who do not fully comprehend SQL joins end up writing bloated software which will pull information from one table, store it, then run another query to get the information they want to join on. They then use code to process the join. It is much more efficient to have the SQL server handle the joining and processing of these records for you than to create custom code which ties tables together.

Let’s use these tables as an example.

AnimalTypes table

AnimalTypeID Name
1 Dog
2 Cat
3 Turtle
4 Ferret

Animals table

AnimalID AnimalTypeID Name Age
1 1 Dusty 5
2 3 Jonesey 2
3 2 Bonnie 1
4 3 Fiddler 3
5 1 Marci 1
6 5 Tails 2

Here we have a table called AnimalTypes which contains 4 types of animals, and a table called Animals which lists out animals and what type they are.

Let’s say we want to write a program which will display all the animals in our animals table, as well as the Name of the Type of the animal, not the ID of the type. If an animal has an unknown type, we want to display “UNKNOWN.” Here is the bad process that we want to avoid.

SELECT * FROM AnimalTypes
SELECT * FROM Animals

The bad program will then loop through all the animals and print out the Name field in the AnimalTypes results which corresponds to the given type ID. If one does not exist in the AnimalTypes results, the code could then print “UNKNOWN.”

It also might occur to you that you can write a query like this one, which would also be incorrect.

SELECT
      *
FROM
      Animals, AnimalTypes
WHERE
      Animals.AnimalTypeID = AnimalTypes.AnimalTypeID

This is the basis of an inner join. It can be written as an inner join query, but I wrote it in the manner above as an example. This query will return results, however you will only get animals who have a corresponding correct AnimalType. Our program wants to display all animals, regardless of whether or not they have a valid type selected. Using the tables from above, you would get these results:

AnimalID AnimalTypeID Name Age AnimalTypeID Name
1 1 Dusty 5 1 Dog
2 3 Jonesey 2 3 Turtle
3 2 Bonnie 1 2 Cat
4 3 Fiddler 3 3 Turtle
5 1 Marci 1 1 Dog

If you notice, the animal Tails is missing from the list. This is because Tails’ AnimalTypeID of 5 did not have a corresponding AnimalTypeID in the AnimalTypes table. In a query such as this, both values have to exist in both tables. This is not the case in our data, even though we want all animals to display.

What we need to use in this case is a left join or a left outer join. The left join will take each record from the left-hand side of the join, and tie it to records on the right-hand side. If the tie does not exist in the right-hand table, then the left-hand fields will still exist in the results, with NULL values for the right-hand.

Here is the query using a left join.

1
2
3
4
5
6
SELECT 
     * 
FROM 
     Animals
     LEFT JOIN AnimalTypes
          ON Animals.AnimalTypeID = AnimalTypes.AnimalTypeID

What you get here is every record from Animals, joined to the corresponding AnimalType record. In the case of Tails, we won’t get any AnimalType information.

AnimalID AnimalTypeID Name Age AnimalTypeID Name
1 1 Dusty 5 1 Dog
2 3 Jonesey 2 3 Turtle
3 2 Bonnie 1 2 Cat
4 3 Fiddler 3 3 Turtle
5 1 Marci 1 1 Dog
6 5 Tails 2 NULL NULL

A very common use of this join is in financial transactions, where you might be missing the category of the transaction but still want the amount included.