Welcome!

Hi there! Hopefully, you've come for one or more of my pearls of wisdom, or at least a good laugh :)

Hit the main page for the latest from me, or the archives for the older (but not necessarily better) stuff.

Saturday, 2 July 2011

Project Euler, Problem 1

So kicking off with Problem 1 from Project Euler:

"Add all the natural numbers below one thousand that are multiples of 3 or 5."

It's a straight-forward question. But how best to attempt it in T-SQL?

A programming mindset oriented towards procedural languages such as C and the associated family might simply use a WHILE loop to iterate through the numbers from 1 to 999, check each one if they are divisible by 3 or 5, and increment a counting variable when a given number is so.

That's fine, and T-SQL provides you with the tools to do so, but T-SQL's main aim is the manipulation of sets in a declarative fashion, not a procedural one.

So, the question became in my mind a series of sets...


Part one, "One sheep, two sheep, three sheep..."
Firstly, build the set of numbers from 1 to 999. There are several ways that you might choose to accomplish this, but none quite so neat as a concept I first saw in Itzik Ben-Gan's "Inside Microsoft SQL Server 2008: T-SQL Querying" (ISBN-13: 978-0735626034).

His approach makes use of Common Table Expressions (CTEs), a feature introduced in SQL Server 2005, that allows you to define a query that would otherwise have been defined in a sub-query before the body of your main query, then reference it by name within the main query like it was any other table in your database, like so:

Sub-query version -
SELECT  table_1.a,
   table_1.b,
   sub_query.c
FROM  table_1
INNER JOIN (SELECT c, b FROM table_2) AS sub_query
  ON sub_query.b = table_1.b

CTE version -
WITH sub_query AS
(
 SELECT c, b
 FROM table_2
)
SELECT  table_1.a,
   table_1.b,
   sub_query.c
FROM  table_1
INNER JOIN sub_query
  ON  sub_query.b = table_1.b

Now, admittedly the example I've given doesn't really help you understand the power of the feature, but it does show the difference between the two.

In my mind, a CTE's strengths are that:
  1. When using the same sub-query multiple times in a single query, you have a higher associated cost if you need to change it due to a bug or feature change, as you have to edit every instance of the sub-query correctly. This is error prone and not trivial for complex queries with multiple instances of a sub-query in it. When using CTEs you only need to edit the definition of the sub-query in the CTE once, and all uses of that CTE in your query are updated.
  2. You can define multiple CTEs before a given query. When you do this, each CTE can be referenced by the main query AND each CTE can also reference any other CTE placed before it at the start of the main query.
Ben-Gan's concept uses this latter principle to great effect. The basic idea is to firstly define a two-tuple query in the first CTE. You then use the prior-referencing principle to CROSS JOIN that first CTE with itself to build a set of four tuples. Make another CTE that CROSS JOINs that CTE with itself to build a set of 16 tuples, and so on.

In a small number of small queries, you will have effectively defined a set with millions of tuples in it. Now, those tuples just hold multiple copies of whatever you put in the two-tuple CTE at the start, so on their own they don't mean much of anything.

But, what if we were to combine them with another feature added in SQL Server 2005, the ROW_NUMBER() function?

ROW_NUMBER() is a function that will effectively output a unique number for every tuple you call it across in a set. Now, ordinarily you would do this to provide a way to rank similar tuples next to one another based on the relative orders of particular values in the tuples. But we have multiple copies of the same thing, so where do we get the ranking from here?

It's really a trick question! The syntax of the ROW_NUMBER() function (and other windowed aggregate functions) allows you to use a sub-query in the OVER clause used to define the ordering you want. Given that we know all our rows are the same, so we don't care how it orders the row numbers, we can use a (SELECT NULL) instead of a column name as our ordering clause. SQL Server then spits out a completely arbitrary set of numbers in a set to whatever limit we define. Overall, this part of the full query looks like this:
WITH N0 AS
 (
  SELECT 0 AS N
  UNION ALL
  SELECT 0
 ),
 N1 AS
 (
  SELECT  B.N
  FROM  N0 AS A
  CROSS JOIN N0 AS B
 ),
 N2 AS
 (
  SELECT  B.N
  FROM  N1 AS A
  CROSS JOIN N1 AS B
 ),
 N3 AS
 (
  SELECT  B.N
  FROM  N2 AS A
   JOIN N2 AS B
 ),
 N4 AS
 (
  SELECT  B.N
  FROM  N3 AS A
  CROSS JOIN N3 AS B
 )
 SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N
 FROM N4


Part two, tying off the loose ends...
Now that we have a quick way of building a set of numbers of whatever size we want, we can look at addressing the question posed. As a quick recap, this problem is looking for the sum of all numbers between 1 and 999 that can be divided by 3 or 5.

So how do we test that a given number in our set is divisible by 3 or 5?

We could perform those divisions on each number, and test that the output is a whole number, but that's a little round-the-houses. Instead, it's simpler to use the modulus operator, % in T-SQL. This operator provides the remainder left by dividing one number into another. If one number divides into another entirely, then the remainder is zero, so our test can be written as:
number % 3 = 0 OR number % 5 =0

Pulling this together with the number generator above and performing a SUM() on the numbers found, we're left with this query:

WITH N0 AS
 (
  SELECT 0 AS N
  UNION ALL
  SELECT 0
 ),
 N1 AS
 (
  SELECT  B.N
  FROM  N0 AS A
  CROSS JOIN N0 AS B
 ),
 N2 AS
 (
  SELECT  B.N
  FROM  N1 AS A
  CROSS JOIN N1 AS B
 ),
 N3 AS
 (
  SELECT  B.N
  FROM  N2 AS A
  CROSS JOIN N2 AS B
 ),
 N4 AS
 (
  SELECT  B.N
  FROM  N3 AS A
  CROSS JOIN N3 AS B
 ),
 Numbers AS
 (
  SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N
  FROM N4
 )
 SELECT  SUM(N) AS Result
 FROM  Numbers
 WHERE  N < 1000
   AND (
     N % 3 = 0
    OR N % 5 = 0
    );
 
On my home PC, this takes about 4ms to get the answer of 233168.