Welcome!

Hi there! Hopefully, you've come for one or more of my pearls of wisdom, or at least a good laugh :)

Hit the main page for the latest from me, or the archives for the older (but not necessarily better) stuff.

Monday, 4 July 2011

Project Euler, Problem 2

Moving straight on to the second of the problems posed by Project Euler:

"By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms."

Again, my mind started thinking in terms of the sets that are involved in the question:
  1. The set of numbers that make us the Fibonacci sequence.
  2. The set of numbers less than or equal to four million.
  3. The set of even numbers.
The problem is looking for the sum of numbers that form the intersection between these sets.

In my last post, we covered using the modulus operator to determine if one number is divisible by another, so that's a straight-forward idea for us to implement.

Applying a condition such as the limit of four million given to us is similarly simple.

But how do we calculate the Fibonacci sequence in T-SQL?

Fibonacci? Shmibonacci!
Let's first look at the Fibonacci sequence. The first few terms of the sequence are:
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89...

Hopefully, you can see that each term after the first two is simply the sum of the previous two terms in the sequence, i.e. Fib(n) = Fib(n - 1) + Fib(n - 2).

Such sequences are sometimes referred to as recursive sequences because of this self-referencing way of calculating the next value in the sequence.

So, the question now becomes: "How do I perform recursion in T-SQL?" Well, just like in the first problem, you could use a cursor or WHILE loop to perform the recursion in a similar fashion to the iteration that we did in that problem. But again, I feel that there is a better solutiona available to us.

When discussing the first Euler problem, I introduced the Common Table Expression, or CTE, as a way of quickly defining sub-queries that can be re-used in a main query. However, I did not mention a really fun aspect of them: self-referencing. Not only can you define multiple CTEs, and reference another, prior CTE in a later CTE, but you can also reference the same CTE within itself if you wish.

Here's a piece of typical syntax:
WITH Recursion AS
(
 SELECT  table_1.a,
    table_1.b
 FROM  table_1
 UNION ALL
 SELECT  table_1.a,
    table_1.b
 FROM  table_1
 INNER JOIN Recursion
   ON Recursion.b = table_1.a
 WHERE  table_1.b IS NOT NULL
)
SELECT *
FROM Recursion
There are a few important things to note with this approach:
  1. Note the two sub-queries involved. The top half is known as an anchor member and the lower half is the recursive member.
  2. You can have multiple anchor members in a recursive CTE, but only one recursive member.
  3. All of the set-based operators (UNION, UNION ALL, EXCEPT and INTERSECT) can be used to separate anchor members from each other.
  4. Only the UNION ALL set-based operator can be used to separate the last anchor member from the recursive member.
  5. By definition, the recursive member is the only member allowed to reference the CTE it is part of, and even then it can only be referenced once; you cannot use aliasing to make two or more references to the CTE within a recursive member.
  6. To prevent infinite loops caused by queries that have no inherent stopping point, recursive CTEs by default will only recurse to 100 levels deep. Whilst this may be enough for most purposes, sometimes you just need to go deeper! You can use the MAXRECURSION query hint to override this depth limit. It has a maximum of 32767 or, if you are very, very confident that you won't cause an infinite recursion, you can disable the safety feature by setting MAXRECURSION to 0.
So, with this knowledge of recursion using CTEs, how do we solve this problem? This was my attempt:

WITH FibonacciGenerator AS
 (
  SELECT 1  AS n,
    0  AS [F(n-1)],
    1  AS [F(n)]
  UNION ALL
  SELECT n + 1,
    [F(n-1)].[F(n)],
    [F(n-1)].[F(n-1)] + [F(n-1)].[F(n)]
  FROM FibonacciGenerator AS [F(n-1)]
  WHERE [F(n-1)].[F(n-1)] + [F(n-1)].[F(n)] < 4000000
 )
 SELECT SUM(FibonacciGenerator.[F(n)]) AS Result
 FROM FibonacciGenerator
 WHERE FibonacciGenerator.[F(n)] % 2 = 0;
where n is the term in the sequence, F(n) is the Fibonacci value for n, and F(n-1) is the previous Fibonacci term to that one.

It renders the result of 4613732 in 1ms on my PC.

Saturday, 2 July 2011

Project Euler, Problem 1

So kicking off with Problem 1 from Project Euler:

"Add all the natural numbers below one thousand that are multiples of 3 or 5."

It's a straight-forward question. But how best to attempt it in T-SQL?

A programming mindset oriented towards procedural languages such as C and the associated family might simply use a WHILE loop to iterate through the numbers from 1 to 999, check each one if they are divisible by 3 or 5, and increment a counting variable when a given number is so.

That's fine, and T-SQL provides you with the tools to do so, but T-SQL's main aim is the manipulation of sets in a declarative fashion, not a procedural one.

So, the question became in my mind a series of sets...


Part one, "One sheep, two sheep, three sheep..."
Firstly, build the set of numbers from 1 to 999. There are several ways that you might choose to accomplish this, but none quite so neat as a concept I first saw in Itzik Ben-Gan's "Inside Microsoft SQL Server 2008: T-SQL Querying" (ISBN-13: 978-0735626034).

His approach makes use of Common Table Expressions (CTEs), a feature introduced in SQL Server 2005, that allows you to define a query that would otherwise have been defined in a sub-query before the body of your main query, then reference it by name within the main query like it was any other table in your database, like so:

Sub-query version -
SELECT  table_1.a,
   table_1.b,
   sub_query.c
FROM  table_1
INNER JOIN (SELECT c, b FROM table_2) AS sub_query
  ON sub_query.b = table_1.b

CTE version -
WITH sub_query AS
(
 SELECT c, b
 FROM table_2
)
SELECT  table_1.a,
   table_1.b,
   sub_query.c
FROM  table_1
INNER JOIN sub_query
  ON  sub_query.b = table_1.b

Now, admittedly the example I've given doesn't really help you understand the power of the feature, but it does show the difference between the two.

In my mind, a CTE's strengths are that:
  1. When using the same sub-query multiple times in a single query, you have a higher associated cost if you need to change it due to a bug or feature change, as you have to edit every instance of the sub-query correctly. This is error prone and not trivial for complex queries with multiple instances of a sub-query in it. When using CTEs you only need to edit the definition of the sub-query in the CTE once, and all uses of that CTE in your query are updated.
  2. You can define multiple CTEs before a given query. When you do this, each CTE can be referenced by the main query AND each CTE can also reference any other CTE placed before it at the start of the main query.
Ben-Gan's concept uses this latter principle to great effect. The basic idea is to firstly define a two-tuple query in the first CTE. You then use the prior-referencing principle to CROSS JOIN that first CTE with itself to build a set of four tuples. Make another CTE that CROSS JOINs that CTE with itself to build a set of 16 tuples, and so on.

In a small number of small queries, you will have effectively defined a set with millions of tuples in it. Now, those tuples just hold multiple copies of whatever you put in the two-tuple CTE at the start, so on their own they don't mean much of anything.

But, what if we were to combine them with another feature added in SQL Server 2005, the ROW_NUMBER() function?

ROW_NUMBER() is a function that will effectively output a unique number for every tuple you call it across in a set. Now, ordinarily you would do this to provide a way to rank similar tuples next to one another based on the relative orders of particular values in the tuples. But we have multiple copies of the same thing, so where do we get the ranking from here?

It's really a trick question! The syntax of the ROW_NUMBER() function (and other windowed aggregate functions) allows you to use a sub-query in the OVER clause used to define the ordering you want. Given that we know all our rows are the same, so we don't care how it orders the row numbers, we can use a (SELECT NULL) instead of a column name as our ordering clause. SQL Server then spits out a completely arbitrary set of numbers in a set to whatever limit we define. Overall, this part of the full query looks like this:
WITH N0 AS
 (
  SELECT 0 AS N
  UNION ALL
  SELECT 0
 ),
 N1 AS
 (
  SELECT  B.N
  FROM  N0 AS A
  CROSS JOIN N0 AS B
 ),
 N2 AS
 (
  SELECT  B.N
  FROM  N1 AS A
  CROSS JOIN N1 AS B
 ),
 N3 AS
 (
  SELECT  B.N
  FROM  N2 AS A
   JOIN N2 AS B
 ),
 N4 AS
 (
  SELECT  B.N
  FROM  N3 AS A
  CROSS JOIN N3 AS B
 )
 SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N
 FROM N4


Part two, tying off the loose ends...
Now that we have a quick way of building a set of numbers of whatever size we want, we can look at addressing the question posed. As a quick recap, this problem is looking for the sum of all numbers between 1 and 999 that can be divided by 3 or 5.

So how do we test that a given number in our set is divisible by 3 or 5?

We could perform those divisions on each number, and test that the output is a whole number, but that's a little round-the-houses. Instead, it's simpler to use the modulus operator, % in T-SQL. This operator provides the remainder left by dividing one number into another. If one number divides into another entirely, then the remainder is zero, so our test can be written as:
number % 3 = 0 OR number % 5 =0

Pulling this together with the number generator above and performing a SUM() on the numbers found, we're left with this query:

WITH N0 AS
 (
  SELECT 0 AS N
  UNION ALL
  SELECT 0
 ),
 N1 AS
 (
  SELECT  B.N
  FROM  N0 AS A
  CROSS JOIN N0 AS B
 ),
 N2 AS
 (
  SELECT  B.N
  FROM  N1 AS A
  CROSS JOIN N1 AS B
 ),
 N3 AS
 (
  SELECT  B.N
  FROM  N2 AS A
  CROSS JOIN N2 AS B
 ),
 N4 AS
 (
  SELECT  B.N
  FROM  N3 AS A
  CROSS JOIN N3 AS B
 ),
 Numbers AS
 (
  SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N
  FROM N4
 )
 SELECT  SUM(N) AS Result
 FROM  Numbers
 WHERE  N < 1000
   AND (
     N % 3 = 0
    OR N % 5 = 0
    );
 
On my home PC, this takes about 4ms to get the answer of 233168.

Friday, 1 July 2011

Toying with SQL

For some time now, I've been dabbling in Transact-SQL, Microsoft's interpretation of Structured Query Language, and SQL Server. Along the way I've learned a lot about both, for good and bad :)

I've learned a lot from reading around the subject. There are a lot of good authors out there that know a lot about SQL Server between them all, but I've always felt that I've learned best by doing, not reading.

To that end, I've been using the Project Euler problems to give me a way to exercise what I've read about.

Project Euler is all about applying a combination of mathematical and programming knowledge to answer stated mathematical problems. You're free to use whatever programming language you wish, one of the good reasons for using the site. By answering a problem successfully, you gain access to forums where other successful Eulerians give insight into their programs in their choice of language.

In this way you can gain an understanding of both other ways you might approach a given problem, and how other languages operate to achieve the same goal.

My intention over the next few posts is to discuss the problems I have solved in T-SQL so far. Both the solutions themselves and the techniques I've used to put them together will be fair game.

Hopefully, you'll see something useful amongst it all!