Inserting a batch of random numbers

Let’s say you want to insert (or, for our example, update) a whole set of random numbers into a table.

You may try this:

UPDATE table SET rnum = RAND()

…only you find, to your amazement, that SQL Server has put the same number into each row.

Hmm.

In SQL Server, when rand() is called multiple times in the same query (e.g. for multiple rows in an update statement), it usually returns the same number.

Two problems:

* Firstly, the rand() function returns a number between 0 and 1.
* Secondly, when rand() is called multiple times in the same query (e.g. for multiple rows in an update statement), it usually returns the same number (which I suspect your algorithm above is trying to solve, by splitting it into multiple calls)

My favourite way around this problem is to use a function that’s guaranteed to return a unique value each time, like NEWID(), convert it to binary, and use it as the seed.

UPDATE table SET rnum = RAND(convert(binary(16),NEWID()))

This works because NEWID() is guaranteed to return a new GUID (a globally unique 16-byte number) each time it’s invoked. We must convert this to binary before using it, as RAND() won’t accept GUIDs as its seed.

So, although RAND() ordinarily gives the same random value for each row in an update, we get over the problem with RAND by giving it a different seed for each row using a function that gives a different result for each row.

Cascading Parameter Values in SSRS

In SQL Server Reporting Services (SSRS), you might want your choice in the first parameter to limit the choices available in the second (and subsequent) parameters.

This works automagically if you order your parameters and datasets correctly

  • First, set up a primary (report) dataset, then a dataset for each parameter dropdown. Code the WHERE clause in the datasets to make the dependencies correct across parameter variables
  • Secondly, order your parameters in the Report | Parameters menu so that the first variable/parameter you want the user to fill in is at the top, and make the second dataset depend on that parameter. Follow this ordering through the parameters; the final parameter(s) should be the one(s) the actual report dataset depends on.
  • Repeat for subsequent parameters

This will work if your WHERE clause in the second and subsequent datasets have variables that SSRS knows are populated from earlier parameters.

As an example, I have three datasets from the venerable pubs database.

pubslist is used to populate the @p parameter, and looks like this:

 select pub_id, pub_name from publishers 

titleslist populates the @t parameter, and looks like this:

  select title_id, title from titles where pub_id = @p

Finally, reportdataset looks like this:

  select title, price, ytd_sales from titles where title_id = @t

The order of the parameters in the Report | Report Parameters menu is crucial; because the datasets must be executed in the order shown above, and the @t parameter is in a dataset that relies on the @p parameter being set first, we move @p to the top of the list.

Now, SSRS evaluates the dataset needed to fill the dropdown for the first parameter with labels. It relies on a dataset that doesn’t need a parameter, so can be produced immediately.

Then, having got that parameter value, it can populate the second parameter’s dropdown. That in turn results in the report being produced.

Tables, tables, tables

So, what’s a temporary table? or an in-memory table? or a pivot table?

An in-memory table is a table in some platforms that’s stored entirely in memory. These don’t really exist in MS SQL, although you could say that it is a table that’s been entirely cached, and so doesn’t result in any physical (hard disk) reads when queried. In earlier versions, the DBCC PINTABLE command allowed the “pinning” of tables in memory, but this was deprecated in SQL Server 2005.

Often, a table-valued variable, @tablename, might be stored in memory (although this is not guaranteed), and declared in a batch or function, with no persistence.

A temporary table is a table that will be automatically dropped when it’s no longer needed, usually when the creating session is terminated. In MS SQL, they begin with a # (or two hashes if they’re global temporary tables, shared between multiple sessions), and are often created with a SELECT INTO #TEMPTABLE … style query.

A pivot table is a special form of query where the values in several rows are summarised, “pivoted” on an axis, and become columns, where the summary data then becomes the rows. Frequently this happens where you’ve rows sorted on dates; these may then be “pivoted” so you end up with a column for January, one for February, one for March, etc.

SQL Server: Database Diagrams

A little feature that some people don’t know about but would find very very useful: database diagrams in SQL Server. If a picture paints a thousand words, then this feature will help you pick up on the structure of data much more easily.

Within SQL Server, there’s the “Database Diagram” container in each database, in Management Studio.

You can create a new diagram in there, and include all (or some) of your tables, so showing where the relationships are, and what the tables are.

You could also create several different diagrams, each showing a subset of tables and relationships useful to a particular use case.

GROUP BY and single-row values

So, you’ve constructed a SQL statement with a GROUP BY clause, and you’re getting this message:

Column col is invalid in the select list because it is not
contained in either an aggregate function or the
GROUP BY clause.

Bear in mind that the result of a GROUP BY statement, or of a statement where one or columns uses an aggregate function, has each row containing a summary of other rows.

This means that if you try to include a column in your select clause that isn’t a summary (this includes values by which you’re grouping), then the server is going to have difficulties returning it; remember, it’s going to return one row per group, and any value that can’t be reduced to a single row per group will fail.

This gets interesting even when you’re hoping to return a single value per group, for example with CASE.

If the CASE expression you’re using is dependent on values of individual rows rather than summaries, you’ll also get the above message; you may only reference non-aggregate values (i.e. single-row values) in an aggregate function, or in the WHERE clause, so the solution would involve placing your CASE inside an aggregate function, in this case your SUM.

The same applies for the HAVING clause, as it is effectively selecting rows in the grouped resultset based on values in each group.

AWE (Address Windowing Extensions) in SQL Server

Memory is cheap these days; it’s not unusual to find laptops with 4GB or more (and yes, I know this post won’t date well!). This is a bit of a problem with some operating systems, though: 4GB is the maximum memory that many system are capable of handling. Yes, that’s right. Anything over 4GB is wasted on many of today’s operating systems, because of a limit imposed by 32 bit processors on how much memory they may address[1].

This becomes rather a problem on SQL Server installations, where having a lot of available memory is quite beneficial.

It’s worth mentioning at this point that 64-bit systems don’t suffer from the same memory limitations as 32-bit systems, so much of this article is irrelevant if your system is running on a 64-bit edition of Windows. The issue we’re talking about here is that 32 bit systems can ordinarily access only 4GB of RAM in total, which under default conditions means that your SQL Server instance can only access 2GB of useful RAM for its own needs, and putting one of these in your development laptop is quite pointless.

However, it is possible to grant access to more memory, but it does depend on a number of conditions.

Physical Address Extension

Since Windows 2000, Microsoft have supported Physical Address Extension (PAE) in a kernel, although the support requires using a non-default kernel. To use PAE, both the hardware and the operating system must support it.

Various 32 bit Windows servers support from 8-128GB, depending on edition. Windows 2000 Advanced and Datacenter Servers, Windows Server 2003 Enterprise and Datacenter Editions, and Windows Server 2008 Enterprise and Datacenter Edition support more than 4GB in 32 bit editions, by using PAE, with Windows 2003 Datacenter hitting the top score of 128GB.

Address Windowing Extensions

Address Windowing Extensions (AWE) is an API in Windows, used by SQL Server to support larger amounts of physical memory. It shouldn’t surprise you by now that AWE relies on PAE, which in turn has the other prerequisites as described above, so in order to enable AWE in SQL Server (using sp_configure), PAE must be supported by the underlying operating system.

The benefits are obvious; SQL Server will have access to more physical memory, which will make its job a lot easier. Your buffer cache will love you, as more and more data can be pulled out of memory rather than causing page hits on your oh-so busy hard disk. The drawbacks are slightly less obvious; memory granted to SQL Server using AWE is non-paged, which means it’s allocated until SQL Server is shut down. This will cause problems is SQL Server is sharing its host with other instances or other AWE-aware services. So keep your maximum memory settings tuned, and enjoy the headroom!


Footnote:
32 bits is 4 bytes (of 8 bits each). Just like a 3-digit number can store up to 1000 different values — 0-999 — an 8-bit number can store from 0-256. A 6-digit number can store 1000 x 1000, which is 1,000,000 different values, 0 – 999,999. Similarly, a 16-bit number can store 256 x 256, which is 65,536 different values. 32 bits can store 65,536 x 65,536 values, which is a number comfortably over 4 billion. This is where we get the 4GB upper limit for most 32 bit systems; they simply can’t point to the location of memory outside this range.

SQL Server Compatibility Levels – to Upgrade or Not to Upgrade?

When you upgrade from one SQL Server type to another (most of my experience has been with 2000 -> 2005), one of the options available is to stick with the existing compatibility level, rather than upgrade.

SQL Server 2000 defaults to “Compatibility Level 80”, with other options including 65, 70. These represent the features and compatibility options for version 8.0 (SQL Server 2000), 6.5 and 7.0 respectively. In other words, it’s possible to coerce a database into thinking it’s running with features consistent with an earlier release of SQL Server.

When upgrading to 2005, it’s possible to retain compatibility level 80, but the recommended route is to switch to 90. But should you?

There are two ways to look at this:

* If you want to change your code (or install some collaborating SQL2005/2008 servers) in the future, your compatibility mode will affect that decision
* If you haven’t run the Upgrade Adviser, you may not be able to upgrade; compatibility level 90 disables certain syntaxes and features that were available in 80.

So, when upgrading, you don’t have to increase your compatibility level, but then you’ve to consider your future with the server too; you might be happy to run on the latest SQL Server instance, but with old-school database settings and syntaxes. On the other hand, collaboration with other servers and applications, or maintenance by future programmers and DBAs, will probably require the switch to a higher compatibility level.

Note: it’s entirely possible to have several databases with different compatibility levels on a single instance.

Tricks with ORDER BY

The ORDER BY clause in SQL Server is a bit more flexible than many people give it credit for.

In a training course I’m giving at the moment, we’ve just gone over a few of the lesser known uses, along with the widely used ones:

  • The obvious: ordering by a column name
    ORDER BY advance
  • Less obvious: ordering by an expression on a column
    ORDER BY id % 2 -- even IDs first, then odd
  • Ordering by multiple fields
    ORDER BY id % 2 desc, price
           -- odd orders first by even/odd IDs, then by prices within each
  • Ordering by visible column position
    SELECT id, customerid, price * quantity as "gross revenue"
    FROM orders
    ORDER BY 2
  • Ordering by column aliases
    SELECT id, customerid, price * quantity as "gross revenue"
    FROM orders
    ORDER BY "gross revenue"

The last one is particularly interesting, because column aliases can’t be used in other places e.g. in the where clause or from clause.

Running a command on a remote server with SQL Server

Another StackOverflow question: How do I run an OS command on a remote server, from within SQL Server? I answered as follows.

Something like rsh might do – have your local SQL instance invoke xp_cmdshell with an rsh command (or a batch file), connecting to the remote server and executing the command.

If you have SQL on the remote machine, it’d be far easier; call a stored procedure on that machine from the local machine, and let that stored procedure do the work (might need fiddling with proxy accounts and “execute as”).

A third option would be to create a .NET stored procedure, and do your socket programming or remote invocation from there – it runs in-process, so you get similar same performance and security benefits you’d get from writing it in T-SQL rather than hopping out to a cmdshell with all the mess that entails.

The rsh solution won the day.

Having said that, the rsh solution is not tied to SQL Server — rsh can be invoked from anything capable of running shell commands locally, even a batch file, or a Perl script.

Rsh is a feature Microsoft added to Windows a few versions ago, although a variety of it has been available in resource kits since NT 4.0. In simplest terms, it provides a remote command line, not unlike telnetting into a Unix server, only you get the Windows CMD prompt. Rsh allows the invoker to invoke a single command remotely, using command-line options, and without the need to write any further server-side code. This makes it suited to the quick-and-dirty route taken by the recipient of the above answer.

Data Warehousing: To Primary Key or not to Primary Key?

Constraints are a necessary part of the tug-and-thrust of database design, but once in a while, when designing a data warehouse, someone comes up with the question “Do I need to map those constraints to my DW?” The simple answer is “Yes”, but it’s also a somewhat naive answer that doesn’t take into account the needs of a DW.

For your primary key, consider using a surrogate or alternate key; you’ll need to cater for slowly changing dimensions, e.g. if you’re doing a report over the last 5 years on average sales per married/unmarried salesperson, you’ll want to register the fact that someone was unmarried for 2 years, then married for the last 3. This means your data warehouse will have two dimension-table rows for the same person. Following the OLTP structure for that will be hard, to say the least. Note that it’s not enough to create a composite key on the original primary key and e.g. marital status — the same person may also have other fields that change occasionally, such as address (zipcode), department, job title, or even gender!

As for other constraints such as default value or check constraints, we need to consider how a DW is populated, and what the arrival of data to the DW means. The DW is invariably populated from an OLTP (relational) system, the same system that typically has all the original constraints. Ideally, this means the data is already clean by the time the DW gets it. It also means that the performance hit required by checking these constraints in writing to the database has already been taken. And OLTP system is typically optimised for writes and reads, particularly the large tables that change frequently and are the subject of most of the storage in a DW.

On the other hand, Data Warehouses are massively optimised for reads (assuming you’re populating as a batch, with reasonably high latency), and constraints don’t really factor in read operations. You can typically get around any constraint issues with your DW populating job, if they haven’t already been dealt with in the OLTP system, and deal with nulls etc. at the analysis/reporting tools if necessary. It’s far more important to make sure default values fit with your conceptual data model, and don’t introduce issues at the DW client tools.