Inserting a batch of random numbers

Let’s say you want to insert (or, for our example, update) a whole set of random numbers into a table.

You may try this:

UPDATE table SET rnum = RAND()

…only you find, to your amazement, that SQL Server has put the same number into each row.

Hmm.

In SQL Server, when rand() is called multiple times in the same query (e.g. for multiple rows in an update statement), it usually returns the same number.

Two problems:

* Firstly, the rand() function returns a number between 0 and 1.
* Secondly, when rand() is called multiple times in the same query (e.g. for multiple rows in an update statement), it usually returns the same number (which I suspect your algorithm above is trying to solve, by splitting it into multiple calls)

My favourite way around this problem is to use a function that’s guaranteed to return a unique value each time, like NEWID(), convert it to binary, and use it as the seed.

UPDATE table SET rnum = RAND(convert(binary(16),NEWID()))

This works because NEWID() is guaranteed to return a new GUID (a globally unique 16-byte number) each time it’s invoked. We must convert this to binary before using it, as RAND() won’t accept GUIDs as its seed.

So, although RAND() ordinarily gives the same random value for each row in an update, we get over the problem with RAND by giving it a different seed for each row using a function that gives a different result for each row.

Advertisement

MySQL: Mass email change

It’s not unheard of for a company to change e-mail domain in mid-thrust; maybe it’s been bought out, or rebranded, or the parent company has spun it off to its own brand.

Only you’ve got hundreds of employees, each one with their own email address, and your MySQL database is in dire need of updating to reflect this.

To get around this, you’ll need to replace the relevant part of each email string within an update statement, grabbing the hostname substring (after the ‘@’) with a REPLACE, and replacing it.

UPDATE table SET email=REPLACE(email,'OLDHOST.com', 'newhost.com');

Note: REPLACE() is case-sensitive, so if needs be, you can use LOWER(email) inside the REPLACE function if you need to catch all case possibilities, as below:

UPDATE table SET email=REPLACE(LOWER(email),'oldhost.com', 'newhost.com');

This will also convert all your email addresses to lowercase, so be aware of that.

Tables, tables, tables

So, what’s a temporary table? or an in-memory table? or a pivot table?

An in-memory table is a table in some platforms that’s stored entirely in memory. These don’t really exist in MS SQL, although you could say that it is a table that’s been entirely cached, and so doesn’t result in any physical (hard disk) reads when queried. In earlier versions, the DBCC PINTABLE command allowed the “pinning” of tables in memory, but this was deprecated in SQL Server 2005.

Often, a table-valued variable, @tablename, might be stored in memory (although this is not guaranteed), and declared in a batch or function, with no persistence.

A temporary table is a table that will be automatically dropped when it’s no longer needed, usually when the creating session is terminated. In MS SQL, they begin with a # (or two hashes if they’re global temporary tables, shared between multiple sessions), and are often created with a SELECT INTO #TEMPTABLE … style query.

A pivot table is a special form of query where the values in several rows are summarised, “pivoted” on an axis, and become columns, where the summary data then becomes the rows. Frequently this happens where you’ve rows sorted on dates; these may then be “pivoted” so you end up with a column for January, one for February, one for March, etc.

GROUP BY and single-row values

So, you’ve constructed a SQL statement with a GROUP BY clause, and you’re getting this message:

Column col is invalid in the select list because it is not
contained in either an aggregate function or the
GROUP BY clause.

Bear in mind that the result of a GROUP BY statement, or of a statement where one or columns uses an aggregate function, has each row containing a summary of other rows.

This means that if you try to include a column in your select clause that isn’t a summary (this includes values by which you’re grouping), then the server is going to have difficulties returning it; remember, it’s going to return one row per group, and any value that can’t be reduced to a single row per group will fail.

This gets interesting even when you’re hoping to return a single value per group, for example with CASE.

If the CASE expression you’re using is dependent on values of individual rows rather than summaries, you’ll also get the above message; you may only reference non-aggregate values (i.e. single-row values) in an aggregate function, or in the WHERE clause, so the solution would involve placing your CASE inside an aggregate function, in this case your SUM.

The same applies for the HAVING clause, as it is effectively selecting rows in the grouped resultset based on values in each group.

Choosing Between Database Platforms: MS SQL and MySQL, with a twist

Another StackOverflow question: Given a site using PHP and VB.NET, and a choice between Microsoft SQL Server and MySQL, which coupling would be easier to maintain?

VB.NET and Microsoft SQL Server are an obvious marriage, as are PHP and MySQL, but as both development languages are in use, we are faced with choosing between two odd combinations:

  • VB.NET and MySQL
  • or PHP and MS SQL

My response? I’d go with MySQL (possibly controversially!), although SQL Server is by far the superior platform, all things considered. If you’re comparing them as similar alternatives, you’re probably not going to use any of the features of MS SQL that make it the better platform, and so it’s not worth the extra hassle.

In summary, here’s why:

  • PHP’s support for MySQL is second to none (given the following caveat)
  • PHP’s support for SQL Server is suboptimal; Microsoft provide a PHP driver, and there are other techniques, but PHP’s simply not quite as database-agnostic as VB.NET
  • VB.NET, although it loves SQL Server, will happily talk to any OLEDB provider (e.g. an ODBC connection) with no problems whatsoever, and MySQL’s ODBC support is pretty mature.

SQL: Why nulls slow down some queries

Someone asked a question on StackOverflow today about why null values slow down certain queries. This is roughly what I answered.

The main issue with null values and performance is to do with forward lookups.

If you insert a row into a table, with null values, it’s placed in the natural page that it belongs to. Any query looking for that record will find it in the appropriate place. Easy so far….

…but let’s say the page fills up, and now that row is cuddled in amongst the other rows. Still going well…

…until the row is updated, and the null value now contains something. The row’s size has increased beyond the space available to it, so the DB engine has to do something about it.

The fastest thing for the server to do is to move the row off that page into another, and to replace the row’s entry with a forward pointer. Unfortunately, this requires an extra lookup when a query is performed: one to find the natural location of the row, and one to find its current location.

So, the short answer to the question of performance is yes, making those fields non-nullable will help search performance. This is especially true if it often happens that the null fields in records you search on are updated to non-null.

Of course, there are other penalties (notably I/O, although to a tiny extent index depth) associated with larger datasets, and then you have application issues with disallowing nulls in fields that conceptually require them, but hey, that’s another problem!