MySQL: Mass email change

It’s not unheard of for a company to change e-mail domain in mid-thrust; maybe it’s been bought out, or rebranded, or the parent company has spun it off to its own brand.

Only you’ve got hundreds of employees, each one with their own email address, and your MySQL database is in dire need of updating to reflect this.

To get around this, you’ll need to replace the relevant part of each email string within an update statement, grabbing the hostname substring (after the ‘@’) with a REPLACE, and replacing it.

UPDATE table SET email=REPLACE(email,'OLDHOST.com', 'newhost.com');

Note: REPLACE() is case-sensitive, so if needs be, you can use LOWER(email) inside the REPLACE function if you need to catch all case possibilities, as below:

UPDATE table SET email=REPLACE(LOWER(email),'oldhost.com', 'newhost.com');

This will also convert all your email addresses to lowercase, so be aware of that.

Tables, tables, tables

So, what’s a temporary table? or an in-memory table? or a pivot table?

An in-memory table is a table in some platforms that’s stored entirely in memory. These don’t really exist in MS SQL, although you could say that it is a table that’s been entirely cached, and so doesn’t result in any physical (hard disk) reads when queried. In earlier versions, the DBCC PINTABLE command allowed the “pinning” of tables in memory, but this was deprecated in SQL Server 2005.

Often, a table-valued variable, @tablename, might be stored in memory (although this is not guaranteed), and declared in a batch or function, with no persistence.

A temporary table is a table that will be automatically dropped when it’s no longer needed, usually when the creating session is terminated. In MS SQL, they begin with a # (or two hashes if they’re global temporary tables, shared between multiple sessions), and are often created with a SELECT INTO #TEMPTABLE … style query.

A pivot table is a special form of query where the values in several rows are summarised, “pivoted” on an axis, and become columns, where the summary data then becomes the rows. Frequently this happens where you’ve rows sorted on dates; these may then be “pivoted” so you end up with a column for January, one for February, one for March, etc.

GROUP BY and single-row values

So, you’ve constructed a SQL statement with a GROUP BY clause, and you’re getting this message:

Column col is invalid in the select list because it is not
contained in either an aggregate function or the
GROUP BY clause.

Bear in mind that the result of a GROUP BY statement, or of a statement where one or columns uses an aggregate function, has each row containing a summary of other rows.

This means that if you try to include a column in your select clause that isn’t a summary (this includes values by which you’re grouping), then the server is going to have difficulties returning it; remember, it’s going to return one row per group, and any value that can’t be reduced to a single row per group will fail.

This gets interesting even when you’re hoping to return a single value per group, for example with CASE.

If the CASE expression you’re using is dependent on values of individual rows rather than summaries, you’ll also get the above message; you may only reference non-aggregate values (i.e. single-row values) in an aggregate function, or in the WHERE clause, so the solution would involve placing your CASE inside an aggregate function, in this case your SUM.

The same applies for the HAVING clause, as it is effectively selecting rows in the grouped resultset based on values in each group.

Representing Hierarchical data in PHP

I’ve come across a lot of people having a problem when representing data from their databases; hierarchical data provides us with a particular problem when displaying the relationships and containers required.

An example of hierarchical data

As an example, let’s say you have forums within categories. The category “Web Programming” may have forums like “PHP”, “JavaScript” and so on, so we want to list each forum under each category. The problem here is that you’ll have:

  • a table for categories
  • one for forums (with a foreign key for categories)
  • one for posts, with a foreign key for forums
  • and probably one for comments.

This fits quite neatly into a hierarchy: each comment belongs to a post, which belongs to a forum, which belongs to a category.

Using multiple resultsets

If you want to display the Categories with Forums under them in your page, one way to do this is by setting up multiple resultsets, each focusing on one level of the hierarchy. To achieve this, you’d need to get a resultset for the categories first, and then iterate over this list with another loop for your forum list.

The following code is an example of iterating using nested queries:

$cat_rs = mysql_query("select id, name from categories");
while($cat_row = mysql_fetch_array($cat_rs)){
     // print category name from $cat_row[1]
     $forum_rs = mysql_query("select name... "
                          . "from forums "
                          . "where cat_id = '" . $cat_row[0] ."'");
     while($forum_row = mysql_fetch_array($forum_rs)){
        //print forum stuff
     }
}

Although this matches the hierarchical nature of the data, and is pretty straightforward and intuitive to understand, it is somewhat inefficient; it issues a select statement for each category, plus one to list the categories in the first place. If there are many categories, the page might take some time to load as all queries are issued and processed.

Using a SQL JOIN: a single resultset

Another way to solve the problem would be to retrieve all relevant data in a single resultset, and use PHP to iterate over the rows, deciding on when to print each individual section.

A join will give you a single recordset with the category in one column, and the forum in another, giving you many rows for a single category.

$join_rs = mysql_query("select c.id, c.name, f.name,... "
                        . "from categories c inner join forums f "
                        . "on c.id = f.cat_id "
                        . "order by c.id"); // this line is crucial
$current_category = "";
while($join_row = mysql_fetch_array($join_rs)){
	if($join_row[0] != $current_category){
                //set up new category headings here
                $current_category = $join_row[0];
        }
        //print the forum stuff in the current category
}

The PHP is a bit more complex, because you have to close divs or tables correctly, within the loop. The order by line in the above example is vital, because without it, the categories may be dotted throughout the resultset. As we want all forum data related to a single category to appear together, we want rows belonging to that category to be adjacent in the resultset. The order by clause achieves this.

Despite the complexity, there are still benefits to this approach. The earlier example was simple and intuitive, using nested queries for each parent element. On the other hand, in this example we’ve only issued one select statement, so the page is likely to load more quickly, given that the number of round trips to MySQL will be substantially fewer.

Choosing Between Database Platforms: MS SQL and MySQL, with a twist

Another StackOverflow question: Given a site using PHP and VB.NET, and a choice between Microsoft SQL Server and MySQL, which coupling would be easier to maintain?

VB.NET and Microsoft SQL Server are an obvious marriage, as are PHP and MySQL, but as both development languages are in use, we are faced with choosing between two odd combinations:

  • VB.NET and MySQL
  • or PHP and MS SQL

My response? I’d go with MySQL (possibly controversially!), although SQL Server is by far the superior platform, all things considered. If you’re comparing them as similar alternatives, you’re probably not going to use any of the features of MS SQL that make it the better platform, and so it’s not worth the extra hassle.

In summary, here’s why:

  • PHP’s support for MySQL is second to none (given the following caveat)
  • PHP’s support for SQL Server is suboptimal; Microsoft provide a PHP driver, and there are other techniques, but PHP’s simply not quite as database-agnostic as VB.NET
  • VB.NET, although it loves SQL Server, will happily talk to any OLEDB provider (e.g. an ODBC connection) with no problems whatsoever, and MySQL’s ODBC support is pretty mature.

SQL: Why nulls slow down some queries

Someone asked a question on StackOverflow today about why null values slow down certain queries. This is roughly what I answered.

The main issue with null values and performance is to do with forward lookups.

If you insert a row into a table, with null values, it’s placed in the natural page that it belongs to. Any query looking for that record will find it in the appropriate place. Easy so far….

…but let’s say the page fills up, and now that row is cuddled in amongst the other rows. Still going well…

…until the row is updated, and the null value now contains something. The row’s size has increased beyond the space available to it, so the DB engine has to do something about it.

The fastest thing for the server to do is to move the row off that page into another, and to replace the row’s entry with a forward pointer. Unfortunately, this requires an extra lookup when a query is performed: one to find the natural location of the row, and one to find its current location.

So, the short answer to the question of performance is yes, making those fields non-nullable will help search performance. This is especially true if it often happens that the null fields in records you search on are updated to non-null.

Of course, there are other penalties (notably I/O, although to a tiny extent index depth) associated with larger datasets, and then you have application issues with disallowing nulls in fields that conceptually require them, but hey, that’s another problem!