Archive for the 'SQL' Category

Putting Freebase in a Star Schema

What’s Freebase?

Freebase is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it’s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase also has an API that lets programs add data and make queries using a language called MQL.  Freebase is complementary to DBpedia and other sources of information.  Although it takes a different approach to the semantic web than systems based on RDF standards,  it interoperates with them via  linked data.

The January 2009 Freebase dump is about 500 MB in size.  Inside a bzip-compressed files,  you’ll find something that’s similar in spirit to a Turtle RDF file,  but is in a simpler format and represents facts as a collection of four values rather than just three.

Your Own Personal Freebase

To start exploring and extracting from Freebase,  I wanted to load the database into a star schema in a mysql database — an architecture similar to some RDF stores,  such as ARC.  The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.

This is sufficient for my immediate purposes,  but future versions of Freebase promise to be much larger:  this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms. Continue Reading »

What do you do when you’ve caught an exception?

Abort, Retry, Ignore

This article is a follow up to “Don’t Catch Exceptions“, which advocates that exceptions should (in general) be passed up to a “unit of work”, that is, a fairly coarse-grained activity which can reasonably be failed, retried or ignored. A unit of work could be:

  • an entire program, for a command-line script,
  • a single web request in a web application,
  • the delivery of an e-mail message
  • the handling of a single input record in a batch loading application,
  • rendering a single frame in a media player or a video game, or
  • an event handler in a GUI program

The code around the unit of work may look something like

[01] try {
[02]   DoUnitOfWork()
[03] } catch(Exception e) {
[04]    ... examine exception and decide what to do ...
[05] }

For the most part, the code inside DoUnitOfWork() and the functions it calls tries to throw exceptions upward rather than catch them.

To handle errors correctly, you need to answer a few questions, such as

  • Was this error caused by a corrupted application state?
  • Did this error cause the application state to be corrupted?
  • Was this error caused by invalid input?
  • What do we tell the user, the developers and the system administrator?
  • Could this operation succeed if it was retried?
  • Is there something else we could do?

Although it’s good to depend on existing exception hierarchies (at least you won’t introduce new problems), the way that exceptions are defined and thrown inside the work unit should help the code on line [04] make a decision about what to do — such practices are the subject of a future article, which subscribers to our RSS feed will be the first to read.

Continue Reading »

Dropping Unique Constraints in SQL Server

The problem

I got started with relational databases with mysql,  so I’m in the habit of making database changes with SQL scripts,  rather than using a GUI.  Microsoft SQL Server requires that we specify the name of a unique constraint when we want to drop it.  If you’re thinking ahead,  you can specify a name when you create the constraint;  if you don’t,  SQL Server will make up an unpredictable name,  so you can’t write a simple script to drop the constraint.

A Solution

In the spirit of “How To Drop A Primary Key in SQL Server“,  here’s a stored procedure that queries the data dictionary to find the names of any unique constraint on a specific table and column and drop them:

CREATE PROCEDURE [dbo].[DropUniqueConstraint]
    @tableName NVarchar(255),
    @columnName NVarchar(255)
    SET @IdxNames = CURSOR FOR
        select from sysindexkeys,syscolumns,sysindexes
                syscolumns.[id] = OBJECT_ID(N'[dbo].['+@tableName+N']')
                AND sysindexkeys.[id] = OBJECT_ID(N'[dbo].['+@tableName+N']')
                AND sysindexes.[id] = OBJECT_ID(N'[dbo].['+@tableName+N']')
                AND sysindexkeys.colid=syscolumns.colid
                AND sysindexes.[indid]=sysindexkeys.[indid]
                AND (
                    SELECT COUNT(*) FROM sysindexkeys AS si2
                    AND si2.indid=sysindexes.indid
    OPEN @IdxNames
    DECLARE @IdxName Nvarchar(255)
    FETCH NEXT FROM @IdxNames INTO @IdxName    

        DECLARE @dropSql Nvarchar(4000)

        SET @dropSql=
            N'ALTER TABLE  ['+@tableName+ N']
                DROP CONSTRAINT ['+@IdxName+ N']'
        FETCH NEXT FROM @IdxNames
        INTO @IdxName
CLOSE @IdxNames

Usage is straightforward:

EXEC [dbo].[DropUniqueConstraint]

This script has a limitation:  it only drops unique constraints that act on a single column,  not constraints that act on multiple columns.   It is smart enough,  however,  to not drop multiple-column constraints in case one of them involves @columnName.

Feedback from SQL stored procedure wizards would be mostly certainly welcome.

kick it on

How to Drop A Primary Key in Microsoft SQL Server

One of my pet peeves with Microsoft’s SQL Server is that it requires you to know the name of constraints and indexes that you’re dropping. This isn’t a problem when you’re working with the management studio, because it looks at the database metadata and generates code like

ALTER TABLE [dbo].[someTable]  DROP CONSTRAINT [PK__someTabl__3214EC07271AA44F]

It’s less convenient, however, when you’re writing a set of migration scripts in SQL to implement changes that you make over the database in time. Specifically, if you create the table twice in two different databases, the hexadecimal string in the name of the key will be different — the ALTER TABLE statement will fail when you try to drop the index later, since the name of the key won’t match.

Here’s a stored procedure that looks up the name of the primary key in the system catalog and uses dynamic SQL to drop the index:

CREATE PROCEDURE [dbo].[DropPrimaryKey]
    @tableName Varchar(255)
       Drop the primary key on @TableName

       Version 1.1
       June 9, 2008

    DECLARE @pkName Varchar(255)

    SET @pkName= (
        SELECT [name] FROM sysobjects
            WHERE [xtype] = 'PK'
            AND [parent_obj] = OBJECT_ID(N'[dbo].['+@tableName+N']')
    DECLARE @dropSql varchar(4000)

    SET @dropSql=
        'ALTER TABLE [dbo].['+@tableName+']
            DROP CONSTRAINT ['+@PkName+']'
Once you've loaded this stored procedure,  you can write
EXEC [dbo].[DropPrimaryKey] @TableName='someTable'

It’s just that simple. Similar stored procedures can be written to convert fields from NOT NULL to NULL and do other operation which required a named constraint.

Tame SQL With Multiline Quotes in C# and PHP

For years people have talked about an “impedance mismatch” between relational databases and object-oriented languages. Big answers have been proposed to this problem, such as object-relational mapping, but we’ve yet to see a big answer that’s entirely satisfactory — this article is one of a series that tries to break the problem into a set of smaller problems that have simple answers.

Many modern languages, such as Java, Perl, PHP and C#, are derived from C. In C, string literals are written in double quotes, and a set of double quotes can’t span a line break. One advantage of this is that the compiler can give clearer error messages when you forget the close a string. Some C-like languages, such as Java and C# are like C in that double quotes can’t span multiple lines. Other languages, such as Perl and PHP, allow it. C# adds a new multiline quote operator, @”" which allows quotes to span multiple lines and removes the special treatment that C-like languages give to the backslash character (\).

The at-quote is particularly useful on the Windows platform because the backslash used as a path separator interacts badly with the behavior of quotes in C, C++ and many other languages. You need to write

String path="C:\\Program Files\\My Program\\Some File.txt";

With at-quote, you can write

String path=@"C:\Program Files\My Program\Some File.txt";

which makes the choice of path separator in MS-DOS seem like less of a bad decision.

Now, one really great thing about SQL is that the database can optimize complicated queries that have many joins and subselects — however, it’s not unusual to see people write something like

command.CommandText = "SELECT firstName,lastName,(SELECT COUNT(*) FROM comments WHERE postedBy
=userId AND flag_spam='n') AS commentCount,userName,city,state,gender,birthdate FROM user,user
Location,userDemographics WHERE user.userId=userLocation.userId AND user.userId=userDemographi
cs.user_id AND (SELECT COUNT(*) FROM friendsReferred WHERE referringUser=userId)>10 ORDER BY c
ommentCount HAVING commentCount>3";

(line-wrapped C# example) Complex queries like this can be excellent, because they can run thousands of times faster than loops written in procedural or OO languages which can fire off hundreds of queries, but a query written like the one above is a bit unfair to the person who has to maintain it.

With multiline quotes, you can continue the indentation of your code into your SQL:

command.CommandText = @"
      ,(SELECT COUNT(*) FROM comments
         WHERE postedBy=userId AND flag_spam='n') AS commentCount
   FROM user,userLocation,userDemographics
      AND user.userId=userDemographics.user_id
      AND (SELECT COUNT(*) FROM friendsReferred
         WHERE referringUser=userId)>10
   ORDER BY commentCount
   HAVING commentCount>3

Although this might not be quite quite as cool as LINQ, it works with Mysql, Microsoft Access or any database you need to connect to — and it works in many other languages, such as PHP (in which you could use ordinary double quotes.) In languages like Java that don’t support multiline quotes, you can always write

String query =
   " SELECT"
  +"    firstName"
  +"   ,lastName"
  +"   ,(SELECT COUNT(*) FROM comments"
  +"      WHERE postedBy=userId AND flag_spam='n') AS commentCount"
  +"   ,userName"
  +"   ,city"
  +"   ,state"
  +"   ,gender"
  +"   ,birthdate"
  +" FROM user,userLocation,userDemographics"
  +"    WHERE"
  +"       user.userId=userLocation.userId"
  +"       AND user.userId=userDemographics.user_id"
  +"       AND (SELECT COUNT(*) FROM friendsReferred"
  +"          WHERE referringUser=userId)>10"
  +" ORDER BY commentCount"
  +" HAVING commentCount>3";

but that’s a bit more cumbersome and error prone.