Monday, January 09, 2006

XML and SQL Server 2000 don't get along too well

(I had this post as a draft since January, it's time to get over the blogging lazyness)

I remember when sql server 2k came out that the main ad campaigns were about it's xml support. I remember that, at the time, I didn't know what the heck xml was, and why all the buzz about it.
I corrected part of my ignorance right after, note, part of my ignorance...
Anyways, apart from retrieving resulsets in xml format, with who knows what schema, I hadn't used xml in sql server much, and thought that *that* xml support was all a programmer gal could wish for.
Oh, I was wrong!
My last project involved a great deal of importing xml into sql server 2000 and it was indeed interesting.
I noticed a few things I would like to blog about so I could keep a reminder for future projects.
  1. sql server 2000 won't import xml using a DTS package unless you have the "proper mapping" of your xml schema and your database schema. I was never able to tickle the proper mapping... My first idea was infering an xsd file out of the xml (I created that with a visual studio console app, loading the xml into a dataset and infering schema, but that didn't seem to work as the proper mapping) See this article Importing XML into sql server 2000.
  2. what about Excel? you theoretically can import xml into Excel, associate an xsd file to your spreadsheet and have it all imported into sql server 2000. Voila! it should be just a couple of clicks. Don't run that fast. Excel won't import complex xml, this is an xml with a complex schema, where there are several child nodes with more child nodes (complex nested types indicating a one to many relationships). I even tried to separate the xml for importing one complex type or table at a time and realized I had to save all of my .xls files as csv (comma separated text) to have them imported properly.
Ok, now that we ended up writing a a dot net utility to use ADO.NET, now that we load the XML and synchronize the dataset with the table in sql server 2k, lets check if I can retrieve the xml out of the database to be able to work with smaller chunks of data in xml format.
  1. This part of the project was also fun indeed. The T-SQL clause for XML does not seem to work when some of the columns in your table are defined as text and will truncate the resulset over a maximun length.
  2. SQL server 2000 is unable to retrive the xml with the schema you would like. Apart from the data truncation, you should make deep xslt with the data afterwards to have it properly complying with the schema you want.
Hopefully sql server 2005 will answer some of our prayers. There is a xml data type, so worse case scenario, you can load the whole document into a column in a temporary table(I know this is not a good practice to have a huge document in a table column), apply XQuery to it and be able to retrieve the chunk of information you do want to import into your tables. DTS packages for importing xml should be better, old sql server 2000 relied on scripting language for everything with the known drawback that scripting languages are typeless.

Said all this, I might blog about sql server 2k5 in a next post (I really like the CTE Common table expressions), ...if I get over this blogging lazyness.
That's all folks!

Thursday, January 05, 2006

Here I am still ranting about database design and PK selection

Following the thread from previous post (and mostly to have this as a personal reminder) I came across this article
Identity and Primary Keys @ sqlteam.com
The idea behind this article is to encourage Database developers to use Natural Keys as the Primary Keys on their entities design. I have seen so many designs like that and have been so much against them. I wanted to double check and posted a thread on sql-server-l at groups.ittoolbox.com.

Some of the brief comments here:
Natural key is not good as a primary key at all. It is usually much longer
than int identity(1,1) 4 bytes.

Natural key is very cumbersome to use as a foreign key.

I am amazed that this became a debate here.

Ahem, it might be a newbie question alright!

I also got another comment on how a uniqueidentifier field cannot guarantee uniqueness:

I have seen plenty of people create databases where the only unique ident=
ifier was a surrogate key, such as a sql server id column. These people o=
ften show up here asking how to delete duplicates. (haven't you seen a to=
n of them here Kalman?) Duplicates? How can there be any duplicates if yo=
u have a sql server ID column as a primary key? LOL

This dba proposed to use Natural Keys as PK and got this reply back:

Take my advice all: int identity(1,1) is the best primary key. You can put
unique contraint or unique index on the natural key column.

Back in the glory days of sybase it was a pain to construct your own
identity(1,1) through the use of the max function.

But the best of all is this comment that was inserted into the original article. Original comment here (page two of the comments, nickname quazibubble)I know the comment is a little bit rude, but I liked how clear that explanation is:

A solid, mature OLTP database architecture has entities with a single identity or GUID PK defined, usually as a clustered index, that by definition is totally meaningless except for building relational structures behind the scenes. That is it's ONLY purpose. Think of it as an undercover system column that has nothing to do with the entity it belongs to, you just happen to be able to see it as a developer. You'll use design systems in the future that don't even give YOU, the developer, access to this column or it's values, but it will be there (they've already existed as CASE tools and their ancestors, as well as their simpler, more useful prodigy). Your job should be designing ENTITIES AND RELATIONSHIP, not PKs and FKs. If you find yourself pondering what to use for a PK (beyond the question of "should it be an identity or GUID?"), you are doing something terribly wrong and will be costing a whole team a great deal of money and headache, even long after you a fired.

Furthermore, if a PK is not meaningless--e.g. it is any attribute remotely helping to add human meaning to the entity (SSN, network login, etc.), there WILL be cascading updates throughout the entire system, many people are going to be very pissed, and you will look like a jacka**. Users make typos, people change their names. Any data that was entered by or derived from a human at any point is subject to change. To be perfectly compliant with the forms of normalization, the value in a PK should not even be used as an account number. But giving a system-defined value that later becomes meaningful to a human is not nearly as harmful as using an existing value of meaning for a PK.


End of reminder