This post is about useful TSQL queries on to run on first contact with a new system. We’re going to drill down into the nitty gritty, bricks and blocks of a database – datatypes. Selecting the correct datatype at design time for EVERY column in each table of your database is fundamental to achieving good performance later on when the database is deployed and has been running in production for a while. Having to alter a table’s Primary Key from integer to bigint when there are a 1000M+ rows in the table is not a pretty sight. Other common issues are when a company grows internationally and suddenly the Decimal(9,2) column doesn’t stretch so far when storing amounts in Indonesian Rupiah and the varchar(50) product name column isn’t so helpful when trying to store Thai or Turkish characters. These are important however the topic of this post is CONSISTENCY, if we are going to store ProductID let’s try and ensure that we have an agreed definition (read datatype) so that our JOINs and COMPARISONS can take advantage of the indexes present. Furthermore when Updating a column in one table based upon data in another we do NOT want to run the risk of data truncation or lossy transformations (i.e. Product Description nvarchar(50) in one table storing spanish converted to varchar(50) in another).

For more details of how implicit conversions can impact query performance even when convert_implicit is not present in the Query Plan I refer you to this post by Paul White :-

http://sqlblog.com/blogs/paul_white/archive/2011/07/19/join-performance-implicit-conversions-and-residuals.aspx

So here is a query which tells you how many different datatypes are defined for each unique column name – ideally the answer is one. To go off on a tangent another way we could get multiple datatypes for the same column name is if we have been inattentive when naming our columns and we have re-used the same name for two different entities, but I hope you’ll agree this is also a problem we would like our attention drawn to.

I am going to work from the INFORMATION_SCHEMA system views, in particular TABLES and COLUMNS as I can get to the information I need and I get great forwards and backwards compatibility over different SQL Server versions.

The core of this query runs against the INFORMATION_SCHEMA.COLUMNS view and using a CASE over the different datatypes constructs the familiar concise datatype descriptions with numeric and decimal having n digits before and after the decimal point, varchar and nvarchar having a length etc. All we need to do then is to GROUP BY column name and concise datatype and using the HAVING clause filter out any column name with more than 1 datatype (these are the column names we are interested in). The results of this inner sub select are used to filter a similar outer query that returns all the table names and concise datatypes for the target column names as it will be a manual intervention by the database administrator to, using their skill, judgement and artistic sensibilities, decide which is actually the correct datatype for this particular column name.

If we are in luck there are a couple of other places where we might pick up information about columns which could be relevant. Firstly, SQL Server has an implementation of User Defined Datatypes whereby a base datatype, length (& precision if appropriate) plus nullability can be bound to a type name and then this user defined type used in when defining columns in a table. To follow the Books on Line example Social Security Number could be defined as char(11) NNN-NN-NNNN.

but User Defined Datatype implementation is somewhat tricky and hasn’t really had any TLC from Microsoft for many a long year, I refer you to this blog by Bart Duncan

http://blogs.msdn.com/b/bartd/archive/2010/08/25/t-sql-udts-what-are-they-good-for.aspx

The other potential source of information is the extended properties, if your predecessor is nice then every table and column will have an extended property defined which acts as a comment describing the purpose of this column, so potentially very help as we can store a much longer text than the column name. Furthermore this “documentation” is stored within the database and will travel with it through backups and restores etc, so is much harder to lose than the Word Doc stored on the departmental file server.

Full TSQL code is below

The first result set just lists out any User Defined Datatypes that are present in the current database, often there won´t be any rows in this resultset.

The second result set produces some simple metrics about column names and datatypes.

·         Most_datatypes_for_one_colname – a count of the maximum number of different datatypes for a single column name.

·         Avg_ncols_per_colname – the average number of columns (different tables) per column name, i.e. on average how many time do I re-use a columnname, in a relational database we expect to find some matching columns.

·         Avg_ndatatypes_per_colname – the average number of datatypes per column name, the closer this is to 1 the more consistent the database. Every unique column name has one and only one datatype.

·         Total_number_of_columns – how many columns are there in all the tables in this database.

·         Pc_UDT – the percentage of all columns which are bound to a user defined datatype (may be a good thing if this is the standard in your organization but see the link by Bart Duncan before you make this decision)

·         Pc_ExtProperty – the percentage of all columns which have had an extended property added, could be helpful as long as the comment is sensible.

sample results from the AdventureWorks2016CTP3 database

 

most datatypes for one colname

avg ndatatypes per colname

avg ncols per colname

total number of columns

pc UDT

pc ExtProperty

3

1.08

2.852

713

5.19

65.78

 

Please feel free to collect these metrics from your databases and either add them to the blog comments or mail them through. I’ll summarise and write a new blog post on the findings.

 

The final resultset lists out those column names having more than one datatype, here is an abridged set from the Adventureworks2016CTP3 database
(downloaded from codeplex.com).

AccountNumber on the Customer table is defined with varchar(10) while all other definitions are nvarchar(15) – 3 are via the UDT whereas 2 are just defined with the native SQL datatype. If you are going to implement UDTs you really need to implement them everywhere. Any attempt to join Customer and SalesOrderHeader tables on AccountNumber will force the conversion of the Customer table rows to nvarchar – nasty. On the plus side we do have extended properties – but I don´t think we could conclusively state that all AccountNumber entities are one and the same thing.

With the columnName Quantity we can see that in ProductInventory it is defined as a smallint (max value 32767) but in the Shoppingcartitem it is an Int (max value 2,147,483,647), let’s hope the shopper has a good credit card limit.

The final example is Title and here we seem to have a case of the same name for two different things – the title of a document or the title of a person (Mr, Mrs etc) – hopefully no over enthusiastic data analyst will join the 2 tables on these fields.

 

table_name

column_name

DataType

FQ_UserDefinedType

ExtPropertyValue

Customer

AccountNumber

varchar(10)

 -

Unique number identifying the customer assigned by the accounting system.

SalesOrder_json

AccountNumber

nvarchar(15)

AdventureWorks2016CTP3.dbo.AccountNumber

 -

SalesOrderHeader

AccountNumber

nvarchar(15)

AdventureWorks2016CTP3.dbo.AccountNumber

Financial accounting number reference.

SalesOrderHeader_inmem

AccountNumber

nvarchar(15)

 -

 -

SalesOrderHeader_ondisk

AccountNumber

nvarchar(15)

 -

 -

Vendor

AccountNumber

nvarchar(15)

AdventureWorks2016CTP3.dbo.AccountNumber

Vendor account (identification) number.

ProductInventory

Quantity

smallint

 -

Quantity of products in the inventory location.

ShoppingCartItem

Quantity

int

 -

Product quantity ordered.

TransactionHistory

Quantity

int

 -

Product quantity.

TransactionHistoryArchive

Quantity

int

 -

Product quantity.

Document

Title

nvarchar(50)

 -

Title of the document.

Person

Title

nvarchar(8)

 -

A courtesy title. For example, Mr. or Ms.

Person_json

Title

nvarchar(8)

 -

 -

Person_Temporal

Title

nvarchar(8)

 -

 -

Person_Temporal_History

Title

nvarchar(8)

 -

 -

 

 


 

USE [AdventureWorks2016CTP3]

/*

Software license

You can use this “software” but you can’t sell it.

Please keep this credit with the code.

© Stephen Morris

Quadtree Consulting AB 2016.

www.quadtree-ab.com

Stephen.morris@quadtree-ab.com

*/

 

--List out any User Defined Datatypes

SELECT * FROM sys.types 

WHERE is_user_defined = 1

 

IF OBJECT_ID('tempdb..#ExcludeCols') IS NOT NULL
BEGIN
Drop
Table #ExcludeCols
END

--Use a temp table to exclude certain troublesome column names, i.e. Description

create table #ExcludeCols (id int identity(1,1), colname sysname)
--You can extend the column names you will ignore by adding extra rows

--to the temp table.
insert #ExcludeCols (colname) values ('DESCRIPTION'), ('CODE'), ('NAME')


 

SELECT

   MAX(X.DataTypeCount) as most_datatypes_for_one_colname,

   AVG(X.DataTypeCount) as avg_ndatatypes_per_colname,

   AVG(X.number_of_columns_with_this_name) as avg_ncols_per_colname,

   SUM(number_of_columns_with_this_name) as total_number_of_columns,

   CONVERT(DECIMAL(9,2), (SUM(IsUDT) / SUM(number_of_columns_with_this_name) * 100)) as pc_UDT,

   CONVERT(DECIMAL(9,2), (SUM(IsExtendedProperty) / SUM(number_of_columns_with_this_name) * 100)) as pc_ExtProperty

FROM

(

   SELECT

   column_name,

   Convert(decimal(9,2), COUNT(*)) as number_of_columns_with_this_name,

   Convert(decimal(9,2), COUNT ( Distinct

   (

   data_type +

   CASE

      WHEN data_type in ('numeric', 'decimal', 'money', 'smallmoney')

      THEN '('+convert(varchar(10),numeric_precision)+'.'+convert(varchar(10),numeric_precision_radix)+')'

      WHEN data_type in ('char', 'varchar', 'binary', 'varbinary')

      THEN '('+convert(varchar(10), character_maximum_length) + ')'

      WHEN data_type in ('nchar', 'nvarchar', 'sysname')

      THEN '('+ CASE WHEN character_maximum_length = -1 THEN 'MAX' ELSE convert(varchar(10), character_maximum_length/2) END + ')'

      ELSE ''

   END))) AS DataTypeCount,

   Sum(CASE WHEN IsNull(ep.value, '-') = '-' THEN 0 ELSE 1 END) as IsExtendedProperty,

   Sum(CASE WHEN IsNull(domain_catalog,'') +'.'+IsNull(domain_schema,'') + '.' + IsNull(domain_name,'') = '..' THEN 0 ELSE 1 END) as IsUDT

   FROM information_schema.columns c

   JOIN information_schema.tables t ON

      c.TABLE_CATALOG = t.TABLE_CATALOG AND

      c.TABLE_SCHEMA = t.TABLE_SCHEMA AND

      c.TABLE_NAME = t.TABLE_NAME AND

      t.TABLE_TYPE = 'base table'

   LEFT OUTER JOIN sys.extended_properties ep ON

      ep.major_id = OBJECT_ID(c.TABLE_CATALOG +'.'+c.TABLE_SCHEMA +'.'+c.TABLE_NAME) AND

      ep.minor_id = COLUMNPROPERTY(OBJECT_ID(c.TABLE_CATALOG +'.'+c.TABLE_SCHEMA +'.'+c.TABLE_NAME), c.COLUMN_NAME, 'Columnid') AND

      ep.class = 1

   WHERE column_name not in (select colname from #ExcludeCols)

   group by column_name

) as X

 


--Non matching column datatypes

SELECT

   c.table_name,

   c.column_name,

   c.data_type +

   CASE

      WHEN data_type in ('numeric', 'decimal', 'money', 'smallmoney')

      THEN '('+convert(varchar(10),numeric_precision)+'.'+convert(varchar(10),numeric_precision_radix)+')'

      WHEN data_type in ('char', 'nchar', 'varchar', 'nvarchar', 'binary', 'varbinary')

      THEN '('+convert(varchar(10), character_maximum_length) + ')'

      ELSE ''

   END AS DataType,

   CASE

      WHEN IsNull(domain_catalog,'') +'.'+IsNull(domain_schema,'') + '.' + IsNull(domain_name,'') = '..'

      THEN ' - '

      ELSE IsNull(domain_catalog,'') +'.'+IsNull(domain_schema,'') + '.' + IsNull(domain_name,'')

   END as FQ_UserDefinedType,

   IsNull(ep.value, ' - ') as ExtPropertyValue

FROM INFORMATION_SCHEMA.COLUMNS c

JOIN INFORMATION_SCHEMA.TABLES t ON

   c.TABLE_CATALOG = t.TABLE_CATALOG AND

   c.TABLE_SCHEMA = t.TABLE_SCHEMA AND

   c.TABLE_NAME = t.TABLE_NAME AND

   t.TABLE_TYPE = 'base table'

LEFT OUTER JOIN SYS.EXTENDED_PROPERTIES ep ON

   ep.major_id = OBJECT_ID(c.TABLE_CATALOG +'.'+c.TABLE_SCHEMA +'.'+c.TABLE_NAME) AND

   ep.minor_id = COLUMNPROPERTY(OBJECT_ID(c.TABLE_CATALOG +'.'+c.TABLE_SCHEMA +'.'+c.TABLE_NAME), c.COLUMN_NAME, 'Columnid') AND

   ep.class = 1

WHERE

   column_name not in (select colname from #ExcludeCols) AND

   column_name in

   (

   SELECT

      column_name

   FROM

      (

      SELECT

      column_name,

      data_type +

      CASE

         WHEN c.data_type in ('numeric', 'decimal')

         THEN '('+convert(varchar(10),numeric_precision)+'.'
                
+convert(varchar(10),numeric_precision_radix)+')'

         WHEN data_type in ('char', 'nchar', 'varchar', 'nvarchar')

         THEN '('+convert(varchar(10), character_maximum_length) + ')'

         ELSE ''

      END AS DataType,

      count(*) AS colcount

      FROM INFORMATION_SCHEMA.COLUMNS c

      JOIN INFORMATION_SCHEMA.TABLES t ON

         c.TABLE_CATALOG = t.TABLE_CATALOG AND

         c.TABLE_SCHEMA = t.TABLE_SCHEMA AND

         c.TABLE_NAME = t.TABLE_NAME AND

         t.TABLE_TYPE = 'base table'

      GROUP BY column_name,

      data_type +

      CASE

         WHEN data_type in ('numeric', 'decimal')

         THEN '('+convert(varchar(10),numeric_precision)+'.'
                
+convert(varchar(10),numeric_precision_radix)+')'

         WHEN data_type in ('char', 'nchar', 'varchar', 'nvarchar')

         THEN '('+convert(varchar(10), character_maximum_length) + ')'

         ELSE ''

      END

      ) AS x

   WHERE column_name not in (SELECT colname FROM #ExcludeCols)

   GROUP BY column_name

   HAVING count(*) > 1

   )

ORDER BY

   column_name,

   table_name