Thursday, November 22, 2012

varchar vs nvarchar (and text, ntext, char, nchar)

Everyone who's ever created a table in SQL has probably at some point wanted to add a field to store text values and wondered about the difference between all the various options. These are numerous. The most commonly used are varchar, nvarchar, char, nchar, text, and ntext. You could also store text in binary, nbinary, or image columns. Actually you could store text in almost any data type if you're willing to put yourself through the hassle of very complicated conversions.  Let's stick with the original six for now: varchar, nvarchar, char, nchar, text, ntext.

The n prefix denotes unicode (more about this later), so ignoring that, we're dealing with three data types here: varchar, char, and text. Varchar is by far the most commonly used and the rest of this article will focus on this type, but in short these three can be differentiated as follows:

  • char: Fixed length string. In other words, char(4) will always have four characters. Thus if you want to store "A" it will become "   A" (A padded with 3 spaces). This is not used very often, but can be useful in certain scenarios. 
  • text: This stores the data in a blob (i.e. not in the row data of the SQL table). This is useful when you expect to store huge text values as there's no limit on the size of the data. However, this type is being deprecated and is being replaced (sort of) by varchar(max).
  • varchar: Up to 8000 characters this stores text in the row data (similar to char) but not always of fixed length. In other words if you create a varchar(4) column and store a value "A" in it, this will use just one byte to store the A. AA will use two bytes, AAA will use 3 and so on. This implies that the length of various rows in the table will vary from one row to the next. varchar always uses an additional 2 bytes for storage, essentially for meta data so that SQL server knows how much data to expect in each specific instance. That means if you know you will always need exactly 4 characters, char is much more efficient in terms of storage requirements and processing overhead, since it doesn't need the additional 2 bytes of meta data nor any processing power to interpret that meta data. Having said that, in the majority of modern applications, this type of gain from using char instead of varchar is trivial. In the olden days when memory, processor cycles, and storage space were more limited, this was a serious consideration though. In previous versions of SQL Server, varchar could only go up to 8000 characters, but as of SQL 2005 varchar(max) will allow you to store any amount of data. This works like varchar described above up to the 8000 character limit and once data size goes over this it switches over to a blob to store the data and behaves more like the text data type (presumably this is also why the text data type is no longer necessary and being deprecated).
Now, on to varchar vs nvarchar. These two types are essentially the same, except that varchar is non-Unicode and nvarchar is Unicode. What does this mean?

Unicode is a standard for encoding text that was specifically designed to allow for storage of a larger number of different characters and character sets to cater for the requirements of different languages. The wikipedia article for Unicode describes this in more detail.

In SQL Server, this means nvarchar supports multiple languages. If your data is only going to be in English, this is not necessary. If there's any chance that your data will contain other language data, especially languages using non-English character sets, like Arabic, Russian, etc., etc. then you will need to use nvarchar. The cost of using nvarchar is that it almost doubles the amount of storage used to store the data as it uses two bytes for each character of data, where varchar uses only one byte per character. Both have the same overhead of 2 characters, so effectively varchar will use 3 bytes to store "A" and nvarchar will use 4 bytes for the same thing. To store "QWERTY", six characters of data, varchar will use 8 bytes (2+6) and nvarchar will use 14 bytes (2+6*2). 

My recommendation: If you're absolutely 100% sure that you will not need non-English data, and you will not need it in the foreseeable future either, then use varchar. In all other cases, use nvarchar.

No comments:

Post a Comment