SQL SERVER - Loading XML Data

on August 02, 2010 0 comments

   

Introduction

With XML becoming more and more the rage, DBAs are seeing many more requests to deal with XML data files. The question is: how do you load the data in the XML file into a table in SQL Server?
 
The XML File - For the purposes of this article, let's use a small XML file that is readily available on the Internet. W3Schools.com has such a sample file at http://www.w3schools.com/XML/cd_catalog.xml. As you might surmise by the name of the file, it contains brief information about a CD catalog. Please download and save this file to your computer. I am going to assume that this is being saved in the C:\SQL folder, and that you are keeping the file name the same (cd_catalog.xml). If you change any of this, make appropriate changes to the code below.
The structure of the XML file is as follows. Note that there are multiple tags.

<CATALOG>
  <CD>
    <TITLE>Empire BurlesqueTITLE>
    <ARTIST>Bob DylanARTIST>
    <COUNTRY>USACOUNTRY>
    <COMPANY>ColumbiaCOMPANY>
    <PRICE>10.90PRICE>
    <YEAR>1985YEAR>
  CD>
CATALOG>

The DatabaseLet's just keep this short and simple. We will load this file into one table, with an identical structure as that of the XML file.

CREATE TABLE dbo.CD_Info (
  Title        varchar(100),
  Artist       varchar(100),
  Country      varchar(25),
  Company      varchar(100),
  Price        numeric(5,2),
  YearReleased smallint);

Load the file into a staging table

We will be utilizing the OpenRowset function to bulk load the file into a staging table. First, we will create a staging table with one column of the XML data type. We will then load the file into this column. Note that use of the OpenRowset function requires a table alias.

DECLARE @CD TABLE (XMLData XML);
INSERT INTO @CDSELECT *FROM OPENROWSET(BULK N'C:\SQL\cd_catalog.xml', SINGLE_BLOB) rs;

Retrieve the data from the staging table

Next we will retrieve the data from the staging table, and "shred" the XML into its columns. One key thing to note about XML: regardless of the case sensitivity of your server/database, XML commands are always case sensitive. In the sample file, all of the XML tags are in upper case, so everything we need to do needs to be in upper case also.

This command will retrieve each element of each CD in the sample file:

SELECT Title = x.data.value('TITLE[1]','varchar(100)'),
       Artist = x.data.value('ARTIST[1]','varchar(100)'),
       Country = x.data.value('COUNTRY[1]','varchar(25)'),
       Company = x.data.value('COMPANY[1]','varchar(100)'),
       Price = x.data.value('PRICE[1]','numeric(5,2)'),
       YearReleased = x.data.value('YEAR[1]','smallint')
FROM @CD t
       CROSS APPLY t.XMLData.nodes('/CATALOG/CD') x(data);

Note in this command the nodes and value functions. These are XML functions. The nodes function specifies the path to the data that you will be using, and it requires an alias. You then apply the value function to the alias to retrieve the specific data that you want. In the value function, you need to supply the element as a singleton expression, and the data type.

The CROSS APPLY will apply each row in the staging table to the XML nodes function. For more information on how the CROSS APPLY works, I refer you to these articles:
Understanding and Using Apply - Part 1
Understanding and Using Apply - Part 2

Inserting the data into the CD_Info table

At this point, inserting the data is very simple: just put an insert statement in front of the above select statement.

DECLARE @CD TABLE (XMLData XML);
INSERT INTO @CDSELECT *FROM OPENROWSET(BULK N'C:\SQL\cd_catalog.xml', SINGLE_BLOB) rs;
INSERT INTO dbo.CD_Info (Title, Artist, Country, Company, Price, YearReleased)
SELECT Title = x.data.value('TITLE[1]','varchar(100)'),
       Artist = x.data.value('ARTIST[1]','varchar(100)'),
       Country = x.data.value('COUNTRY[1]','varchar(25)'),
       Company = x.data.value('COMPANY[1]','varchar(100)'),
       Price = x.data.value('PRICE[1]','numeric(5,2)'),
       YearReleased = x.data.value('YEAR[1]','smallint')
FROM @CD t
       CROSS APPLY t.XMLData.nodes('/CATALOG/CD') x(data);
SELECT * FROM dbo.CD_Info
       

SQL SERVER - JOINS - III

0 comments

       

The SELF JOIN

The JOIN operator can be used to combine any pair of tables, including combining the table to itself. This is the "self join". Self joins can use any JOIN operator.
For instance, check this classical example of returning an employee's boss (based on Table1). In this example, we consider that the value in field2 is in fact the boss' code number, therefore related to key1.

SELECT t1.key1, t1.field1 as Name, 
   t1.field2, mirror.field1 as Boss
FROM Table1 t1

   LEFT JOIN Table1 mirror ON t1.field2 = mirror.key1;
And this is the output to this query.

key1 Name field2 Boss
3 Erik 8 Harry
4 John 3 Erik
6 Mark 3 Erik
7 Peter 8 Harry
8 Harry 0 null

In this example, the last record shows that Harry has no boss, or in other words, he is the #1 in the company's hierarchy.

Excluding the Intersection of the Sets

Checking the previous Venn diagrams I just showed above, one may come to a simple question: what if I need to get all records from Table1 except for those that match with records in Table2. Well, this is pretty useful in day-to-day business, but obviously we don't need a special JOIN operator to do it.
Observe the result sets above and you will see you only need to add a WHERE clause to your SQL statement, looking for records that have a NULL value for Table2's key. So, the result set we are looking is the red area shown in the Venn diagram below Figure.


Non-matching records from Table1.

We can write a LEFT JOIN for this query, for instance:

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key, 
   t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
   LEFT JOIN Table2 t2 ON t1.key2 = t2.key2 
WHERE t2.key2 IS NULL;
And, finally, the result set will be:

key1 Name T1Key T2Key City
6 Mark 7 null null
7 Peter 8 null null

When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed.

One Word about Execution Plans

These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed.
This is as true for SQL Server as any other RDBMS.
Having a basic understanding of how SQL works is important for any DBA or developer. This will help you to get things done. In fast and reliable way. If you are curious about this, just take a look in the execution plan for the query above, shown in below figure.


The execution plan to a query using LEFT JOIN

Joins and Indexes

Take a look again at the Execution Plan of that query. Notice it used the clustered indexes of both tables. Using indexes is the best way to make your query run faster. But you have to pay attention to some details.
When we write our queries, we expect the SQL Server Query Optimizer to use the table indexes to boost your query performance. We can also help the Query Optimizer choose the indexed fields to be part of your query.
For instance, when using the JOIN operator, the ideal approach is to base the join condition over indexed fields. Checking again the Execution Plan, we notice that the clustered index on Table2 was used. This index was automatically built on key2 when this table was created, as key2 is the primary key to that table.
On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index. This was the table clustered index, based on key1, the primary key on Table1. You see the query optimizer is really a smart tool. But you would help it a lot creating a new index (a non-clustered one) on key2.
Remembering a bit about referential integrity, you see key2 should be a foreign key on Table1, because it is related to another field in other table (which is Table2.key2).
Personally I believe foreign keys should exist in all real-world database models. And it is a good idea to create non-clustered indexes on all foreign keys. You will always run lots of queries, and also use the JOIN operator, based on your primary and foreign keys.
(IMPORTANT: SQL Server will automatically create a clustered index on primary keys. But, by default, it does nothing with foreign keys. So make sure you have the proper settings on your database).

Non-equal Comparisons

When we write SQL statements using the JOIN operator, we usually compare if one field in one table is equal to another field in the other table. But this is not the mandatory syntax. We could use any logical operator, like different than (<>), greater than (>), less than (<) and so on.
Although this fancy stuff might give you the impression that SQL gives so much power, I feel this is more like a cosmetic feature. Consider this example. See Table 1 above , where we have 5 records. Now let's consider the following SQL statement.

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key, 
   t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
   INNER JOIN Table2 t2 ON t1.key2 <= t2.key2
WHERE t1.key1 = 3 ;
Notice this uses an inner join and we are specifically picking one single record from Table1, the one where key1 is equal to 3. The only problem is that there are 6 records and Table2 that satisfy the join condition. Take a look in the output to this query.

key1 Name T1Key T2Key City
3 Erik 1 1 New York
3 Erik 1 2 Sao Paulo
3 Erik 1 4 Paris
3 Erik 1 5 London
3 Erik 1 6 Rome
3 Erik 1 9 Madrid
The problem with non-equal joins is that they usually duplicate records. And this is not something you will need in a regular basis. Anyway, now you know you can do it.

Multiple JOINs

SQL JOINs are always about putting together a pair of tables and finding related objects that obey a given rule (usually, but not limited to, equal values). We can join multiple tables. For instance, to combine 3 tables, you will need 2 joins. And a new join will be necessary for each new table. If you use a join in each step, to combine N tables, you will use N-1 joins.
One important thing is that SQL allows you to use different types of joins in the same statement.
But DBAs and developers have to be careful on joining too many tables. Several times, I have seen situations where queries demanded 10, 20 tables or even more. For performance reasons, it is not a good idea to do a single query to put all data together. The query optimizer will do a better job if you break your query it into several smaller, simpler queries.
Now consider we have a third table, called Table3, shown bellow.

Table3

key3 field1
1 Engineer
2 Surgeon
3 DBA
4 Lawyer
5 Teacher
6 Actor

Now let's write a statement to bring the employee's name, the city where he lives and what is his profession. This will demand us to join all 3 tables. Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3. The resulting script is shown below.

SELECT t1.key1, t1.field1 as Employee, 
     t2.key2, t2.field1 as City,
     t3.key3, t3.field1 as Profession
FROM Table1 t1
     INNER JOIN Table2 t2 ON t1.key2 = t2.key2
     INNER JOIN Table3 t3 ON t1.key3 = t3.key3;

As we are running only INNER JOINs, we will have only records that match the combination of the 3 tables. See the output below.

key1 Name key2 City key3 Profession
3 Erik 1 New York 6 Actor
4 John 4 Paris 4 Lawyer
6 Harry 9 Madrid 2 Surgeon

 

Beyond the SELECT statements

The use of JOIN operators is not restricted to SELECT statements. In T-SQL, you can use joins in INSERT, DELETE and UPDATE statements as well. But keep in mind that in most RDBMS's we have nowadays, joins are not supported on DELETE and UPDATE statements. So I would advise you to restrict the use of joins to SELECT and INSERT statements only, even in SQL Server code. This is important if you want to keep your scripts more easily portable to different platforms.

Source: SQL Server Central    

SQL SERVER - JOINS - II

0 comments

    

 The LEFT JOIN

Also known as LEFT OUTER JOIN, this is a particular case of the FULL JOIN. It brings all requested data from the table that appears to the left of the JOIN operator, plus the data from the right table which intersects with the first one. Below we have a Venn diagram illustrating the LEFT JOIN of two tables in below figure.

 Representing the LEFT JOIN

See the syntax below,

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key, 
   t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
   LEFT JOIN Table2 t2 ON t1.key2 = t2.key2 ;
The result set of this statement will be:

key1 Name T1Key T2Key City
3 Erik 1 1 New York
4 John 4 4 Paris
6 Mark 7 null null
7 Peter 8 null null
8 Harry 9 9 Madrid

The third and forth records (key1 equals to 6 and 7) show NULL values on the last fields because there is no information to be brought from the second table. This means we have a value in field key2 in Table1 with no correspondent value in Table2. We could have avoided this "data inconsistency" in case we had a foreign key on field key2 in Table1.

The RIGHT JOIN

Also known as RIGHT OUTER JOIN, this is another particular case of the FULL JOIN. It brings all requested data from the table that appears to the right of the JOIN operator, plus the data from the left table which intersects with the right one. The Venn diagram for the RIGHT JOIN of two tables is in Figure 5.

 Representing the RIGHT JOIN

As you can see, syntax is very similar.

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
   t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
   RIGHT JOIN Table2 t2 ON t1.key2 = t2.key2 ;
The result set of this statement will be:

key1 Name T1Key T2Key City
null null null 0 Bangalore
3 Erik 1 1 New York
null null null 2 Sao Paulo
4 John 4 4 Paris
null null null 5 London
null null null 6 Rome
8 Harry 9 9 Madrid

Observe now that records with key1 equal to 6 and 7 no longer appear in the result set. This is because they have no correspondent record in the right table. There are 4 records showing NULL values on the first fields, because they are not available in the left table.

The CROSS JOIN

A CROSS JOIN is in fact a Cartesian product. Using CROSS JOIN generates exactly the same output of calling two tables (separated by a comma) without any JOIN at all. This means we will get a huge result set, where each record of Table1 will be duplicated for each record in Table2. If Table1 has N1 records and Table2 has N2 records, the output will have N1 times N2 records.
I don't believe there is any way to represent this output in a Venn diagram. I guess it would be a three dimensional image. If this is really the case, the diagram would be more confusing than explanatory.

The syntax for a CROSS JOIN will be:

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key, 
 t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
 CROSS JOIN Table2 t2 ;

As Table1 has 5 records and Table2 has another 7, the output for this query will have 35 records (5 x 7).


Quite honestly, I don't remember at this very moment not a single real-life situation that I do need to generate a Cartesian product of two tables. But whenever you need, CROSS JOIN is there, anyway.
Besides, you should be concerned about performance. Say you accidentally run in your production server a query with a CROSS JOIN over two tables with 1 million records. This is surely something that will give you a headache. Probably your server will start showing performance problems, as your query might run for some time consuming a considerable amount of the server resources.


 In the next article, we will see other joins.
      

SQL SERVER - JOINS - I

0 comments
       
The first thing we learn to do with SQL is writing a SELECT statement to get data from one table. This kind of statement seems to be straightforward and very close to the language we speak.
But real-world queries are often much more sophisticated than those simple SELECT statements.
First of all, usually the data we need is split into several different tables. This is a natural consequence of data normalization, which is an essential feature of any well designed database model. And SQL gives you the power to put that data together.

In the past, DBAs and developers used to put all necessary tables and/or views in the FROM clause and then use the WHERE clause to define how the records from each table would combine with the other records.

But there's been a long time since we have a standard for bringing these data together. And this is done using the JOIN operator ( ANSI-SQL 92). Unfortunately, there are some details about JOIN operators that remain obscure for many people.
Below I will show different syntaxes of joins supported by T-SQL (that is SQL SERVER 2008). I will outline a few concepts I believe they shouldn't be forgot each time we combine the data from two tables or more.

Getting Started: 1 Table, no Join

When you have only one object to query, the syntax will be quite simple and no join will be used. The statement will be the good & old "SELECT fields FROM object" plus any other optional clause you might want to use (that is WHERE, GROUP BY, HAVING or ORDER BY).

One thing that end users don't know is that we DBAs usually hide lots of complex joins under one nice and easy-to-use view. This is done for several reasons, ranging from data security to database performance. For instance, DBAs can give permissions for the end users to access one single view instead of several production tables, obviously increasing data security. Or considering performance, DBAs can create a view using the right parameters to join the records from several tables, correctly using database indexes and thus boosting query performance.
All in all, joins might be there in the database even when the end users don't see them.

The Logic Behind Joining Tables

When I started working with SQL, I learned there were several types of joins. But it took me some time to understand what exactly I was doing when I brought those tables together. Maybe because people are so scared of mathematics, it is not frequently said that the whole idea behind joining tables is about Set Theory. Despite the fancy name, the concept is so simple we are taught it in elementary school.

The drawing in Figure 1 is quite similar to the ones found in kids' books from First Grade. The idea is to find correspondent objects in the different sets. Well, this is precisely what we do with SQL JOIN's!

  
Combining objects from different sets


Once you understand the analogy, things will start to make sense.
Consider that the 2 sets in the above figure are tables and the numbers we see are the keys we will use to join the tables. So in each set, instead of representing the whole records, we are only seeing the key fields from each table. The result set of this combination will be determined by the type of join we consider, and this is the topic I will show now. To illustrate our examples, consider we have 2 tables, shown below:

Table1:

CREATE TABLE Table1 (key1 INT NOT NULL,
                                         field1 VARCHAR(50),
                                         field2 INT,
                                         key2 INT NOT NULL,
                                         key3 INT NOT NULL,
                                         PRIMARY KEY (key1));

key1 field1 field2 key2 key3
3 Erik 8 1 6
4 John 3 4 4
6 Mark 3 7 1
7 Peter 6 8 5
8 Harry 0 9 2

 

Table2:

CREATE TABLE Table2 (key2 INT NOT NULL,
                                         field1 VARCHAR(50),
                                         field2 VARCHAR(1),
                                         field3 VARCHAR(1),
                                         PRIMARY KEY (key2));

key2 field1 field2 field3
1 New York A N
2 Sao Paulo B N
4 Paris C Y
5 London C Y
6 Rome C Y
9 Madrid C Y
0 Bangalore D N


You will notice this script does not fully implement referential integrity. I intentionally left the tables without foreign keys to better explain the functionality of the different types of joins. But I did so for didactical purposes only. Foreign keys are extremely useful to guarantee data consistency and they should not be left out of any real-world database.
Well, now we are ready to go. Let's check the types of joins we can use in T-SQL, the correspondent syntax and the result set that each one will generate.

The Inner Join

This is the most common join we use in SQL. It returns the intersection of two sets. Or in terms of tables, it brings only the records from both tables that match a given criteria.
We can see in Figure 2 the Venn diagram illustrating the inner join of two tables. The result set of the operation is the area in red.

 Representing the INNER JOIN

Now check the syntax to combine the data from Table1 and Table2 using a INNER JOIN.

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key, 
   t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
   INNER JOIN Table2 t2 ON t1.key2 = t2.key2; 

The result set of this statement will be:

key1 Name T1Key T2Key City
3 Erik 1 1 New York
4 John 4 4 Paris
8 Harry 9 9 Madrid

Notice it returned only the data from records which have the same value for key2 on both Table1 and Table2.
Opposed to the INNER JOIN, there is also the OUTER JOIN. There are 3 types of OUTER JOINs, named full, left and right. We will look at each one in detail below.

The FULL JOIN

This is also known as the FULL OUTER JOIN (the reserved word OUTER is optional). FULL JOINs work like the union of two sets. Now we have in Figure 3 the Venn diagram illustrating the FULL JOIN of two tables. The result set of the operation is again the area in red.


 Representing the FULL JOIN

The syntax is almost exactly the same we saw before.

SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
    t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
    FULL JOIN Table2 t2 ON t1.key2 = t2.key2 ;

The result set of this statement will be:

key1 Name T1Key T2Key City
3 Erik 1 1 New York
4 John 4 4 Paris
6 Mark 7 null null
7 Peter 8 null null
8 Harry 9 9 Madrid
null null null 2 Sao Paulo
null null null 5 London
null null null 6 Rome
null null null 0 Bangalore

The FULL JOIN returns all records from Table1 and Table2, without duplicating data.

In the next article, we will see other joins.

Source: SQL Server Central


SQL SERVER - Replace Multiple Spaces with One

2 comments
 

Introduction

Replacing multiple spaces with a single space is an old problem. If you Google the problem, you find that most folks still resort to While Loops in functions or maybe even a Tally table or (ugh!) XML in a function to solve this seemingly complex problem. The truth is that you don't need the RBAR of a User Defined Function at all.
This article explains the "puzzle solving" behind a common set based method to replace unknown numbers of adjacent spaces with a single space by establishing and replacing simple patterns in a set based fashion.

The Problem

You have a column of data that looks something like the following.

-- Create and populate a test table.
-- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in    it.  ' 
UNION ALL
 SELECT 'So                     does                      this!' 
UNION ALL
 SELECT 'As                                does                        this' 
UNION ALL
 SELECT 'This, that, and the other  thing.' 
 UNION ALL
 SELECT 'This needs no repair.'

The goal is to convert the sections of multiple spaces of unknown length to a single space each as quickly as possible. There are some additional requirements. You can't make the data any bigger during the process because it might get too big for VARCHAR(8000) (or whatever size the column is) or you can't use VARCHAR(MAX) because you're using SQL Server 2000.

The Method Explained

I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....
O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO
Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.

So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces. When we replace all pairs of space "OO" with "OX", we get the following
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

STEP 2 is to replace all occurrences of "XO" with NOTHING...
O
OX
OXOOXOX
OXOXOOXOXOX
OXOXOXOOXOXOXOX

... and that leaves us with ...
O
OX
O
OX
O
OX
O
OX

STEP 3 is to replace "X" with NOTHING...
O
OXO
OXO
OXO
OX

... and that leaves us with just singles spaces everywhere...
O
O
O
O
O
O
O
O

Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some "unlikely character" like a special non printable, almost non type-able character like ASCII 7 (the "Bell" character). It's just a matter of 3 nested REPLACE functions to handle ANY number of spaces to accomplish the puzzle we solved above. It can all be done in a single set-based query without loops or even UDF's.
Be careful which "unlikely character" you pick, though. We'll talk more about that when we get to the "Unlikely Characters and Collation" section of this article further below.

The Code

Ok... now that you know how it works, here's the code that accomplishes the 3 steps as 3 nested REPLACE's. I've included the test table I previously covered in "The Problem" section of this article just to make things easy to run. Notice that I've also added an LTRIM/RTRIM to take out any leading or trailing spaces, as well.

-- Create and populate a test table.
-- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' 
 UNION ALL
 SELECT 'So                     does                      this!' 
 UNION ALL
 SELECT 'As                                does                        this' 
 UNION ALL
 SELECT 'This, that, and the other  thing.' 
 UNION ALL
 SELECT 'This needs no repair.'

 -- Reduce each group of multiple spaces to a single space
 -- for a whole table without functions, loops, or other
 -- forms of slow RBAR.  In the following example, CHAR(7)
 -- is the "unlikely" character that "X" was used for in 
 -- the explanation.
 SELECT REPLACE(
            REPLACE(
                REPLACE(
                    LTRIM(RTRIM(OriginalString))
                ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model
            ,CHAR(7)+' ','')        --Changes the XO model to nothing
        ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing
 FROM @Demo
 WHERE CHARINDEX('  ',OriginalString) > 0
Results from Code Above,
CleanString
This has multiple unknown spaces in it.
So does this!
As does this
This, that, and the other thing.

 

Unlikely Characters and Collation

Just a quick note on "unlikely characters". You do have to be really careful about what you select as an "unlikely character" for the "X" of the "OX" model we previously discussed. For example, if you have the not-so-uncommon collation of Latin1_General_CI_AI on a column and you've chosen the "unlikely character" of Thorn (þ), you could end up deleting a whole lot more than you bargained for. According to Wikipedia, "þ" (the Thorn character) still survives as 30th character of the Icelandic alphabet and appears in other alphabets, as well. Further, in many languages, it has been replaced by the "th" digraph and, in certain collations like the Latin1_General_CI_AI collation, the "þ" character and "th" are treated as equals.

For example,

-- Create and populate a test table.
-- *** NOTICE THE COLLATION SETTING ON THE STRING COLUMN. ***
DECLARE @Demo TABLE(OriginalString VARCHAR(8000) COLLATE Latin1_General_CI_AI)
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.  ' 
 UNION ALL
 SELECT ' So                     does                      this!   ' 
 UNION ALL
 SELECT '   As                                does                                this' 
 UNION ALL
 SELECT 'This, that, and the other thing.'

--===== This uses a "thorn" character as the "X" of the "OX" model 
 SELECT LTRIM(RTRIM(
            REPLACE(REPLACE(REPLACE(OriginalString,'  ',' þ'),'þ ',''),'þ','')
        ))
 FROM @Demo

Because of the equivalent treatment of the Thorn (þ) character and the "th" digraph, you end up with a most undesirable result.

Results from Code Above,
CleanString
is has multiple unknown spaces in it.
So does is!
As does is
is, at, and e oer ing.

... which is nothing like what we wanted.

There's also a danger in selecting the wrong "control character" (ASCII characters 0 through 31) as the "unlikely character". I chose CHAR(7) which is a very benign character in today's world of electronics. It was designed to literally ring the bell on old Tele-Type machines and is just passed through by today's electronics.
If you chose CHAR(0), you've just chosen the "NULL" character and anything that follows it's appearance will simply disappear. Using the current test table in this article, you'd get an output that looks like the following,

Results from Code Above,
CleanString
This
So
As
This,that,andtheotherthing.

If you chose CHAR(1), you've just chosen the ASCII "Start of Header" character which is still in use today. When certain electronics see this character, it can cause some very strange behavior usually resulting in failure of your code.

Another good "unlikely character" is CHAR(8) which is the ASCII "Backspace" character. It's normally never included in any type of assembled text now adays. CHAR(127) also works well because it's a left over for "Delete" from the paper tape world. It actually punches all the holes in a paper tape (7 of them) to quite literally delete a character.
    

Introduction to Indexes - Part VI

0 comments

  

How a nonclustered index is used?

 

Seek

For SQL to do a seek on a nonclustered index, the query must have a SARGable1 predicate referencing the index key or a left-based subset of the index key. In addition to this, the index must be either covering or return sufficiently small number of rows that the required lookups to the clustered index/heap (to retrieve the remainder of the columns) is not considered too expensive by the optimiser. If the required lookups are considered too expensive then the index will not be used.
It usually works out that the number of rows where the optimiser decides that key/RID lookups are too expensive is somewhere around 0.5%-1% of the total rows in the table.
(1) - SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek.

Scan

An index scan is a read of all of the leaf pages in the nonclustered index. A scan of a nonclustered index is generally more efficient than a scan of the clustered index, because nonclustered indexes are generally smaller.
A nonclustered index scan usually indicates that the index contains all the columns required by the query but they are in the wrong order, the query predicates are not SARGable or there are no query predicates.

 

Update

When a change is made to a column that is either a key column or an include column of a nonclustered index, that index will be modified as part of the insert statement. Indexes are never left out of sync with the underlying table data.
Inserts and deletes will affect all nonclustered indexes on a table.

 

Considerations for selecting nonclustered indexes

The decision as to what columns should be indexed should be based on the queries that are run against the table. There’s no point in indexing a column that is never used in a query.

 

Selectivity

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the query would be seeking  on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.
It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

 

Single column vs. Multi-column indexes

In general, wider nonclustered indexes are more useful than single column nonclustered indexes. This is because it is very unusual for SQL to use multiple nonclustered indexes on the same table to evaluate a query. An index defined over more than one column is referred to as a composite index.
Consider a hypothetical table (Table1) that has three indexes on it, one on Col1, one on Col2 and one on Col3. If a query with predicates on all three of those columns is executed against that table, it is exceedingly unlikely for SQL to seek on all three indexes and intersect the results. It is far more likely that one of the indexes will be used; the predicate against the column within the selected index evaluated then the rest of the columns retrieved (from the clustered index) and then the remaining two predicates evaluated.
Were as, if there was a single index defined with all three columns, the matching rows could be located as a single seek operation.

Three separate indexes:










One composite index:






















Column order

The order of the columns in an index key should be chosen based on several factors: the selectivity of the columns, what % of queries will filter on that column and whether the queries will filter with equality or inequality matches.
If the query predicate is based on multiple columns and there is an index where the combination of columns forms a left-based subset of the index key, then retrieving the rows is a single seek operation. If the query predicate is based on multiple columns and there is an index that contains those columns as part of the index key, but they do not all form a left-based subset of the index key, then retrieving those rows is a 2-step process; seek on the columns that do form a left-based subset and then filter out rows that don’t match the other conditions.

Let’s look at a quick example.
Imagine we have a hypothetical table named Table1 and that there is an index on that table with the index key consisting of three columns – Col1, Col2 and Col3.

CREATE INDEX idx_Table1_Demo
      ON Table1 (Col1, Col2, Col3)

Let’s say further that there are two queries against that table.

SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col2 = @Var2
This query can be evaluated by a single seek operation against the two columns.



SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col3 = @Var3
This query cannot be evaluated by a single seek operation against the  two columns.  Rather it has to be done as a seek just on one column,  Col1, and the rows that are returned from that seek get checked to see  if they match the second predicate or not.


The other thing that has to be considered is the type of predicates that the queries will be using.

It is often said that the most selective column should be made the leading column of an index. This is true, but must be considered in light of the other factors. There is little point in making a column the leading column of an index if only 5% of the queries that use that index filter on that column. It would mean that only those queries could seek on the index and the other 95% would either scan or use a different index (if one existed)
     
Related Posts:

Introduction to Indexes - Part I
Introduction to Indexes - Part II
Introduction to Indexes - Part III
Introduction to Indexes - Part IV
Introduction to Indexes - Part V
Introduction to Indexes - Part VI
     

Introduction to Indexes - Part V

0 comments
   
In this article, we will take a closer look at nonclustered indexes, how SQL uses nonclustered indexes and some of the recommendations for selecting useful nonclustered indexes.

What is a nonclustered index?

A nonclustered index is the second type of index in SQL Server. They have the same b-tree structure as the clustered index (see previous article). Unlike the clustered index nonclustered indexes does not contain the entire data row at the leaf level. Rather, the nonclustered index contains just the columns defined in the index, and a pointer to the actual data row. See figure 1 for a high-level architecture. Assume that the nonclustered index depicted there is based off the clustered index depicted in the previous part.


When the underlying table is a heap (has no clustered index) then the pointer to the data row is the RID, an 8 byte structure comprised of the File ID, Page No and Slot index, i.e. the actual location of the row within the data file.
When the underlying table has a clustered index, the pointer to the actual data row is the clustering key. As mentioned in the previous article, this has important considerations for the selection of a clustering key.
There can be multiple non-clustered indexes on a table. In SQL 2005 and earlier, there was a limit of 249 non-clustered index per table. In SQL 2008, with the introduction of filtered indexes, the limit on indexes has been increased to 999.

Include columns

Columns specified as include columns are stored at the leaf level of the nonclustered index, but not at the intermediate or root levels. They are not part of the index key and do not count towards the 16 column/900 byte limit on indexes. Any data type other than the old LOB types (TEXT, NTEXT, IMAGE) are allowed for include columns, although columns defined as varbinary(MAX) Filestream are not permitted as include columns. The ability to specify include columns was added in SQL 2005.
Include columns are useful in that they are columns available in the index but not contributing to the size of the index key hence they make it easier to create covering indexes (see next section) than could be done without them.
Typically columns that appear only in the select clause of a query and not within the WHERE, FROM or GROUP BY are candidates for INCLUDE columns.
If LOB columns are specified as include columns, the entire contents of the LOB columns are duplicated. It’s not just a pointer to the existing LOB pages added to the nonclustered index.
The ability to specify include columns was added in SQL 2005.

Covering indexes

Covering is not a property of an index, it is not possible to say, without additional information that a query is covering or not covering. Instead, covering deals with an index and a query together.
An index is said to be covering for a specific query if that index contains within it all of the columns necessary for the query. The columns may be part of the key or they may be include columns. This means that a query that has a covering index can be evaluated completely from that index, without needing to go to the base table (heap or cluster) to fetch additional columns.
It has been said before that creating a covering index for a query is just about the best way to speed a query up. This is true. It is not always possible or desirable to cover every query. Covering every query may lead to unacceptably large indexes or unacceptably large numbers of indexes with all the attendant problems (as mentioned in part 1)
It is possible (and often desirable) for an index to be covering for more than one query.

Filtered indexes

Filtered indexes are a new feature in SQL 2008 and they allow for an index to contain only some of the rows in the table. Prior to this, an index would always have, at the leaf level, the same number of rows as the table. With filtered indexes, an index can be based on just a subset of the rows.
When creating a filtered index, a predicate is specified as part of the index creation statement. There are several limitations on this predicate. The comparison cannot reference a computed column, a column that is declared as a user-defined type, a spatial column or a hierarchy-id column.
A clustered index cannot be filtered as it is the actual table.

In the next article, we will see how a non-clustered index is used.

Related Posts:

Introduction to Indexes - Part I
Introduction to Indexes - Part II
Introduction to Indexes - Part III
Introduction to Indexes - Part IV
Introduction to Indexes - Part V
Introduction to Indexes - Part VI
   

Introduction to Indexes - Part IV

on August 01, 2010 0 comments

    

Considerations for selecting the clustering key


There are two main schools of thought in what makes a good clustering. One school says to put the clustered index on the column or set of columns that would be most useful for queries, either frequently run queries or ones doing queries of large ranges of data. The other school says to use the clustered index primarily to organise the table and leave data access to the nonclustered indexes.
I hold to the second school, so the rest of this article will be written from that perspective.
There are four main attributes that are desirable for a clustering key. The clustering key should be:
  • Narrow
  • Unique
  • Unchanging
  • Ever increasing
  •  

Narrow

The width of the clustering key affects the depth of the clustered index and hence it’s efficiency. A clustered index with a very deep b-tree requires queries to read more intermediate pages to locate rows. This increased page reads makes deeper clustered indexes less efficient for data access, especially for lookups. Additionally, more pages means more space used on disk and more space used in memory when the index is in the data cache
The width of the clustering key does not, however, only affect the clustered index. The clustering key, being the rows’ address, is located in every single nonclustered index. Hence a wide clustering key increases the size of all nonclustered indexes, reducing their efficiency as well.

Unique

The clustered index has to be unique. It’s used as the address for a row. If the clustered index is not defined as unique, SQL makes it unique by adding a hidden 4 byte integer column. This makes the index wider than it needs to be and makes the nonclustered indexes wider than they should be.
This attribute needs to be considered in relation to the others. If making the cluster unique requires adding several wide columns then it may be better to keep the index narrower and leave it as not unique.

Unchanging

The clustering key defines where a row is found, which page it is in. If the value of the clustering key is changed, the row must be moved from the page where it currently is to a new location. This means that the update is essentially a delete-insert combination.

Ever-increasing

An ever-increasing column is one where every value inserted is higher than all values currently in the table. This is important for a clustered index as, if inserts occurred all over the table, there would be a high number of page splits and the index would become rapidly fragmented.
Since the clustered index is the largest index on a table (as it contains the entire data row) and because it tends to be used for scans more than other indexes, the clustered index is the one that is affected the most when fragmented and the one that, generally, is the most expensive to rebuild.
For this reason, an ever-increasing clustering key is usually a good choice, especially for tables that are subject to frequent inserts.

 In the next part, we will see about Non-clustered indexes.

Related Posts:

Introduction to Indexes - Part I
Introduction to Indexes - Part II
Introduction to Indexes - Part III
Introduction to Indexes - Part IV
Introduction to Indexes - Part V
Introduction to Indexes - Part VI
   

Introduction to Indexes - Part III

0 comments
   
In the previous article I looked at the very basics of what indexes are, what types exist in SQL and how they’re used. In this article I’m going to take a closer look at clustered indexes, what makes them different from nonclustered indexes, how SQL uses clustered indexes and some of the recommendations for selecting the clustered index key.

What is a clustered index?

A clustered index is an index where the leaf level of the index contains the actual data rows of the table. Like any other index, a clustered index is defined on one or more columns – the index key. The key columns for a clustered index are often referred to as the clustering key.
The clustered index can be looked at as a balanced tree (b-tree) structure built on top of a table, with the rows in the table stored logically in the order of the clustering key (see figure 1). The clustered index essentially is the table and hence there cannot be more than one on a table.


The clustering key is, in some ways, the ‘address’ of a data row. It can be used, through the tree structure of the clustered index, to locate a row of data. Because it is used to locate a data row, the clustering key must be unique. If, when the clustered index is created it is defined as unique (either by specifying the UNIQUE keyword as part of the create index statement or by defining the clustered index as part of a primary key or unique constraint), then everything is fine. What if the clustered index is not defined as unique? In that case, SQL automatically adds another column to the index, a 4-byte integer column that’s called the Uniquifier. This column is hidden, it is not displayed in the table design and it cannot be queried directly.


Difference between a clustered index and a heap

A table that has no clustered index is referred to as a heap. Whereas a clustered index has the data stored logically in the order of the index key, a heap has no ordering of rows or pages.
When a row is added to a table with a clustered index, that row has a defined page that it must go on to, according to the value of the clustered index key. Using the index depicted in Figure 1 as an example, if a new row is added with a value of 4 for the clustered index key, that row must go onto the second leaf page. If that page is full, a new page gets allocated, joined into the index and some of the rows from the full page are moved to the new one. This is called a page split, it can be an expensive operation and it leads to fragmentation.
When a row is added to a table without a clustered index (a heap) there’s no specific place that the row has to go. A heap has no defined order. The row will be added wherever there’s a space large enough to put the row.

How a clustered index is used

There are three ways that a clustered index can be used by a query to locate – seek, scan and lookup.

Seek

For the clustered index to be used in a seek operation, the query must contain a SARGable1 predicate based on the clustering key or part of the clustering key.
A simple example of a clustered index seek is as follows (based on the AdventureWorks database)
SELECT ProductID, ProductNumber, Name
 FROM Production.Product
 WHERE ProductID = 380
ProductID is the clustering key for the table Production.Product and the filter on ProductID is SARGable.
 
(1)   SARGable is a made-up word, constructed from the phrase Search  ARGument. It refers to a predicate that is of a form that SQL can use  for an index seek.

Scan

A scan of the clustered index is equivalent to a table scan. It’s a read of all of the data pages in the table
Because the clustered index is the table it can be seen as the default access path for queries to get data to retrieve rows. If a query does not have a SARGable predicate1 based on the clustering key  and there are no non-clustered indexes which are appropriate for a query (and what makes a non-clustered index appropriate will be discussed in detail in part 3), then a scan of the clustered index will be done to retrieve the data for the query. This is a full scan of all the data pages, i.e. a table scan.
In this example a clustered index scan is done because there are no indexes on either of the columns in the where clause.
SELECT ProductID, ProductNumber, Name
FROM Production.Product   
WHEREColor = 'Blue' AND DaysToManufacture 
BETWEEN 2 AND 4 
 SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek.

Lookups

A lookup occurs when a nonclustered index was used to locate rows for a query, but the nonclustered index did not contain all of the columns required for the query. To fetch the remaining columns, a lookup is done to the clustered index.
A lookup is equivalent to a single row clustered index seek. Lookups are always done one row at a time. For this reason, they are very expensive operations, especially when lots of rows are involved.
SELECT ProductID, ProductNumber, Name
FROM production.Product
WHERE ProductNumber = 'HN-1224'


In the next part, we will see the considerations for selecting the clustering key.

Related Posts:

Introduction to Indexes - Part I
Introduction to Indexes - Part II
Introduction to Indexes - Part III
Introduction to Indexes - Part IV
Introduction to Indexes - Part V
Introduction to Indexes - Part VI