SQL SERVER - Replace Multiple Spaces with One

on August 02, 2010
 

Introduction

Replacing multiple spaces with a single space is an old problem. If you Google the problem, you find that most folks still resort to While Loops in functions or maybe even a Tally table or (ugh!) XML in a function to solve this seemingly complex problem. The truth is that you don't need the RBAR of a User Defined Function at all.
This article explains the "puzzle solving" behind a common set based method to replace unknown numbers of adjacent spaces with a single space by establishing and replacing simple patterns in a set based fashion.

The Problem

You have a column of data that looks something like the following.

-- Create and populate a test table.
-- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in    it.  ' 
UNION ALL
 SELECT 'So                     does                      this!' 
UNION ALL
 SELECT 'As                                does                        this' 
UNION ALL
 SELECT 'This, that, and the other  thing.' 
 UNION ALL
 SELECT 'This needs no repair.'

The goal is to convert the sections of multiple spaces of unknown length to a single space each as quickly as possible. There are some additional requirements. You can't make the data any bigger during the process because it might get too big for VARCHAR(8000) (or whatever size the column is) or you can't use VARCHAR(MAX) because you're using SQL Server 2000.

The Method Explained

I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....
O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO
Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.

So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces. When we replace all pairs of space "OO" with "OX", we get the following
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

STEP 2 is to replace all occurrences of "XO" with NOTHING...
O
OX
OXOOXOX
OXOXOOXOXOX
OXOXOXOOXOXOXOX

... and that leaves us with ...
O
OX
O
OX
O
OX
O
OX

STEP 3 is to replace "X" with NOTHING...
O
OXO
OXO
OXO
OX

... and that leaves us with just singles spaces everywhere...
O
O
O
O
O
O
O
O

Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some "unlikely character" like a special non printable, almost non type-able character like ASCII 7 (the "Bell" character). It's just a matter of 3 nested REPLACE functions to handle ANY number of spaces to accomplish the puzzle we solved above. It can all be done in a single set-based query without loops or even UDF's.
Be careful which "unlikely character" you pick, though. We'll talk more about that when we get to the "Unlikely Characters and Collation" section of this article further below.

The Code

Ok... now that you know how it works, here's the code that accomplishes the 3 steps as 3 nested REPLACE's. I've included the test table I previously covered in "The Problem" section of this article just to make things easy to run. Notice that I've also added an LTRIM/RTRIM to take out any leading or trailing spaces, as well.

-- Create and populate a test table.
-- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' 
 UNION ALL
 SELECT 'So                     does                      this!' 
 UNION ALL
 SELECT 'As                                does                        this' 
 UNION ALL
 SELECT 'This, that, and the other  thing.' 
 UNION ALL
 SELECT 'This needs no repair.'

 -- Reduce each group of multiple spaces to a single space
 -- for a whole table without functions, loops, or other
 -- forms of slow RBAR.  In the following example, CHAR(7)
 -- is the "unlikely" character that "X" was used for in 
 -- the explanation.
 SELECT REPLACE(
            REPLACE(
                REPLACE(
                    LTRIM(RTRIM(OriginalString))
                ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model
            ,CHAR(7)+' ','')        --Changes the XO model to nothing
        ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing
 FROM @Demo
 WHERE CHARINDEX('  ',OriginalString) > 0
Results from Code Above,
CleanString
This has multiple unknown spaces in it.
So does this!
As does this
This, that, and the other thing.

 

Unlikely Characters and Collation

Just a quick note on "unlikely characters". You do have to be really careful about what you select as an "unlikely character" for the "X" of the "OX" model we previously discussed. For example, if you have the not-so-uncommon collation of Latin1_General_CI_AI on a column and you've chosen the "unlikely character" of Thorn (þ), you could end up deleting a whole lot more than you bargained for. According to Wikipedia, "þ" (the Thorn character) still survives as 30th character of the Icelandic alphabet and appears in other alphabets, as well. Further, in many languages, it has been replaced by the "th" digraph and, in certain collations like the Latin1_General_CI_AI collation, the "þ" character and "th" are treated as equals.

For example,

-- Create and populate a test table.
-- *** NOTICE THE COLLATION SETTING ON THE STRING COLUMN. ***
DECLARE @Demo TABLE(OriginalString VARCHAR(8000) COLLATE Latin1_General_CI_AI)
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.  ' 
 UNION ALL
 SELECT ' So                     does                      this!   ' 
 UNION ALL
 SELECT '   As                                does                                this' 
 UNION ALL
 SELECT 'This, that, and the other thing.'

--===== This uses a "thorn" character as the "X" of the "OX" model 
 SELECT LTRIM(RTRIM(
            REPLACE(REPLACE(REPLACE(OriginalString,'  ',' þ'),'þ ',''),'þ','')
        ))
 FROM @Demo

Because of the equivalent treatment of the Thorn (þ) character and the "th" digraph, you end up with a most undesirable result.

Results from Code Above,
CleanString
is has multiple unknown spaces in it.
So does is!
As does is
is, at, and e oer ing.

... which is nothing like what we wanted.

There's also a danger in selecting the wrong "control character" (ASCII characters 0 through 31) as the "unlikely character". I chose CHAR(7) which is a very benign character in today's world of electronics. It was designed to literally ring the bell on old Tele-Type machines and is just passed through by today's electronics.
If you chose CHAR(0), you've just chosen the "NULL" character and anything that follows it's appearance will simply disappear. Using the current test table in this article, you'd get an output that looks like the following,

Results from Code Above,
CleanString
This
So
As
This,that,andtheotherthing.

If you chose CHAR(1), you've just chosen the ASCII "Start of Header" character which is still in use today. When certain electronics see this character, it can cause some very strange behavior usually resulting in failure of your code.

Another good "unlikely character" is CHAR(8) which is the ASCII "Backspace" character. It's normally never included in any type of assembled text now adays. CHAR(127) also works well because it's a left over for "Delete" from the paper tape world. It actually punches all the holes in a paper tape (7 of them) to quite literally delete a character.
    

1 comments:

Post a Comment