Optimizing SQL Strategies

Optimizing SQL Strategies

How many of you have come across SQL code that, for whatever reason, wasn’t written as good as it could have been? I’m not talking about going into the inner workings of your database’s engine and milking every possibly nanosecond off the execution plan.

 

I’m talking about code where you see something more like the following. This code represents about half the update statements (the other half left out for brevity). The stg1Table table being updated had several million records with no indexes. Can anyone guess before reading any further why my jaw dropped when I saw this?

 -- beginning of the updates

UPDATE stg1Table
SET    AccType = 'RE'
WHERE CHARINDEX('Romeo', CatchPhrase) > 0

UPDATE stg1Table
SET    AccType = 'TL'
WHERE CHARINDEX('Tango', CatchPhrase) > 0

UPDATE stg1Table
SET    AccType = 'VA'
WHERE CHARINDEX('Victor', CatchPhrase) > 0

UPDATE stg1Table
SET    AccType = 'WC'
WHERE CHARINDEX('Whiskey', CatchPhrase) > 0

UPDATE Stg1Table
SET    AccType = 'BT'
WHERE CHARINDEX('Bravo', CatchPhrase) > 0

UPDATE Stg1Table
SET   AccType = 'SC'
WHERE CHARINDEX('Sierra', CatchPhrase) > 0

UPDATE Stg1Table
SET    AccType = 'EC'
WHERE CHARINDEX('Echo', CatchPhrase) > 0

 --  : more similar updates omitted for brevity
 --  : This was unbelieveable!

I constantly come across SQL code where performance is obviously not considered. So that’s when this curious consultant decided to document some basic Optimization SQL Strategies in hopes of helping some SQL coders “lift their game”. :-)

With that, here we go in no particular order:

 

Whenever possible, DELETE first, UPDATE second, INSERT third

I came across a stored procedure that was operating on a table with approximately 26 columns and several million records. There was no index on the table other than the primary key which was just an IDENTITY column. The stored procedure was run approximately once per week. With each run, here’s what it was coded to do:

  1. INSERT several thousand records marking them as “new”.
  2. DELETE “expired” records
  3. UPDATE “old” records

The records were mutually exclusive of each other. That is, any records that were “new” wouldn’t be affected by the UPDATE or DELETE operation.

Ask yourself why this isn’t the best approach. Where is the inefficiency? Answer: in the order of operations.

By INSERTing several thousand new records first, that means the database engine has to search through that many more records to find those to DELETE and UPDATE later. If the INSERT were performed last, that obviously saves the database engine from having to scan through all the potential new records that are being INSERTed.

Therefore, you should always DELETE first whenever possible as this will present the database engine with less records to search when performing subsequent UPDATEs.

Next, UPDATE current records before performing an INSERT whenever possible because again, that’s less records the database will have to scan through.

Finally, perform your INSERT last whenever possible.

 

Use @TableVars for small amounts of table data instead of #temp tables

They’re small. Fast. Require zero disk I/O.

Here’s the perfect example I came across in numerous stored procedures for one client:

CREATE TABLE #DaysOfWeek
(
	[Day] char(3)
)
INSERT INTO #DaysOfWeek([Day]) VALUES ('Sun'),('Tue'),('Fri')

 -- 
 --  a few queries with joins on #DaysOfWeek
 -- 

DROP TABLE #DaysOfWeek

I promptly modified the code as follows, which actually increased the overall stored procedure runtime performance:

 -- Replaced the #table with @table! Why waste time on
 -- disk I/O for 3 records?? Seriously!
DECLARE @DaysOfWeek TABLE 
(
	[Day] char(3)
)
INSERT INTO @DaysOfWeek([Day]) VALUES ('Sun'),('Tue'),('Fri')

 

Eliminate Joins On Massive Tables When Most Columns Aren’t Used

I was modifying a stored procedure which was updating fact tables in a data warehouse in Database1. This stored procedure had several joins to a significantly larger, staging table (3 dozen columns, several million records, no indexes on the columns being joined on or selected) in Database2 on the same server.

In this scenario, I managed to shave several minutes off the stored procedure runtime by creating a smaller #temptable from the joined table in Database2, using only the 3 columns that were utilized. I then created a CLUSTERED index on the one column in the #temptable that was contained in the WHERE clause.

Here’s the old code:

 -- This was the original code.
 -- Several queries similar to this where only 
 -- the 3 columns from [Database2].[dbo].DB2Table1
 -- are used out of the 36.
INSERT INTO OurDataTable
SELECT DISTINCT 
	CAST(J.id as int) as id, 
	cast(J.code as CHAR(1)) as code 
FROM [Database2].[dbo].DB2Table1 AS J –has 36 columns and millions of rows 
LEFT OUTER JOIN [Database1].[dbo].DB1Table1 AS D ON J.id = D.id 
WHERE D.R_Man IS NULL 
	AND J.branch_code NOT in('44','61','63','64')

The new code thus becomes:

 -- Added for performance gains. 
 -- Only get the 3 columns we're concerned with.
 -- Takes just a few seconds
select 
	CAST(Id as int) as Id, 
	cast(Code as varchar(1)) as Code, 
	cast(Branch_Code as varchar(2)) as Branch_Code
into #tempTable
FROM [Database2].[dbo].DB2Table1 AS J 

 -- Create an index on the one column
 -- that's used in the WHERE clauses
CREATE NONCLUSTERED
	INDEX [UIX_tempTable] ON [#tempTable]([Branch_Code])
	WITH
	FILLFACTOR = 100
	,STATISTICS_NORECOMPUTE
ON [PRIMARY]

 -- Now do what we need to do joining
 -- on the smaller #temp table. 
 -- It runs significantly faster.
INSERT INTO OurDataTable
SELECT DISTINCT 
	CAST(J.id as int) id, 
	cast(J.code as CHAR(1)) as code 
FROM #tempTable AS J -- for performance. Smaller table with only the columns we need + an index
LEFT OUTER JOIN [Database1].[dbo].DB1Table1 AS D ON J.id = D.id 
WHERE D.R_Man IS NULL 
	AND J.branch_code NOT in('44','61','63','64')

See what I’ve done? It made a HUGE difference in the run time!

As an exercise to the reader, you can get the SQL code to execute faster if you add a WHERE clause to the SELECT statement when creating the #tempTable.

 

If a #temptable is used once, consider putting it in a join in the query to increase parallelism

Conversely, I’ve found #temptables in code where it hasn’t been advantageous. More disk I/O and resources were wasted, as well as limiting the database server’s ability to implement parallelism.

Here’s an example of code I rewrote:

 -- They created a #temptable
SELECT ColA, ColB, ColC, ColD, Col4, Col5
INTO #table2
FROM [Database2].[dbo].SomeTable

 -- Lots of SQL code here, then 

 --  a join query. Only place where #table2 is used. 
SELECT a.Col1, a.Col2, a.Col3, b.ColA, b.ColB, b.ColC, b.ColD
FROM Table1 a 
inner join #table2 b on a.Col5 = b.Col5 and a.Col4 = b.Col4
inner join Table3 c on a.Col1 = c.Col1 and a.Col3 = c.Col3

 -- Some more miscellaneous SQL was here

DROP TABLE #table2

 -- -- -- -- -- -- -- -- -- -- -- -- -- -- 

 -- So instead of doing a single query to create #table2, 
 -- which forces the code to execute in serial,
 -- and also wasting time with the associated explicit Disk I/O,
 -- rewrite the query as follows to increase the server's ability
 -- to implement parallelism:
SELECT a.Col1, a.Col2, a.Col3, b.ColA, b.ColB, b.ColC, b.ColD
FROM Table1 a 
inner join 
(
	 -- While the server is performing this query for this join...
	SELECT ColA, ColB, ColC, ColD, Col4, Col5
	FROM [Database2].[dbo].SomeTable
) b on a.Col5 = b.Col5 and a.Col4 = b.Col4
 -- ...it can perform this join between Table1 and Table3
inner join Table3 c on a.Col1 = c.Col1 and a.Col3 = c.Col3

 

Indexes on #temptables

This trick has to be determined on a case-by-case basis because basically you have to weigh up the time difference between:

  1. Creating and loading the #temptable; then performing the queries AND
  2. Creating and loading the #temptable; creating the index on the necessary columns; then performing the queries.

Sometimes the time and associated costs for the server to perform #1 is practically the same as performing #2. This especially happens when the #temptable as a relatively low number of records it takes just as much time to scan the table as it would to create then scan the index for the values.

My general rule of thumb for considering whether or not to do this is:

  1. The #temptable has at least 100,000 records in it
  2. The #temptable is in several JOINs.

 

Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present

Unless absolutely necessary. Why? LEFT, SUBSTRING, and similar functions are just that – functions. They do not allow the database engine to take advantage of indexes.

 -- Do not do this if there is an index on Column3
SELECT Column1, Column2
FROM Table1
WHERE LEFT(Column3, 1) = 'a'

 -- Do not do this either if there is an index on Column3
SELECT Column1, Column2
FROM Table1
WHERE SUBSTRING(Column3, 1, 1) = 'a'

 -- Instead, use LIKE
SELECT Column1, Column2
FROM Table1
WHERE Column3 LIKE 'a%'

As a hint, if you’re going for your SQL Server MCSE Data Platform certification, questions similar to the above scenario were on the SQL Server 2012 exams, so you should know this. :-)

 

Don’t Sort Data in a Table Contrary to a Clustered Index

Obviously pay this no mind if the clustered index is solely on an identity column. Otherwise, it generally just wastes time.

Let’s say you have two columns in a table forming a clustered index as follows:

CREATE CLUSTERED INDEX IX_TheIndex 
ON TestTable 
(
      Name Desc,
      DateAdded ASC
)

And if you’re lucky like me (let me roll my eyes for you at that sarcasm), you find an insert statement similar to the following:

INSERT INTO TestTable
SELECT Name, DateAdded, AnotherColumn, AndAnotherColumn, StillAnotherColumn
FROM SomeOtherTable
ORDER BY Name, DateAdded

--The above ORDER BY clause is pointless because the 
--SQL engine could determine, when generating an execution
--plan, that the clustered index could be used in whole or part.
--So give me 3 good reasons why, during the insert, the
--Name and DateAdded columns are sorted in ASC order?

 

Filter by Subqueries in JOINs instead of WHERE clauses

The majority of the time, filtering data by using a subquery in the WHERE clause is slower than filtering by JOINing to a subquery. I believe this is because an SQL server engine can optimize and implement parallelism within JOIN clauses more efficiently than in WHERE clauses.

--Most of the time, a query structured like this:
SELECT a.Col1, b.Col2
FROM  Table1 a
INNER JOIN Table2 b on a.Col4 = b.Col4
INNER JOIN
(
      select Col3
      from Table3
) c on c.Col3 = a.Col4

--executes faster than a query structured like this:
SELECT  a.Col1, b.Col2
FROM  Table1 a
INNER JOIN Table2 b on a.Col4 = b.Col4
WHERE a.Col4 in
(
      select Col3
      from Table3
)

 

Combine multiple UPDATE statements

From my opening example. Yes, I seriously came across this production code:

 -- beginning of updates
 -- Each UPDATE requires an entire table scan! 
UPDATE stg1Table
SET AccType = 'RE'
WHERE CHARINDEX('Romeo', CatchPhrase) > 0

UPDATE stg1Table
SET AccType = 'TL'
WHERE CHARINDEX('Tango', CatchPhrase) > 0

UPDATE stg1Table
SET AccType = 'VA'
WHERE CHARINDEX('Victor', CatchPhrase) > 0

UPDATE stg1Table
SET AccType = 'WC'
WHERE CHARINDEX('Whiskey', CatchPhrase) > 0

UPDATE Stg1Table
SET AccType = 'BT'
WHERE CHARINDEX('Bravo', CatchPhrase) > 0

UPDATE Stg1Table
SET AccType = 'SC'
WHERE CHARINDEX('Sierra', CatchPhrase) > 0

UPDATE Stg1Table
SET AccType = 'EC'
WHERE CHARINDEX('Echo', CatchPhrase) > 0
 -- and yet more similar update statements were here 

Rewritten by doing some simple optimization SQL magic:

 -- Rewrite the code; just do the UPDATE once!
UPDATE stg1Table
Set AccType = Case (CatchPhrase)
    When LIKE '%Romeo%' Then 'RE'
    When LIKE '%Tango%' Then 'TL'
    When LIKE '%Victor%' Then 'VA'
    When LIKE '%Whiskey%' Then 'WC'
    When LIKE '%Bravo%' Then 'BT'
    When LIKE '%Sierra%' Then 'SC'
    When LIKE '%Echo%' Then 'EC'
    -- remaining When clauses 
End

‘nuf said. :-)

I hope you enjoyed this article!

Do you have any SQL optimizing tips and techniques? If so, leave a comment and share with your fellow readers! Let’s make the SQL world a better, happier place for everyone. :-)

  • Paul

    Dave – Thanks for the post, however, unless I’m mistaken, I believe your version of the “Combine multiple UPDATE statements” would actually produce different results to the original. If the “CatchPhrase” field for a row/record had all of the words you list within the row’s column/field value, your output would leave AccType as “RE” following execution whereas the original code would have left it as “EC”. The order or the case statement WHEN clauses should really be reversed.

    You should also have an “else” case option/line to set the AccType to the same AccType if you are updating all records without using a filter (as they will be set to NULL) but as a byproduct of that, more costs for updating those rows/values would be incurred, which again, wouldn’t be performed in the original code as records that don’t fit any of the queries wouldn’t be updated.

    On a final note, you haven’t made reference to CTEs as an alternative to sub queries and filters which I personally find very useful (especially so for recursive queries), and possibly MERGE in comparison with the INSERTS, UPDATES and DELETES – Both which your readers may find useful.

    • FireMystdl

      HI Paul:

      Thanks for your feedback. You are right with the comments in your first paragraph. What I failed to mention though is the values in that production system would only have one of those dozen or so fixed words. Thus the example you cited would never (well, perhaps I should say ‘should never’ ;-) ) occur. Otherwise, yes, I totally agree with you.

      Yes, CTE’s may be useful. Unfortunately, I’m not as adept in thinking CTE solutions. If you’re game and want to post sample code for my readers, that would be great!

  • Delilah

    Hey Dave, that’s excellent tips which are overlooked sometimes. Keep up the good Work!!!