Optimizing SQL Strategies
How many of you have come across SQL code that, for whatever reason, wasn’t written as good as it could have been? I’m not talking about going into the inner workings of your database’s engine and milking every possibly nanosecond off the execution plan.
I’m talking about code where you see something more like the following. This code represents about half the update statements (the other half left out for brevity). The stg1Table table being updated had several million records with no indexes. Can anyone guess before reading any further why my jaw dropped when I saw this?
-- beginning of the updates UPDATE stg1Table SET AccType = 'RE' WHERE CHARINDEX('Romeo', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'TL' WHERE CHARINDEX('Tango', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'VA' WHERE CHARINDEX('Victor', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'WC' WHERE CHARINDEX('Whiskey', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'BT' WHERE CHARINDEX('Bravo', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'SC' WHERE CHARINDEX('Sierra', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'EC' WHERE CHARINDEX('Echo', CatchPhrase) > 0 -- : more similar updates omitted for brevity -- : This was unbelieveable!
I constantly come across SQL code where performance is obviously not considered. So that’s when this curious consultant decided to document some basic Optimization SQL Strategies in hopes of helping some SQL coders “lift their game”.
With that, here we go in no particular order:
Whenever possible, DELETE first, UPDATE second, INSERT third
I came across a stored procedure that was operating on a table with approximately 26 columns and several million records. There was no index on the table other than the primary key which was just an IDENTITY column. The stored procedure was run approximately once per week. With each run, here’s what it was coded to do:
- INSERT several thousand records marking them as “new”.
- DELETE “expired” records
- UPDATE “old” records
The records were mutually exclusive of each other. That is, any records that were “new” wouldn’t be affected by the UPDATE or DELETE operation.
Ask yourself why this isn’t the best approach. Where is the inefficiency? Answer: in the order of operations.
By INSERTing several thousand new records first, that means the database engine has to search through that many more records to find those to DELETE and UPDATE later. If the INSERT were performed last, that obviously saves the database engine from having to scan through all the potential new records that are being INSERTed.
Therefore, you should always DELETE first whenever possible as this will present the database engine with less records to search when performing subsequent UPDATEs.
Next, UPDATE current records before performing an INSERT whenever possible because again, that’s less records the database will have to scan through.
Finally, perform your INSERT last whenever possible.
Use @TableVars for small amounts of table data instead of #temp tables
They’re small. Fast. Require zero disk I/O.
Here’s the perfect example I came across in numerous stored procedures for one client:
CREATE TABLE #DaysOfWeek ( [Day] char(3) ) INSERT INTO #DaysOfWeek([Day]) VALUES ('Sun'),('Tue'),('Fri') -- -- a few queries with joins on #DaysOfWeek -- DROP TABLE #DaysOfWeek
I promptly modified the code as follows, which actually increased the overall stored procedure runtime performance:
-- Replaced the #table with @table! Why waste time on -- disk I/O for 3 records?? Seriously! DECLARE @DaysOfWeek TABLE ( [Day] char(3) ) INSERT INTO @DaysOfWeek([Day]) VALUES ('Sun'),('Tue'),('Fri')
Eliminate Joins On Massive Tables When Most Columns Aren’t Used
I was modifying a stored procedure which was updating fact tables in a data warehouse in Database1. This stored procedure had several joins to a significantly larger, staging table (3 dozen columns, several million records, no indexes on the columns being joined on or selected) in Database2 on the same server.
In this scenario, I managed to shave several minutes off the stored procedure runtime by creating a smaller #temptable from the joined table in Database2, using only the 3 columns that were utilized. I then created a CLUSTERED index on the one column in the #temptable that was contained in the WHERE clause.
Here’s the old code:
-- This was the original code. -- Several queries similar to this where only -- the 3 columns from [Database2].[dbo].DB2Table1 -- are used out of the 36. INSERT INTO OurDataTable SELECT DISTINCT CAST(J.id as int) as id, cast(J.code as CHAR(1)) as code FROM [Database2].[dbo].DB2Table1 AS J –has 36 columns and millions of rows LEFT OUTER JOIN [Database1].[dbo].DB1Table1 AS D ON J.id = D.id WHERE D.R_Man IS NULL AND J.branch_code NOT in('44','61','63','64')
The new code thus becomes:
-- Added for performance gains. -- Only get the 3 columns we're concerned with. -- Takes just a few seconds select CAST(Id as int) as Id, cast(Code as varchar(1)) as Code, cast(Branch_Code as varchar(2)) as Branch_Code into #tempTable FROM [Database2].[dbo].DB2Table1 AS J -- Create an index on the one column -- that's used in the WHERE clauses CREATE NONCLUSTERED INDEX [UIX_tempTable] ON [#tempTable]([Branch_Code]) WITH FILLFACTOR = 100 ,STATISTICS_NORECOMPUTE ON [PRIMARY] -- Now do what we need to do joining -- on the smaller #temp table. -- It runs significantly faster. INSERT INTO OurDataTable SELECT DISTINCT CAST(J.id as int) id, cast(J.code as CHAR(1)) as code FROM #tempTable AS J -- for performance. Smaller table with only the columns we need + an index LEFT OUTER JOIN [Database1].[dbo].DB1Table1 AS D ON J.id = D.id WHERE D.R_Man IS NULL AND J.branch_code NOT in('44','61','63','64')
See what I’ve done? It made a HUGE difference in the run time!
As an exercise to the reader, you can get the SQL code to execute faster if you add a WHERE clause to the SELECT statement when creating the #tempTable.
If a #temptable is used once, consider putting it in a join in the query to increase parallelism
Conversely, I’ve found #temptables in code where it hasn’t been advantageous. More disk I/O and resources were wasted, as well as limiting the database server’s ability to implement parallelism.
Here’s an example of code I rewrote:
-- They created a #temptable SELECT ColA, ColB, ColC, ColD, Col4, Col5 INTO #table2 FROM [Database2].[dbo].SomeTable -- Lots of SQL code here, then -- a join query. Only place where #table2 is used. SELECT a.Col1, a.Col2, a.Col3, b.ColA, b.ColB, b.ColC, b.ColD FROM Table1 a inner join #table2 b on a.Col5 = b.Col5 and a.Col4 = b.Col4 inner join Table3 c on a.Col1 = c.Col1 and a.Col3 = c.Col3 -- Some more miscellaneous SQL was here DROP TABLE #table2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- So instead of doing a single query to create #table2, -- which forces the code to execute in serial, -- and also wasting time with the associated explicit Disk I/O, -- rewrite the query as follows to increase the server's ability -- to implement parallelism: SELECT a.Col1, a.Col2, a.Col3, b.ColA, b.ColB, b.ColC, b.ColD FROM Table1 a inner join ( -- While the server is performing this query for this join... SELECT ColA, ColB, ColC, ColD, Col4, Col5 FROM [Database2].[dbo].SomeTable ) b on a.Col5 = b.Col5 and a.Col4 = b.Col4 -- ...it can perform this join between Table1 and Table3 inner join Table3 c on a.Col1 = c.Col1 and a.Col3 = c.Col3
Indexes on #temptables
This trick has to be determined on a case-by-case basis because basically you have to weigh up the time difference between:
- Creating and loading the #temptable; then performing the queries AND
- Creating and loading the #temptable; creating the index on the necessary columns; then performing the queries.
Sometimes the time and associated costs for the server to perform #1 is practically the same as performing #2. This especially happens when the #temptable as a relatively low number of records it takes just as much time to scan the table as it would to create then scan the index for the values.
My general rule of thumb for considering whether or not to do this is:
- The #temptable has at least 100,000 records in it
- The #temptable is in several JOINs.
Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present
Unless absolutely necessary. Why? LEFT, SUBSTRING, and similar functions are just that – functions. They do not allow the database engine to take advantage of indexes.
-- Do not do this if there is an index on Column3 SELECT Column1, Column2 FROM Table1 WHERE LEFT(Column3, 1) = 'a' -- Do not do this either if there is an index on Column3 SELECT Column1, Column2 FROM Table1 WHERE SUBSTRING(Column3, 1, 1) = 'a' -- Instead, use LIKE SELECT Column1, Column2 FROM Table1 WHERE Column3 LIKE 'a%'
As a hint, if you’re going for your SQL Server MCSE Data Platform certification, questions similar to the above scenario were on the SQL Server 2012 exams, so you should know this.
Don’t Sort Data in a Table Contrary to a Clustered Index
Obviously pay this no mind if the clustered index is solely on an identity column. Otherwise, it generally just wastes time.
Let’s say you have two columns in a table forming a clustered index as follows:
CREATE CLUSTERED INDEX IX_TheIndex ON TestTable ( Name Desc, DateAdded ASC )
And if you’re lucky like me (let me roll my eyes for you at that sarcasm), you find an insert statement similar to the following:
INSERT INTO TestTable SELECT Name, DateAdded, AnotherColumn, AndAnotherColumn, StillAnotherColumn FROM SomeOtherTable ORDER BY Name, DateAdded --The above ORDER BY clause is pointless because the --SQL engine could determine, when generating an execution --plan, that the clustered index could be used in whole or part. --So give me 3 good reasons why, during the insert, the --Name and DateAdded columns are sorted in ASC order?
Filter by Subqueries in JOINs instead of WHERE clauses
The majority of the time, filtering data by using a subquery in the WHERE clause is slower than filtering by JOINing to a subquery. I believe this is because an SQL server engine can optimize and implement parallelism within JOIN clauses more efficiently than in WHERE clauses.
--Most of the time, a query structured like this: SELECT a.Col1, b.Col2 FROM Table1 a INNER JOIN Table2 b on a.Col4 = b.Col4 INNER JOIN ( select Col3 from Table3 ) c on c.Col3 = a.Col4 --executes faster than a query structured like this: SELECT a.Col1, b.Col2 FROM Table1 a INNER JOIN Table2 b on a.Col4 = b.Col4 WHERE a.Col4 in ( select Col3 from Table3 )
Combine multiple UPDATE statements
From my opening example. Yes, I seriously came across this production code:
-- beginning of updates -- Each UPDATE requires an entire table scan! UPDATE stg1Table SET AccType = 'RE' WHERE CHARINDEX('Romeo', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'TL' WHERE CHARINDEX('Tango', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'VA' WHERE CHARINDEX('Victor', CatchPhrase) > 0 UPDATE stg1Table SET AccType = 'WC' WHERE CHARINDEX('Whiskey', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'BT' WHERE CHARINDEX('Bravo', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'SC' WHERE CHARINDEX('Sierra', CatchPhrase) > 0 UPDATE Stg1Table SET AccType = 'EC' WHERE CHARINDEX('Echo', CatchPhrase) > 0 -- and yet more similar update statements were here
Rewritten by doing some simple optimization SQL magic:
-- Rewrite the code; just do the UPDATE once! UPDATE stg1Table Set AccType = Case (CatchPhrase) When LIKE '%Romeo%' Then 'RE' When LIKE '%Tango%' Then 'TL' When LIKE '%Victor%' Then 'VA' When LIKE '%Whiskey%' Then 'WC' When LIKE '%Bravo%' Then 'BT' When LIKE '%Sierra%' Then 'SC' When LIKE '%Echo%' Then 'EC' -- remaining When clauses End
I hope you enjoyed this article!
Do you have any SQL optimizing tips and techniques? If so, leave a comment and share with your fellow readers! Let’s make the SQL world a better, happier place for everyone.