Fastest way to compare multiple column values
I was recently working on a project with stored procedures that had a significant amount of column comparisons in a MERGE statement. This was turning into a real performance bottleneck (relatively speaking) as the entire row of data had to be updated if any one of the numerous fields were modified.
Here’s the same code. Pay particular attention to all the comparisons that were being performed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
MERGE [dbo].[The_Table] USING ( SELECT [Account_No], -- rest of columns here FROM [#The_TableStaging] ) b -- the primary keys on the table ON b.[Account_No] = a.[Account_No] AND b.[Transaction_Id] = a.[Transaction_Id] WHEN MATCHEDAND ( --Surely there must be a better, faster way to do this? COALESCE(b.[Account_Type],'') != COALESCE(a.[Account_Type],'') OR COALESCE(b.[Transaction_Type],'') != COALESCE(a.[Transaction_Type],'') OR COALESCE(b.[Branch_Code],'') != COALESCE(a.[Branch_Code],'') OR COALESCE(b.[Pay_Method_Cd],'') != COALESCE(a.[Pay_Method_Cd],'') OR COALESCE(b.[Trans_Date],'') != COALESCE(a.[Trans_Date],'') OR COALESCE(b.[Effective_Date],'') != COALESCE(a.[Effective_Date],'') OR COALESCE(b.[Amount],CAST(0 ASDECIMAL(38, 12))) != COALESCE(a.[Amount],CAST(0 ASDECIMAL(38, 12))) OR COALESCE(b.[Fund_Code], 0) != COALESCE(a.[Fund_Code], 0) OR COALESCE(b.[Batch_No], 0) != COALESCE(a.[Batch_No], 0) OR COALESCE(b.[Updated_Date],'') != COALESCE(a.[Updated_Date],'') OR COALESCE(b.[Updated_Time],'') != COALESCE(a.[Updated_Time],'') OR COALESCE(b.[Updated_Userid],'') != COALESCE(a.[Updated_Userid],'') OR COALESCE(b.[Archive_Ind],'') != COALESCE(a.[Archive_Ind],'') OR COALESCE(b.[Posted_Tran], 0) != COALESCE(a.[Posted_Tran], 0) OR COALESCE(b.[Updated_Session], 0) != COALESCE(a.[Updated_Session], 0) OR -- the column list continues... ) THEN UPDATE SET [Account_Type] = b.[Account_Type] , -- etc etc etc WHEN NOT MATCHED BY TARGET THEN INSERT ( -- column list ) VALUES ( -- b.<column list> ) WHEN NOT MATCHED BY SOURCE AND @FullLoadInd = 1 THEN DELETE -- so on and so forth -- |
Seeing those column comparisons, that’s when this curious consultant started wondering what’s the fastest way to compare multiple column values?
The Alternatives
The research began, and besides the technique above (which is probably the most common as it’s pretty straight forward), here are a few other ways to do the same thing:
# |
What To Do |
Sample Code Snippet |
||
1 |
Nothing. Leave code as is. |
|||
2 |
Alter the tables to add a BINARY_CHECKSUM column that incorporates all the table’s columns; then compare the checksum columns |
ADD the following column definition to the tables:
|
||
3 |
Implement the STUFF method |
|
||
4 |
Use CONCAT to concatenate all the values into a single value for comparison |
|
||
5 |
Use HASHBYTES with CONCAT to hash all values into a single value |
|
||
6 |
Perform an on-the-fly BINARY_CHECKSUM comparison. This is case-sensitive. |
|
||
7 |
Perform an on-the-fly normal CHECKSUM. This is case insensitive. |
|
Setting Things Up for the Speed Test
For testing purposes, the SQL code was executed on a Windows 2012 Server with 128GB memory, 16 core CPU rated at 2.54 Ghz, using Microsoft SQL Server 2014.
To ensure SQL Server didn’t keep any queries (or anything for that matter) cached, the following code was run before each test:
1 2 3 4 5 6 7 8 9 10 |
checkpoint go DBCC DROPCLEANBUFFERS go DBCC FREESESSIONCACHE go DBCC FREEPROCCACHE go DBCC FREESYSTEMCACHE('ALL') go |
Two tables are created, and populated with 20 million rows, using a subset of columns from an actual table that has over 100 million records. The subset of columns has the same schema as that of the original table. The primary keys are the same.
Here is the code that’s common across every test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[AAA_DAVE_TEST_Source]') AND type in (N'U')) BEGIN CREATE TABLE [AAA_DAVE_TEST_Source]( [Account_No] [int] NOT NULL, [Transaction_Id] [int] NOT NULL, [Account_Type] [char](2) NOT NULL, [Transaction_Type] [char](4) NOT NULL, [Branch_Code] [char](2) NOT NULL, [Pay_Method_Cd] [char](3) NULL, [Trans_Date] [datetime] NOT NULL, [Effective_Date] [datetime] NULL, [Amount] [decimal](18, 2) NOT NULL, [Fund_Code] [int] NOT NULL, [Batch_No] [int] NULL, [Updated_Date] [datetime] NOT NULL, [Updated_Time] [char](8) NULL, [Updated_Userid] [char](8) NOT NULL, [Archive_Ind] [char](1) NULL, [Posted_Tran] [int] NULL, [Updated_Session] [int] NULL, CONSTRAINT [PK_AAA_DAVE_TEST_Source] PRIMARY KEY CLUSTERED ( [Account_No] ASC, [Transaction_Id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] -- Only used for the test with the checksum column added --Alter table [AAA_DAVE_TEST_Source] --ADD ChecksumCol as BINARY_CHECKSUM([account_no],[transaction_id],[account_type],[transaction_type],[branch_code],[pay_method_cd],[trans_date],[effective_date],[amount],[fund_code],[batch_no],[Updated_date],[updated_time],[updated_userid],[archive_ind],[posted_tran],[updated_session]) END ELSE BEGIN truncate table aaa_dave_test_source END IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[AAA_DAVE_TEST_Target]') AND type in (N'U')) BEGIN CREATE TABLE [AAA_DAVE_TEST_Target]( [Account_No] [int] NOT NULL, [Transaction_Id] [int] NOT NULL, [Account_Type] [char](2) NOT NULL, [Transaction_Type] [char](4) NOT NULL, [Branch_Code] [char](2) NOT NULL, [Pay_Method_Cd] [char](3) NULL, [Trans_Date] [datetime] NOT NULL, [Effective_Date] [datetime] NULL, [Amount] [decimal](18, 2) NOT NULL, [Fund_Code] [int] NOT NULL, [Batch_No] [int] NULL, [Updated_Date] [datetime] NOT NULL, [Updated_Time] [char](8) NULL, [Updated_Userid] [char](8) NOT NULL, [Archive_Ind] [char](1) NULL, [Posted_Tran] [int] NULL, [Updated_Session] [int] NULL, CONSTRAINT [PK_AAA_DAVE_TEST_Target] PRIMARY KEY CLUSTERED ( [Account_No] ASC, [Transaction_Id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] -- Only used for the test with the checksum column added --Alter table [AAA_DAVE_TEST_Target] --ADD ChecksumCol as BINARY_CHECKSUM([account_no],[transaction_id],[account_type],[transaction_type],[branch_code],[pay_method_cd],[trans_date],[effective_date],[amount],[fund_code],[batch_no],[Updated_date],[updated_time],[updated_userid],[archive_ind],[posted_tran],[updated_session]) END ELSE BEGIN truncate table aaa_dave_test_target END -- insert 20,000,000 records into the first table insert into AAA_DAVE_TEST_Source select top 20000000 * from dbo.The_Table -- insert 20,000,000 records into the second table insert into AAA_DAVE_TEST_Target select top 20000000 * from dbo.The_Table -- ensure there are differences so an update has to be performed Update AAA_DAVE_TEST_Target Set Fund_Code = 0 DECLARE @START_TIME datetime DECLARE @END_TIME datetime SELECT 'Starting Test' SET @START_TIME = GETDATE() UPDATE a SET [Account_Type] = b.[Account_Type] ,[Transaction_Type] = b.[Transaction_Type] ,[Branch_Code] = b.[Branch_Code] ,[Pay_Method_Cd] = b.[Pay_Method_Cd] ,[Trans_Date] = b.[Trans_Date] ,[Effective_Date] = b.[Effective_Date] ,[Amount] = b.[Amount] ,[Fund_Code] = b.[Fund_Code] ,[Batch_No] = b.[Batch_No] ,[Updated_Date] = b.[Updated_Date] ,[Updated_Time] = b.[Updated_Time] ,[Updated_Userid] = b.[Updated_Userid] ,[Archive_Ind] = b.[Archive_Ind] ,[Posted_Tran] = b.[Posted_Tran] ,[Updated_Session] = b.[Updated_Session] FROM AAA_DAVE_TEST_Target a inner join [AAA_DAVE_TEST_Source] b ON b.[Account_No] = a.[Account_No] AND b.[Transaction_Id] = a.[Transaction_Id] WHERE ( -- [ PLACEHOLDER ] -- This is the code that varies between techniques. -- See below for the code snippets. -- ) SET @END_TIME = GETDATE() SELECT 'Finished', @END_TIME, 'Time To Run: ' + CAST(DATEDIFF(SECOND, @START_TIME,@END_TIME) as varchar) + ' seconds' -- drop the tables to also ensure nothing's cached and -- no statistics are kept which might affect results DROP TABLE AAA_DAVE_TEST_Source DROP TABLE AAA_DAVE_TEST_Target |
Here are the full code snippets that are in the WHERE clause’s [ PLACEHOLDER ] section for each technique:
# |
Code |
||
1 |
|
||
2 |
|
||
3 |
|
||
4 |
|
||
5 |
|
||
6 |
|
||
7 |
|
The Results Are In!
Here are the results, in seconds, as to how long each technique took to execute.
The winner is highlighted in green.
# |
Time to execute, in seconds, over 3 runs: |
1: Original |
95, 96, 95 |
2: Checksum Column |
84, 84, 84 |
3: STUFF |
247, 240, 244 |
4: CONCAT |
157, 157, 155 |
5: HASHBYTES |
254, 253, 253 |
6: BINARY_CHEKSUM |
84, 84, 84 |
7: CHECKSUM |
88, 88, 87 |
The Way To Go
I didn’t know what to expect when running the tests, so made no guesses as to which way might run the fastest.
I was surprised by two things:
- The Checksum column and the BINARY_CHECKSUM took the same time to complete
- Both of the above ran in the same amount of time across 3 individual runs. There was no fluctuation.
For me, the way to go will be using BINARY_CHECKSUM in the stored procedures (which are being developed) because we can’t modify the 500+ tables which might need a checksum column.
As for everyone else, I’ve left you the SQL code above, so feel free to use it as a basis for conducting your own performance benchmarks.
If anyone else has any tricks or techniques for comparing multiple columns quickly in a MERGE or WHERE clause, definitely leave a comment below and share the knowledge! 🙂