Indexing Strategy for JSON Value in Sql Server 2016

This article explains the strategy for Indexing JSON value in Sql Server 2016 with examples, as we don’t have any special indexes like the XML indexes which we had for the XML value stored in the XML data type column.

[ALSO READ]:

If we are storing JSON data in a table column, then we may come across a scenario where we may need to retrieve only the records with specific JSON property value. For example if we are storing Customer address and phone details in JSON structure in a Varchar/Nvarchar column, then we may get a request to get all the customers whose phone member matches to the input value. To handle such use cases, creating an index on the JSON column is not the correct approach and of no use. As it indexes the complete JSON value like any other value in a Varchar/NVarchar column and we are looking for particular JSON Property value which is not at the beginning of the JSON string. And also it takes lot of additional storage space as the complete JSON value is indexed. So, creating such indexes is of no use.

But for JSON we do have an alternative way of Indexing JSON Property. This article explains how we can index a JSON property from the JSON string by example and also explains the storage implications if any.

Let us create a sample Database with a Customer table with 200,000 records as shown in the following image by the following script:

Customer Table
Script:

CREATE DATABASE SqlHintsJSONDemo
GO
USE SqlHintsJSONDemo
GO
CREATE TABLE dbo.Customer( 
	Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
	Name NVARCHAR(50), Detail NVARCHAR(MAX))
GO
INSERT INTO dbo.Customer ( Name, Detail )
SELECT TOP 200000 NEWID(),
 REPLACE('{"Address":{"State":"KA","Country":"India"},
 "Phone":"@Phone"}',
 '@Phone', 100000000-ROW_NUMBER() OVER (ORDER BY SC1.object_id))
FROM SYS.all_columns SC1
		CROSS JOIN SYS.all_columns SC2

Below is the Customer table storage usage details at this point of time.

Initial SpaceUsed by the Customer table

Enable the IO and TIME statistics using the following script, to measure the performance of the queries:

--Enable Statistics
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO

Let us execute the below statement to get the details of the Customer whose phone number is 99890000

SELECT *
FROM dbo.Customer
WHERE JSON_VALUE(Detail,'$.Phone') = '99890000'

RESULT:
Get Customer Detail using JSON_VALUE function

From the above result it is clear that fetching the customer data with this approach is resulting in higher number of IO and CPU time.

Let us now explore the solution for the above problem. The solution to this problem is create a non-persisted computed column where this column value will be the value of the Phone property from the JSON string and then index it. Let us understand this by performing these steps on the Customer table created above:

First add a non-persisted computed column PhoneNumber to the Customer table and the value of this computed column is the Phone property value extracted using the JSON_VALUE function from the Detail column having Address and Phone information stored in the JSON format.

ALTER TABLE dbo.Customer
ADD PhoneNumber AS JSON_VALUE(Detail,'$.Phone')

Let us verify is there any storage implication of adding a computed column

SpaceUsed by the Customer table after adding computed column

From the results it is clear that a non-persisted computed column doesn’t take any additional storage space. Non-persisted computed column value is computed/evaluated at run-time.

Let us now try to get the details of the Customer whose phone number is 99890000 by using the computed column PhoneNumber.

SELECT *
FROM dbo.Customer
WHERE PhoneNumber  = '99890000'

RESULT:
Get Customer Detail using Computed column

From the result it is clear that the computed column didn’t improve the IO and Time taken to execute the query.

Let us now create an Index on the Computed column PhoneNumber by using the following script:

CREATE INDEX IX_Customer_PhoneNumber
 ON dbo.Customer(PhoneNumber)

When you create this index you may get the below warning. The reason for the warning is Sql Server doesn’t know what will be the length of the value which will be extracted by the JSON_VALUE function. And the maximum length of the value that can be returned by JSON_VALUE function is NVARCHAR(4000).

Warning! The maximum key length for a nonclustered index is 1700 bytes.
The index ‘IX_Customer_PhoneNumber’ has maximum length of 8000 bytes.
For some combination of large values, the insert/update operation will fail.

Let us verify the Customer Table storage detail after adding index on the computed column:

SpaceUsed by the Customer table after adding index on computed column

From the result it is clear that there is no change the storage used for the table, it has taken only the extra storage for the index. So the computed column is still not persisted only in the index tree the computed value is persisted. Which is similar to creating index on any other table column.

Now let us verify whether we have any improvement in the performance of the query to retrieve the Customer details based on the Phone JSON property value (i.e. by the indexed computed column PhoneNumber).

SELECT *
FROM dbo.Customer
WHERE PhoneNumber  = '99890000'

RESULT:
Get Customer Detail using Indexed Computed column

From the result it is clear that adding index on the computed column has drastically improved the IO and CPU time.

We can execute the below statement to disable the IO and TIME statistics, which we have enabled in the beginning of this article.

--Disable statistics
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
GO

Conclusion: If we have a need to retrieve record from table based on a particular property value in the JSON string column, in such scenario creating a computed column whose value is extracted by the JSON_VALUE function from the JSON string column and indexing is the optimal approach. It takes extra storage for the index just like indexing any other column. And the computed column is still not persisted even after creating an index on it. And while creating index it takes some extra time to evaluate and extract the computed value to store in the index pages, again this is noting different when create a index on a computed column.

[ALSO READ]: