Relational databases like MySQL, Postgres or MSSQL were designed to store two-dimensional data, not multi-dimensional. If our data have some relationships, then we have 2 options: duplication and/or normalization. Also; even if they are not as effective as document-based databases, most of relational databases have support for document columns like JSON. (By the way, my favorite one is Postgres by far.)
Let's look at MySQL.
JSON support in MySQL is not great but it's better than nothing. Basically; it has a column type called JSON that allows JSON data to be stored. If you want, you can get (select) just some of these JSON fields, use these fields in WHERE clauses to filter results, etc. But when it comes to large datasets, performance becomes horrible. Unfortunately, MySQL does not have support for indexes on JSON fields, so we can not add an index to a JSON field directly. What we can do to gain performance is create a generated column and add an index to it.
So, If you are happy with MySQL and your only need is running some queries with JSON fields faster, using MySQL generated columns and indexing them might be a good solution for this situation.
Let me say my last words at the start; this method works, but "it works" does not mean that it should be the first choice. My opinions about deciding whether to use this technique:
- If you can, create another column with an index and store the data in there.
- If you have many JSON fields on the same or different tables, MySQL might be the wrong choice for your stack. Consider using Postgres or multi-dimensional database alternatives like MongoDB.
- If all you need is to run some queries which contain JSON fields quickly and you don't want to migrate to another database currently, this technique will probably help you.
What Is MySQL Generated Column?
MySQL added a feature called Generated Column
in version 5.7. It's named generated column
because they are computed values. For example; if you have first_name and last_name columns, you can add a full_name column by concatenating first name and last name.
For this scenario; to create a generated column, it is enough to run this query:
ADD COLUMN `full_name` varchar(255) GENERATED ALWAYS AS (CONCAT(first_name,' ',last_name))
After running the query, the table will act like it has a full_name column. SELECT full_name FROM users
or SELECT * FROM users WHERE full_name = 'Emincan Ozcan'
queries are valid right now.
Using Virtual Column & Indexing For MySQL JSON Column
As an example; let we have a table named hits
that has a JSON column named user_data
and the fields in the column are country, ip_address, device
. So our data to store is as follows:
{
"country": "Turkey",
"ip_address": "78.173.68.165",
"device": "desktop"
}
Let's assume our need is to find the count of hits from Turkey. We can send a query to MySQL as follows;
SELECT count(*) FROM hits WHERE user_data->>"$.country" = "Turkey"
If the number of rows is not many, this query performs well.
But increasing the number of rows will make this query very slow. In my test environment, I seeded MySQL with 10 million rows and the average query time was 5.7 seconds. (I run the query many times, this value is the average value)
Now, let us create a virtual column and name it country_virtual
.
ALTER TABLE hits ADD COLUMN `country_virtual` VARCHAR(40) GENERATED ALWAYS AS (`user_data` ->> '$.country') VIRTUAL;
So, we have a country_virtual column. Let's run the following query to find the count of hits from Turkey:
SELECT count(*) FROM hits WHERE country_virtual = "Turkey"
The good news is the query is working. The bad news is it takes even more time than the previous query that we use JSON field directly, 6.9 seconds. So, let's do some magic by adding an index.
CREATE INDEX `country_virtual_index` ON `hits`(`country_virtual`);
Now, we can find the count of hits from Turkey with the following query.
SELECT count(*) FROM hits WHERE country_virtual = "Turkey"
In my testing environment; the average query time became 280ms, so much faster than the old 5.7 seconds!
(I did tests on a MySQL:8 docker container. My testing environment is not reliable; there are lots of variables that might affect this performance (total dataset, data types, MySQL configuration, hardware speed, etc.). For example; I also did some tests but using an unsigned integer column and the result is 5.6s -> 24ms. If you are planning to use this technique in production, I recommend you test it in a production-like environment that has production-like data.)
Implementation In Laravel
Generating a virtual column and adding an index to it is easy with Laravel migrations. For our example:
// creating json column
$table->json("user_data");
// generating virtual column and adding index
$table->string('country_virtual')
->virtualAs("(`user_data` ->> '$.country')")
->index();
Therefore; when using the Eloquent model, we can use:
$hits = App\Models\Hit::where("country_virtual", "Turkey")->count();
Because I have already said my final words at the start (about decision making), I'm done here. Hope this was helpful! 👋🏻