database indexes and optimization

Previously, I discussed database schema normalization. Next, I would like to add indexes and optimize a schema for high-volume production usage.

Consider the following schema:

CREATE TABLE foobar_users (
    user_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_groups (
    group_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    groupname VARCHAR(20) NOT NULL
);

CREATE TABLE foobar_users_groups (
    user_id INT UNSIGNED NOT NULL,
    group_id INT UNSIGNED NOT NULL,
    PRIMARY KEY(user_id, group_id)
);

Each table shown is in Sixth Normal Form (6NF) and contain no indexes other than the primary key. The schema represents a typical user/group mapping where there is a many-to-many relationship between users and groups (i.e., a user can belong to multiple groups, and a group contains multiple users).

I've randomly generated and loaded data into the tables. There is roughly twice as many entries in the mapping table as the domain tables:

mysql> SELECT COUNT(*) FROM foobar_users;
+----------+
| COUNT(*) |
+----------+
|   121005 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT COUNT(*) FROM foobar_users_groups;
+----------+
| COUNT(*) |
+----------+
|   240999 |
+----------+
1 row in set (0.00 sec)

Common queries will be noticeably slower with this amount of data. For example, getting a list of groups for a specific user can take over one second. Using the EXPLAIN command we can find out why, e.g.,

mysql> EXPLAIN -- fetch groups with 'bbob' as member
    -> SELECT g.group_id, g.groupname
    -> FROM foobar_users u, foobar_groups g, foobar_users_groups ug
    -> WHERE u.user_id = ug.user_id
    ->   AND g.group_id = ug.group_id
    ->   AND u.username = 'bbob';
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | rows   | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+
|  1 | SIMPLE      | ug    | index  | PRIMARY       | PRIMARY | 8       | 240999 | Using index |
|  1 | SIMPLE      | g     | eq_ref | PRIMARY       | PRIMARY | 4       |      1 |             |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY       | PRIMARY | 4       |      1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+

The culprit is the full table scan on foobar_users_groups, which is perhaps the worst thing we can do since it's the largest table with almost a quarter million rows. In the EXPLAIN output, if you look in the column identified as rows, you'll see that 240999 rows were scanned.

A natural join version of this same query performs just as poorly:

mysql> EXPLAIN -- natural join version
    -> SELECT group_id, groupname
    -> FROM foobar_users
    -> NATURAL JOIN foobar_users_groups
    -> NATURAL JOIN foobar_groups
    -> WHERE username = 'bbob';
+----+-------------+---------------------+--------+---------------+---------+---------+--------+-------------+
| id | select_type | table               | type   | possible_keys | key     | key_len | rows   | Extra       |
+----+-------------+---------------------+--------+---------------+---------+---------+--------+-------------+
|  1 | SIMPLE      | foobar_users_groups | index  | PRIMARY       | PRIMARY | 8       | 240999 | Using index |
|  1 | SIMPLE      | foobar_groups       | eq_ref | PRIMARY       | PRIMARY | 4       |      1 |             |
|  1 | SIMPLE      | foobar_users        | eq_ref | PRIMARY       | PRIMARY | 4       |      1 | Using where |
+----+-------------+---------------------+--------+---------------+---------+---------+--------+-------------+

Selecting which users belong to a specific group performs the same full table scan of foobar_users_groups. If we were to load millions of users and groups the performance would be unacceptable by any reasonable expectation.

mysql> EXPLAIN -- users in the MVD group
    -> SELECT u.user_id, u.username
    -> FROM foobar_users u, foobar_groups g, foobar_users_groups ug
    -> WHERE u.user_id = ug.user_id
    ->   AND g.group_id = ug.group_id
    ->   AND g.groupname = 'MVD';
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | rows   | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+
|  1 | SIMPLE      | ug    | index  | PRIMARY       | PRIMARY | 8       | 240999 | Using index |
|  1 | SIMPLE      | g     | eq_ref | PRIMARY       | PRIMARY | 4       |      1 | Using where |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY       | PRIMARY | 4       |      1 |             |
+----+-------------+-------+--------+---------------+---------+---------+--------+-------------+

Indexes

I want to add indexes to these tables to prevent full table scans. In fact, one of the main problems is that we're starting with a username and groupname and neither of those are indexed, only the user_id and group_id have indexes (from the primary key). Since these tables are fully normalized, there is no reason that username and groupname were not given the UNIQUE identifier in the create table statements, e.g.,

CREATE TABLE foobar_users (
    user_id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(20) NOT NULL UNIQUE
);

This minor change would not only have enforced a data integrity constraint but also would have created an index on username that would have alleviated the query performance issues.

Rather than re-create the table, we can add indexes to existing tables, as follows

mysql> CREATE UNIQUE INDEX username ON foobar_users(username);
Query OK, 121005 rows affected (1.72 sec)
Records: 121005  Duplicates: 0  Warnings: 0

mysql> CREATE UNIQUE INDEX groupname ON foobar_groups(groupname);
Query OK, 121005 rows affected (1.83 sec)
Records: 121005  Duplicates: 0  Warnings: 0

With these indexes in place, the query performance is noticeably improved, and the EXPLAIN command will show why:

mysql> EXPLAIN -- fetch groups with 'bbob' as member
    -> SELECT g.group_id, g.groupname
    -> FROM foobar_users u, foobar_groups g, foobar_users_groups ug
    -> WHERE u.user_id = ug.user_id
    ->   AND g.group_id = ug.group_id
    ->   AND u.username = 'bbob';
+----+-------------+-------+--------+------------------+----------+---------+------+-------------+
| id | select_type | table | type   | possible_keys    | key      | key_len | rows | Extra       |
+----+-------------+-------+--------+------------------+----------+---------+------+-------------+
|  1 | SIMPLE      | u     | const  | PRIMARY,username | username | 62      |    1 |             |
|  1 | SIMPLE      | ug    | ref    | PRIMARY          | PRIMARY  | 4       |   12 | Using index |
|  1 | SIMPLE      | g     | eq_ref | PRIMARY          | PRIMARY  | 4       |    1 |             |
+----+-------------+-------+--------+------------------+----------+---------+------+-------------+

The query is no longer scanning entire tables, in fact, the largest estimated row count is from the foobar_users_groups mapping table-- EXPLAIN estimates it needs to scan 12 rows, which corresponds correctly with the number of groups that will be returned.

However, querying users in a specific group remains slow, consider the following:

mysql> EXPLAIN -- users in the MVD group
    -> SELECT u.user_id, u.username
    -> FROM foobar_users u, foobar_groups g, foobar_users_groups ug
    -> WHERE u.user_id = ug.user_id
    ->   AND g.group_id = ug.group_id
    ->   AND g.groupname = 'MVD';
+----+-------------+-------+--------+-------------------+-----------+---------+--------+-------------+
| id | select_type | table | type   | possible_keys     | key       | key_len | rows   | Extra       |
+----+-------------+-------+--------+-------------------+-----------+---------+--------+-------------+
|  1 | SIMPLE      | g     | const  | PRIMARY,groupname | groupname | 62      |      1 |             |
|  1 | SIMPLE      | u     | ALL    | PRIMARY           | NULL      | NULL    | 121005 |             |
|  1 | SIMPLE      | ug    | eq_ref | PRIMARY           | PRIMARY   | 8       |      1 | Using index |
+----+-------------+-------+--------+-------------------+-----------+---------+--------+-------------+

The query is no longer doing a full table scan of foobar_users_groups, but it is doing a full table scan of the foobar_users table. The problem is, the primary key on foobar_users_groups is a composite key consisting of (user_id, group_id). The query (as written) is scanning all of the user_id's that might be part of that composite (user_id, group_id).

Adding a separate group_id index on the foobar_users_groups table will solve this, i.e.,

mysql> CREATE INDEX group_id ON foobar_users_groups(group_id);
Query OK, 240999 rows affected (1.03 sec)
Records: 240999  Duplicates: 0  Warnings: 0

mysql> EXPLAIN -- users in the MVD group
    -> SELECT u.user_id, u.username
    -> FROM foobar_users u, foobar_groups g, foobar_users_groups ug
    -> WHERE u.user_id = ug.user_id
    ->   AND g.group_id = ug.group_id
    ->   AND g.groupname = 'MVD';
+----+-------------+-------+--------+-------------------+-----------+---------+------+-------+
| id | select_type | table | type   | possible_keys     | key       | key_len | rows | Extra |
+----+-------------+-------+--------+-------------------+-----------+---------+------+-------+
|  1 | SIMPLE      | g     | const  | PRIMARY,groupname | groupname | 62      |    1 |       |
|  1 | SIMPLE      | ug    | ref    | PRIMARY,group_id  | group_id  | 4       |    8 |       |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY           | PRIMARY   | 4       |    1 |       |
+----+-------------+-------+--------+-------------------+-----------+---------+------+-------+
This entry was posted in data arch., mysql. Bookmark the permalink.