Quantcast
Viewing all articles
Browse latest Browse all 14

Symmetric indices to make JOINs faster

I am frequently asked how to increase the performance of Rails, and here's a great starting point. This advice generalizes to just about any database or platform that relies on B-Tree indices. If your using MySQL out of the box, then this definitely applies. Consider the following three models which are a very basic "has and belongs to many" setup:

http://gist.github.com/191818

So as you can see, a user can be in many groups and a group can have many users, all by way of the memberships join relationship. The two example use cases I'm going to work with are:
  1. Given a user, what groups is he/she in?, and
  2. Given a group, who are the members?
Both of these are pretty typical, but can yield surprisingly different results from the database's perspective. If we try and get this data from the console we either start with a user and navigate to group, or the reverse. Here's the MySQL EXPLAIN output from the console ( I recommend viewing the RAW output unless I can figure out how to make github display it correctly):

http://gist.github.com/191820

Totally understanding the output of the EXPLAIN syntax is well outside the scope of this post, but we're going to need to cover the basics. The first thing you should notice is the word ALL in the type column and the NULL in possible_keys. This indicates that MySQL's query optimizer has no index to read from and will be forced to perform a table scan to return the result. In general, this will kill your performance. Note the rows value of 100. This value will be whatever the size of your table is. If you have 500,000 records, then the database will check all 500,000 rows. It's worth noting that for small datasets you'll see ALL and a possible_keys value. This means the optimizer believes that scanning the table will be faster than actually loading the index into memory. This is generally fine. So let's go ahead and add a composite index on [user_id,group_id]. The SQL is:
ALTER TABLE memberships ADD INDEX test_index(user_id,group_id)
Now let's repeat the previous queries.

http://gist.github.com/191821

This is where I see most people stop when it comes to performance optimization. Note though that these tables aren't the same! If you join from users to groups (the first query) you see a massive speedup. Only one row is consulted (instead of 100) and it's in the index. A further benefit we see in both queries is the "Using index" in the Extra column. This means that MySQL can determine the query result without ever checking the actual table because all required info is in the index (ie no extra disk hits). Unfortunately, joining from groups to users (second query) still (sorta) sucks. It says index instead of ALL, but that just means it will have to scan the entire index rather than scan the entire table on disk. This is a marginal improvement at best, so 50% of our use cases still suck. Here's the explanation: B-Tree indices are unidimensional structures. That means that the interior nodes of the index tree are strongly ordered, and thus cannot be arbitrarily accessed out of order. If that doesn't make any sense, it means that joining from users to groups is not an equivalent operation to joining groups to users because of the ordering of the index elements. So let's cleanup the second use case by adding an additional index, exactly like the first, except the order of the elements is reversed. Here's the SQL:
ALTER TABLE memberships ADD INDEX test_index1(group_id,user_id)
Now we have symmetric indices. Let's run our queries again with our second index in place:

http://gist.github.com/191823

Voila! Now it doesn't matter which way we join the tables because we have an index that is correctly ordered based on the directionality of the join. You can even see that the optimizer selects a different index (key) depending on which direction you join the tables, exactly as expected. Also, both queries now only require consulting the exact number of rows necessary and won't involve any further disk hits as both queries can be satisfied with data available entirely within the index.

Permalink | Leave a comment  »


Viewing all articles
Browse latest Browse all 14

Trending Articles