MySQL: ORDER BY RAND() – a case study of alternatives

I finally had the time to translate my post from german to english – so here we go 🙂

Das Problem

Many of you reading this post might already stopped by this problem – you just want a few random rows from your database and most of the times you might have used „ORDER BY RAND()“ and not thought about, what’s happening there. Just add all the columns you want to select, add JOINS, WHERE, etc. and add ORDER BY RAND() and LIMIT.

For small applications this might even scale well, but what happens in large applications with hundreds, thousands or millions of rows in a table?

I want to look a little closer into this. While developing my deprecated rhCMS (a content management system), I asked myself the same question.

Problem Analysis

First of all you got to ask „What happens with ORDER BY RAND() in a Query?“ and understand it.

The MySQL Database will query for all the rows you requested and filter out those that don’t match from WHERE, JOIN conditions etc. Then all rows will be in memory or in a temporary table. Then the RAND() will be replaced with a random number and attached to each row. After this, the set can be sorted ascending or descending. And then the LIMIT will be used. Using temporary tables is really bad because this causes a lot of problems and is not performant at all.

Solutions

Let’s take a look how we can solve this problem. First of all I want to show you a PHP solution:

A PHP solution

Query for the total number of rows:

<?php echo $mysql_hl->highlight('SELECT COUNT(*) FROM table'); ?>

and then query for a random number between 1 and the number of rows:

<?php echo $php_hl->highlight('<?php
  $random_id = mt_rand(1, [ COUNT(*) - Ergebnis ]);
?>'); ?>

Afterwards you can query for the ID:

<?php echo $mysql_hl->highlight('SELECT colA, colB FROM table WHERE idCol = ...'); ?>

But as you probably already asked yourself:

  1. What happens when I delete rows?
  2. What if I want more than one row?

The first question seems rather easy, right? Just query as long as you don’t get a row. But this isn’t the best solution!

And how about multiple rows? The easiest way could be put another loop around it and query for random IDs the number of times you want. This seems to be acceptable but it still has some weaknesses. Let’s have a look at a MySQL based solution:

MySQL solution

The trick is to create a MySQL PROCEDURE that fills a temporary table with rows containing random IDs that are returned to the user. But then we would have the same problem: What if the ID didn’t exist or it exists twice in our temporary table?

The final solution looks like this:

  1. Create a temporary table to save random IDs and make the ID a UNIQUE KEY.
  2. Query the maximum ID
  3. Put the IDs into the table
  4. Return the dataset

You might ask: What happens if I insert an ID twice? The solution is simple: The error handler will recursively call our procedure. This way the UNIQUE KEY will make sure no ID is inserted twice and another ID is selected in this case.

And this is what the code looks like:

<?php

echo $mysql_hl->highlight("DROP PROCEDURE IF EXISTS proc_get_random_post_id;

DELIMITER //

CREATE PROCEDURE proc_get_random_post_id (IN cnt INT)
BEGIN
  DECLARE max_id INTEGER;

  DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
    CALL proc_get_random_post_id(cnt);
  END;

  IF cnt >= 1 THEN
    DROP TEMPORARY TABLE IF EXISTS random;
    CREATE TEMPORARY TABLE random ( post_id INTEGER(11) unsigned, UNIQUE KEY(post_id) );

    insert_loop : LOOP
      IF cnt < 1 THEN
        LEAVE insert_loop;
      END IF;

      SELECT MAX(post_id) FROM posts INTO max_id;

      INSERT INTO random
        SELECT
         p.post_id
        FROM
          posts AS p
        JOIN
          (SELECT (RAND() *  max_id) AS random_id) AS r
        WHERE
          p.post_id >= r.random_id
        ORDER BY
          p.post_id
        LIMIT
          1;

      SET cnt = cnt - 1;
    END LOOP insert_loop;

    SELECT post_id FROM random;
  END IF;
END//

DELIMITER ;");
?>

Looks pretty much to get a few rows, huh? But is this really better than a PHP based solution? Let’s have a look at some benchmarks:

A problem could be the maximum recursion depth that is 0 by default. So make sure you increase it:

<?php echo $mysql_hl->highlight('SET max_sp_resursion_depth = 2;'); ?>

Note: „2“ means the number of collisions here. So make sure to increase this value for larger tables.

With a  few rows (~ 25000) the procedure wasn’t that much larger but with ~230.000 rows the procedure was almost 10 times faster! And then I queried a forum with ~4500000 posts for a random post id. The results looks like this:

The first query with the following code took about 72 seconds:

<?php echo $mysql_hl->highlight('SELECT post_id FROM posts ORDER BY RAND() LIMIT 20'); ?>
mysql&gt; SELECT post_id FROM posts ORDER BY RAND() LIMIT 20;
+---------+
| post_id |
+---------+
| 2013034 |
| 3579000 |
| 3316421 |
| 4093120 |
| 2359496 |
| 1885511 |
| 3835027 |
| 1046873 |
| 1185652 |
| 4331911 |
| 4890819 |
|  240006 |
| 2437795 |
| 1712680 |
|  868551 |
| 2096902 |
| 3083704 |
|  856132 |
| 2190370 |
| 4862958 |
+---------+
20 rows in set (52.66 sec)

The second one 35 seconds and the average was at about 25 seconds.

Note: When querying an ID with another column (e.g. a DATETIME field) the query only took about 5 seconds. And now let’s have a look into our procedure:

<?php echo $mysql_hl->highlight('CALL proc_get_random_post_id(20);'); ?>
mysql&gt; CALL proc_get_random_post_id(20);
  +---------+
  | post_id |
  +---------+
  |  354319 |
  |  436352 |
  |  762503 |
  |  924832 |
  | 1395933 |
  | 1515758 |
  | 1633515 |
  | 1731376 |
  | 1789180 |
  | 2512748 |
  | 2549002 |
  | 2854479 |
  | 2863860 |
  | 3103881 |
  | 3233293 |
  | 4337101 |
  | 4560893 |
  | 4596010 |
  | 4609112 |
  | 4960802 |
  +---------+
  20 rows in set (0.04 sec)

This small piece of code now calls our procedure in the background and only takes about 0,034 seconds!

This small example shows you how to quickly query for rows. And it is very easy to extend this to query for even more columns.

Another PHP solution

Someone asked me for a pure PHP solution and that’s what I came up with:

<?php echo $php_hl->highlight('<?php
  /**
   * Default Random ID Class
   *
   * @author Robert Hartung
   * @copyright www.roberthartung.de, 2014
   */

  class randomIdList
  {
    /**
     * MySQLI instance
     *
     * @var mysqli
     */

    protected $mysqli;

    /**
     * Maximum duplicate key errors
     *
     * @var integer
     */

    private $max_errors = 10;

    /**
     * Random ID Count
     *
     * @var integer
     */

    private $count;

    /**
     * Table to select from
     *
     * @var string
     */

    private $table;

    /**
     * Column that contains the ID
     *
     * @var string
     */

    private $column_id;

    /**
     * Constructor
     */

    public function __construct(mysqli $mysqli, $table, $column_id)
    {
      $this->mysqli = $mysqli;
      $this->table = $table;
      $this->column_id = $column_id;
    }

    /**
     * Returns a result set of random IDs
     *
     * @param boolean $return Query for the results?
     *
     * @return mysqli_result
     */

    public function getIDs($count, $return = true)
    {
      $this->count = $count;

      /* Do some preparing stuff */

      $this->mysqli->multi_query("DROP TEMPORARY TABLE IF EXISTS random;
      CREATE TEMPORARY TABLE random ( random_id INTEGER(11) unsigned, UNIQUE KEY(random_id) );
      SELECT @max_id := MAX(post_id) FROM pmcms_forum_posts");

      /* Fetch the results otherwise: Command out of sync error */

      while($this->mysqli->more_results())
      {
        $this->mysqli->next_result();
        $this->mysqli->use_result();
      }

      /* Prepare the statement so the query needs to be parsed only once! */

      $stmnt = $this->mysqli->prepare("INSERT INTO random
          SELECT
           ".$this->column_id."
          FROM
            ".$this->table." AS p
          JOIN
            (SELECT (RAND() * @max_id) AS random_id) AS r
          WHERE
            ".$this->column_id." >= r.random_id
          ORDER BY
            ".$this->column_id."
          LIMIT
            1");

      /* Get Random IDs */

      while($this->count > 0 && $this->max_errors > 0)
      {
         $stmnt->execute();

         /* if there was an error - go on */
         if($this->mysqli->errno === 1062)
         {
           $this->max_errors -= 1;
         }
         else
         {
           $this->count -= 1;
         }
      }

      /* If requested, return the result otherwise null*/

      if($return)
        return $this->mysqli->query("SELECT random_id FROM random");

      return null;
    }
  }
?>');
?>

The class expects a MySQLi instance as a parameter as well as the table name and your ID’s column name. Using the method getIDs($count, [ $return = true ]) you can ask for rows. The function uses the same concept as our MySQL procedure

  1. Create a temporary table
  2. Get maximum
  3. Find random IDs
  4. Return resulös

This object oriented solution has another benefit: You don’t have to create a procedure for every table in your database. You could easily extend this class to query for more than this column:

<?php echo $php_hl->highlight('<?php
  /**
   * Example Class for Posts
   *
   * @author Robert Hartung
   * @copyright www.roberthartung.de, 2014
   */

  final class randomPosts extends randomIdList
  {
    public function __construct(mysqli $mysqli)
    {
      parent::__construct($mysqli, \'forum_posts\', \'post_id\');
    }

    /**
     * Returns a different result set
     *
     * @return mysqli_result
     */

    public function getIDs($count)
    {
      parent::getIDs($count, false);

      return $this->mysqli->query("SELECT post_id FROM random JOIN forum_posts ON post_id = random_id");
    }

    public function getRandomPosts($count)
    {
      parent::getIDs($count, false);

      return $this->mysqli->query("SELECT post_id, post_datetime, post_topic_id FROM random JOIN forum_posts ON post_id = random_id");
    }
  }
?>');
?>

We could use this class like this:

<?php echo $php_hl->highlight('<?php
  /** Basic */

  $random = new randomIdList($mysqli, \'pmcms_forum_posts\', \'post_id\');
  $qry_random = $random->getIDs(5);

  while($number = $qry_random->fetch_assoc())
  {
    /* Do something */
  }

  /** Basic 2 */

  $qry_random = $random->getIDs(10);

  while($number = $qry_random->fetch_assoc())
  {
    /* Do something */
  }

  /** Inherited class */

  $random = new randomPosts($mysqli);
  $qry_random = $random->getIDs(5);

  while($number = $qry_random->fetch_assoc())
  {
    /* Do something */
  }

  $qry_random = $random->getRandomPosts(5);

  while($post = $qry_random->fetch_assoc())
  {
    /* Do something */
  }
?>');
?>

This should give you a good idea on how to avoid using MySQL’s ORDER BY RAND(). I appreciate any kind of feedback.

2 Gedanken zu „MySQL: ORDER BY RAND() – a case study of alternatives

  1. This has been very helpful; but I’m noticing that i often don’t get the desired count when executing, this occurs when the SELECT set has limited rows, maybe less than 25. Is there a way to resolve this?

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Please enter the captcha *