Curve sort

In this post I want to present an integer sorting algorithm I invented (at least independently from any researcher) a week ago. It is a counting sort that adapts to data. As you remember, the conventional counting sort creates an integer array C in the beginning of actual sorting and initializes each component to zero. After that, it iterates over the input range a_1, a_2, \dots, a_n, and for each a_i it increments the array component C[a_i - \min a_i], thus counting the number of occurrences of a_i. The second step is to iterate C from beginning to end, and at each a_i, insert C[a_i - \min a_i] copies of a_i in the input array. Now, the limitation of this algorithm should be obvious. First, the “width” of the data w = \max a_i - \min a_i + 1 must be known in advance. If it is not, the counting sort must sacrifice a single pass in order to determine the width of the input range. This, however, does not solve the next problem: if w = \omega(n), the sort will degrade to superlinear running time, or, even worse, run out of memory while allocating C.

One way to address the problematic issues of counting sort is to drop the counter array C and use a (doubly-linked) list. In that list, each node records the integer it is counting, the actual counter, and two links to possible previous and next nodes. The list is sorted by integer keys its nodes store. Now, whenever processing a new element, we search for a node containing that very same element, and once found, increments the counter. If the algorithm notices the absence of an appropriate node, it creates the node and inserts it in its proper location so that the list remains sorted.

As the length of the list is at most n nodes, inserting each element requires in the worst case \Theta(n) time, and as there is n elements in the input range, the worst case running time is obviously \Theta(n^2). Notwithstanding, the algorithm can do better: suppose k = |\{ a_1, a_2, \dots, a_n \}| is the amount of distinct integers. Now the length of the list is bounded from above by k, which leads to running time \mathcal{O}(nk). What comes to the adaptive nature of the sort, relatively large k does not inevitably imply that it will run slowly. In fact, we did an optimization: we store the reference to the list node last inserted or updated. As we proceed to the next array element, we start moving in the list from the previously updated node, and if the next node is “close” to the previous, we will get to the new node faster. So at this point, we get a conjecture.

Conjecture
The “smoother” is the curve obtained from (i, a_i), the faster the sort will run.

My experiments support the conjecture. I generated an array of 5e6 (5000000) integers containing roughly 100 000 distinct integers. For that array, the plot (i, a_i) was a sine wave with amplitude roughly 50000 and period length around 628000 array components. My counting sort consistently runs in under 100 milliseconds, which turned out to be around five times faster than Arrays.sort(int[]) of Java Development Kit 1.8. Yet, as the worst case time is quadratic, I do not hope it to cope with linearithmic time sorts in the average case, so I profiled it against an optimized insertion sort called “straight insertion sort” that spares a lot of assignment operations. Regardless, my bulk demonstration reports insertion sort demanding around 20 seconds to sort all the arrays, whereas my sort requires only around two and a half seconds.

Experiment code

Curvesort.java

package net.coderodde.util.sorting;

/**
 * This class implements an adaptive counting sort that adapts to the input.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6
 */
public final class Curvesort {

    /**
     * Sorts the entire input integer array.
     * 
     * @param array the integer array to sort.
     */
    public static void sort(int[] array) {
        sort(array, 0, array.length);
    }

    /**
     * Sorts the range {@code array[fromIndex], array[fromIndex + 1], ...,
     * array[toIndex - 2], array[toIndex - 1]}.
     * 
     * @param array     the array containing the range to sort.
     * @param fromIndex the starting, inclusive range index.
     * @param toIndex   the ending, exclusive range index.
     */
    public static void sort(int[] array, int fromIndex, int toIndex) {
        if (toIndex - fromIndex < 2) {
            return;
        }
        
        Curvesort sort = new Curvesort(array, fromIndex, toIndex);
        sort.count();
        sort.buildRange();
    }
        
    /**
     * The node containing the least integer so far.
     */
    private Node head;
    
    /**
     * The node containing the largest integer so far.
     */
    private Node tail;
    
    /**
     * The node updated or created at previous array component.
     */
    private Node previous;
    
    /**
     * The actual array containing the range to sort.
     */
    private final int[] array;
    
    /**
     * The starting, inclusive index of the range to sort.
     */
    private final int fromIndex;
    
    /**
     * The ending, exclusive index of the range to sort.
     */
    private final int toIndex;
    
    /**
     * This field caches the previous array component.
     */
    private int lastElement;
    
    /**
     * Constructs the state needed for sorting.
     * 
     * @param array     the array containing the range to sort.
     * @param fromIndex the starting index of the range to sort.
     * @param toIndex   the ending index of the range to sort.
     */
    private Curvesort(int[] array, int fromIndex, int toIndex) {
        this.lastElement = array[fromIndex];
        this.array = array;
        this.fromIndex = fromIndex;
        this.toIndex = toIndex;
        this.head = this.tail = this.previous = new Node(lastElement);
    }
    
    /**
     * If {@code currentElement} is less than {@code lastElement}, inserts it in
     * its proper location.
     * 
     * @param currentElement the current element to insert.
     */
    private void findAndUpdateSmallerNode(int currentElement) {
        Node tmp = previous.prev;

        // Go down the node chain towards the nodes with smaller keys.
        while (tmp != null && tmp.element > currentElement) {
            tmp = tmp.prev;
        }

        if (tmp == null) {
            // 'currentElement' is the new minimum. Create new head node and put
            // the integer in it.
            Node newnode = new Node(currentElement);
            newnode.next = head;
            head.prev = newnode;
            head = newnode;
            previous = newnode;
        } else if (tmp.element == currentElement) {
            // The node containing 'currentElement' exists. Just increment the
            // counter.
            tmp.count++;
            previous = tmp;
        } else {
            // Insert a new node between 'tmp' and 'tmp.next'.
            Node newnode = new Node(currentElement);
            newnode.prev = tmp;
            newnode.next = tmp.next;
            newnode.prev.next = newnode;
            newnode.next.prev = newnode;
            previous = newnode;
        }
    }
    
    /**
     * If {@code currentElement} is larger than {@code lastElement}, inserts it
     * in its proper location.
     * 
     * @param currentElement the current element to insert.
     */
    private void findAndUpdateLargerNode(int currentElement) {
        Node tmp = previous.next;

        // Go up the chain towards the nodes with larger keys.
        while (tmp != null && tmp.element < currentElement) {
            tmp = tmp.next;
        }

        // 'currentElement' is the new maximum. Create new tail node and put the
        // integer in it.
        if (tmp == null) {
            Node newnode = new Node(currentElement);
            newnode.prev = tail;
            tail.next = newnode;
            tail = newnode;
            previous = newnode;
        } else if (tmp.element == currentElement) {
            // The node containing 'currentElement' exists. Just increment the 
            // counter.
            tmp.count++;
            previous = tmp;
        } else {
            // Insert a new node between 'tmp.prev' and 'tmp'.
            Node newnode = new Node(currentElement);
            newnode.prev = tmp.prev;
            newnode.next = tmp;
            tmp.prev.next = newnode;
            tmp.prev = newnode;
            previous = newnode;
        }
    }
    
    /**
     * Constructs the sorted chain of integers with their counts.
     */
    private void count() {
        for (int i = fromIndex + 1; i < toIndex; ++i) {
            int currentElement = array[i];
            
            if (currentElement < lastElement) {
                findAndUpdateSmallerNode(currentElement);
            } else if (currentElement > lastElement) {
                findAndUpdateLargerNode(currentElement);
            } else {
                previous.count++;
            }
            
            lastElement = currentElement;
        }
    }
    
    /**
     * Builds the range being sorted.
     */
    private void buildRange() {
        int index = fromIndex;
        
        for (Node node = head; node != null; node = node.next) {
            int element = node.element;
            int count = node.count;
            
            for (int i = 0; i < count; ++i) {
                array[index++] = element;
            }
        }
    }

    /**
     * Implements a node in a doubly linked list containing the integer and its
     * count.
     */
    private static final class Node {

        Node(int element) {
            this.element = element;
            this.count = 1;
        }

        Node prev;
        Node next;
        int element;
        int count;
    }
}

Insertionsort.java

package net.coderodde.util.sorting;

/**
 * This class provides a static method for sorting integer arrays using 
 * insertion sort.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6
 */
public class Insertionsort {

    /**
     * Sorts the entire input integer array.
     * 
     * @param array the integer array to sort.
     */
    public static void sort(int[] array) {
        sort(array, 0, array.length);
    }

    /**
     * Sorts the range {@code array[fromIndex], array[fromIndex + 1], ...,
     * array[toIndex - 2], array[toIndex - 1]}.
     * 
     * @param array     the array containing the range to sort.
     * @param fromIndex the starting, inclusive range index.
     * @param toIndex   the ending, exclusive range index.
     */
    public static void sort(int[] array, int fromIndex, int toIndex) {
        for (int i = fromIndex + 1; i < toIndex; ++i) {
            int element = array[i];
            int j = i;

            for (; j > fromIndex && array[j - 1] > element; --j) {
                array[j] = array[j - 1];
            }

            array[j] = element;
        }
    }    
}

Demo.java

import java.util.Arrays;
import java.util.Random;
import java.util.stream.IntStream;
import net.coderodde.util.sorting.Curvesort;
import net.coderodde.util.sorting.Insertionsort;

public class Demo {

    /**
     * The number of iterations for each array type.
     */
    private static final int OPERATION_COUNT = 30;

    /**
     * The maximum length of the array to profile against.
     */
    private static final int LENGTH = 40000;

    /**
     * The assumed console window width in characters.
     */
    private static final int CONSOLE_WIDTH = 80;

    public static void main(final String... args) {
        System.out.println(title("Smooth data test against Arrays.sort"));
        
        testSin();
        
        System.out.println(title("Bulk test"));
        
        long seed = System.currentTimeMillis();
        Random random = new Random(seed);

        int[] array1;
        int[] array2;

        long totalMySort = 0L;
        long totalInsertionsort = 0L;

        System.out.println("Seed: " + seed);
        System.out.println(title("Random arrays"));

        //// RANDOM ARRAYS ////
        for (int op = 0; op < OPERATION_COUNT; ++op) {
            int maxValue = 20 + 20 * op;
            System.out.println("Max value: " + maxValue);

            array1 = getRandomIntegerArray(LENGTH, maxValue, random);
            array2 = array1.clone();

            int fromIndex = random.nextInt(LENGTH / 20);
            int toIndex = LENGTH - random.nextInt(LENGTH / 20);

            long startTime = System.currentTimeMillis();
            Curvesort.sort(array1, fromIndex, toIndex);
            long endTime = System.currentTimeMillis();
            long duration = endTime - startTime;

            System.out.println("Counting sort in " + duration 
                                                   + " milliseconds.");

            totalMySort += duration;

            startTime = System.currentTimeMillis();
            Insertionsort.sort(array2, fromIndex, toIndex);
            endTime = System.currentTimeMillis();
            duration = endTime - startTime;

            System.out.println("Insertion sort in " + duration
                                                    + " milliseconds.");
            System.out.println(bar());
            totalInsertionsort += duration;

            if (!Arrays.equals(array1, array2)) {
                throw new RuntimeException("Sorts did not agree.");
            }
        }

        System.out.println();
        System.out.println(title("Presorted arrays"));

        //// PRESORTED ARRAYS ////
        for (int op = 0; op < OPERATION_COUNT; ++op) {
            int runAmount = 20 + 20 * op;
            System.out.println("Run amount: " + runAmount);

            array1 = getPresortedIntegerArray(LENGTH, runAmount, random);
            array2 = array1.clone();

            int fromIndex = random.nextInt(LENGTH / 20);
            int toIndex = LENGTH - random.nextInt(LENGTH / 20);

            long startTime = System.currentTimeMillis();
            Curvesort.sort(array1, fromIndex, toIndex);
            long endTime = System.currentTimeMillis();
            long duration = endTime - startTime;

            System.out.println("Counting sort in " + duration 
                                                   + " milliseconds.");

            totalMySort += duration;

            startTime = System.currentTimeMillis();
            Insertionsort.sort(array2, fromIndex, toIndex);
            endTime = System.currentTimeMillis();
            duration = endTime - startTime;

            System.out.println("Insertion sort in " + duration
                                                    + " milliseconds.");
            System.out.println(bar());
            totalInsertionsort += duration;

            if (!Arrays.equals(array1, array2)) {
                throw new RuntimeException("Sorts did not agree.");
            }
        }

        System.out.println("Total insertion sort time: " + 
                           totalInsertionsort + " milliseconds.");
        System.out.println("Total counting sort time:  " + 
                           totalMySort + " milliseconds.");
    }

    private static int[] getRandomIntegerArray(int size, 
                                               int maxValue, 
                                               Random random) {
        return IntStream.range(0, size)
                        .map((i) -> random.nextInt(maxValue))
                        .toArray();
    }

    private static int[] getPresortedIntegerArray(int size, 
                                                  int runs, 
                                                  Random random) {
        int[] ret = getRandomIntegerArray(size, size, random);
        int chunkSize = size / runs + 1;
        int chunkId = 0;

        for (; chunkId < size / chunkSize; chunkId++) {
            Arrays.sort(ret, 
                        chunkSize * chunkId, 
                        Math.min(size, (chunkId + 1) * chunkSize));
        }

        return ret;
    }

    private static String title(String s) {
        int textWithSpacesLength = s.length() + 2;
        int leftBarLength = (CONSOLE_WIDTH - textWithSpacesLength) / 2;
        int rightBarLength = CONSOLE_WIDTH - leftBarLength 
                                           - textWithSpacesLength;

        StringBuilder sb = new StringBuilder(CONSOLE_WIDTH);

        for (int i = 0; i < leftBarLength; ++i) {
            sb.append('-');
        }

        sb.append(' ').append(s).append(' ');

        for (int i = 0; i < rightBarLength; ++i) {
            sb.append('-');
        }

        return sb.toString();
    }

    private static String bar() {
        StringBuilder sb = new StringBuilder(CONSOLE_WIDTH);

        for (int i = 0; i < CONSOLE_WIDTH; ++i) {
            sb.append('-');
        }

        return sb.toString();
    }
    
    private static void testSin() {
        int[] array1 = new int[5000000];
        
        for (int i = 0; i < array1.length; ++i) {
            array1[i] = (int)(50000 * Math.sin(1.0 * i / 50000.0));
        }
        
        int[] array2 = array1.clone();
        
        long startTime = System.currentTimeMillis();
        Curvesort.sort(array1, 10, array1.length - 10);
        long endTime = System.currentTimeMillis();
        
        System.out.println("Counting sort in " + (endTime - startTime) +
                           " milliseconds.");
        
        startTime = System.currentTimeMillis();
        Arrays.sort(array2, 10, array2.length - 10);
        endTime = System.currentTimeMillis();
        
        System.out.println("Arrays.sort in " + (endTime - startTime) + 
                           " milliseconds.");
        
        System.out.println("Equal: " + Arrays.equals(array1, array2));
    }
}

The typical output might look like:


--------------------- Smooth data test against Arrays.sort ---------------------
Counting sort in 87 milliseconds.
Arrays.sort in 498 milliseconds.
Equal: true
---------------------------------- Bulk test -----------------------------------
Seed: 1476129336289
-------------------------------- Random arrays ---------------------------------
Max value: 20
Counting sort in 5 milliseconds.
Insertion sort in 303 milliseconds.
--------------------------------------------------------------------------------
Max value: 40
Counting sort in 3 milliseconds.
Insertion sort in 318 milliseconds.
--------------------------------------------------------------------------------
Max value: 60
Counting sort in 5 milliseconds.
Insertion sort in 338 milliseconds.
--------------------------------------------------------------------------------
Max value: 80
Counting sort in 3 milliseconds.
Insertion sort in 348 milliseconds.
--------------------------------------------------------------------------------
Max value: 100
Counting sort in 3 milliseconds.
Insertion sort in 334 milliseconds.
--------------------------------------------------------------------------------
Max value: 120
Counting sort in 4 milliseconds.
Insertion sort in 328 milliseconds.
--------------------------------------------------------------------------------
Max value: 140
Counting sort in 5 milliseconds.
Insertion sort in 328 milliseconds.
--------------------------------------------------------------------------------
Max value: 160
Counting sort in 5 milliseconds.
Insertion sort in 329 milliseconds.
--------------------------------------------------------------------------------
Max value: 180
Counting sort in 6 milliseconds.
Insertion sort in 347 milliseconds.
--------------------------------------------------------------------------------
Max value: 200
Counting sort in 6 milliseconds.
Insertion sort in 307 milliseconds.
--------------------------------------------------------------------------------
Max value: 220
Counting sort in 7 milliseconds.
Insertion sort in 334 milliseconds.
--------------------------------------------------------------------------------
Max value: 240
Counting sort in 8 milliseconds.
Insertion sort in 395 milliseconds.
--------------------------------------------------------------------------------
Max value: 260
Counting sort in 9 milliseconds.
Insertion sort in 342 milliseconds.
--------------------------------------------------------------------------------
Max value: 280
Counting sort in 9 milliseconds.
Insertion sort in 352 milliseconds.
--------------------------------------------------------------------------------
Max value: 300
Counting sort in 9 milliseconds.
Insertion sort in 346 milliseconds.
--------------------------------------------------------------------------------
Max value: 320
Counting sort in 10 milliseconds.
Insertion sort in 359 milliseconds.
--------------------------------------------------------------------------------
Max value: 340
Counting sort in 11 milliseconds.
Insertion sort in 327 milliseconds.
--------------------------------------------------------------------------------
Max value: 360
Counting sort in 10 milliseconds.
Insertion sort in 311 milliseconds.
--------------------------------------------------------------------------------
Max value: 380
Counting sort in 12 milliseconds.
Insertion sort in 309 milliseconds.
--------------------------------------------------------------------------------
Max value: 400
Counting sort in 11 milliseconds.
Insertion sort in 317 milliseconds.
--------------------------------------------------------------------------------
Max value: 420
Counting sort in 13 milliseconds.
Insertion sort in 355 milliseconds.
--------------------------------------------------------------------------------
Max value: 440
Counting sort in 12 milliseconds.
Insertion sort in 314 milliseconds.
--------------------------------------------------------------------------------
Max value: 460
Counting sort in 14 milliseconds.
Insertion sort in 346 milliseconds.
--------------------------------------------------------------------------------
Max value: 480
Counting sort in 13 milliseconds.
Insertion sort in 322 milliseconds.
--------------------------------------------------------------------------------
Max value: 500
Counting sort in 17 milliseconds.
Insertion sort in 345 milliseconds.
--------------------------------------------------------------------------------
Max value: 520
Counting sort in 14 milliseconds.
Insertion sort in 320 milliseconds.
--------------------------------------------------------------------------------
Max value: 540
Counting sort in 17 milliseconds.
Insertion sort in 326 milliseconds.
--------------------------------------------------------------------------------
Max value: 560
Counting sort in 16 milliseconds.
Insertion sort in 307 milliseconds.
--------------------------------------------------------------------------------
Max value: 580
Counting sort in 18 milliseconds.
Insertion sort in 337 milliseconds.
--------------------------------------------------------------------------------
Max value: 600
Counting sort in 19 milliseconds.
Insertion sort in 305 milliseconds.
--------------------------------------------------------------------------------

------------------------------- Presorted arrays -------------------------------
Run amount: 20
Counting sort in 36 milliseconds.
Insertion sort in 322 milliseconds.
--------------------------------------------------------------------------------
Run amount: 40
Counting sort in 35 milliseconds.
Insertion sort in 328 milliseconds.
--------------------------------------------------------------------------------
Run amount: 60
Counting sort in 13 milliseconds.
Insertion sort in 407 milliseconds.
--------------------------------------------------------------------------------
Run amount: 80
Counting sort in 19 milliseconds.
Insertion sort in 320 milliseconds.
--------------------------------------------------------------------------------
Run amount: 100
Counting sort in 28 milliseconds.
Insertion sort in 355 milliseconds.
--------------------------------------------------------------------------------
Run amount: 120
Counting sort in 22 milliseconds.
Insertion sort in 327 milliseconds.
--------------------------------------------------------------------------------
Run amount: 140
Counting sort in 33 milliseconds.
Insertion sort in 323 milliseconds.
--------------------------------------------------------------------------------
Run amount: 160
Counting sort in 35 milliseconds.
Insertion sort in 318 milliseconds.
--------------------------------------------------------------------------------
Run amount: 180
Counting sort in 39 milliseconds.
Insertion sort in 319 milliseconds.
--------------------------------------------------------------------------------
Run amount: 200
Counting sort in 46 milliseconds.
Insertion sort in 329 milliseconds.
--------------------------------------------------------------------------------
Run amount: 220
Counting sort in 53 milliseconds.
Insertion sort in 327 milliseconds.
--------------------------------------------------------------------------------
Run amount: 240
Counting sort in 38 milliseconds.
Insertion sort in 313 milliseconds.
--------------------------------------------------------------------------------
Run amount: 260
Counting sort in 43 milliseconds.
Insertion sort in 315 milliseconds.
--------------------------------------------------------------------------------
Run amount: 280
Counting sort in 63 milliseconds.
Insertion sort in 305 milliseconds.
--------------------------------------------------------------------------------
Run amount: 300
Counting sort in 67 milliseconds.
Insertion sort in 323 milliseconds.
--------------------------------------------------------------------------------
Run amount: 320
Counting sort in 65 milliseconds.
Insertion sort in 371 milliseconds.
--------------------------------------------------------------------------------
Run amount: 340
Counting sort in 73 milliseconds.
Insertion sort in 377 milliseconds.
--------------------------------------------------------------------------------
Run amount: 360
Counting sort in 89 milliseconds.
Insertion sort in 321 milliseconds.
--------------------------------------------------------------------------------
Run amount: 380
Counting sort in 89 milliseconds.
Insertion sort in 327 milliseconds.
--------------------------------------------------------------------------------
Run amount: 400
Counting sort in 73 milliseconds.
Insertion sort in 343 milliseconds.
--------------------------------------------------------------------------------
Run amount: 420
Counting sort in 73 milliseconds.
Insertion sort in 317 milliseconds.
--------------------------------------------------------------------------------
Run amount: 440
Counting sort in 119 milliseconds.
Insertion sort in 344 milliseconds.
--------------------------------------------------------------------------------
Run amount: 460
Counting sort in 111 milliseconds.
Insertion sort in 306 milliseconds.
--------------------------------------------------------------------------------
Run amount: 480
Counting sort in 87 milliseconds.
Insertion sort in 325 milliseconds.
--------------------------------------------------------------------------------
Run amount: 500
Counting sort in 92 milliseconds.
Insertion sort in 329 milliseconds.
--------------------------------------------------------------------------------
Run amount: 520
Counting sort in 129 milliseconds.
Insertion sort in 342 milliseconds.
--------------------------------------------------------------------------------
Run amount: 540
Counting sort in 125 milliseconds.
Insertion sort in 318 milliseconds.
--------------------------------------------------------------------------------
Run amount: 560
Counting sort in 109 milliseconds.
Insertion sort in 345 milliseconds.
--------------------------------------------------------------------------------
Run amount: 580
Counting sort in 81 milliseconds.
Insertion sort in 361 milliseconds.
--------------------------------------------------------------------------------
Run amount: 600
Counting sort in 172 milliseconds.
Insertion sort in 339 milliseconds.
--------------------------------------------------------------------------------
Total insertion sort time: 19945 milliseconds.
Total counting sort time:  2351 milliseconds.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s