Keeping vectors efficient

If you are a programmer, you probably know such data structures as java.util.ArrayList of Java and/or std::vector of C++. Both are simple arrays under the hood, and whenever there is no room for appending a new element, the implementation creates a larger array, copies the contents of the current array to it, deletes the current array, sets the newly created array as current, and, finally, appends the element normally. Basically, as long as there is room for appending, the running time of such an operation is constant. Obviously, if there is no room, the running time is $\Theta(n)$. This is by no means informative, and we would like to understand the efficiency of such a data structure better. Now, let amortized analysis come to rescue. The idea behind amortized analysis is that you sum the costs of operations appearing in a sequence, and divide the sum by the number of operations in the sequence. Formally, if the cost sequence of an operation sequence is $\langle c_1, c_2, \dots, c_n \rangle$, the amortized running time is simply the average of them:

$\displaystyle \frac{1}{n} \sum_{i = 1}^n c_i$.

Now, we have two schemes for expanding the underlying array:

1. arithmetic: choose a positive integer $d$ and every time you have to expand the array, make it $d$ array components longer;
2. geometric: choose a real number $q > 1$, and every time you have to expand the array of length $N$, make it $\lfloor qN \rfloor$ array components long. (If $N_0$ is the initial array length, we require $\lfloor qN_0 \rfloor > N_0$.)

Arithmetic scheme

Suppose the initial array capacity is $m$ and $d > 0$ is given. Also we are given a sequence of appending operations of length $n > m > 0$. Now the total work is

\displaystyle \begin{aligned} C(n, m, d) &= \overbrace{\sum_{k = 0}^{\lceil \frac{n - m}{d} \rceil}(kd + m)}^A - \overbrace{\Bigg( m + \Bigg\lceil \frac{n - m}{d} \Bigg\rceil d - n \Bigg)}^R \\ &= \sum_{k=0}^{\lceil \frac{n-m}{d} \rceil}(kd + m) + n - m - \Bigg\lceil \frac{n - m}{d} \Bigg\rceil d \\ &= d \sum_{k = 1}^{\lceil \frac{n - m}{d} \rceil} k + m \Bigg( \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + 1 \Bigg) + n - m - d \Bigg\lceil \frac{n - m}{d} \Bigg\rceil \\ &= \frac{d}{2} \Bigg[ \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + 1\Bigg] \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + m \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + n - d \Bigg\lceil \frac{n - m}{d} \Bigg\rceil \\ &= \frac{d}{2} \Bigg\lceil \frac{n - m}{d} \Bigg\rceil^2 + \Bigg(m - \frac{d}{2}\Bigg) \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + n. \end{aligned}

Above, $A$ is the total work of expanding and filling the array sufficiently many times in order to accommodate $n$ elements, and $R$ denotes the number of elements we could have put before expanding once again.

Now, let us show that the amortized cost is linear in $n$. We start from the lower bound:

\displaystyle \begin{aligned} \frac{1}{n} C(n, m, d) &= \frac{1}{n} \Bigg[ \frac{d}{2} \Bigg\lceil \frac{n - m}{d} \Bigg\rceil^2 + \Bigg( m - \frac{d}{2} \Bigg) \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + n \Bigg] \\ & \geq \frac{d}{2n} \Bigg( \frac{n - m}{d} \Bigg)^2 + \frac{1}{n} \Bigg( m - \frac{d}{2} \Bigg) \Bigg( \frac{n - m}{d} \Bigg) + 1 \\ &= \frac{d}{2n} \Bigg( \frac{n^2 - 2nm + m^2}{d^2} \Bigg) + \frac{1}{nd} \Bigg( m - \frac{d}{2} \Bigg) \Bigg(n - m\Bigg) + 1 \\ &= \frac{1}{2nd} \Bigg(n^2 - 2nm + m^2\Bigg) + \frac{1}{nd}\Bigg( nm - m^2 - \frac{nd}{2} + \frac{md}{2} \Bigg) + 1 \\ &= \frac{1}{2nd} \Bigg( n^2 - 2nm + m^2 \Bigg) + \frac{1}{2nd} \Bigg( 2nm - 2m^2 - nd + md \Bigg) + 1 \\ &= \frac{1}{2nd} \Bigg(n^2 - m^2 - nd + md + 2nd\Bigg) \\ &= \frac{1}{2nd} \Bigg( n^2 - m^2 + nd + md \Bigg) \\ &\geq \frac{n^2 - m^2}{2nd} \\ &= \frac{n}{2d} - \frac{m^2}{2nd} \\ &\overset{n > m}{>} \frac{n}{2d} - \frac{mn}{2nd}\\ &= \frac{n}{2d} - \frac{m}{2d} > 0. \end{aligned}

Next, we need to show that there exist $c_1 > 0$ and $n_0^L \in \mathbb{N} \setminus \{0\}$ such that $c_1 n \leq \frac{n}{2d} - \frac{m}{2d}$ for all $n \geq n_0^L$. Let us choose $c_1 = \frac{1}{3d}$. Now we have

\displaystyle \begin{aligned} \frac{1}{3d}n &\leq \frac{1}{2d}n - \frac{1}{2d}m \\ \frac{1}{3}n &\leq \frac{1}{2}n - \frac{1}{2}m \\ \frac{1}{6}n &\geq \frac{1}{2} m \\ n &\geq 3m = n_0^L. \end{aligned}

Since $m$ is a constant, so is $n_0^L$. Next, let us derive an upper bound:

\displaystyle \begin{aligned} \frac{1}{n}C(n, m, d) &= \frac{1}{n} \Bigg[ \frac{d}{2} \Bigg\lceil \frac{n - m}{d} \Bigg\rceil^2 + \Bigg(m - \frac{d}{2}\Bigg) \Bigg\lceil \frac{n - m}{d} \Bigg\rceil + n\Bigg] \\ &\leq \frac{1}{n} \Bigg[ \frac{d}{2} \Bigg\lceil \frac{n}{d} \Bigg\rceil^2 + m \Bigg\lceil \frac{n}{d} \Bigg\rceil + n \Bigg] \\ &\leq \frac{1}{n} \Bigg[ \frac{d}{2} \Bigg( \frac{n}{d} + 1 \Bigg)^2 + m \Bigg( \frac{n}{d} + 1 \Bigg) + n \Bigg] \\ &= \frac{1}{n} \Bigg[ \frac{d}{2} \Bigg( \frac{n^2}{d^2} + \frac{2n}{d} + 1 \Bigg) + \frac{nm}{d} + m + n \Bigg] \\ &= \frac{1}{n} \Bigg[ \frac{n^2}{2d} + n + \frac{d}{2} + \frac{2nm}{2d} + m + n \Bigg] \\ &= \frac{n}{2d} + 1 + \frac{d}{2n} + \frac{m}{d} + \frac{m}{n} + 1 \\ &= \frac{n}{2d} + 2 + \overbrace{\frac{d}{2n}}^{< 1} + \frac{m}{d} + \overbrace{\frac{m}{n}}^{<1} \\ &< \frac{n}{2d} + 4 + \frac{m}{d} \\ &< \frac{n}{2d} + 4 + \frac{n}{d} \\ &= \frac{n}{2d} + \frac{2n}{2d} + 4 \\ &= \frac{3n}{2d} + 4. \end{aligned}

At this point we choose $c_2 = 2$, which leads to

\displaystyle \begin{aligned} \frac{3n}{2d} + 4 &\leq 2n \\ 3n + 8d &\leq 4dn \\ 4dn - 3n &\geq 8d \\ (4d-3)n &\geq 8d \\ n &\geq \frac{8d}{4d - 3} = n_0^R. \end{aligned}

If we choose $n_0 = \max(n_0^L, n_0^R)$, we have that $c_1 n \leq \frac{1}{n}C(n,m,d) \leq c_2 n$ for all $n \geq n_0$, which is the very definition of $\Theta$, and so, we have that $\frac{1}{n}C(n,m,d) = \Theta(n)$.

Next, we will show that appending with geometric strategy runs in $\Theta(1)$ amortized time. If we manage to show that the upper bound is constant, we can omit proving the lower bound. Suppose we start with the empty array of capacity $m > 0$. Also, we are given the expansion factor $q > 1$. As a slight technicality, we must choose $q$ such that $\lfloor qm \rfloor > m$; otherwise, the very first expansion will have no effect, and, thus, by induction, none of them. Assuming the same unit costs for copying and appending elements like in the arithmetic strategy, the total work of appending $n > m$ elements is

$\displaystyle W(n, m, q) = m + mq + mq^2 + \cdots + mq^k.$

We require $k$ to be the smallest integer such that $mq^k \geq n$, which leads us to the following inequalities:

\displaystyle \begin{aligned} mq^k &\geq n \\ q^k &\geq \frac{n}{m} \\ \log_q q^k &\geq \log_q \Bigg( \frac{n}{m} \Bigg) \\ k &\geq \log_q n - \log_q m. \end{aligned}

Since $k$ is required to be the smallest integer satisfying the above inequality, we can set $k = \lceil \log_q n - \log_q m \rceil$. Also,

\displaystyle \begin{aligned} qW(n, m, q) = mq + mq^2 + \cdots + mq^{k + 1} &\Longrightarrow W(n, m, q) - q W(n, m, q) = m(1 - q^{k + 1}) \\ &\Longrightarrow W(n, m, q) = m \frac{1 - q^{k + 1}}{1 - q}. \end{aligned}

Since $k = \lceil \log_q n - \log_q m \rceil$, we have

\displaystyle \begin{aligned} W(n, m, q) &= m \frac{1 - q^{\lceil \log_q n - \log_q m \rceil + 1}}{1 - q} \\ &\leq m \frac{1 - q \cdot q^{\lceil \log_q n \rceil}}{1 - q} \\ &\leq m \frac{1 - q \cdot q^{\log_q n + 1}}{1 - q} \\ &= m \frac{1 - q^2n}{1 - q}. \end{aligned}

Now we have that

\displaystyle \begin{aligned} \frac{1}{n} W(n,m,q) &\leq \frac{1}{n} \Bigg[ m \frac{1 - q^2 n}{1 - q} \Bigg] \\ &= \frac{1}{n} \Bigg[ \frac{m}{1 - q} - \frac{nmq^2}{1 - q} \Bigg] \\ &= \frac{m}{(1 - q)n} - \frac{mq^2}{1 - q} \\ &\leq \frac{m}{1 - q} - \frac{mq^2}{1 - q} \\ &= \frac{m(1 - q^2)}{1 - q}, \end{aligned}

which is constant since $m$ and $q$ are constants. Hence, appending an element to an array-based data structure using geometric array expansion strategy runs in amortized constant time.