Sunday, November 13, 2016

Basic Algorithms: The Duplicate Number

As part of learning to code, it is important to learn how to think about certain problems, specifically simple algorithm problems.

The Problem

You have an array of numbers from 1 to n-1, one of the numbers is duplicated, meaning the array has a length of n. For example

[1,2,3,3,4,5,6,7], where the duplicates is 3.

The array is not necessarily sorted. Find and return the duplicated number.

Another sample array might be

[4,2,3,2,1,5,6]. with 2 being a duplicate

The Naive Solution

A naive solution, that is, one not optimized for memory or computational savings, would likely do something along the lines of,

1. Create a comparison array, which is a duplicate of the first array
2. Delete the first number from the comparison array.
3. Compare the first number of the main array to each value in the comparison array
4. If it matches any numbers in the comparison array, it is our duplicate, return it
5. If it doesn't, delete that number from the first array.
6. Repeat for each number until you have the duplicate
This is obviously not terribly efficient, with a space complexity of O(n), as it requires a new array to be created (two if you want to back up the main array), and a time complexity of O(n^2) as you iterate over a number of arrays. It is not a good solution, but it would work.

The code would look something like this

The Optimized Solution

This can actually be solved with a time complexity of O(n) and a space complexity of O(1). Given that the array has a known length, n, and that the digits only go up to n-1, we can figure out what we would expect the sum of an array of no duplicates to look like, and then subtract the sum of the original array from the theoretical array. This requires iterating over the first array only once to sum it.

In testing, using an array of 1,000,000, digits the optimal solution took about 0.1 second to run, (on an average of 10 trials) while the naive version took 45 seconds.  The 10,000,000 trial took one second in the optimized version, and 414(!) seconds (almost 7 minutes) in the naive version. In the 100,000,000 number trial, the optimized version took roughly 10 seconds. I did not run the naive version as the projected time to run would be 414 seconds squared, which is roughly two days! This should highlight the importance of efficient search algorithms in managing large data sets.

The important takeaway here is that, by thinking of the constraints of the problem, it is possible to arrive at a better solution. With algorithm questions, the elegant solutions most often rely on finding the way the limitations of the data set can be leveraged.