Integration-Centric: Finding Missing Value

Originally publised on May 25, 2011

Often times it is required to find a missing number in large set, say of 32-bit integers (n), in random order. For example, integer numbers used as keys for some records and the missing key(s) could be reused for new allocations or to find missing data/holes in the data-set.

So, what is optimal way to find the missing value. Well, it depends. We have to further define the constraints. What is the size of data-set, how much virtual memory do we have - can the numbers be loaded into memory? What if we are constrained by the amount of memory but we have plenty of disk space. How about the case where we have some (limited) memory to partially fit in the input in memory. The optimal solution, as you might have guessed, depends on these constraints and this blog present various techniques that can be used to find the missing value for each of these cases. Mainly, following solutions for following three cases will be discussed. Also, we will note the running time for each.

Plenty of virtual memory
Very little virtual memory but plenty of disk space
Some memory

Plenty of Virtual Memory

Solution 1

We can define a bit vector of size equal to n + 1 all initialized to 0. Then by looping through the input numbers, for a given number set the corresponding position in the bit set to 1. Finally, scanning the bit set for a value that is not set to 1 will give us the missing number. This solution of course assume that plenty of virtual memory available to represent the input set in bit array. The running time of this algorithm is O(n), not counting the time it takes to build the bit set.

Solution 2

Sort the input array and then using binary search find the missing value.

Solution 3

Another possible solution for the case where only one number is missing in a range of numbers would be to take sum of numbers (can be obtained using arithmetic series - S = n/2(a1 + an)) and then subtract each number from the sum of the range. The result would be missing number. We of course need to be careful that the sum does not overflow the maximum allowed for the data-type.

Very Little Virtual Memory

When we are limited by virtual memory and cannot load the input array into memory, we cannot do either bit check or sort, hence these solutions will not work. Assuming we have plenty of disk space, a solution similar to binary search (without the sort step) can be used to find missing number(s).

We will successively divide the input into two files half the size of input and write each to the disk. The trick is to divide the input into two halves based on most significant bit (MSB). One file will have all numbers with MSB value of 1 and other file with numbers starting with 0 as MSB. Each successive iteration will reduce the input array into half and the MSB for dividing the file will also move one step to right (towards least significant bit or LSB).

Since we know the maximum capacity for a given number of bits, the max elements for each half is known. Finding out the size of each file and knowing the total capacity, we can pick the half which has number of items less than the capacity and discard the other file. Note that it is possible that the other half also have missing numbers (the size less than the capacity), but since we are interested in finding a missing value (not all missing values), it does not matter which file (if the size of both is less than the capacity) we choose. Repeat the process over the selected file (now approx half the size) now considering the next MSB. Continue doing this till we have reached all but the last bit (LSB). At this point the input will have only one (or none) element remaining. Knowing the the possible values that can be represented for already found MSB, and the original item(s) remaining (for this range), we can easily find the missing number(s).

Also, note that in case there are range of missing values adjacent to each other, the solution may converge earlier and there is no need to traverse from MSB all the way to LSB.

Lets understand the above solution using simple example. Assume 4-bit integer (range 0 - 16) with missing value of 12 (1100). Here is the input set

input (integer) {5 7 15 8 1 9 4 0 3 10 13 6 14 2 11}
input (binary) {0101 0111 1111 1000 0001 1001 0100 0000 0011 1010 1101 0110 1110 0010 1011}

In the first pass, we serially read the input and divide it into two arrays based on most significant bit and write them to the disk

first file (bit 1xxx) {1111 1000 1001 1010 1101 1110 1011}
second file( bit 0xxx) {0101 0111 0001 0100 0000 0011 0110 0010}

Thus the two files sizes obtained are 7 (with MSB as 1) and 8 (MSB of 0). The max capacity of each being 8 (the total capacity ignoring the MSB).

Thus looking at the size of two files, it is clear that the first file contains the missing number. At this point we can discard the second file and repeat the above process for the first file using the second most significant bit to obtain two files:

first file (bit 11xx) {1111 1101 1110}
second file (bit 10xx) {1000 1001 1010 1011}

Above clearly suggests that the missing number must be in the range 1100 - 1111. Discarding the second file and repeating the same using the third most significant bit, we obtain following:

first file (bit 111x) {1111 1110}
second file (bit 110x) {1101}

This we conclude that the missing value is 1100 (the only remaining value for second file above).

In conclusion, the using the above mechanism we successively divided the input array into half. Thus for array of size n, we can find the missing value using log(n) passes, with each pass reducing the input size to n, n/2, n/4.....(geometric series sum = n/(1-1/2)= 2n) producing running time proportional to n.

Refer code at the end for implementation of above idea.

Some Memory

Where we do have some memory to work with, the solution could be hybrid of above. Check first few significant bit to reduce the problem size (input array) so that it is able to fit in the memory. Then you can employ any of the solution listed in Case 1 above. For example, for an input size of 2³²(approx 4 billion values), checking 10 most significant bit can reduce the input to the order of 1000 (approx 4 million values) which might fit into memory.

  

/**
 * Copyright (c) 2010-2020 Malkit S. Bhasin. All rights reserved.
 * 
 * All source code and material on this Blog site is the copyright of Malkit S.
 * Bhasin, 2010 and is protected under copyright laws of the United States. This
 * source code may not be hosted on any other site without my express, prior,
 * written permission. Application to host any of the material elsewhere can be
 * made by contacting me at mbhasin at gmail dot com
 * 
 * I have made every effort and taken great care in making sure that the source
 * code and other content included on my web site is technically accurate, but I
 * disclaim any and all responsibility for any loss, damage or destruction of
 * data or any other property which may arise from relying on it. I will in no
 * case be liable for any monetary damages arising from such loss, damage or
 * destruction.
 * 
 * I further grant you ("Licensee") a non-exclusive, royalty free, license to
 * use, modify and redistribute this software in source and binary code form,
 * provided that i) this copyright notice and license appear on all copies of
 * the software;
 * 
 * As with any code, ensure to test this code in a development environment
 * before attempting to run it in production.
 */

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

/**
 * The following program finds missing a value in input array. It takes an input
 * array and the bit size of the maximum value (the number of bits needed to
 * represent the represent max value) and returns missing number, if any. The
 * algorithm fully loads the input in memory and require additional memory of
 * size equal to input, however can be modified to work in limited memory
 * environment (refer comments in the end).
 * 
 * Also, note the following: 1. In certain cases, the implementation might
 * return two contiguous missing values (as the case might be). 2. The utility
 * does not return all the missing values, but at least one value (assuming list
 * has missing values). 3. The bitSize parameter can be derived from input, but
 * that would require recursing the input to find the maximum value.
 * 
 * The implementation works by inspecting the Most Significant bit and work like
 * binary search in that the input arrays is successively reduced to two halves
 * during iteration and then selects the one that is guaranteed to have at least
 * one missing value.
 * 
 * The running time (asymptotic) of this algorithm is O(log(n)) which is much
 * better than sort followed by binary search (total of O(nlog(n) + log(n)).
 * 
 * Another value of this algorithm is that for the case where the virtual memory
 * is limited and the input array cannot be fit into memory, modification of
 * this algorithm can be used to read the input stream and write the reduced
 * files back to file system with each step reducing the input size by half.
 * 
 * @author Malkit S. Bhasin
 */
public class MissingNumber {

 /**
  * Returns a missing value (if any) in the input array. If none is found,
  * empty set is returned.
  * 
  * @param input
  *            array of integers
  * @param bitSize
  *            the size of maximum value in input array
  * @return
  */
 static Set findMissing(int[] input, int bitSize) {
  if (bitSize < 0) {
   throw new IllegalArgumentException(
     "bitSize value cannot be nagative");
  }

  // convert the input array into list
  List list = asList(input);

  StringBuilder result = new StringBuilder(bitSize);

  // this represents the maximum capacity of half array
  // during successive iteration
  int maxCapacity = -1;

  // binary representation of int value
  String binary = null;

  // traversing from most significant bit (msb) to least significant bit
  // (lsb)
  for (int i = 0; i < bitSize; i++) {
   // temp arrays to store values beginning with 1 bit and 0 bit
   // (for current bit index -> i)
   List ones = new ArrayList();
   List zeros = new ArrayList();

   // Calculate max number of values that can fit in 'ones' and 'zeros'
   // array
   maxCapacity = (int) Math.pow(2, bitSize - (i + 1));

   for (int j = 0; j < list.size(); j++) {
    binary = toBinaryString(list.get(j), bitSize);
    if (binary.charAt(i) == '1') {
     ones.add(list.get(j));
    } else {
     zeros.add(list.get(j));
    }
   }

   if (ones.size() < maxCapacity) {
    // this indicate that there is definitely at least one missing
    // value
    // in this array (ones array). Note that this does not mean
    // that there is no missing value in zeros array. Since we just
    // to find a missing value, we dont have to check other array..
    // As we progress through the msb to lsb, we collect the bits
    // in result which will be used later to find the missing value
    result.append("1");
    list = ones;
   } else if (zeros.size() < maxCapacity) {
    // indicate that the missing value in zeros array
    result.append("0");
    list = zeros;
   } else {
    // no missing element.. return empty set
    return new HashSet();
   }

   // this means that successive reduction of input array
   // has resulted in array of size 1 or even 0 (the case
   // where both the elements are missing - see test3 below..)
   if (list.size() <= 1) {
    return getMissingInt(list, result.toString(), bitSize);
   }
  }
  return new HashSet();
 }

 /**
  * Here the passed list is the values for a narrow range which has other
  * value(s) missing. The logic works by find all the values that will fit in
  * this range then striking off the values that are already present in this
  * range (as identified by list).
  * 
  * @param list
  * @param result
  * @param bitSize
  * @return
  */
 private static Set getMissingInt(List list,
   String result, int bitSize) {
  Set resultSet = getValuesForRange(result, bitSize);
  for (Integer i : list) {
   resultSet.remove(i);
  }
  return resultSet;
 }

 /**
  * Returns all possible values for a given partially filled most significant
  * bits. The method is passed binary string containing most significant bits
  * for total bitSize. For example, if the result value is "110" and bitSize
  * is 4, this means that the possible values that can fit in are "1100" and
  * "1101" and these will be returned as result.
  * 
  * @param result
  * @param bitSize
  * @return
  */
 private static Set getValuesForRange(String result, int bitSize) {
  String start = padRight(result, bitSize).replace(' ', '0');
  String end = padRight(result, bitSize).replace(' ', '1');
  int startInt = Integer.valueOf(start, 2);
  int endInt = Integer.valueOf(end, 2);
  Set resultSet = new HashSet();
  for (int i = startInt; i <= endInt; i++) {
   resultSet.add(i);
  }
  return resultSet;
 }

 private static List asList(int[] input) {
  List list = new ArrayList();
  for (int index = 0; index < input.length; index++) {
   list.add(input[index]);
  }
  return list;
 }

 private static String padRight(String s, int n) {
  return String.format("%-" + n + "s", s);
 }

 /**
  * Converts the input integer to equivalent binary string representation.
  * Also, the returned value is left padded with 0's.
  * 
  * @param value
  * @param bitSize
  * @return
  */
 private static String toBinaryString(int value, int bitSize) {
  String binary = Integer.toBinaryString(value);
  binary = (String.format("%" + bitSize + "s", binary)).replace(' ', '0');
  return binary;
 }

 private static void printResult(String testName, Set result) {
  System.out.printf("%6s: %-10s %n", testName, result);
 }

 /**
  * @param args
  */
 public static void main(String[] args) {
  // the following test has one element (2) missing, 2 bits are sufficient
  // to
  // represent largest value (3)
  int[] input1 = { 0, 1, 3 };
  printResult("test1", findMissing(input1, 2));

  // the following test has one element - 12 missing
  int[] input2 = { 5, 7, 15, 8, 1, 9, 4, 0, 3, 10, 13, 6, 14, 2, 11 };
  printResult("test2", findMissing(input2, 4));

  // the list has two elements - 12, 13 missing
  int[] input3 = { 5, 7, 15, 8, 1, 9, 4, 0, 3, 10, 6, 14, 2, 11 };
  printResult("test3", findMissing(input3, 4));

  // the list has maximum value of 31, which means it needs at least 5
  // bits. The missing value is 15
  int[] input4 = { 5, 7, 8, 1, 9, 4, 0, 3, 10, 13, 6, 14, 2, 11, 12, 16,
    17, 19, 18, 21, 20, 23, 22, 25, 24, 27, 26, 29, 28, 31, 30 };
  printResult("test4", findMissing(input4, 5));

  // here all the values from 0 - 28 are present, but the input array
  // needs 5 bits (because
  // maximum value is 28). The utility returns missing value(s) that can
  // fit in this
  // bit range (29, 30, 31). This is by design. If strictly values between
  // the
  // lower and upper values of input are needed, the implementation can
  // scan through
  // the input to find the min and max and return missing values between
  // this range.
  int[] input5 = { 5, 7, 8, 1, 9, 4, 0, 3, 10, 13, 6, 14, 2, 11, 12, 16,
    17, 19, 18, 21, 20, 23, 22, 25, 24, 27, 26, 28 };
  printResult("test5", findMissing(input5, 5));
 }
}

Integration-Centric

Wednesday, July 18, 2012

Finding Missing Value