Given a list of integers $\{c_1, \dots, c_N \}$, how do I find an integer $D$ that minimizes the sum of remainders $\sum_i c_i \text{ mod } D$?

Question

I have a set of fixed integers $S = \{c_1, \dots, c_N \}$. I want to find a single integer $D$, greater than a certain threshold $T$, i.e. $D > T \geq 0$, that divides each $c_i$ and leaves remainder $r_i \geq 0$, i.e. $r_i$ can be written as $r_i = c_i \text{ mod } D$, such that the sum of remainders is minimized.

In other words, this is my problem

\begin{equation} \begin{aligned} D^* \quad = \text{argmin}_D& \sum_i c_i \text{ mod } D \\ \textrm{subject to} &\quad D > T \end{aligned} \end{equation}

If the integers have a common divisor, this problem is easy. If the integers are relatively co-prime however, then it is not clear how to solve it.

The set $|S| = N$ can be around $10000$, and each element also has a value in tens of thousands.

I was thinking about solving it with a genetic algorithm (GA), but it is kind of slow. I want to know is there any other way to solve this problem.

I would actually think that this is one of the problems that GAs are fast at solving; being a purely mathematical problem, with mono-dimensional solutions, that can be bounded in a finite space. — Alvin Sartor, Dec 22 '19 at 09:03
Hello. This is an old post, but how did you solve this problem in the end? You may want to provide a formal answer below, if you found a good solution to this problem. — nbro, Jan 23 '21 at 17:38
I am almost certain the answer here is just T (being the lower bound on N). The reason for this is factors are prime, and therefor mostly random. As N becomes larger, the possibility for a larger remainder also increases, but the possibility of N being a divisor (or even close divisor) remains the same. This is all speculation, but I'm only saying this based on some simulations I did where the sum of each N was strictly increasing with N. This excludes edge cases with very small sample sizes (like [10,20,30] and T=9), but once a sample size is large enough, it seems $D*=T$ — Recessive, Jul 07 '21 at 00:31
Disregard "strictly increasing". Some values are less than others even in large sample sizes, but only slightly. It's more like "mostly increasing, with outliers that are a few percent smaller than the previous value" — Recessive, Jul 09 '21 at 06:24

Andre Goulart · Answer 1 · 2021-08-14T16:41:04.570

Genetic Algorithm is not the best approach here:

GA is a stochastic methods and therefore will never guarantee the best possible solution.
Your solution (GA individual) is modeled as a simple integer. GA is optimal to find a solution modeled as an array (like our genomes), so it can do mutation and crossover.

Good Approach:

The set |S|=N can be around 10000, and each element also has a value in tens of thousands.

It certainly looks a lot, but for a computer, that's not so much. We can start with a brute-force and incrementally simplify it for better performance:

Brute Force:

You can try every single integer from D>T until D<max(S) and store the best candidate.

# Fully functional python solution:
import random
import time

# Let's start initializing a random T
T = random.randint(1, 100)
# And a S, with 10,000 integers fom T to 10,000
S = [random.randint(T, 10000) for i in range(10000)]


def LowestRemainder(List, Threshold):
    # Storing the all time lowest D and remainder
    Best_D_SoFar = Threshold+1
    lowestRemainderSoFar = sum(List)
    for d in range(Threshold+1, max(List)+1):
        remainder = sum(c%d for c in List)
        if remainder < lowestRemainderSoFar:
            lowestRemainderSoFar = remainder
            Best_D_SoFar = d
    return Best_D_SoFar, lowestRemainderSoFar

print("Looking for D > %d, that minimizes the sum of all remainder in a list containing %d integers" % (T, len(S)))
# Keep track of time
start_time = time.time()
# call the function
D, R = LowestRemainder(S, T)
# stop the clock
elapsed_time = time.time() - start_time
print("Found D =", D, "and lowest remainder =", R, "in", elapsed_time, "seconds")

I've tested a few times on my machine and it always runs in less than 4 seconds (for the maximum

But that's the brute-force baseline. So let's make it more efficient, by early interrupting the inner loop:

    # Remove this:
    # remainder = sum(c%d for c in List)
    
    # Add this instead:
      remainder = 0
      for c in List:
          remainder += c%d
          if remainder > lowestRemainderSoFar:
              break

And done!

It now takes less than 0.2 seconds

(for the whole 10,000 sized array with random integers up to 10,000:)

Extra thoughts:

As the best solution is probably a low number (as discussed later), if it takes too long to finish, you can simply interrupt and you'll still have a good solution (not guaranteed to be optimal) just like with GA (and probably better, as you'll quickly exploit the best candidates, instead of exploring the space).

Factorization is our friend:

Suppose a solution D, with a score S.

Now factorize the solution, finding the list of primes that composes D: [p1,p2,…].

Now remove some p, generating a D' and you will notice that the c mod(D) >= c mod(D') for any c, any D, and any p.

That means, you don't need to explore any multiple of D, if D was already evaluated.

That will drastically drop the amount of trials!

No threshold scenario:

Using this factorization principle, in the simpler scenario where T=1 (if T<1, than D=1 is the best solution), we guarantee that the solution is a prime number! Also, you don't need to search until the biggest number max(S), as the sqrt(max(S)) will be enough.

With threshold scenario

It makes it a little harder to program, but still you just need to loop through co-primes.

Upper limit:

Once you hit the first candidate (d) larger than the smallest S (d>min(S)), than this remainder will be min(S).

If the best solution so far is already better than min(S), we can skip all the next candidates.

This logic is cumulative. Once the candidate d is larger than the N lowest S, the sum of remainders will be at least the sum of all N lowest S. So, our loop could be something like:

if(sum([s<d for s in S]) > lowest):
  return d
else:
  d = next_coprime()

In a list with thousand of elements, there's a high chance that the smallest element is found early. And even if it does not satisfy the finish condition, it will only accumulate more elements as d grows.

In summary:

Start with a baseline and restrict the boundaries until your code runs in the desirable time.

Example:

Inputs:

S = [10, 19, 28]
T = 3

Process: Testing all d>T, skip the multiples and stop on upper limit:

d = 4 --> remainders = [2,3,0] = 5
d = 5 --> remainders = [0,4,3] = 7
d = 6 --> remainders = [4,1,4] = 9
d = 7 --> remainders = [3,5,0] = 8
d = 8 --> is multiple of 4 (skip)
d = 9 --> remainders = [1,1,1] = 3
d = 10--> is multiple of 5 (skip)
d = 11--> 11>min(S) --> remainders = [10,x,x] > 3. DONE!