Genetic Algorithm is not the best approach here:
- GA is a stochastic methods and therefore will never guarantee the best possible solution.
- Your solution (GA individual) is modeled as a simple integer. GA is optimal to find a solution modeled as an array (like our genomes), so it can do mutation and crossover.
Good Approach:
The set |S|=N can be around 10000, and each element also has a value in tens of thousands.
It certainly looks a lot, but for a computer, that's not so much. We can start with a brute-force and incrementally simplify it for better performance:
Brute Force:
You can try every single integer from D>T until D<max(S) and store the best candidate.
# Fully functional python solution:
import random
import time
# Let's start initializing a random T
T = random.randint(1, 100)
# And a S, with 10,000 integers fom T to 10,000
S = [random.randint(T, 10000) for i in range(10000)]
def LowestRemainder(List, Threshold):
# Storing the all time lowest D and remainder
Best_D_SoFar = Threshold+1
lowestRemainderSoFar = sum(List)
for d in range(Threshold+1, max(List)+1):
remainder = sum(c%d for c in List)
if remainder < lowestRemainderSoFar:
lowestRemainderSoFar = remainder
Best_D_SoFar = d
return Best_D_SoFar, lowestRemainderSoFar
print("Looking for D > %d, that minimizes the sum of all remainder in a list containing %d integers" % (T, len(S)))
# Keep track of time
start_time = time.time()
# call the function
D, R = LowestRemainder(S, T)
# stop the clock
elapsed_time = time.time() - start_time
print("Found D =", D, "and lowest remainder =", R, "in", elapsed_time, "seconds")
I've tested a few times on my machine and it always runs in less than 4 seconds (for the maximum
But that's the brute-force baseline. So let's make it more efficient, by early interrupting the inner loop:
# Remove this:
# remainder = sum(c%d for c in List)
# Add this instead:
remainder = 0
for c in List:
remainder += c%d
if remainder > lowestRemainderSoFar:
break
And done!
It now takes less than 0.2 seconds
(for the whole 10,000 sized array with random integers up to 10,000:)
Extra thoughts:
As the best solution is probably a low number (as discussed later), if it takes too long to finish, you can simply interrupt and you'll still have a good solution (not guaranteed to be optimal) just like with GA (and probably better, as you'll quickly exploit the best candidates, instead of exploring the space).
Factorization is our friend:
Suppose a solution D, with a score S.
Now factorize the solution, finding the list of primes that composes D: [p1,p2,…].
Now remove some p, generating a D' and you will notice that the c mod(D) >= c mod(D')
for any c, any D, and any p.
That means, you don't need to explore any multiple of D, if D was already evaluated.
That will drastically drop the amount of trials!
No threshold scenario:
Using this factorization principle, in the simpler scenario where T=1 (if T<1, than D=1 is the best solution), we guarantee that the solution is a prime number! Also, you don't need to search until the biggest number max(S)
, as the sqrt(max(S))
will be enough.
With threshold scenario
It makes it a little harder to program, but still you just need to loop through co-primes.
Upper limit:
Once you hit the first candidate (d) larger than the smallest S (d>min(S)
), than this remainder will be min(S).
If the best solution so far is already better than min(S), we can skip all the next candidates.
This logic is cumulative. Once the candidate d is larger than the N lowest S, the sum of remainders will be at least the sum of all N lowest S. So, our loop could be something like:
if(sum([s<d for s in S]) > lowest):
return d
else:
d = next_coprime()
In a list with thousand of elements, there's a high chance that the smallest element is found early. And even if it does not satisfy the finish condition, it will only accumulate more elements as d grows.
In summary:
Start with a baseline and restrict the boundaries until your code runs in the desirable time.
Example:
Inputs:
Process: Testing all d>T, skip the multiples and stop on upper limit:
- d = 4 --> remainders = [2,3,0] = 5
- d = 5 --> remainders = [0,4,3] = 7
- d = 6 --> remainders = [4,1,4] = 9
- d = 7 --> remainders = [3,5,0] = 8
- d = 8 --> is multiple of 4 (skip)
- d = 9 --> remainders = [1,1,1] = 3
- d = 10--> is multiple of 5 (skip)
- d = 11--> 11>min(S) --> remainders = [10,x,x] > 3. DONE!