# Santa’s stolen sleigh – Kaggle’s optimization competition

**This year’s annual Kaggle’s optimization competition, sponsored once again by FICO, was “Santa’s Stolen Sleigh” which featured a variant of a vehicle routing problem. Our team master.exploder@deepsense.io claimed 3rd place overall, as well as the Xpress Prize, awarded for best use of the FICO Xpress Optimization Suite.**

The goal of the challenge was to organize Santa’s gift-delivery schedule in a way that minimizes the reindeer’s total workload.

All the gifts are stored in a warehouse at the North Pole, where Santa loads them onto his sleigh. He has to take into account that each gift weighs a certain amount, and the sleigh cannot carry more than 1,000 kg at a time. Santa then delivers the batch of gifts and comes back for the next one. Given that the total weight of all gifts was about 1.4*10^{6} kg, Santa has to make at least 1,400 trips to deliver all the gifts.

While it is very natural to ask for a set of trips with the shortest total distance traveled, Santa has a much more important goal in mind – to minimize the reindeer’s workload. This is defined as the integral of weight carried over the distance traveled. In simple terms, for every meter traveled with x kg of sleigh weight, the weariness increases by x [kg*m]. Note that sleigh itself weighs 10 kg, which is rather small compared to the 1,000 kg of maximum load, but does affect the problem quite significantly.

Kaggle’s Wendy Kan has published a very interesting and well-written post on how this challenge was designed.

## The data

The problem described above is NP-hard. This means that depending on the actual instance to be solved, very different approaches should be used. For small or highly structured sets of data, one might hope for a fast exact solution. On the other hand, for generic and large sets of data, the best one can hope for is a good, but suboptimal solution.

The challenge data consisted of 100,000 gifts with more or less random weights and destinations selected randomly all over the world. For data that large, it is likely not possible to find an optimal solution within reasonable time. However, given the problem’s similarity to the much studied Travelling Salesman Problem, one can hope for some very good heuristics.

## The basic approach

It is generally agreed, that the best methods for approaching an Kaggle’s optimization competition like this are based on local search. The idea is to start with an easy to obtain solution (e.g. a random or greedy one) and then keep performing improvements. The way this is usually implemented is as follows: For every solution (also called a *state*), we define a set of *close* solutions, also called its *neighbors*. We then either scan all neighbors looking for one that is better than the current state, or perhaps sample this set. The word *local* might suggest that the differences between neighboring solutions are restricted spatially, but that is often not the case. Instead, the word *local* points at the locality structure of the state space induced by the neighborhoods.

The most naive implementation of local search would be to exhaustively scan the whole neighborhood of the current solution at each step and then proceed with either the first or the best-improving move found. Rather aptly, this approach is usually called *hill climbing*, and it has a rather glaring weakness. Just as always going uphill does not necessarily lead to reaching the highest peak, hill climbing has a tendency to get stuck in so-called *local optima*. These are solutions that, while not optimal, are still better than all of their neighbors.

One could devote thousands of pages to describing the different approaches to handling the problem of local optima in local search algorithms (to name a few: simulated annealing, threshold acceptance, tabu search, etc.). An obvious solution would be to simply increase the size of the neighborhoods, i.e. look further away. While this sometimes works surprisingly well (e.g. state of the art TSP solvers based on Lin-Kernighan heuristic do exactly that), it does not improve things substantially in general. A more subtle idea is to sometimes move to a state that is worse than the current. The most popular approach based on this idea is *simulated annealing*.

In simulated annealing (SA), one maintains an additional parameter T, called the temperature. This parameter is initially set to a large value and gradually decreased. SA randomly chooses a neighbor of the current state. If it improves upon the current state, it is always accepted, otherwise SA proceeds in a probabilistic fashion. When T is large, the move is still quite likely to be accepted. On the other hand, when T is very small, SA essentially reverts back to hill climbing. The way this behaviour is achieved is by accepting a worsening move with probability exp((old_objective-new_objective)/T).

As an example, consider a move that increases the objective function by 1. For T=100, this move would be accepted with probability exp(-1/100), which is around 99/100. On the other hand, for T=1/100, this move would be accepted with probability exp(-100), i.e. almost never.

From our experience, simulated annealing is the most effective approach to handling problems like this one, and that is the approach we decided to go with.

## Choosing the right moves

Deciding on the shape of the neighborhoods, i.e. what kind of moves are allowed for a given state, is very often the key design decision in implementing SA. We kept making small adjustments to our list of moves until the very end of the competition, and here is what we arrived at:

- swap positions of two gifts in the same route,
- move a gift to an arbitrary position within the same route,
- move a gift to an arbitrary position in another route,
- swap positions of two gifts from two different routes,
- double shift: move one gift from route r
_{1}to route r_{2}and another gift from route r_{2}to route r_{1}, - swap suffixes of two routes,
- three-way suffix swap: given three routes r
_{1}, r_{2}, r_{3}split each of them into two parts r_{i}=pref_{i}suf_{i}, where pref_{i}is a prefix of r_{i}and suf_{i}is a suffix of r_{i}, then create new routes r’_{1}=pref_{1}suf_{2}, r’_{2}=pref_{2}suf_{3}, r’_{3}=pref_{3}suf_{1}.

It is worth noting that all of the moves on this list are non-greedy ones. We do not use moves like “Remove an item from one route and place it *optimally* in another route”. There are pros and cons to such moves. On one hand, they lead to much faster convergence towards good solutions, they also naturally prevent divergence in higher temperatures. On the other hand, by using such moves one risks getting stuck at local optima too quickly – despite using SA. This is always a delicate balance, and it is very hard to reason out in a rigorous fashion. In this particular case, more greedy moves might have actually been better. As it turned out, the top two teams used them, and it seems to be the biggest difference between our solution and theirs.

## Where to begin?

Usually, one initializes SA with a random solution. However, in our case, this does not seem very appealing. The state space is huge, and it has thousands of parameters. To get from a random solution to a good one would require a very large number of local moves. There is also another reason for not using a random initial solution. It is quite easy to notice that in a good solution, Santa’s trips will be more or less vertical. This property can be exploited algorithmically (details in a later section), but only if it holds for most trips.

For the reasons given above we decided to use a non-random initialization. We first split the gifts into two categories: ones on the South Pole, and the remaining ones. Then, we processed each category separately. The main reason for doing this is that “everything is close on the South Pole”, so the verticality property holds to a much smaller extent there.

For each of the two categories we formed vertical stripes of gifts of varying width (wider at the South Pole), use SA to find good solutions for each stripe, and then glue the solutions together using dynamic programming. The SA we use here is a very crude and quick one, which is fine since the number of gifts in each stripe is relatively small.

## How to make it fast?

One problem with performing local search on a massive state space, like the one we are dealing with, is that it requires a huge number of iterations to converge on a really good solution. There are several ways to achieve that goal, and we used all of them:

**Use an efficient implementation:**We used a very efficient implementation in C++. We observed that even at high temperature, almost all moves are rejected (this is in part because our moves are not greedy). We used this observation in our implementation: actually performing a move is rather slow, but evaluating it works lightning fast.**Use better hardware:**We had access to one of the deepsense.ai machine learning team’s multi-core Xeon machines. Making SA multi-threaded is actually a rather non-trivial task. In the end, we decided to go with the following approach: once a second, we semi-randomly split the trips into buckets (this relies on verticality of the trips since we try to assign to each bucket a set of trips that are close to each other horizontally) and each thread only optimizes a single bucket, and thus avoids conflicts with other threads.**Run your program for a very long time:**We did. Our longest single SA runs would take as long as two days straight.

## Cooling schedule

One issue we ignored so far is choosing the cooling schedule, i.e. deciding how the temperature changes over time. The most commonly used schedule is the exponential one, in which after each iteration we multiply T by some constant 0<c<1. The larger c is, the longer the whole process takes, and most of the time this is the only consideration.

The target value of T is usually chosen so that SA in the target temperature essentially reduces to hill climbing. Choosing the initial value of T is a whole different story. In general, the higher this value, the better. Of course, a higher value also means a longer running time. However, in our case, if we set it too high, the solution starts to diverge from the initial one and has trouble converging back within reasonable time. Because of this, we needed to empirically choose the value that was as high as possible without causing any divergence problems.

## Using MIP

One problem with local search algorithms, as opposed to exact algorithms, is that it is not clear at what point they are strong enough to produce good quality solutions. When solving very large TSP instances, Mixed Integer Program (MIP) solvers were used to this end (see e.g. http://www.math.uwaterloo.ca/tsp/index.html). We attempted to use MIPs in a similar fashion. The basic idea was to model gift transportation as a flow problem, in which:

- all gifts are sinks,
- the North Pole is the only source,
- each node is limited to a single outgoing edge (this works for modelling a single trip, to model k trips, we allow at most k outgoing edges at the North Pole).

We also added many extra constraints to the integer program to make it easier to solve. This method worked really well for single trips and allowed us to prove that our SA optimizes them perfectly. For two trips or more, however, the resulting MIP proved too difficult for the solver. Our findings in this area netted us the Xpress prize.