The world seems to be full of great data and as a data scientist, engineer, and researcher, I am in awe of what humankind has accomplished in the data science field. Despite all that we have accomplished, as a human being I started to ask, “are we doing great things with all this great data?”

So I set myself on a new and exciting path, looking for ways to give back while still doing what I love. I started volunteering for an organization called DataKind: a global nonprofit that plans and organizes humanitarian data projects with nonprofits, NGOs and government agencies. DataKind’s mission is simple: harness the power of data science in the service of humanity. They engage pro bono data scientists and leading social change organizations on projects that address critical humanitarian problems and drive the conversation about how data science can be applied to solve the world’s biggest challenges. As a DataKind volunteer, I have worked on challenging projects and collaborated with smart, fun people. My time there has been some of the most rewarding of my career.

My DataKind team was partnered with SOIL: a nonprofit based in Haiti that works to simultaneously improve Haitian’s access to dignified sanitation and their soil quality using a compostable toilet system. Each week SOIL visits their clients to collect the waste gathered using their provided toilets. The waste is treated and transformed to create rich compost that improves farmers’ yields and environmental regeneration.

The Challenge

Each week, SOIL has to visit a set of clients. They produce a schedule detailing the number of trips required for their vehicles and the order to best visit their clients. While they have invested time into finding efficient approaches, their hope is that data can help make them more efficient and keep them that way as things change (e.g. adding customers or vehicles, modifying expected loads, changing road conditions).

The technical problem we set out to solve was a variation on the traveling salesman problem called the capacitated vehicle routing problem. In a nutshell, we have a fleet of vehicles, each with limited capacity, and a group of customers to be visited once a week. At each visit the driver collects a load from the customer and places it on the vehicle. At the end of a route, the vehicles return to the depot where they are unloaded and made available for another route.

The figure above illustrates a graphical depiction of the problem. Each marker, called a “node” in geekspeak, represents a customer. The blue value beside the node is the load to be collected by the driver. The lines connecting the nodes, called “edges” in geekspeak, represent the routes between them. In this model, the values of the edges (the orange numbers) are the distances between the customers. Unfortunately, it is not computationally feasible to simply check every possible ordering of customers and pick the best one. We solve this problem by formulating the model shown and using an optimization problem to find the best answer.

The full problem in precise geekspeak is:

Given a fleet of vehicles with limited capacity, the ability to unload vehicles at a depot and the graph above, what order of visits to the customers and the depot gives the route that minimizes the total distance travelled?

The answer to this problem has significant impact for our client. Nonprofits tend to operate on small budgets and the savings from fuel and general vehicle wear and tear is critical. Equally important to answering this question is doing so affordably—both in not having to pay for expensive software licenses and in building the system without needing impossible amounts of volunteer developers’ time. To this end, we employ various open-source software and data. In no way should this discount the efforts made on this project. A huge part of good engineering is in identifying and connecting the correct existing tools. A lot of work went into building a great system and the system would not have been possible without the open-source community supporting the great tools they create. I’ll be emphasizing this project’s open-source components throughout.

The Project

The approach has three major stages:

  1. Clean the data
  2. Formulate the model and run the optimization algorithm to get a solution
  3. Visualize the routes to check them, possibly edit them, and give them to drivers

Data Cleaning

The first step was to take a list of client-provided customers and prepare it for optimization. Initially, we are given a list of customers and a few attributes including GPS coordinates and the expected load to be collected. We did some minimal data cleaning including a check to make sure the customer’s contract was still active, the latitude and longitude coordinates were logical for the area and the load values were populated.

Modeling and Optimization

The Model

After data cleaning comes the real fun: the model! In short, we have to take a bunch of nodes and calculate all the information that composes the graph (geekspeak for the nodes and edges connecting them) shown previously. First, we find the distances and times of the routes (edges) the vehicles are to traverse. That’s not as simple drawing a straight line between the two nodes and measuring the distance (as was done in the figure shown above). What if there was a mountain between the two points? Or maybe there is a magic portal causing the distance to be essentially zero! Ok, there’s probably no magic portal but we had to compute the distance between each node pair as experienced by the vehicle. In order to do this we have to recruit the help of a routing engine. SOIL stated that OpenStreetMap (open-source!) tended to be the most accurate mapping for their needs. This in turn led us to use OSRM, an open-source routing machine able to quickly formulate routes and compute distances and times between points based on OpenStreetMap maps. We used OSRM to produce distance and time matrices. The distance/time matrices are N x N matrices, produced for N nodes, where each row (or column) contains the distance/time from one node to all other nodes.

Now we introduce an interesting wrinkle. Our clients have different kinds of vehicles capable of doing different things. I’m not talking about different capacities (because they also have that). The different kinds of vehicles were more along the lines of big trucks vs. motorcycles. Trucks might move faster down a freeway or be able to handle certain terrains better but motorcycles can travel through alleys or dirt paths or make u-turns in the middle of the street. These details might heavily influence the suggested routes and therefore impact the distances between nodes! Fortunately OSRM allows for selection and customization of vehicle profiles. So, we modified the stock profiles to better represent the client’s vehicles and produced new distance/time matrices for each vehicle type.

A Modeling Challenge: “Snapped” Locations

One of the challenges encountered in building the model was the difference in “snapped” locations of the nodes per vehicle. When you request information (e.g. distance or route) from OSRM about two GPS coordinates, it finds the nearest as-the-crow-flies point on a road accessible by that vehicle. From a mapping perspective, that makes total sense. From the point of view of an operations optimization problem, it’s potentially problematic.

In the figure above, the person is at the Smithsonian National Zoological Park in Washington DC, USA trying to get to the Female African Lions exhibit. The figure shows the OSRM-generated routes for a person in a car and a person on foot. From OSRM’s perspective, the person travels 330 meters (4 minutes) by foot and 250 meters (1 minute) by car between these points. At first glance, driving is the clear choice. However, the car isn’t allowed to drive into the zoo because these are pedestrian paths. So, the car distance reported is for a snapped location on a road that was the closest physical point to the exhibit. If the user got out of the car at the snapped location and continued to their destination on foot, they would need to walk an additional 413 meters (6 minutes). That’s more than the walking distance from our original starting point!

Without providing the model information about the differences in snapped locations, we would have chosen a far less ideal solution. One method to address this solution is to calibrate the distances by adding the difference between the snapped and actual GPS coordinates (and multiplying by 2 for the round-trip back to the car).

There are several issues with this approach. If you started with the most obvious method and added the as-the-crow-flies distance, you are potentially overlooking the feasibility of that route and introducing significant error. In the example above, people are not permitted to walk in the forested area between the snapped node and the actual node. Another method would be to add the distance between snapped and actual node as generated by the OSRM foot profile. This would better simulate the difference, but it’s not clear that you would have chosen the snapped location as a starting point. In this example, the original location was a better starting point. In order to make this method work, you would have to search the area for the ideal starting point (accessible by the vehicle) to walk from. This started to exceed the scope of the project and ultimately we decided to leave the vehicle profile selection as a configurable parameter per region.

It turns out that for this project, a relatively simple heuristic lets us assign vehicles to areas and avoid this problem. The smaller and more-accessible vehicles are too slow to be optimal for larger routes and are only ideal for a known subset of densely populated customers away from car-accessible roads.


Finally we are ready for the optimization step. The optimization step formulates the route assignments for each vehicle. A route assignment is a round-trip, starting and ending at the depot, visiting a list of customers in a prescribed order. Vehicle routing optimization is hard (NP-hard actually), so we used a open-source reputable optimization suite built just for this kind of problem: Google OR-Tools. For this step we set up different aspects of the model such as information about the fleet of vehicles (for each vehicle declaring the capacity and the vehicle-type-based distance matrix), node demands (i.e., loads) and identifying the depot node. Google OR-Tools takes the information and produces the order for customer visits.

Model/Optimization Solver Tuning

There were several iterations of model/solver tuning (as is always the case with optimization tools). In many cases, seemingly unintuitive routes may indeed turn out to be better! However, there are other scenarios that are more obviously incorrect. One example of an obviously incorrect solution is a driver driving down a side road to visit a node, coming back out to the main road to complete a different portion of a route and then revisiting that side road again later. Initially we did experience some of these peculiarities, but they were ultimately resolved with model/solver tuning. We tuned the following parameters to improve our optimization performance:

  • Maximum Solver Time Allowance: Google OR-Tools maximum solver time allowance parameter helps the solver to continue looking for better solutions as long as it stays within a time limit. There is a trade-off here because more time allows the solver to search for better solutions but too much time makes the software slow. Our investigation showed that after about 2 minutes for ~100 node blocks, the solution didn’t seem to improve much.
  • Cost Function Scaling: When we were optimizing for minimum travel time, nodes that were close together didn’t always seem to be in the correct order. We initially used minutes for our time matrix. Incidentally, Google OR-Tools routing solver only works on integer cost functions and several nodes were less than a minute from each other. We changed the time cost function units to seconds to avoid rounding errors and improve performance.
  • Local Search Algorithm: Google OR-Tools offers a handful of different local search algorithms. Local search algorithms allow the solver to escape local optima. We found that the guided local search algorithm worked better than Google OR-Tools automatic selection.
  • Vehicle Profile Customization: Several iterations were devoted to further customization of the OSRM vehicle profiles to ensure the vehicles followed the same practices as our nonprofit’s drivers.

Routing and Visualization

The final task was to transform the optimal customer order from Google OR-Tools into a usable format. Here we enlist the help of OSRM once again to define the turn-by-turn routes between the nodes. We visualize the routes on maps using the Python open-source library Folium.

Routing and visualization serves a couple functions. The first, and most obvious, is to provide the drivers with maps to follow the new, more efficient routes. Another function of routing and visualization is human/optimization co-existence for managing solutions. Visualization allows the nonprofit’s scheduling team to assess the optimization performance and instills confidence in the tool. It brings the human back into the loop and occasionally shows instances where the human may still just have more knowledge (e.g., certain roads might be inaccessible due to weather). To this end, we implemented a feature whereby the user could manually edit the solution and produce exactly the routes and maps they wanted for the driver.


  • We built a tool optimizing routing operations for a great cause. That tool will continue to improve SOIL’s operations through more efficient routing. It also automates some routing/scheduling efforts by quickly producing new solutions with changing conditions (e.g., adding/removing customers, modifications to vehicle fleet, changing depot/focal point locations).
  • Open source is a truly powerful thing. For this project, we used open source mapping data (OpenStreetMap), routing tools (Open Source Routing Machine), optimization tools (Google OR-Tools) and visualization (Folium). It made our project affordable and thus possible.
  • Routing optimization tools require lots and lots (and lots) of tuning. Data scientists have access to so many amazing tools (e.g., Google OR-Tools), but much of good data science and engineering work is iteratively customizing those tools to use to their full potential.


I want to thank my DataKind team and SOIL for the amazing experience. With great data comes great responsibility. We took that responsibility and ran with it, improving one small corner of the world through the work we love best.

Categories: Data Science