0

Problem statement: I want to predict future prices of trips based on historical pricing data.

I'm looking for an algorithm that has the following features:

  • Unsupervised algorithm
  • Limit the amount of preprocessing required
  • The algorithm should be able to handle categorical data out of the box, semi-fixed length
  • The algorithm should be able to handle "list of words" data out of the box, semi-fixed length. I'm not sure if I'm using the correct terminology but I will expand on this below
  • Should be a distributed algorithm as I have a lot of data to process, but I want the algorithm to be efficient so that I don't have to load the entire dataset into memory
  • I currently have my data in AWS s3, so it would be good if the library I use will work out of the box with s3 and/or filesystem data

I currently have data that consist of the following data, among others:

  • Data fields which can easily be turned into numbers, e.g. price per person, date, duration of trip, total mileage of trip, hotel rating, etc.

And more interestingly, some categorical data and what I call "list of words" data

  • Categorical data like the trip supplier companies (company A, company B, etc.), the type of room (1 queen vs 2 twins, etc.), states, regions, etc.
  • List of words data, i.e. the itinerary for each trip (e.g. Los Angeles, San Francisco, Portland, Seattle)

I am currently using the XGBoost algorithm to perform the price predictions. Obviously, the algorithm requires all data fields to be number fields and requires some preprocessing to get it to this point. I deal with the categorical data by using one hot encoding. Not too difficult to preprocess but there are two issues. The one hot encoding, because of the sheer number categories possible, results in several hundred columns, albeit creating a sparse matrix. The other problem is the number of categorical columns may grow over time, requiring me to preprocess by scanning the data before I do any training to determine the possible categorical columns.

The bigger issue for me is the list of words data, i.e. the itinerary data per trip. Like the categorical data, I use one hot encoding for this. This definitely works. However, the number of different cities that are possible among all the itineraries then requires me to have one hot encoding build several thousand columns. The number of cities may also grow over time. Again, the resulting matrix is sparse but it is still annoying to use several thousand columns. And like the categorical data, I have to prescan the itinerary data as I have to determine all the possible cities that would encompass the one hot encoding columns. I would like to use vectorization (perhaps something like Doc2Vec) of the list of words as it would make sense to vectorize the itinerary list. However, preprocessing of the itinerary data to first vectorize the itinerary before putting it into XGBoost is quite a bit of work, so it would be nice if an algorithm is built in to do something like this!

Does anyone have recommendations based on the requirements I've listed above? I've looked at Catboost and it looks promising. Are there any known gotchas or limitations with Catboost given my requirements, especially with the "list of words"? Are there any other algorithms out there that may be a fit for me?

0 Answers0