What ML algorithm should I use that suits this data?

Question

What if I have some data, let's say I'm trying to answer if education level and IQ affect earnings, and I want to analyze this data and put in a regression model to predict earnings based on the IQ and education level. My confusion is, what if the data is not linear or polynomial? What if it's a mess but there are still patterns that the linear plane algorithm can't capture? How do I figure out if plotting all of the independent variables will form a line or a polynomial curve like here?

I mean, with one dependent and one independent variable it's easy because you can plot it and see, but in a situation with multiple independent variables... how do I figure out if the relationship is linear or something like this? How do I figure out if I should use a regression model?

Let's say I want to predict a store's daily revenue based on the day of the week, weather and the number of people arrived in the city. My data would look something like this:

+-----------+---------+----------------+---------+
| DAY       | WEATHER | PEOPLE ARRIVED | REVENUE |
+-----------+---------+----------------+---------+
| Monday    | Sunny   | 1115           | $500    |
+-----------+---------+----------------+---------+
| Tuesday   | Cloudy  | 808            | $250    |
+-----------+---------+----------------+---------+
| Wednesday | Sunny   | 450            | $300    |
+-----------+---------+----------------+---------+

I'm a bit confused about what ML algorithm I should use in such a scenario. I can represent the days of the week as (Monday - 1, Tuesday - 2, Wednesday - 3, etc.) and the weather as (Sunny - 1, Cloudy - 2, Normal - 3, etc.) but would a regression model work? I'm skeptical because I'm not sure if there's a linear relationship between the variables and I'm not sure if a hyperplane can create accurate representation of what's going on.

score 1 · Accepted Answer · answered Jan 02 '21 at 17:15

What you should do as part of your exploration is to learn various models of increasing complexity. Start from a simple linear model, ending in multi-layer neural networks (with non-linear activations of course). If the nonlinear models are better then that implies that your data do not follow a linear hyperplane.

Also check this out for recent trends: https://machinelearningmastery.com/auto-sklearn-for-automated-machine-learning-in-python/

score 1 · Answer 2 · answered Jan 23 '23 at 02:22

There is a special model selection technique that is called K-Fold Cross Validation just for this situation. It is basically dividing your dataset into separate pieces, training and evaluating on each of them iteratively. Check the example image below:

Each of these e values represents the exclusive error on that fold of data. Applying summation to them and dividing into fold count would give you the model error.

Model error represents the model's performance on that specific given dataset. If you want to see which model would suit your data better, I suggest you to comparing their model errors.

score 0 · Answer 3 · answered Aug 05 '20 at 15:42

Regression Model will definitely work on That problem.

You only need to change shape of predicting variables like(Day, Weather, people arrived) into 1D array if you got error.. Otherwise you can simply apply Linear Regression, SVM etc to get your output with good accuracy.

What ML algorithm should I use that suits this data?

3 Answers3