Questions tagged [scikit-learn]

For questions related to the Python's package scikit-learn (or sklearn).

26 questions
5
votes
2 answers

Why isn't my decision tree classifier able to solve the XOR problem properly?

I was trying to solve an XOR problem, and the dataset seems like the one in the image. I plotted the tree and got this result: As I understand, the tree should have depth 2 and four leaves. The first comparison is annoying, because it is close to…
4
votes
0 answers

When computing the ROC-AUC score for multi-class classification problems, when should we use One-vs-Rest and One-vs-One?

The sklearn's documentation of the method roc_auc_score states that the parameter multi_class can take the value 'OvR' (which stands for One-vs-Rest) or 'OvO' (which stands for One-vs-One). These values are only applicable for multi-class…
4
votes
2 answers

Can ML be used to curve fit data based on dataset of example fits?

Say I have x,y data connected by a function with some additional parameters (a,b,c): $$ y = f(x ; a, b, c) $$ Now given a set of data points (x and y) I want to determine a,b,c. If I know the model for $f$, this is a simple curve fitting problem.…
argentum2f
  • 151
  • 1
  • 7
2
votes
0 answers

How matrix factorization helps with recommendations when it converges to the initial user-items matrix?

We can say that matrix factorization of a matrix $R$, in general, is finding two matrices $P$ and $Q$ such that $R \approx P.Q^{T}$ with some constraints on $P$ and $Q$. Looking at some matrix factorization algorithms on the internet like…
KindNewbie
  • 21
  • 2
2
votes
0 answers

Suitable deep learning algorithms for spatial / geometric data

I have a task of classifying spatial data from a geographic information system. More precisely, I need a way to filter out unnecessary line segments from the CAD system before loading into the GIS (see the attached picture, colors for illustrative…
2
votes
1 answer

Is it compulsary to normalize the dataset if doing so can negatively impact a Binary Logistic regression performance?

I am using raw data set with 4 feature variables (Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, and Cigraeette count) to do a Binominal Classification (find stroke likelihood) using Logistic Regression Algorithm. I made sure…
1
vote
0 answers

Using ML to uncover procedural logic

The game Elite Dangerous has a proceduraly generated galaxy of some 400 billion star systems. Each star system in the game can be uniquely identified bu a 64bit number (id64) which is used as a seed for building the system but can also be decoded…
1
vote
1 answer

Unexpected behaviour on using class weights in loss

I’m working on a classification problem (500 classes). My NN has 3 fully connected layers, followed by an LSTM layer. I use nn.CrossEntropyLoss() as my loss function. To tackle the problem of class imbalance, I use sklearn’s class_weight while…
1
vote
1 answer

Why does sklearn perceptron converge for linearly inseparable data points?

I learned that the perceptron algorithm only converges if the dataset is linearly separable. I am implementing this algorithm using scikit learn. The blue and orange points are from the training set, while red and green are from the test set.…
1
vote
1 answer

How can I interpret the value returned by score(X) method of sklearn.neighbors.KernelDensity?

For sklearn.neighbors.KernelDensity, its score(X) method according to the sklearn KDE documentation says: Compute the log-likelihood of each sample under the model For 'gaussian' kernel, I have implemented hyper-parameter tuning for the…
1
vote
1 answer

Interpretation of feature selection based on the model

The description of feature selection based on a random forest uses trees without pruning. Do I need to use tree pruning? The thing is, if I don't cut the trees, the forest will retrain. Below in the picture is the importance of features based on 500…
1
vote
0 answers

How can I split the data into training and validation sets such that entries with a certain value are kept together?

I have the following kind of data frame. These are just example: A 1 Normal A 2 Normal A 3 Stress B 1 Normal B 2 Stress B 3 Stress C 1 Normal C 2 Normal C 3 Normal I want to do 5-fold cross-validation and splitting the data using skf =…
1
vote
0 answers

How can I use gradient boosting with multiple features?

I'm trying to use gradient boosting and I'm using sklearn's GradientBoostingClassifier class. My problem is that I'm having a data frame with 5 columns and I want to use these columns as features. I want to use them continuously. I mean I want each…
0
votes
1 answer

cross_val_score of sklearn and LinearRegression scoring method

cross_val_score (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) uses the estimator’s default scorer (if available) and LinearRgression (the estimator I use -…
0
votes
1 answer

Can I implement a sklearn model inside a Pytorch nn.Module?

I am making a custom Pytorch model that at some point, clusters a latent space that was created by another, previous routine of the model (Autoencoder). In a bit more detail, my model is a regular Autoencoder, but in every training step, I want to…
1
2