6

I am trying to perform a binary classification of tweets using machine learning.

The usual way of doing this seems to be putting a hand-classified tweet's words into a big vector, then use that vector as input to an algorithm, which then predicts new tweets based on this data.

Is there a standard method, or algorithm, that can include other input, such as the location of a tweet, in this process?

I could just add tweet location at the end of the vector, I suppose, but that would give it a very small weighting.

Any pointers are much appreciated.

nbro
  • 39,006
  • 12
  • 98
  • 176
schoon
  • 245
  • 2
  • 7

1 Answers1

3

Data pre-processing and feature extraction are by far the most important part of any machine learning algorithm. It's even more important that the model you choose to do the classification.

Unfortunately, pre-processing and feature extraction are completely different for each type of data. You need to play around with the data yourself to find out what works best with the nature of your data. With experience you start noticing some patterns with different data types. For example, as you are doing, building a word vector is an effective means of feature extraction with text based data.

"I could just add tweet location at the end of the vector I suppose, but that would give it a very small weighting."

This is entirely untrue for any machine learning algorithm I would choose. Your model should not associate weights to the inputs based on their array location. They should be associated based on the explained variance (information gain) it provides.


After you do your pre-processing and feature extraction you can then further refine your feature set with some common methods which can be found in libraries such as: principle component analysis (PCA) and linear discriminant analysis (LDA).

JahKnows
  • 470
  • 2
  • 8
  • Thanks, Jah, much appreciated. Wrt 'Your model should not associate weights to the inputs based on their array location.' I wasn't meaning because it was last in the array, but because, say, it is in a 5,000-word array, it would be only one factor of 5,001. This is what I am looking for an example of: combining a big array and e.g. location. – schoon Jul 12 '17 at 06:10
  • It should still be fine. If that variable explains the most of the variance then it will be attributed more weight. Of course, if your featurespace is too high dimensional for the limited amount of data you have. Then you will have problems. But, you should use some feature reduction techniques to get rid of features which are not pertinent in your decision boundary. – JahKnows Jul 12 '17 at 13:11