4

I've written a program to analyse a given piece of text from a website and make conclusary classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.

The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 datapoints available for training.

Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)

Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?

What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.

Arnav Das
  • 101
  • 4

2 Answers2

1

The accuracy depends on various factors. Might not always be the algorithm. For example a cleaner data with a poor algorithm might still give better results and vice versa.

What are the preprocessing techniques you are using? This preprocessing techniques article is a good starting point for html data. And by vectorising I assume you mean word2vec, use a pre-trained word2vec model. Like google's word2vec model it's trained on a lot of data(about 100 billion words).

LSTM performs good whenever the intent of the sentence is important. Check out this. Ram hit Vijay and Vijay hit Ram, might mean the same to most algorithms. Example, Naive Bayes.

codeblooded
  • 163
  • 4
1

First of all, there are multiple factors on how well models will work. Amount of data, source of data, hyperparameters, model type, training time etc... All of these will affect the accuracy. However, no classifier will work best in general. It all depends on the different factors, and not one can satisfy all, at least for now.

For improving the accuracy, we need to first make those factors ideal so that the classification will have a higher accuracy.

First of all, how many data do you have? If you are using html webpage, you probably need at least 10000 data samples. If you have at least that amount of data you should be ok with overfitting. You also need to clean the data. One way to do it is to tokenize it. Tokenization of text data basically means to split the text into words and make a dictionary out of it. Then each word is encoded to a specific number where each same word have the same encoding. You are using the raw HTML as input, which have a lot of unnecessary information and tags and stuff, you can try removing those or completely remove all html tags if they are not required. The key to cleaning the data is to extract the pieces of information that is important and necessary for the model to work.

Then, you should explore the model. For a NLP (Natural Language Processing) task, your best bet is to choose a RNN (Recurrent Neural Network). This type of network have memory cells taht helps with text type data as text often have distant linkage in a paragraph, for example one sentence may use a "she" that refers to a person mentioned in two sentence before, and if you just feed every single encoding of words in a MLP, it would not have this memory for the network to learn long term connection between text. A RNN also is time dependent, meaning it processes each token one by one according to the direction of time. This makes the text more intuitive to the network as text is designed to be read forward, not all at once.

Your current method is to first vectorize the HTML code, then feed it into a random forest classifier. A random forest classifier works great, but it cannot scale when there is more data. The accuracy of a random forest classifier will stay mostly the same when data increase while in deep neural networks the accuracy will increase with the amount of data. However a deep neural network will require a large amount of data to start of with. If your amount of data is not too much (< 10000), this method should be your choice of method. However if you plan to add more data or if teh data is more, you should try a deep learning based method.

For deep learning based method, ULMFit is a great model to try. It uses a LSTM(Long Short Term Memory) network (which is a type of RNN) with a language model pretraining and many different method to increase the accuracy. You can try it with the fast.ai implementation. https://nlp.fast.ai/

If you wish to try a method that you can practically implement yourself, you could try to use a plain LSTM with one hot encoding as input. However, don't use word2vec to do preprocessing as your input data is html code. The word2vec model is for normal English text, not the html tags and stuff. Moreover a custom encodings will work better as in the training process you can train teh encoding as well.

Hope I can help you

Clement
  • 1,725
  • 7
  • 24
  • What do you mean by custom encodings? Also, I'm only using a small part of the HTML. Actually, the program searches the HTML for the 'description' tag, and then it vectorizes any of the words that it can find in the GloVe vector space. If it can't find those words, I omit it. I am not sure how to utilise the HTML in a better way. Is there any way I could use the text on the page? I couldn't find any single tag that holds all the text, so it seemed complicated. Also, I'm using 2400 datapoints (2100 training and 300 testing). I'm still new to Deep Learning, so I just used a standard classifier. – Arnav Das Jan 04 '20 at 14:10
  • By custom encoding I mean making a dictionary of all teh tokens and assigning each of them an individual number, and convert every single word in the input text to teh number, then feed the one hot encoding of the number to the network – Clement Jan 04 '20 at 14:58
  • But if you are using the description tag, you can use word2vec or GloVe but a custom encoding will work better as the encoding will be trained as well with teh gradients back propagated – Clement Jan 04 '20 at 15:00
  • https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html try this for the custom embedding /encoding I mentioned. Hope I can help you. – Clement Jan 04 '20 at 15:02
  • What I am doing is taking all the words in the description, embedding them into vectors, and adding up all of the vectors to get a vector for the whole description. I am putting this description vector in my model as an input. Should I be doing anything differently? – Arnav Das Jan 07 '20 at 15:13
  • How did you embed the words to vectors? Are you using word2vec? – Clement Jan 07 '20 at 15:40
  • Yes, using the GloVe word2vec embeddings to create 300 dimensional vectors. And then adding up all the words. The reason I did this is because my model needs to have a fixed shape input, so I couldn't think of any other way but to go ahead and just add them to get a fixed 300 length vector. – Arnav Das Jan 07 '20 at 15:46
  • You can check out my answer to your new question. Hope I can help you. – Clement Jan 07 '20 at 15:48