I have a log file of the format,
Index, Date, Timestamp, Module, App, Context, Session, Verbosity level, Description
The log file can be considered as a master log, which consists of individual logs from several modules constituting a distributed system. The individual log can be identified using the corresponding Module+App+Context tags. The verbosity level(Info, Warn, Error, …) and the descriptions(system generated + print statements added by developers) contain further information on the log events necessary for debugging. I need to perform an unsupervised anomaly detection with the log file as input. The output should be the functionality and timestamp of the identified anomalies.
Since the log is mostly textual, I plan to use NLP algorithm (Bag of words/TF-IDF) to convert the data into word vectors and then perform a generative learning method to identify the normal pattern. Can someone suggest if my approach in the right direction? Which headers of the log file would be relevant for the word-vector representation and further analysis?