How should I select the features for predicting diseases (in particular when patients specify their health issues)?

Question

My aim is to train a model for predicting diseases. Now, according to this Wikipedia article, diseases are classified based on the following criteria in general:

Causes (of the disease)
Pathogenesis (the mechanism by which the disease progresses)
Age
Gender
Symptoms (of the disease)
Damage (caused by the disease)
Organ type (e.g. heart disease, liver disease, etc.)

Are these features used for predicting diseases universally (i.e. all types of diseases)? I don't think so. There can be other attributes as well. For example, traveling in the case of coronavirus.

So, are there better features for predicting diseases? Or which ones among them are better than the others, when patients specify their health issues?

You are not asking an AI question but a subject domain question. I suggest you ask the same question on a disease research site. When you build your classifier you can determine which features are dominant. For example, with random forests see this article: [https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e). — Brian O'Donnell, Mar 17 '20 at 12:34

score -1 · Answer 1 · answered Aug 28 '20 at 07:21

So for Medical Prognosis, there are some variables that commonly come up like Age, Sex, Ascites, Hepato, Spider, Status of the disease and many others but it depends on the disease. You'll commonly encounter these variables if you're doing regression or classification.

Also, if you're reading Radiology Reports for getting the input for the model then you also have to take care of jargons. The same symptoms can be written in various ways but all point towards the same prognosis i.e., there can be synonyms for labels. Try reading this to get more information on how we can do information extraction from Radiology Reports. This is the famous CheXpert paper

score -2 · Answer 2 · answered Aug 28 '20 at 03:11

To begin from the scratch, and in order to keep approach simple we have to analyze the input text(clinical narration) for the following data:

Is input a word or a group of words or a sentence?

Is input a meaningful sentence? By meaningful, I mean grammatically correct.

Does the word, group of words or a sentence contain symptoms or health issues?

Does the sentence contain data about a person’s age and gender?

Does the sentence contain data about a person’s diet, medical history, work routine, travelling history or getting in contact with any ill person?

If there are any other attributes that one has to look for then I would be keen to find out from the subject matter experts.

How should I select the features for predicting diseases (in particular when patients specify their health issues)?

2 Answers2