Finding the right questions to increase accuracy in classification

Question

Lets say I have a list of 100k medical cases from my hospital, each row = patient with symptoms (such as fever , funny smell, pain etc.. ) and my labels are medical conditions such as Head trauma, cancer , etc..

The patient come and say "I have fever" and I need to predict his medical condition according to the symptoms.According to my data set I know that both fever and vomiting goes with condition X. So i would like to ask him if he is vomiting to increase certainty in my classification.

What is the best algorithmic approach to find the right question (generating question from my data set of historical data). I thought about trying active learning on the features but I am not sure that it is the right direction.

score 2 · Accepted Answer · answered Jul 12 '18 at 19:40

The problem you're trying to address can, in some sense, be viewed as a Feature Selection problem. If you look for literature using only those words, you're not going to find what you're looking for though. In general, "Feature Selection" simply refers to the problem where you already have a large amount of features, and you're simply deciding to select which ones to keep and which ones to throw away (because they're not informative or you don't have the processing power to try training with all features for example).

I'd recommend looking around for a combination of "Feature Selection" and "Cost-Sensitive". This is because, in your case, there are costs associated with selecting features; values may be costly to obtain for some features. Searching for this combination leads to publications which look to be interesting for you, such as:

I cannot personally vouch for any of those techniques since I've never used them, but those papers certainly look relevant for your problem.

When you're looking around for more literature, terms like "cost", "cost-based", maybe "budgeted" are crucial to include. If you don't include those, you're just going to get papers on problems like:

Feature Selection: given a set of features/columns, which ones am I going to use across all samples/instances/rows?
Feature Extraction: given data (typically without clear human-defined features, like images, sound, etc.), how am I going to extract relevant features from this?
Active Learning: given a bunch of samples without labels but feature values already assigned, which one would I like an oracle/human expert/etc. to have a look at so that they can tell me what the true label is?

Those kinds of problems all do not really appear to be relevant in your case. Active Learning may be somewhat interesting in that it is about trying to figure out which rows would be valuable to learn from, whereas your problem is about which columns would be valuable to learn from. There does seem to be a connection there, Active Learning techniques might to some extent be able to inspire techniques for your problem, but just that; inspire, they likely won't be 100% directly applicable without additional work.

Thank you @Dennis for the detailed answer. I have read the material you've added about cost-sensitive feature selection . Although it seems to be very related ,i'm wondering how come the amount of papers in this subject so thin. Is there a chance that there are more "names" or methods to deal with this issue?. — Latent, Jul 15 '18 at 08:32
also , since i do not have any cost in asking the patient question (lets say that i can ask couple of questions for free ) it is hard for me to think about a cost that can be calculated here. so the ordering of features to ask about should not be cost related but more of prediction interval related . i mean , if you have fever the classifier can say in P=0.3 that it is condition X and 0.2 Condition Y ,but if i'l know that you are also vomiting , i could be sure in 0.8 that it is condition X, then i'm asking the user if he is vomiting . — Latent, Jul 15 '18 at 08:36
@Uri By searching for "active learning features" I was able to also find https://users.cs.duke.edu/~amink/publications/manuscripts/hartemink05.icml.pdf which seems interesting. Still, they don't appear to give a new, easily-searchable name for this problem setting, and with that kind of search questions you do also find lots of less relevant results. As for defining costs for asking the patient questions, you could, for example, view the `expected gain in accuracy` as a "negative cost" (or just a "reward") for asking a question. Then you can minimize those "negative costs" / maximize "rewards" — Dennis Soemers, Jul 15 '18 at 08:56
That would turn it in some sort of Reinforcement Learning problem (or Multi-Armed Bandit problem), which I believe would be most closely related to the second link in my answer. I may be "biased" towards thinking of such approaches though since I'm mostly in RL myself, it may not be the best way to think about the problem. It is a way of thinking about the problem that I personally find interesting though — Dennis Soemers, Jul 15 '18 at 09:01

score 1 · Answer 2 · answered Jul 12 '18 at 18:39

Feature Extraction

Patterson and Gibson's Deep Learning, A Practitioner's Approach, O'Reiley, 2017 states, "Convolutional Neural Networks (CNNs) ... consistently top image classification competitions," which is consistent with our experience in the lab. If your data is multi-dimensional in that pain is on a scale from one to ten, fever is in degrees, and smell can be a result of blood components which can be quantified in lab reports, you can have a hypercube that can be treated just as frames in a movie can. Movie learning is in ℝ⁴, the third being frame index and the fourth being sample index. With subjective pain, digital thermometer temperature, and three blood component concentrations, you have {P, T, C₁, C₂, C₃} and learning in ℝ⁶ for your CNN design.

Selecting Input Channels

Asking 100 questions and taking 10 blood panels is probably prohibitive. So you will need to stuff all the data from limited questioning and panels into a hyper-cube and find what will similarly extract features from sparse data input. Then the weighting leading from input to feature layers will identify the questions from which the most important features can be extracted. By searching scholarly articles for, "Feature extraction sparse data," a large number of options will be presented.

Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms, B Zheng, SW Yoon, SS Lam - Expert Systems with Applications, 2014 - Elsevier may be particularly interesting, given the common domain.

Outcomes Analysis

The above is a limited approach because the loop is not closed. Only if the outcomes of treatment are used to produce labels or a real time (over the course of months or years) reinforcement will the system produce an optimization that is meaningful. Unsupervised learning for this particular problem is not likely to produce any significant improvement in treatment efficacy.

Thanks @FauChristian for your answer. My issue is more on the categorical domain , since i can only hear the symptom from the patient and ask him about other symptoms . no blood test /fever measurement available. Only yes/no answers. "fever? " - "no" , "vomiting?" - yes ...etc. — Latent, Jul 15 '18 at 07:49

Finding the right questions to increase accuracy in classification

2 Answers2