How can I approach this problem of producing a 3-bit binary string given a sequence of letters?

Question

Suppose, I have the following data-set:

... ...
... ...
AABBB  7.027  5.338  5.335  8.122  5.537  6.408
ABBBA  5.338  5.335  5.659  5.537  5.241  7.043
BBBAA  5.335  5.659  6.954  5.241  8.470  8.474
BBAAA  5.659  6.954  5.954  8.470  9.266  9.334
BAACA  6.954  5.954  6.117  9.266  9.243 12.200
AABAA  5.954  6.117  6.180  9.243  8.688 11.842
ACAAA  6.117  6.180  5.393  8.688  5.073  7.722
ABAAC  6.180  5.393  6.795  5.073  8.719  7.854
BAACC  5.393  6.795  5.796  8.719  9.196  9.705
... ...
... ...

Apparently, the feature values represent a string pattern comprising of only three letters A, B, and C.

I have to design a neural network that would be able to detect these patterns and spit out a binary representation of these strings where the letters should be encoded in 3-bit binary(one-hot encoding).

My first question is, What kind of problem is it and why?

My next question is, How should I approach this problem to solve it?

Is the length of the strings consistent? If they aren't use an RNN. Otherwise, use crinix's answer — Recessive, Jul 13 '21 at 07:33
@Recessive, Is it a Multi-label Classification problem? Or, is it a multinomial logistic regression problem? — user366312, Jul 13 '21 at 07:42

score 2 · Answer 1 · edited Jul 14 '21 at 00:30

2

If you're trying to predict the string pattern, given the numerical feature and assuming your string pattern is fixed sized, you can one-hot encode each letter then combine them (into an array that is no longer one-hot). So AABBC would look like:

[1,0,0,1,0,0,0,1,0,0,1,0,0,0,1] <- Use this for training
[A,B,C,A,B,C,A,B,C,A,B,C,A,B,C]
[A,_,_,A,_,_,_,B,_,_,B,_,_,_,C]
AABBC

where each group of triplets represent a single integer. Then you can train a network with cross-entropy. This is the problem formulation of multi-task learning where you predict multiple things simultaneously. Needless to say, it is classification.

edited Jul 14 '21 at 00:30

Recessive

1,346
8
21

answered Jul 13 '21 at 05:04

meliksahturker

296
1
8

Is it a Multi-label Classification problem? Or, is it a multinomial logistic regression problem? – user366312 Jul 13 '21 at 07:43
It is not multinomial logistic regression or multi-label classification, which are the same in their essence, where there are more than 2 classes to predict. It can be said to be multi-multinomial logistic regression. Its proper name is multi-task learning. – meliksahturker Jul 13 '21 at 10:55
Is multi-task learning synonymous to multi-output learning? – user366312 Jul 15 '21 at 12:35
@user366312 I can't find a definition for multi-output learning anywhere, but maybe you mean multi-instance learning? In which case, no, they are not equivalent. I think what may be confusing you is one small detail - you should split each character classifier into it's own cost function. So on the final layer, you'll have 5 classifiers, each for characters 1,2,3,4 and 5 in the sequence. Then you can encode each classifier 1 hot and use a simple softmax cross-entropy set up. – Recessive Jul 17 '21 at 02:14
@Recessive, [Scikit-learn 84:Supervised Learning 62: Multiclass & Multioutput](https://youtu.be/-wqM_ODs-8s?list=PLXovS_5EZGh4_ThQVgO2boGf31Dqs5vzm) – user366312 Jul 17 '21 at 12:50

How can I approach this problem of producing a 3-bit binary string given a sequence of letters?

1 Answers1

Linked