1

Most of the tutorials only teach us to split the whole dataset into three parts: training set, develop set, and test set. But in the industry, we are kind of doing test-driven development, and what comes most important is the building of our test set. We are not given a large corpus first, but we are discussing how to build the test set first.

The most resource-saving method is just sampling(simple random sampling) cases from the log and then having them labeled, which represents the population. Perhaps we are concerning that some groups should be more important than others, then we do stratified sampling.

Are there any better sampling methods?

What to do when we are releasing a new feature and we cannot find any cases of that feature from the user log?

Lerner Zhang
  • 877
  • 1
  • 7
  • 19

1 Answers1

1

I am not sure whether that solves your problem at hand, but one approach you could look into is k-fold Cross Validation (CV). In this approach, you split your combined train, development, and test data into $k$ randomized and equally-sized partitions. Afterwards, you train and evaluate your model $k$ times. In the $i^{th}$ iteration, you train your model on all but the $i^{th}$ partition. After training on the $k-1$ partitions is done, you evaluate your model on the $i^{th}$ partition. You repeat this process for all $i \in {1, 2, ..., k}$. To be clear, you keep your initially randomized partitions fixed during k-fold CV. Then, you take the average performance over all $k$ performed test runs to assess the quality of your model based on the averaged test performance. Afterwards, you could train your model on all train, development, and test data and deliver the resulting model as your final one. In the most extreme case, you would perform Leave-One-Out-CV, where you set $k$ equal to the number of all the data points at your disposal. That is the most expensive approach, but yields the most accurate performance estimates. For more information, see this website.

Generally, using that approach, you don't waste any data by reserving it exclusively for development/testing. Also, it might be important to mention that this approach is compatible with other sampling techniques as well. For example, in the $i^{th}$ iteration of your CV algorithm, you could apply stratified sampling to your $k-1$ partitions used for training during that ($i^{th}$) iteration.

I am not entirely sure whether I get the second part of your question right.

If it is about how to later introduce the new feature to a given model, I would say the following. When it comes to introducing new features, I think you are pretty much out of luck with respect to recycling your old model. Of course, (assuming that the introduction of new features to the existing model is technically possible) there might be types of models which allow continual learning under certain circumstances, but in the worst case that might cause catastrophic forgetting since you change the distribution of your underlying training data when adding new features, which not all models might be able to deal with. An example of this case is when you add more diverse training images for a given Convolutional Neural Net (CNN), which the CNN then has to learn to map to an already existing set of classes. In other cases, introducing new features might not even be technically possible if their introduction would require adding new input (or output) nodes to an existing model.

However, when your second part of the question asks for how to fill gaps in your older data, caused by missing data, there are different strategies you could try for imputing the missing data, some of which are briefly mentioned here.

Daniel B.
  • 805
  • 1
  • 4
  • 13
  • Your description of k-fold CV seems good to me, apart from one thing that is not clear to me from your description: in k-fold cross-validation, as far as I remember, you have only 2 datasets (training dataset and test dataset), so I guess that the validation dataset doesn't come into play here. Given that you read that article, maybe you're currently more familiar with the details of the k-fold CV and tell me if I'm wrong or not. Anyway, I'm not sure the question was about assessing the quality (e.g. the generalization) of the model or how to train the model given a fixed dataset. – nbro Jan 23 '21 at 23:07
  • The question seems to be more about "Which sampling techniques should I use to build a test dataset". I guess the question arises because the OP may not know how to build the test dataset (to test, I suppose, the generalization ability of the model) given a stream of data or something like that (they mention a "log"). It seems that they are also concerned with the work that needs to be done to label the data from the log, so, in this regard, your second paragraph seems to address that. So, maybe you could edit your answer to address the question in the title more directly. – nbro Jan 23 '21 at 23:08
  • In any case, I could be wrong regarding what the main question here is or what are the main issues of the OP. – nbro Jan 23 '21 at 23:10
  • 1
    If Lerner clarifies what he exactly means, I could of course delete the second part if it's not relevant. Regarding the first part, I only mentioned the dev/validation dataset, since it was mentioned in the original question as well. I just wanted to state that actually *all* available data is used for performing k-fold CV. – Daniel B. Jan 23 '21 at 23:17
  • 1
    Ok, let's then wait for some clarification. – nbro Jan 23 '21 at 23:26