How does Knowledge Distillation help Federated Learning?

Question

As per my understanding, typically in FL, there is a global server that interacts with various client devices. The global server and the client both possess a ML models. The client(s) update their models locally and then send the weights across to the server where it is averaged.

The paper, Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data, has the following paragraph - "To rectify this, each device in FD stores per-label mean logit vectors, and periodically uploads these local-average logit vectors to a server. For each label, the uploaded local-average logit vectors from all devices are averaged, resulting in a global-average logit vector per label."

I am really lost with what one can do with "mean logit vectors" of a label. According to me, that's like saying, a dataset consists of 2 labels with the first label coming up 40% of the time and the second 60%. How does this help with prediction? Perhaps my understanding is wrong here.

How does Knowledge Distillation help Federated Learning?

0 Answers0