Today’s paper: Emerging properties in self-supervised vision transformers by Mathilde Caron et al. Let’s get the dinosaur out of the room: the name DINO refers to self-distillation with no labels. The self-distillation part refers to self-supervised learning in a student-teacher setup as is often seen for distillation. However, the catch is that in contrast to normal distillation setups where a previously trained teacher network is training a student network, here they work without labels and without pre-training the teacher.
Today’s paper: Rethinking ‘Batch’ in BatchNorm by Wu & Johnson BatchNorm is a critical building block in modern convolutional neural networks. Its unique property of operating on “batches” instead of individual samples introduces significantly different behaviors from most other operations in deep learning. As a result, it leads to many hidden caveats that can negatively impact model’s performance in subtle ways. This is a citation from the paper’s abstract and the emphasis is mine which caught my attention.
Label noise in digital Pathology In the field of digital pathology and other health related deep learning applications, label noise is an important challenge to consider during training. It’s inherent to the medical fields as the problems are extremely challenging even for trained experts, so there is high intra- as well as inter-observer variability. This blog post dives into the idea of the paper P-DIFF: Learning Classifier with Noisy Labels based on Probability Difference Distributions which is authored by researchers of Microsoft in China.
Label noise introduction Training machine learning models requires a lot of data. Often, it is quite costly to obtain sufficient data for your problem. Sometimes, you might even need domain experts which don’t have much time and are expensive. One option that you can look into is getting cheaper, lower quality data, i.e. have less experienced people annotate data. This usually has the side effect of your labels becoming more noisy.
In many neural network architectures like MobileNets, depthwise separable convolutions are used instead of regular convolutions. They have been shown to yield similar performance while being much more efficient in terms of using much less parameters and less floating point operations (FLOPs). Today, we will take a look at the difference of depthwise separable convolutions to standard convolutions and will analyze where the efficiency comes from. Short recap: standard convolution In standard convolutions, we are analyzing an input map of height H and width W comprised of C channels.
Follow me on twitter! Follow @mpaepper