Gidi Shperber
- Oct 6, 2019
- 7 min read

A different kind of (deep) learning: part 2

Updated: Oct 7, 2019

Self Supervised learning: generative approaches.

Intro

In the previous post, we’ve discussed some self supervised learning articles, along with some attempts to strive towards the “holy grail”: exploiting the almost unlimited number of un-annotated images available wherever to generalize for other tasks. And hopefully, get closer to the currently unmet benchmark of ImageNet pre-training.

Surprisingly, or perhaps not so surprisingly, we’ve got some extra tailwind from Yan Lecun, which devoted a few minutes of his NeurIPS talk (“The Next Step Towards Artificial Intelligence”) to self supervised learning. He described self supervised learning as “the body of the cake” when the topping is supervised learning, and the cherry is reinforcement learning (because the sparsity of the reward in RL). Lecun also takes some slides from our favorite self supervised researcher, Aloysha Efros, so I’m happy we share interests.

Additionally, a few readers have pointed out that there are also prominent self supervised body of work dealing with videos, which is evident. However, videos will be discussed in a later post, since we have another topic, which is the generative models.

What’s generative models have to do with self supervision

In his talks about self supervision, Efros (yes, he is going to be dominant in this post as well) frequently discusses the difficulty in finding the right loss function for self supervised tasks.

In the previous post we examined the special classification loss used for the colorization task, and emphasized the difficulty of finding the right loss function for them.

In the talk, Efros described a method for finding such loss functions. He called it: “graduate student descent”. In other words, there is a lot of trial and error in finding a good loss function for these models. So can we have some better way, more universal, of finding them?

Additionally, there is the colorization Turing test thing: to evaluate the results, researchers use mechanical Turks to tell between real and fake photos. So wishfully, we would like to have some kind of mechanism to tell between these two types of images.

If you were into deep learning back in 2014, you probably remember that when Ian Goodfellow presented his groundbreaking GAN work for the first time, the community was very excited about the promising generational abilities, but many researchers were skeptical about the purpose of this work. To them, it was merely a toy, at least until some significant progress to be made.

The self-supervised researchers had some different thoughts: The GAN, in their eyes, was potentially a custom loss function for the self-supervised tasks.

Let’s think about it for a second: in the colorization work, we’ve used standard deep learning paradigm for predicting color for each pixel. Can we use the power of GAN discriminator as a custom loss? if so, it will require structuring the problem in a different way.

We know that GAN in its essence generates images from a completely random distribution. What if we can make it generate a colored image given a black and white image, using the discriminator to evaluate the result?

This requires some change in paradigm: generating images from something different then a completely random distribution was done by the conditional GAN: adding a feature to the generator, making it produce some sub set of the target space. E.g, a specific number from Mnist dataset. But if we can use a scalar (digit)as the “conditional”, we can use a vector as well. And if we can use a vector, we can also use a tensor. And an image is merely a tensor, isn’t it?

So here is the idea:train a conditional-GAN-like network, which the condition (as well as the input to the Generator) is a black and white image, which will constrain the output to be a colored image.

Pix2pix

Phillip Isola, a student of Efros that was also involved in the previously discussed colorization work, took on this task in the paper “Image-to-Image Translation with Conditional Adversarial Networks” that was nicknamed Pix2pix. This required a serious tweaking with GAN architecture: first using an encoder-decoder architecture for the generator. Second, the discriminator can’t just get randomly paired images from dataset and generator. It should be fed with strict pairs of images, one is the original RGB, and the other is generated form black and white. The discriminator architecture and training schedule are also different from standard. You can read a nice explanation about it in this post. It is evident that a significant amount of hard work was put into this paper.

But Isola went one step further: He probably said to himself: well, if I succeeded in building a colorizing-GAN, which learns from pairs of images, why can’t I apply it to different pairs? what about:

- Google maps and google earth image pairs

- Building transcripts and actual building facades

- Edges and objects

And so on. And it all worked.

This became one of the most interesting deep learning works in last years (which means, ever), and it triggered something that Efros called: “Twitter-driven research”. Since the code of the paper was readily available on GitHub, many people trained it on various pairings of images, and reached some highly creative results. And these as well:

You may also find more by looking for #pix2pix on twitter. Efros said that these projects amazed him, brought many new ideas and took their research many steps forward.

OK, we got a bit carried away — all this GAN excitement opened a variety of new options. However, in the route to innovation a small detail was lost: the self supervised paradigm, which intended to use a self supervised model for transfer learning, was neglected along the way, and wasn’t even mentioned in in this paper. Perhaps the Generator architecture was too different to try this, or perhaps the importance of the generative results overshadowed the potential of yet one more partially-successful self supervised attempt.

BiGAN

Well, the pix2pix work is “natural successor” of the colorization and context works from the first post. But there was another work that actually did try to apply transfer learning on a GAN network: the bi-directional GAN — BiGAN.

BiGAN presented a new concept (back then, 2016): along with standard GAN architecture, an Encoder is added to the architecture (see bellow). This encoder is intended to learn the inverse of the generator, for different purposes.

The work takes a very interesting approach: taking a standard GAN architecture, and instead of presenting the discriminator with x (a the real image) and G(z) (the generated image, when z is the random input to generator) the discriminator is fed with 2 pairs: (x,E(x)),(G(z),z), which means the random input z and E(x), which is the encoder function tries to replicate the random input. Interestingly, there is no shared knowledge between the Generator and the Encoder.

This is a bit tricky to grasp, so you can read further details in the paper - formal (and intuitive) explanations about the encoder and generator must learn to invert one another to fool the discriminator.

Although there is no conditional element here — z is not a label, the Encoder can be used for classification (and hence detection and segmentation, after switching some layers) with transfer learning. Results may be described as “reasonable”.

If you feel that similar ideas appear in the BiGan and in the Pix2Pix works (I should mention that BiGan came out earlier) it is not by chance. The successive paper of pix2pix, the CycleGAN, was a combination of the two, and allowed training such networks without “paired” training images, and significantly expand the transferable objects, and create the famous transfer of zebra to horse (and back).

Cross channel encoder

So we’ve seen GANs have great potential (somehow yet to be fulfilled) in self supervised learning. But what about their older, currently less popular counterpart, the autoencoder?

Indeed, autoencoders have reached good results on some tasks, but they always suffer from information losses through their layers.

In self supervised learning they had some success though.

In our discussion of colorization in previous post, we’ve mentioned that colorization is actually a cross channel encoder. Meaning, using some channels to predict others. What if we give a chance here to real auto-encoders and define channels bit differently?

More specifically, by trying to reconstruct half of the image by the other half. The following work, named “split-brain” does exactly this. It defines the task of bisecting an image diagonally, and using the auto-encoder to predict one half, using the other.

Going further, every image can be used twice, using every half for predicting the other.

Seeing this method work reasonably well for diagonal bisection, researchers went back to color, doing back and forth predicting: RBG by B&W and vice versa, b&w by color (using specifically Lab space)

Summary and Evaluation

Pursuant the previous post I’ve got some questions regarding the actual transfer learning results of the self supervised models. As said there, the important feature is the generalization on different tasks, e.g detection and segmentation. Intentionally, I don’t put too much emphasis on these results since they are quite fragile, and always keep a quite stable difference of 10% from their goal: ImageNet pretraining.

The above table is taken from the rotation work, which is perhaps surprisingly current “state of the art” in self-supervised transfer learning. Most other discussed papers are there as well. However, this should be taken with a grain of salt since:

- The rotation paper was not highly rated by reviewers and

- There are some data presentation tricks in many of these papers that make them seem leading in the time of publication, but not really useful in real-life.

So once again, It seems there is a lot of potential, especially in the custom-loss function idea, but results are not “there” yet. However, we still have some reasons to be optimistic: fortunately, visual signal is not limited to images, but is also found in… videos. Videos add the important the dimension of time, which in its turn adds an immense number of new possible tasks, paradigms and options, and eventually, some real results(!). This and more will be discussed in the next post so stay tuned.

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @shgidi

Series links:

- An intro to self supervised learning

- Self Supervised learning: generative approaches (this post)

- Self supervised learning from videos (future post)