SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Appendix 정리 안 됨

연관 포스트:

Abstract
contrastive self-supervised learning without specialized architectures or a memory bank

predictive task에 data augmentation이 중요함

representation과 contrastive loss를 계산 중간에 learnable nonlinear transformation을 넣으면 representation이 더 좋아짐

contrastive learning은 supervised learning보다 큰 batch size, 더 긴 training step에 더 이득

1. Introduction

supervision 없이 학습하는데 주류는 2 가지: Generative or Discriminative
Generative
- 만들어내거나 input space에 pixel을 모델링
- pixel-level generation은 expensive하고 representation learning에 필수적인 건 아닐 수 있음
Discriminative
- objective function으로 representative
- supervised learning과 비슷하지만, pretext task를 수행하여 input과 label이 unlabeled dataset으로부터 나오도록 학습
SimCLR
- data augmentation의 다양한 조합이 contrastive prediction task에 중요함을 보임
- supervised보다 unsupervised contrastive learning이 data augmentation에 더 영향을 많이 받음
- representation과 contrastive loss를 계산 중간에 learnable nonlinear transformation을 넣으면 representation이 더 좋아짐
- normalized된 embeddings과 적절한 temerature parameter $\tau$ 가 contrastive cross entropy loss에 유리
- 더 큰 batch size와 긴 training이 supervised보다 contrastive learning에 더 유리
- supervised와 contrastive learning 둘 다 deeper & wider networks에 더 잘 학습됨

2. Method

1) The Contrastive Learning Framework

latent space에서 같은 data를 다르게 augmentation한 view들의 agreement 최대화

4가지 요소로 구성
1. stochastic data augmentation module
  - 같은 example에서 random하게 생성된 두 개의 상호연관된 2개의 view들($\tilde{x}_i$, $\tilde{x}_j$)은 positive pair
  - 3 가지 simple augmentation 순서대로 적용 : random cropping(and resize to original size) > random color distortions > random Gaussian blur
  - radom crop & color distortion이 가장 결정적이었음 (3.Data Augmentation for Contrastive Representative Learning에 더 자세히 설명)
2. base encoder $f(\cdot)$
  - augmented data에서 representation vector 추출
  - ResNet 사용
  - $h_i=f(\tilde{x}_i)=ResNet(\tilde{x}_i)$
  - $h_i \in \mathbb{R}^d$는 average pooling layer의 output
3. projection head $g(\cdot)$
  - contrastive loss를 적용할 space로 representations mapping
  - hidden layer 1개 사용
  - $z_i=g(h_i)=W^{(2)}\sigma(W^{(1)}h_i)$
  - $\sigma$ = ReLU
  - $h_i$에서보다 $z_i$에서 contrastive loss 계산하는게 더 좋음 (4.Architectures for Encoder and Head)
4. contrastive loss function
mini-batch 구성
- N 개의 examples를 random sampling한 후에, augmented pair로 만들어서 최종적으로 2N 개의 data points로 구성된 mini-batch를 듦
- positive pair만 주고, 나머지 $2(N-1)$개는 negative examples
- NT-Xent: Normalized Temperature-scaled Cross Entropy Loss
\[\mathcal{L}_{CL}=-\sum_{i,j}^Nlog\frac{exp(sim(\mathbf{z}_i, \mathbf{z}_j)/ \tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k \neq i]} exp(sim(\mathbf{z}_i, \mathbf{z}_j)/ \tau)}\]
- $\mathbf{1}_{[k \neq i]}\in 0,\ 1$ : $k \neq i$일 때만 1 아니면 0
- $\tau$ : temperature parameter (default 0.5)
- $\mathbf{z}$ : encoded representation
- $N$ : mini-batch size
- $(i,\ j)$: positive pairs
- $sim(\mathbf{u}, \mathbf{v})$ : $sim(\mathbf{u}, \mathbf{v}) = \mathbf{u}^T\mathbf{v} / \Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert$ cosine similarity, dot product between two $\mathcal{l}_2$ normalized vector $\mathbf{u}$ and $\mathbf{v}$

2) Training with Large Batch Size

memory bank 사용 안함
Batch size N = 256 ~ 8192
N = 8192면, 하나의 positive pair에 대해 16382개 negative example 존재
batch size가 커지면 standard SGD/Momentum with linear learning rate scaling을 쓰면 학습이 불안정해짐
- 모든 batch size에 대해 LARS optimizer 사용
Global Batch Normalization
- BN은 local한 하나의 device에 대한 mean/variance
- positive pair는 같은 device에 존재하기에 다른 device에서 information leakage 발생 가능
- (1) 모든 devices에 대한 BN 계산 « SimCLR에서는 이 방법 사용
- (2) devices 전체적으로 data shuffling
- (3) BN 대신 layer norm 사용

3) Evaluation Protocol

Dataset and Metrics
- ImageNetILSVRC-2012 dataset / CIFAR-10 사용
- represetation 테스트를 하기 위해 freeze한 base network 위에 linear classifier 추가

Default setting

방법	내용
augmentation	random cropping(and resize to original size) > random color distortions > random Gaussian blur
base encoder	ResNet-50
projection head	2-layer MLP to 128-dimensional latent space
loss	NT-Xent
optimizer	LARS with lr 4.8 $(=0.3\times batch_size/256)$
weight decay	$10^{-6}$
linear warmup	for 10 epochs
scheduler	cosinedecay without restarts

3.Data Augmentation for Contrastive Representative Learning

1) Conposition of data augmentation operation is crucial

Conposition of data augmentation operation is crucial for learning good representation
spatial/geomatric 혹은 appearance transformation이 존재하고 아래 이미지들이 SimCLR에서 비교 및 분석한 실험들

이미지 크기가 다르기 때문에 항상 crop » crop의 영향을 알 수 없다는 문제
- crop에 대한 영향을 알기 위해, 항상 random crop하고 같은 해상도로 resize
- 그 후, 하나의 branch에만 target augmentation 진행
- performance는 안 좋아지지만 각각에 대한 영향 파악 가능

윗줄(figure 4 이미지)과 아랫줄은 각각 다른 이미지

대각선은 augmentation 한 개만 함:
contrastive task에서 positive pair를 찾을 수 있지만, representation 결과가 안 좋음
하지만 2개 이상으로 사용했을 때 문제는 어려워지지만, representation은 좋아짐
조합 중에서 random cropping + color distortion 결과가 압도적으로 좋았음
- random cropping을 해도 pathch들 간의 color histogram은 유사(figure 6에서 (a)의 각 줄의 분포 유사)
- color distortion만으로 image가 다름 판단 가능
cropping과 color distortion은 generalize feature 학습에 중요

2) Stronger data augmentation than supervised learning

Contrastive learning needs stronger data augmentation than supervised learning
supervised와 다르게 color augmentation을 더 강하게 해줄수록 linear evalution 성능이 좋아짐
AutoAug: augmentation policy인데, unsupervised에서는 성능이 더 떨어짐
supervised에서 accuracy에 도움이 안 된 augmentation일지라도 contrastive learning에서는 중요할 수도 있음

4.Architectures for Encoder and Head

1) Bigger Models

Unsupervised constrastive learning benefits more from bigger models than its supervised counterpart
supervised보다 unsupervised contrastive learning이 model 크기에 더 민감하게 반응함

2) Nonlinear projection head

(1) identity mapping(None) (2) linear projection(Linear)
(3) default nonlinear with one additional hidden layer(+ReLU)(Non-linear)

결과

A nonlinear projection head improves the representataion quailty of the layer before it
projection head $g(\mathbf{h})$ 의 중요성
None < linear < Non-linear 순
projection dimension에 관계없이 projection head를 쓰면 결과 값은 비슷
non-linear projection을 사용하면, $\mathbf{h}$ 이 $\mathbf{z}=g(\mathbf{h})$ 보다 더 좋은 결과(> 10%)를 보임
projection head 이전의 representation이 이후 layer의 representation보다 더 특징값을 잘 나타냄을 의미
- non-linear layer 이전의 특징값을 사용해야하는 이유

분석

가설
- contrastive loss에 의해 information loss가 일어나기 때문일 것임
- projection head $g(\cdot)$ 가 color나 object orientation과 같은 downstream task에 필요한 정보 제거 할 수 있음
- 하지만 그 덕분에 projection head $g(\cdot)$ 이전의 $\mathbf{h}$ 에 더 많은 정보가 담길 수 있음
증명용 실험
- $\mathbf{h}$ 와 $g(\mathbf{h})$ 이 pre-training할 때 적용된 transformation을 예측할 수 있는지
- $g(h)=W^{(2)}\sigma(W^{(1)}h)$, input & output dimension 2048로 동일
- $\mathbf{h}$ 에 transformation에 대한 정보가 더 많음을 확인 가능
Appendix B.4. Understanding the Non-linear Projection Head
- Figure B.3: linear projection matrix $\mathbf{z}=W\mathbf{h}$에 쓰인 $W \in \mathbb{R}^{2048\times 2048}$의 eigenvalues distribution
  >> 큰 eigenvalue가 엄청 적음 » low rank » 이미지 흐릿
- Figure B.4: $\mathbf{h}$ 와 $\mathbf{z}=g(\mathbf{h})$ 의 random 10 class의 t-sne
  >> $\mathbf{h}$ 가 $\mathbf{z}$ 보다 class들을 더 잘 분리함

5. Loss Functions and Batch Size

1) NT-Xent and temperature

Normalized cross entropy loss with adjustable temperature works better than alternatives
(1) gradient 비교하면, $\mathcal{l}_2$ normalization과 temperature를 같이 쓰면, 다른 example들을 효율적으로 weigth 할 수 있고,
적절한 temperature는 hard negatives에 대해 model이 학습하는 것을 도와줌
(2) cross-entropy와 다르게, negative의 상대적인 hardness로 weight하지 않음
loss function에 semi-hard negative mining 적용해야함
>> semi-hard negative mining 사용
전체 loss가 아닌 positive examples과 멀지만, loss margin 내에 있어 거리가 가까운 sample에 대해서 gradient 계산

<loss 비교>

loss의 정확한 비교를 위해 모든 loss function에 대해 $\mathcal{l}_2$ normalization 진행하고 가장 좋은 결과가 나오도록 hyperparameters 조정
NT_Xent가 가장 좋은 결과

<$\mathcal{l}_2$ normalization의 중요성과 temperature $\tau$ 비교>

normalization 안 하고, temperature $\tau$ scale이 적절하지 않으면 결과가 상당히 안 좋음
normalization하면 contrastive task 정확도는 높지만, representation은 linear evaluation보다 안 좋음

2) Larger batch size and Longer training

batch size 영향력

Contrastive learning benefits (more) from larger batch sizes and longer training
train epoch 수가 적으면 batch size가 클수록 좋음
하지만 training 시간이 길수록 batch size에 대한 영향력 감소
supervised learning에서는 batch가 큰 것에 대한 의미가 적지만,
contrastive learning에서는 큰 batch는 더 많은 negative example을 제공하고 수렴 속도도 빠름
학습을 더 오래하는 것도 똑같은 이유로 더 유리

6. Comparison with State-of-the-art

Wide Residual Networks ResNet-50의 3가지의 hidden layer widths($\times 1, \times 2, \times 4$)
즉, 각 convolution filter 수가 $\times 1, \times 2, \times 4$
SimCLR은 1000 epoch만큼 학습

<Linear Evaluation>

다른 self-supervised model과 비교

<Semi-supervised Learning>

sampling 1% or 10% of ILSVRC-12 training dataset

ILSVRC-12 training dataset의 1% 혹은 10% sampling
regularization 없이 전체 base model fine-tuning
전체 ImageNet dataset으로 SimCLR ResNet-50(2 $\times$, 4 $\times$) fine-tuning하면, 처음부터 학습했을 때보다 최대 2% point정도 더 정확도가 높았음

<Transfer Learning>

fixed feature extractor & linear evaluation

Twitter Facebook LinkedIn