CLAR: Contrastive Learning of Auditory Representations

CLAR: Contrastive Learning of Auditory Representations
CLAR github

연관 포스트:

Abstract

다양한 data augmentations

time-frequency audio feature > raw signals in learned representations

supervised & contrastive losses(semi-supervised) > self-supervised pre-training $\rightarrow$ supervised fine-tuning

converge faster with significantly better representations

1. Introduction

supervised learning
- input signal과 class mapping
- generalizability limitation
  - expensive to label
  - skewed towards ont partivular damain (e.g. speech, music, etc…)
vision self-supervised model SimCLR based model for audio

1) Self-Supervised Contrastive Learning

input pair의 similarity / dissimilarity 학습
SimCLR이 다른 SSL 모델뿐만 아니라 supervised보다 성능이 좋았음(vision)
같은 data의 2개의 augmentated view는 argreement 최대화
다른 data와는 contrastive loss 최대화
NT-Xent: Normalized Temperature-scaled Cross Entropy Loss
\[\mathcal{L}_{CL}=-\sum_{i,j}^Nlog\frac{exp(sim(\mathbf{z}_i, \mathbf{z}_j)/ \tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k \neq i]} exp(sim(\mathbf{z}_i, \mathbf{z}_j)/ \tau)}\]
- $\mathbf{1}_{[k \neq i]}\in 0,\ 1$ : $k \neq i$일 때만 1 아니면 0
- $\tau$ : temperature parameter (default 0.5)
- $\mathbf{z}$ : encoded representation
- $N$ : mini-batch size
- $(i,\ j)$: positive pairs
- $sim(\mathbf{u}, \mathbf{v})$ : $sim(\mathbf{u}, \mathbf{v}) = \mathbf{u}^T\mathbf{v} / \Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert$ cosine similarity between two normalized vector $\mathbf{u}$ and $\mathbf{v}$

2) Supervised Contrastive Learning

Cross-Entropy Loss이 가장 흔하게 쓰이는 방법
hyper-parameter, noisy labels에 민감히고, margin도 약함

SupCon에서 문제를 해결하고자함

NT-Xent	SupCon
같은 class의 set와 남은 다른 class와 비교	같은 이미지에서 augment한 이미지와 batch 내의 다른 이미지들과 비교
SSL	supervised

CLAR에서는 supervised와 self-supervised 방식 힘께 사용
efficient representation & faster training

3. Methods

1) Audio Pre-processing

raw audio signal과 time-frequency audio features로 실험 진행
둘 다 16 kHz로 down-sampling 한 후에, 길이가 같아지도록 zero-padding 혹은 오른쪽 넘은 부분 clipping
time-frequency audio features
- 16ms windows and 8 ms stride STFT magnitude & phase
- 128 frequency bins equally spaced on the Mel scale(128 mel filter)
- STFT magnitude & mel spectrogram log-power $f(S)=10log_{10}\vert S\vert^2$
- channel 축으로 STFT magnitude, phase, mel spectrogram concat » $3\times F \times T$
- multi-domain audio signal(speech, environmental, music, etc) + ResNet 구조 유지
- GPU 1D convolutional Neural Network로 만듦 » nnaudio toolbox

2) Training/Evaluating Protocol

(1) Data Augmentation

각 sample을 random augmentation해서 2개의 random views 생성
5.Augmentation에서 적용된 augmentation 설명
6.Data Augmentation For Contrastive Learning에서 각 augmentation 영향 확인

(2) Encoder

data sample을 representational vector로 mapping
adaptive average pooling 사용
SimCLR training protocol로 1D & 2D ResNet18 학습
time-frequency features: ResNet18 random initalization
raw audio signal: ResNet18의 모든 연산(conv, max-pooling, batch norm) 2D에서 1D로 바꿈
output vector는 512 dimension vector

(3) Projection Head

encodered output vector를 contrastive/supervised loss 계산이 가능한 space로 mapping
fully connected layers + ReLU
output vector는 128 dimension
supervised: contrastive loss 계산에 사용되었던 최종 layer의 output 크기를 class 개수로 바꿈
label이 있는 data는 cross entropy loss만 계산하는게 아니라,
그 전 layer에서 contrastive loss도 함께 계산

(4) Evaluation Head

각각 다른 training 방법을 사용해서 encoder를 학습시킨 후,
projection head를 evaluation head가 대체
encoder를 freeze하고 그 위에 linear classifier를 학습
test accuracy 비교용
supervised, SSL과 비교하기 위해 full labeled data로 학습

(5) 전체적으로

batch size: 1024 (memory 문제로 512로 줄인 실험도 있음)
optimizer: Layer-wise Adaptive rate Scaling (LARS)
weight decay: $10^{-4}$
linear warmup for first 10 epochs
lr scheduler: cosine decay schedule without restarts
global batch normalization
random initalization

3) Datasets

(1) Speech Commands

2,618 speakers, 105,829 audio
16 kHz, single channel (mono)
35개 단어 중 하나 말하는 데이터셋
약 1초
실험은 데이터는 한 자리 숫자 말하는 데이터셋만 사용(~20k sample)

(2) NSynth

305,979개의 4초 audio
다양한 악기로 구성, 각 악기가 한 음 연주
3초 동안 연주/누르고 있고, 마지막 1초는 decay
label: 악기군(musical instrument family): 11 class / pitch: 128 class

(3) Environmental Sound Classification (ESC-10/50)

10/50은 label 개수
ESC-50: 2000개의 5초 환경음(각 class마다 40개)
ESC-10은 ESC-50의 일부분
dataset 제작자들이 5 fold로 만들었지만,
위 실험에서는 앞 4개는 training으로 남은 1개를 test로 사용

4. CLAR Framework

	Supervised	Contrastive
focus	여러 class에서 sample 식별에 집중	pair sample similar/disimilarity
constraints	latent space에서의 제약 조건 없음	negative view는 멀게 positive view는 가깝게
advantage	optimize 간단(training 시간 단축)	큰 batch size와 더 긴 학습시 좋음

Self-supervised Contrastive Learning에서는 위 두 개의 장점을 합치고자 함
- self-supervised 방식으로 pre-training한 후, supervised fine-tuning 진행
- catastrophic forgetting: 새로운 데이터를 받아들이면서 기존에 학습했던 내용을 잊어버리는 현상. 특히 작은 network에서 더 큰 문제가 됨
- two stage로 진행하기에 training이 더 어려워짐
CLAR에서는 fine-tuning stage 사용하지 않고 contrastive learning과 supervised learning을 함께 사용하여 학습 진행

\[L= \mathcal{L}_{CL}+\mathcal{L}_{CE}\]

	$\mathcal{L}_{CL}$	$\mathcal{L}_{CE}$
name	Contrastive loss	Categorical Cross-Entropy loss
when	label 유무에 관계없이 항상	label이 있을 때만, 없으면 0
where	projection head 마지막 fc layer 전에서	projection head 마지막 fc layer에서
sampling	-	statified(계층) sampling labeled/unlabeled 비율 유지하면서 sampling 이유 (1)사용한 dataset의 크기가 작음 (2) batch size가 큼(1024)

5. Augmentations

raw audio에 적용하는 6가지 augmentation
spectrogram에 직접적으로 영향을 주는 augmentation 없음
augmentation 할 지 / 안 할 지, 하면 얼만큼 할 건 지 uniform distribution에서 random하게 선택

1) Frequency Transformation

Pitch Shift(PS)
- pitch 올리거나 내리거나
- [-15, 15] semitones
Noise Injection
- noise의 intensity는 Signal-to-Noise Ratio random하게 선택
- White noise: intensity만
- Mixed noise: white, brown, pink 1/3 확률로 고르기

2) Temporal Transformation

Fade in/out(FD)
- fade 정도: linear, log, exp 1/3 확률로 고르기
- fade 크기: (max) audio_length / 2
Time Masking(TM)
- 일정 부분(segment)을 normal noise 혹은 constant로 바꿈
- random location
- random size: (max) audio_length / 8
Time Shift(TS)
- roll-over backwards or forwards
- degree: [0, audio_length / 2]
Time Stretching(TST)
- audio sample faster / slower speec
- phase vocoder 사용
- STFT » stretching with a phase vocoder » inverse STFT
- 원래 길이와 맞추기 위해서 cropping하거나 down-sample - rate > 1: speed up - rate < 1: slow down - rate range: [0.5, 1.5]

6. Data Augmentations For Contrastive Learning

대각 성분은 augmentation 1개만 사용
row: 1st augmentation
column: 2nd augmentation
각각 마지막 줄은 평균 결과
1D Model top 3: fade in/out, time stretching, pitch shifting
2D Model top 3: fade in/out, time masking, time shifting
전체적으로는 1D(68.6 $\pm$ 0.82) > 2D(67.0 $\pm$ 1.36)
- 2D에서 time masking이 1st augmentation일 때 worst
하지만 2D에서 best 결과 나옴(89.3%)

7. Raw Signal vs Time-Frequency Features

6. Data Augmentations For Contrastive Learning에서 가장 좋았던 augmentation을 사용해서 학습한 encoder freeze
2D model(Fade in/out + Time Masking)한 결과가 전체적으로 결과가 좋음
augmentation을 늘리는 것도 좋지 않음
- 2D model에서 Time Shifting까지 했을 때, 결과가 더 안 좋음

8. CLAR vs Supervised vs Self-Supervised

7. Raw Signal vs Time-Frequency Features에서 가장 정확도가 높았던 SC-10 dataset으로 다른 방식과 비교 진행
CLAR evaluation head의 10 epoch마다 확인 후, 1000 epoch까지 학습
CLAR는 semi-supervised하게 학습도 가능하기에
labeled data를 100%, 20%, 10%, 1% 비율로 학습

1) CLAR improves learned representations

top 1 accuracy
같은 epoch 수 만큼 학습시켰을 때, CLAR이 가장 좋은 결과
supervised model은 label 비율이 100%에서 1%가 되면 손실이 대략 65% point
CLAR은 19% point만 손실
CLAR의 label 비율을 줄일 수록 self-supervised보다 accuracy가 작아지는데, 효율적인 representation을 찾아내기보다 적은 수의 label에 overfitting한 결과로 볼 수도 있음

2) CLAR improves the speed of learning representations

self-supervised보다 수렴 속도가 빠름
특히 100% labeled data면 self-supervised보다 빠르고 정확도도 높음
labeled data 비율을 줄일수록 supervised와 차이가 점점 커짐
- encoder의 latent representation을 개선하는 공통적인 representation이 self-supervised와 supervised에 있음
학습 알고리즘이 Categorical Cross-Entropy loss를 먼저 optimize한 다음에, Contrastive loss를 천천히 optimize

Twitter Facebook LinkedIn

CLAR: Contrastive Learning of Auditory Representations

1. Introduction

1) Self-Supervised Contrastive Learning

2) Supervised Contrastive Learning

3. Methods

1) Audio Pre-processing