Wav2Letter

논문: Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

1. Introduction

simple end-to-end model for speech recognition
convolution acoustic model + graph decoding
force alignment 없이 바로 transcribed speech 출력
CTC보다 간결한 방식 사용
acoustic model: 사람이 label한 phonetic transcription 사용 없이 문자로 나오게 학습됨

2. Architecture

ConvNet Acoustic Model

1d convolutional neural network, pooling layer 대신 striding 사용
activation fuction으로는 hyperbolic tangents나 ReLU 사용 (성능 비슷)
input feature에 따라 구조가 바뀜
- MFCC-based: 이미 input을 생성할 때 큰 stride를 사용했기에 네트워크에는 stride 적게 필요
- power spectrum-base & rawwave-base: stride 있는 convolution 위치보다 네트워크 전체적인 stride가 더 중요
  input sequence가 길기에, 첫 번째 layer에 strided convolution들을 배치하는게 좋음
마지막 layer는 letter dictionary($d_y=\vert \mathcal{L} \vert$)에 있는 각 letter의 점수 출력

위 이미지는 raw wave를 활용했을 때의 network 구조
- 빨간색 상자: convolution with strides
- 회색 상자: convolution with kernel width($kw$) of 1 (=fully connected layer)
- MFCC & power spectrum-base 모델은 첫번째 layer 없음
전채 네트워크는 kernel size가 31280, stride가 320인 non-linear convolution으로 볼 수 있음
input sampling rate가 16 kHz이기에 label score들은 1955 ms window와 20 ms step을 사용하여 생성됨

Inferring Segmentation with AutoSegCriterion

대부분 크기가 큰 데이터셋들은 각 오디오 파일에 대한 text transcript 제공
하지만 네트워크를 제대로 학습시키기 위해선 transcription 안의 각 문자(letter)에 대한 segmentation이 필요하기에 여러 방법 제안됨
HMM/GMM model: iterative EM procedure
1. Estimation: letter transcription과 input sequence의 joint probability 최대화하는 모델로, 가장 좋은 segmentation 추론
2. Maximization: 이미 추측한 (fixed) segmentation을 기준으로 frame-level criterion 최적화
HMM/NN system: MMI or MBR
1. MMI: acoustic sequence와 word sequences의 mutual information 최대화
2. MBR: Minimum Bayse Risk criterion
NN only: transcription segmentation을 추론하며 동시에 전체적으로 맞는 transcription 점수를 높이는 방법으로 학습. CTC가 대표적임

CTC

Connectionist Temporal Classification criterion
Baidu Deep Speech architecture에서 제안됨
frame-level에서 normalized된 network의 output probablity가 점수를 매긴다고 추정
transcription을 만드는 “모든” letter의 sequence 고려
각 letter 사이에 삽입될 특별한 “blank” state 존재
1. gletter 사이의 필요없는 arbage frame 표시
2. 연속적으로 똑같은 letter가 나올 때 분리용 (ex) add

$\mathcal{G}_{ctc}(\theta,\ T)$: T frames로 구성된 주어진 transcription $\theta$의 unfolded graph (Fig. 2(b))
$\pi =\pi_1, \cdots,\pi_T \in \mathcal{G}_{ctc}(\theta,\ T)$: graph에서 유효한 모든 letter sequences
매 time step $t$에서 각 graph의 node는 acoustic model output에 해당하는 log-probability letter($f_t(\cdot)$라고 표시) 대입
ctc의 목적은 $\mathcal{G}_{ctc}(\theta,\ T)$의 path의 “전체적”인 점수를 높이는 것:

$CTC(\theta,\ T) = -logadd_{\pi\in\mathcal{G}_{ctc}(\theta,\ T)}\sum_{t=1}^T f_{\pi_t}(x)\\$ $logadd(a, b) = exp(log\ a + log\ b)$
logadd는 max($\cdot$)의 부드러운 버전
- max: 위 수식을 max로 변경 시, Viterbi 알고리즘(hidden Markov 모형 등에서 관측된 사건들의 순서를 야기한 가장 가능성 높은 은닉 상태들의 순서를 찾기 위한 동적 계획법)으로 구할 수 있음
  model belief에 따라 가장 좋은 경로의 점수 최대화
- logadd: 비슷한 점수를 가진 경로는 전체적인 점수가 같고, 점수가 큰 경로는 점수가 낮은 경로보다 더 큰 전반적인 weight 갖음
  acuostic model이 log-probability $f_t(\cdot)$의 normalized된 점수를 출력하기에, 점수를 maximizing할 때 네트워크가 발산하지 않음

ASG

Auto Segmentation Criterion
CTC와 다른점
1. “blank” label 없음
  - graph가 단순해지는 장점 + 실제로 blank label로 grabage frame 표현하는 것에 대한 이득이 없음
  - letter 반복은 새로운 레이블로 대체 가능 (ex) “caterpillar” > “caterpil2ar”
  - decoder 단순화
2. node에 un-normalized 점수 사용
  - graph edge에 transition score 삽입하여 다른 language model 적용 가능
  - letter보다 high-level representation으로 발전시킬 수 있음
  - “label-bias” 문제 경감을 위해 normalized transition 피함
  - 제안된 방법은 acoustic model과 함께 학습되는 transition scalar 사용
3. frame이 아닌 global normalization
  - 2번 조건을 사용 시에 틀린 transcription이 낮은 confidence를 갖도록 하기 위함

(b): $\mathcal{G}_{asg}(\theta,\ T)$, T frames로 구성된 주어진 transcription $\theta$의 unfolded graph
(c): $\mathcal{G}_{full}(\theta,\ T)$, T frames에 대한 fully connected graph (ahems letter의 sequence 나타냄)
아래의 수식 최소화:
\[ASG(\theta,\ T) = -logadd_{\pi\in\mathcal{G}_{asg}(\theta,\ T)}\sum_{t=1}^T (f_{\pi_t}(x) + g_{\pi_{t-1},\pi_t}(x)) \\ + logadd_{\pi\in\mathcal{G}_{full}(\theta,\ T)}\sum_{t=1}^T (f_{\pi_t}(x) + g_{\pi_{t-1},\pi_t}(x))\]
- $g_{i,j}(\cdot)$: label i에서 j로의 transition score
- 좌측: letter sequence가 맞는 transcription가 되도록 함
- 우측: 모든 letter의 sequence들의 가능성 낮추기 (demote)

Beam-Search Decoder

beam threholding, histogram pruning, language model smearing으로 구성된 간단한 beam-search하는 one-pass decoder 사용
KenLM 사용 + input으로 un-normalized acoustic 점수 (transitions and emissions from the acoustic model) 사용
아래 수식을 최소화:
\[\mathcal{L}(\theta) = logadd_{\pi\in\mathcal{G}_{asg}(\theta,\ T)}\sum_{t=1}^T (f_{\pi_t}(x) + g_{\pi_{t-1},\pi_t}(x))\\ + \alpha\ log\ P_{lm}(\theta) + \beta\vert\theta\vert\]
- $P_{lm}(\theta)$: transcription $\theta$가 주어졌을 때, language model의 확률
- $\alpha$: language model 가중치 하이퍼-파라미터
- $\beta$: word insertion penalty 하이퍼-파라미터

3. Experiments

Data

LibriSpeech: 1000 시간 사용, 16 kHz
vocabulary $\mathcal{L}$: 30 graphemes (alphabet + apostrophe + silence + two special “repetition”)
평가지표: letter-error-rates (LERs), word-error-rates (WERs)
WERs는 본 논문의 decoder와 LibriSpeech가 제동하는 표준 4-gram language model을 사용하여 계산
MFCC: 13 coefficients, 25 ms sliding window, 10 ms stride + 1차 미분 + 2차 미분
Power spectrum: 257 componenets, 25 ms window, 10 ms stride
각 input들을 normalized하여 사용 (mean 0, std 1)

Results

Table 1
- (a): LER에서 ASG에서 더 좋은 성능 보임
- (b): 짧은 sequence에서 CTC가 더 빠르게 계산
- (c): 긴 sequence에서 ASG가 더 빠르게 계산
- Baidu GPU CTC는 더 큰 단어들로 구성된 언어에 적함 (중국 한자 5000개)
Figure 4: training size + data augumentation (shift + stretching) 영향
- (a): augumentation은 적은 수의 학습 데이터에서 효과적이나 학습 데이터 사이즈의 크기가 커질 수록 차이가 적어짐
- (b): Deep Speech 1 & 2에 비해 작은 학습 데이터를 사용했어도 비교할 만한 성능을 보임

1000 시간으로 학습
네트워크 구조의 전체적인 stride는 320으로, 매 20 ms마다 label 생성
output의 정확도(precision)을 정제함으로서 1% 성능 향상이 가능
input sequence를 shift하고 반복해서 네트워크에 넣어주면 됨
Table 2에서는 10 ms shift한 결과
MFCC보다 power spectrum, raw wave의 성능이 떨어지나 Fig. 4에서 보듯이 충분한 데이터가 있으면 차이가 작아질 것으로 생각됨

Twitter Facebook LinkedIn

Wav2Letter

1. Introduction

2. Architecture

ConvNet Acoustic Model

Inferring Segmentation with AutoSegCriterion

CTC

ASG

Beam-Search Decoder

3. Experiments

Data

Results

공유하기

댓글남기기

참고

Interspeech 2024 관심 논문 리스트

Generative Agents: Interactive Simulacra of Human Behavior

Diffusion Models in Vision: A Survey

ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection