Interspeech 2024 논문들 중 관심 있는 논문 리스트입니다. [Interspeech 2024 Archive]

📋 논문 리스트

Speech Features

VC: Voice Conversion

SVC: Singing Voice Conversion

TTS & Speech Synthesis

  • Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [데모]
    • 섹션명: Zero-shot TTS
    • 키워드: TTS, LLM, emotion
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • source 오디오의 감정을 반영함!
      • pretrained LLM 사용해서 semantic token 성능 향상 시킴
      • 훑어보는 건 필수
  • DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness
    • 섹션명: Zero-shot TTS
    • 키워드: TTS, HuBERT, teacher-student EMA model, noise augmentation
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • DINO loss를 사용해서 voice cloning 성능 향상
        “substantial improvements in naturalness and speaker similarity in both clean and especially real-life noisy scenarios, outperforming traditional AAM-Softmax-based training methods”
      • HuBERT를 사용하면 노이즈 유무가 포함된 embedding을 얻을 수 있기 때문에, 노이즈 label 없이도 노이즈 있는 데이터 학습 가능
      • pretrained speaker verification CAM++ model 사용
  • Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion
    • 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
    • 키워드: emotion, domain adaptation
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • bin을 사용한 방법에 대해서 더 자세히 보고자 함
        • 나는 졸업 논문에서 bin을 3개로 나누었는데, 해당 논문에서는 5개, 10개로 나눴음
      • 기반 논문도 확인이 필요해보임
        “We implement our idea using the K-nearest neighbors-voice conversion strategy [19], which is a recently proposed approach that achieves impressive results in VC despite its simplicity”
  • GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, emotion, dataset
    • 관심 정도: ⭐
    • 메모:
      • Glottalization, Tenseness, and Resonance label 사용
        • Glottalization: control of air flow due to the tension of the glottis (i.e., throat)
        • Tenseness: tense vowels in pronunciation involve tension in the tip and root of the tongue, while lax vowels are the opposite.
        • Resonance: integration of articulatory phonetics with vocal register insight (흉성, 두성으로 추청됨)
  • TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, expressive, text, prompt
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • reference 음성 없이 text-base로 발화 스타일 추출
      • 데모에 한국어 있음
      • unseen speaker에 대해서는 감정 표현이 약해지는 것 같음 (학습된 화자가 4명이라서 어쩔 수 없는 부분인 것 같아 보임. 더 많은 화자를 사용했을 때의 결과가 궁금함!)
      • 2080 Ti 2장의 결과물이란 게 너무 대단함..
  • Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, expressive, LM
    • 관심 정도: ⭐⭐
    • 메모:
      • “음~”과 웃음 소리를 다양한 버전으로 합성할 수 있음
      • LM 기반 TTS 모델이고, acoustic decoder는 VALL-E 기반
  • Text-aware and Context-aware Expressive Audiobook Speech Synthesis [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, emotion, LM, text
    • 관심 정도: ⭐⭐
    • 메모:
      • text뿐만 아니라 맥락까지 고려한 모델
      • 다른 모델 대비 덜 딱딱하게 읽는 느낌이 있음 (데모가 중국어라서 듣는데 한계가 있음)
  • Controlling Emotion in Text-to-Speech with Natural Language Prompts [toolkit]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, emotion, text, prompt
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • 감정적인 요소가 있는 text를 prompt로 사용 (ex: (중립)알겠습니다 / (행복)정말요! )
      • contribution
        1. an architecture that allows for separate modeling of a speaker’s voice and the prosody of an utterance, using a natural language prompt for the latter
        2. a training strategy to learn a strongly generalized prompt conditioning
        3. a pipeline that allows users to generate speech with fitting prosody without manually selecting the emotion by simply using the text to be read as the prompt
  • Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, emotion
    • 관심 정도: ⭐
    • 메모: 각 감정으로 fine-tuning한 모델과 base model의 차이를 emotion vector로 사용
  • EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [데모]
    • 섹션명: Speech Synthesis: Expressivity and Emotion
    • 키워드: TTS, emotion
    • 관심 정도: ⭐⭐
    • 메모: 석사 때 emotion sphere와 같이 해보고 싶었는데, 해당 논문에서 방법을 제시해서 궁금함!
  • Word-level Text Markup for Prosody Control in Speech Synthesis [코드] [데모]
    • 섹션명: Speech Synthesis: Prosody
    • 키워드: TTS, prosody
    • 관심 정도: ⭐⭐
    • 메모: prosodic markup - prosody를 unsupervise 방법으로 학습하고, control 할 수 있게한 논문
  • Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech[데모]
    • 섹션명: Speech Synthesis: Prosody
    • 키워드: TTS, prosody
    • 관심 정도: ⭐⭐
    • 메모:
      • 기존 nonautoregressive TTS의 deterministic duration predictor(DET)을 probabilistic duration modelling(OT-CFM-based duration model, FM)로 바꾸고 비교
        “We explore the effects of replacing the MSE-based duration predictor in existing NAR TTS approaches with a log-domain duration model based on conditional flow matching”
      • 비교에 사용한 논문
        • a deterministic acoustic model (FastSpeech 2)
        • an advanced deep generative acoustic model (Matcha-TTS)
        • a probabilistic endto-end TTS model (VITS)
  • Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
    • 섹션명: Speech Synthesis: Prosody
    • 키워드: TTS, prosody, duration
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • “designed to precisely control the length of generated speech while maintaining speech quality at different speech rates”
      • “a novel duration model based on Mask”GIT-based to enhance the diversity and quality of the phoneme durations”
  • Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling [데모]
    • 섹션명: Speech Synthesis: Prosody
    • 키워드: TTS, prosody, diffusion
    • 관심 정도: ⭐⭐
    • 메모:
      • 억양이 안 닮은 문제를 해결하기 위한 논문
      • contribution 1.speaker timbre is a global attribute: speaker encoder to extract global speaker embedding (input: mel spectrograms) 2.diffusion model as a pitch predictor: to match speech prosody diversity by leveraging its natural advantage in generating content diversity
        1. prosody shows both global consistency and local variations: to model prosody hierarchically, such as frame-level, phoneme level, and word-level, to improve the prosody performance of synthesized speech.
  • Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 1
    • 키워드: TTS
    • 관심 정도: ⭐
    • 메모:
      • 자연스럽게 말하는 것과 또박또박 말하는 스타일 선택 가능
      • 아이디어 부분을 더 자세히 보면 좋을 것 같다는 생각이 들었음
  • Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 1
    • 키워드: TTS, codec
    • 관심 정도: ⭐⭐
    • 메모:
      • single-codebook codec, compression and reconstruction on mel-spectrogram
      • “Single-Codec performs compression and reconstruction on Mel Spectrogram instead of the raw waveform, enabling efficient compression of speech information while preserving important details, as stated in Tortoise-TTS”
  • ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 1
    • 키워드: TTS, cross-lingual, code-switching
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • 현대 자동차에서 만듦
      • 한 문장 내에서 영어와 한국어 code-switching 가능 (cross-lingual and code-mixed speech with high naturalness), 해당 부분의 아이디어를 자세히 볼 필요 있음
  • Multi-modal Adversarial Training for Zero-Shot Voice Cloning
    • 섹션명: Speech Synthesis: Paradigms and Methods 1
    • 키워드: TTS
    • 관심 정도: ⭐
    • 메모:
      • Zoom~
      • “GAN-based, FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zeroshot voice cloning”
      • “Multi-feature Generative Adversarial Training pipeline which uses our discriminator to enhance both acoustic and prosodic features for natural and expressive TTS”
  • Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
    • 섹션명: Speech Synthesis: Paradigms and Methods 1
    • 키워드: TTS, markup, expressive
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • pre-trained Voicebox 사용해서 아래 3가지 경우 control한 음성 생성
        • Punctuation: It’s good!
        • Emphasis:It’s good
        • Laughter: It’s good [laughter]
      • “efficient fine-tuning methods to bridge the gap between pre-trained parameters and new fine-grained conditioning modules”
  • Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [코드] [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS,
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • neural codec language model
        “In contrast with previous TTS codec LM model that leverages decoder-only (GPT) transformers, Small-E relies on encoder-decoder architecture”
      • Can be easily pretrained and finetuned on midrange GPUs
      • Trained on long context
  • Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [데모] [nvidia blog]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS, duration, LLM
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • nvidia, T5-TTS (T5: text-to-text model)
      • “first attampt at synthesizing multi-codebook neural audio codecs with an encoder-decoder architecture”
      • cross-attention heads가 monotonic alignment를 학습할 수 있도록 만듦
      • 연속으로 반복되는 단어나 문장에 대해서 엄청 자연스럽게 발화함
  • Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment [데모(안뜸)]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS,
    • 관심 정도: ⭐⭐
    • 메모: sentence 단위의 음성만으로 자연스러운 longform speech 생성 가능
  • 논문제목 [코드] [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS, text, speech editing
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • Text-based Speech Editing
      • Acoustic and Prosody Consistency Losses
        • Acoustic: quantify the smooth transition between the editing region and the adjacent context
        • Prosody: for capturing the prosody feature from the predicted masked region while also analyzing the overall prosody characteristics present in the original speech
  • High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS, text
    • 관심 정도: ⭐⭐⭐
    • 메모:
      • 퀄리티 엄청 좋고, control한 오디오도 자연스러움
      • Interpreting: text-to-semantic token stage
        • k-means clustering on wav2vec 2.0
        • mainly focus on phonetic information, but it also dealing with some prosodic information such as speech rate and overall pitch contour.
      • Speaking: semantic to the acoustic token stage (HiFi-Codec)
  • 논문제목 [코드] [데모]
    • 섹션명: Speech Synthesis: Paradigms and Methods 2
    • 키워드: TTS, vision, text
    • 관심 정도: ⭐
    • 메모:
      • “generate speech and co-verbal facial movements from text, animating a virtual avatar”
      • “The proposed model generates mel-spectrograms and facial features (head, eyes, jaw and lip movements) to drive the virtual avatar’s action units”

Speech Emotion Recognition

Audio Captioning

Etc

댓글남기기