Interspeech 2024 관심 논문 리스트

Interspeech 2024 논문들 중 관심 있는 논문 리스트입니다. [Interspeech 2024 Archive]

📋 논문 리스트

Speech Features

YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation [코드]
- 섹션명: Speech and Audio Analysis and Representations
- 키워드: F0 예측
- 관심 정도: ⭐⭐
- 메모: F0 SOTA, F0가 음성 생성에 주요한 요소로 사용되는 경우가 많기에 확인해보는 것도 좋은 듯함

VC: Voice Conversion

Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, 공간 음향
- 관심 정도: ⭐
- 메모:
  - multi-speaker 상황에서 원하는 대상만 Voice Conversion 가능
  - multi-channel로 공간적 요소 고려
  - 데모에서 기계음이 존재함
Neural Codec Language Models for Disentangled and Textless Voice Conversion [코드 (미업데이트)]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, neural codec language models
- 관심 정도: ⭐⭐⭐
- 메모:
  - classifier free guidance로 발화자 유사도 향상
  - 기존 codec language models 대비 연산량이 적음
  - accent disentanglement와 speaker similarity 좋음
  - 비교 논문 중 UniAudio도 확인하면 좋을 것 같음 [코드]
  - 데모와 코드가 없어서 아쉬움
Fine-Grained and Interpretable Neural Speech Editing [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: speech editing
- 관심 정도: ⭐⭐
- 메모:
  - speech editing이 잘되고 있음, pitch shifting/time strecthing 가능함
  - VC 항목의 다양한 demo가 있었으면 좋았을 것 같음
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, diffusion
- 관심 정도: ⭐⭐
- 메모: Diffusion step 1번으로 복원
DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, real-time
- 관심 정도: ⭐
- 메모: 굉장히 짧은 오디오로도 음성 변환 가능
Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, emotion
- 관심 정도: ⭐
- 메모: Emotional Intensity-aware, 구글 논문을 더 먼저 읽어야함..!
Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC
- 관심 정도: ⭐⭐
- 메모:
  - adaptive global response normalization(AGRN) 방법이 흥미로워 보였음
  - 특히 ablation studies!
Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, cross-lingual, training-inference mismatch
- 관심 정도: ⭐⭐⭐
- 메모:
  - 성능이 너무 좋아보이는데 코드가 없어서 너무 아쉬움..
  - 3초 음성만으로도 VC 진행 가능
  - training-inference mismatch 문제 해결하기 위한 방법 제안
    “teacher-guided refinement process to form a dual-mode (conversion mode and reconstruction mode) training strategy with the original reconstruction process”
Residual Speaker Representation for One-Shot Voice Conversion [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, timbre
- 관심 정도: ⭐
- 메모:
  - timbre 조절 가능
  - 새로운 speaker에 대한 robustness 증가
  - layer-wise error modeling 사용해서 성능 향상
Disentangling prosody and timbre embeddings via voice conversion
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, Disentangling, prosody, timbre
- 관심 정도: ⭐
- 메모: 음성을 prosody와 timbre로 분해, RVC 사용
PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech Representations [데모 (재생이 안 됨)]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC. knowledge distillation
- 관심 정도: ⭐
- 메모: knowledge distillation
HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
  - supports text and audio prompts
  - 음색이 유사도는 살짝 떨어지는 듯하나 아이디어가 좋아 보임
DreamVoice: Text-Guided Voice Conversion [HuggingFace] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, dataset, diffusion, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
  - text-guided VC
  - 다른 음성 변환 모델이랑 붙혀서 사용할 수 있음
  - 데이터셋도 제공
  - Diffusion Probabilistic Models + Classifier-free Guidance
Hear Your Face: Face-based voice conversion with F0 estimation [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, F0, pitch
- 관심 정도: ⭐
- 메모:
  - 얼굴 사진을 보고 음성 변환을 수행함..!
  - So-VITS-SVC 아이디어 발전
  - 데모 퀄리티 좋음
Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, pitch, F0, voiced-unvoices, prosody, knowledge distillation, streaming
- 관심 정도: ⭐
- 메모:
  - 데모 문장이 다양하지 않아서 아쉬움
  - "”The three dimensional prosody feature consists of z-scored log-F0 and energy and a binary voice-unvoiced flag”

SVC: Singing Voice Conversion

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: SVC, diffusion
- 관심 정도: ⭐
- 메모: pre-trained So-VITS-SVC 사용
MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, diffusion
- 관심 정도: ⭐
- 메모:
  - speech 데이터셋으로 노래 음성 변환이 가능함
Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, pitch
- 관심 정도: ⭐⭐⭐
- 메모:
  - 퀄리티가 좋음!
  - VITS + music score
  - 비교한 다른 논문인 VISinger2(코드)도 보면 좋을 것 같음
  - owing to deterministic pitch conditioning, they do not fully address the one-to-many problem.
    - integrates variational autoencoders for the periodic and aperiodic components
    - eliminates the dependency on an external aligner by estimating the phoneme alignment through a monotonic alignment search within note boundaries.
  - pitch augmentation
    - we apply the smoothed pitch augmentation method to ensure that the latent variables capture both wide and narrow pitch variations
    - For pitch augmentation, we extracted the smoothed F0 using a median filter with a kernel size of 13.
X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, multi-lingual
- 관심 정도: ⭐⭐⭐
- 메모:
  - 넷 마~ 블!
  - 하나의 샘플 안에 일본어, 중국어, 한국어로 자연스럽게 바꾸면서 노래 생성 가능
  - language와 speaker 분리: mix-LN transformer (Mix-LN mixes the feature statistics of the speaker embedding, which confuses the model by the mismatched speaker information)
  - CFM-based decoder
    - conditional flow matching 사용 (matchaTTS, P-Flow)
  - 명시적인 pitch 예측 없이도 비슷한 목소리로 생성 가능
  - zero-shot/더 풍부한 표현/발화 데이터 사용에 대한 연구는 추후에 진행된다고 함

TTS & Speech Synthesis

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [데모]
- 섹션명: Zero-shot TTS
- 키워드: TTS, LLM, emotion
- 관심 정도: ⭐⭐⭐
- 메모:
  - source 오디오의 감정을 반영함!
  - pretrained LLM 사용해서 semantic token 성능 향상 시킴
  - 훑어보는 건 필수
DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness
- 섹션명: Zero-shot TTS
- 키워드: TTS, HuBERT, teacher-student EMA model, noise augmentation
- 관심 정도: ⭐⭐⭐
- 메모:
  - DINO loss를 사용해서 voice cloning 성능 향상
    “substantial improvements in naturalness and speaker similarity in both clean and especially real-life noisy scenarios, outperforming traditional AAM-Softmax-based training methods”
  - HuBERT를 사용하면 노이즈 유무가 포함된 embedding을 얻을 수 있기 때문에, 노이즈 label 없이도 노이즈 있는 데이터 학습 가능
  - pretrained speaker verification CAM++ model 사용
Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, domain adaptation
- 관심 정도: ⭐⭐⭐
- 메모:
  - bin을 사용한 방법에 대해서 더 자세히 보고자 함
    - 나는 졸업 논문에서 bin을 3개로 나누었는데, 해당 논문에서는 5개, 10개로 나눴음
  - 기반 논문도 확인이 필요해보임
    “We implement our idea using the K-nearest neighbors-voice conversion strategy [19], which is a recently proposed approach that achieves impressive results in VC despite its simplicity”
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, dataset
- 관심 정도: ⭐
- 메모:
  - Glottalization, Tenseness, and Resonance label 사용
    - Glottalization: control of air flow due to the tension of the glottis (i.e., throat)
    - Tenseness: tense vowels in pronunciation involve tension in the tip and root of the tongue, while lax vowels are the opposite.
    - Resonance: integration of articulatory phonetics with vocal register insight (흉성, 두성으로 추청됨)
TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, expressive, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
  - reference 음성 없이 text-base로 발화 스타일 추출
  - 데모에 한국어 있음
  - unseen speaker에 대해서는 감정 표현이 약해지는 것 같음 (학습된 화자가 4명이라서 어쩔 수 없는 부분인 것 같아 보임. 더 많은 화자를 사용했을 때의 결과가 궁금함!)
  - 2080 Ti 2장의 결과물이란 게 너무 대단함..
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, expressive, LM
- 관심 정도: ⭐⭐
- 메모:
  - “음~”과 웃음 소리를 다양한 버전으로 합성할 수 있음
  - LM 기반 TTS 모델이고, acoustic decoder는 VALL-E 기반
Text-aware and Context-aware Expressive Audiobook Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, LM, text
- 관심 정도: ⭐⭐
- 메모:
  - text뿐만 아니라 맥락까지 고려한 모델
  - 다른 모델 대비 덜 딱딱하게 읽는 느낌이 있음 (데모가 중국어라서 듣는데 한계가 있음)
Controlling Emotion in Text-to-Speech with Natural Language Prompts [toolkit]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
  - 감정적인 요소가 있는 text를 prompt로 사용 (ex: (중립)알겠습니다 / (행복)정말요! )
  - contribution
    1. an architecture that allows for separate modeling of a speaker’s voice and the prosody of an utterance, using a natural language prompt for the latter
    2. a training strategy to learn a strongly generalized prompt conditioning
    3. a pipeline that allows users to generate speech with fitting prosody without manually selecting the emotion by simply using the text to be read as the prompt
Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion
- 관심 정도: ⭐
- 메모: 각 감정으로 fine-tuning한 모델과 base model의 차이를 emotion vector로 사용
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion
- 관심 정도: ⭐⭐
- 메모: 석사 때 emotion sphere와 같이 해보고 싶었는데, 해당 논문에서 방법을 제시해서 궁금함!
Word-level Text Markup for Prosody Control in Speech Synthesis [코드] [데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody
- 관심 정도: ⭐⭐
- 메모: prosodic markup - prosody를 unsupervise 방법으로 학습하고, control 할 수 있게한 논문
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech[데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody
- 관심 정도: ⭐⭐
- 메모:
  - 기존 nonautoregressive TTS의 deterministic duration predictor(DET)을 probabilistic duration modelling(OT-CFM-based duration model, FM)로 바꾸고 비교
    “We explore the effects of replacing the MSE-based duration predictor in existing NAR TTS approaches with a log-domain duration model based on conditional flow matching”
  - 비교에 사용한 논문
    - a deterministic acoustic model (FastSpeech 2)
    - an advanced deep generative acoustic model (Matcha-TTS)
    - a probabilistic endto-end TTS model (VITS)
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody, duration
- 관심 정도: ⭐⭐⭐
- 메모:
  - “designed to precisely control the length of generated speech while maintaining speech quality at different speech rates”
  - “a novel duration model based on Mask”GIT-based to enhance the diversity and quality of the phoneme durations”
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling [데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody, diffusion
- 관심 정도: ⭐⭐
- 메모:
  - 억양이 안 닮은 문제를 해결하기 위한 논문
  - contribution 1.speaker timbre is a global attribute: speaker encoder to extract global speaker embedding (input: mel spectrograms) 2.diffusion model as a pitch predictor: to match speech prosody diversity by leveraging its natural advantage in generating content diversity
    1. prosody shows both global consistency and local variations: to model prosody hierarchically, such as frame-level, phoneme level, and word-level, to improve the prosody performance of synthesized speech.
Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS
- 관심 정도: ⭐
- 메모:
  - 자연스럽게 말하는 것과 또박또박 말하는 스타일 선택 가능
  - 아이디어 부분을 더 자세히 보면 좋을 것 같다는 생각이 들었음
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, codec
- 관심 정도: ⭐⭐
- 메모:
  - single-codebook codec, compression and reconstruction on mel-spectrogram
  - “Single-Codec performs compression and reconstruction on Mel Spectrogram instead of the raw waveform, enabling efficient compression of speech information while preserving important details, as stated in Tortoise-TTS”
ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, cross-lingual, code-switching
- 관심 정도: ⭐⭐⭐
- 메모:
  - 현대 자동차에서 만듦
  - 한 문장 내에서 영어와 한국어 code-switching 가능 (cross-lingual and code-mixed speech with high naturalness), 해당 부분의 아이디어를 자세히 볼 필요 있음
Multi-modal Adversarial Training for Zero-Shot Voice Cloning
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS
- 관심 정도: ⭐
- 메모:
  - Zoom~
  - “GAN-based, FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zeroshot voice cloning”
  - “Multi-feature Generative Adversarial Training pipeline which uses our discriminator to enhance both acoustic and prosodic features for natural and expressive TTS”
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, markup, expressive
- 관심 정도: ⭐⭐⭐
- 메모:
  - pre-trained Voicebox 사용해서 아래 3가지 경우 control한 음성 생성
    - Punctuation: It’s good!
    - Emphasis:It’s good
    - Laughter: It’s good [laughter]
  - “efficient fine-tuning methods to bridge the gap between pre-trained parameters and new fine-grained conditioning modules”
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS,
- 관심 정도: ⭐⭐⭐
- 메모:
  - neural codec language model
    “In contrast with previous TTS codec LM model that leverages decoder-only (GPT) transformers, Small-E relies on encoder-decoder architecture”
  - Can be easily pretrained and finetuned on midrange GPUs
  - Trained on long context
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [데모] [nvidia blog]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, duration, LLM
- 관심 정도: ⭐⭐⭐
- 메모:
  - nvidia, T5-TTS (T5: text-to-text model)
  - “first attampt at synthesizing multi-codebook neural audio codecs with an encoder-decoder architecture”
  - cross-attention heads가 monotonic alignment를 학습할 수 있도록 만듦
  - 연속으로 반복되는 단어나 문장에 대해서 엄청 자연스럽게 발화함
Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment [데모(안뜸)]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS,
- 관심 정도: ⭐⭐
- 메모: sentence 단위의 음성만으로 자연스러운 longform speech 생성 가능
논문제목 [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, text, speech editing
- 관심 정도: ⭐⭐⭐
- 메모:
  - Text-based Speech Editing
  - Acoustic and Prosody Consistency Losses
    - Acoustic: quantify the smooth transition between the editing region and the adjacent context
    - Prosody: for capturing the prosody feature from the predicted masked region while also analyzing the overall prosody characteristics present in the original speech
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, text
- 관심 정도: ⭐⭐⭐
- 메모:
  - 퀄리티 엄청 좋고, control한 오디오도 자연스러움
  - Interpreting: text-to-semantic token stage
    - k-means clustering on wav2vec 2.0
    - mainly focus on phonetic information, but it also dealing with some prosodic information such as speech rate and overall pitch contour.
  - Speaking: semantic to the acoustic token stage (HiFi-Codec)
논문제목 [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, vision, text
- 관심 정도: ⭐
- 메모:
  - “generate speech and co-verbal facial movements from text, animating a virtual avatar”
  - “The proposed model generates mel-spectrograms and facial features (head, eyes, jaw and lip movements) to drive the virtual avatar’s action units”

Speech Emotion Recognition

An Effective Local Prototypical Mapping Network for Speech Emotion Recognition
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, Prototype selection
- 관심 정도: ⭐⭐⭐
- 메모:
  - IEMOCAP accuracy: 77.42%(WA), 75.82%(UA) 달성
  - MIL 기반 Prototype selection 기법 제안
  - 석사 때, MIL을 사용해보려고 했던 입장으로 꼭 읽어보고 싶음
Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, joint training
- 관심 정도: ⭐⭐⭐
- 메모:
  - IEMOCAP accuracy: 79.50%(WA), 79.62%(UA) 달성
  - ASR, SER를 각각 학습한 후, joint training을 진행하는 방법 사용
  - SER에 text도 중요한 요소임을 보여줌

Audio Captioning

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모: outperforms the winner of DCASE 2023 Task 6A on almost all metrics.
Streaming Audio Transformers for Online Audio Tagging [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드: straming
- 관심 정도: ⭐
- 메모: 2초 정도 시간 소요
  “The best model, SAT-B, achieves an mAP of 45.1 with a 2s delay, using 8.2 Gflops and 36 MB of memory during inference.””
Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모:
ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모: SER task에서 시작해서 발전됨(EMOTION, VAD, GENDER), “surpass the performance of open-source state-of-the-art models”

Etc

Universal Score-based Speech Enhancement with High Content Preservation [코드]
- 섹션명: Generative Speech Enhancement
- 키워드:
- 관심 정도: ⭐
- 메모:
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- 섹션명: Speech Synthesis: Evaluation
- 키워드: VC, evaluation
- 관심 정도: ⭐
- 메모:
  - VC speaker similarity 평가
  - 선행 논문 코드: SVSNet: An end-to-end speaker voice similarity assessment model
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning [코드] [데모]
- 섹션명: Speech Synthesis: Tools and Data
- 키워드:
- 관심 정도: ⭐
- 메모: voice tagging에 쓰기 좋을 것 같음
Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 3
- 키워드: VC, dataset
- 관심 정도: ⭐⭐
VoxSim: A perceptual voice similarity dataset
- 섹션명: Oth
- 키워드: dataset, speaker similarity
- 관심 정도: ⭐
- 메모: 41k utterance pairs from the VoxCeleb dataset, collect 70k speaker similarity scores through a listening test
SAMSEMO: New dataset for multilingual and multimodal emotion recognition [코드]
- 섹션명: Oth
- 키워드: dataset, multi-lingual, emotion
- 관심 정도: ⭐⭐

Twitter Facebook LinkedIn

Interspeech 2024 관심 논문 리스트

📋 논문 리스트

Speech Features

VC: Voice Conversion

SVC: Singing Voice Conversion

TTS & Speech Synthesis

Speech Emotion Recognition

Audio Captioning

Etc

공유하기

댓글남기기

참고

Generative Agents: Interactive Simulacra of Human Behavior

Diffusion Models in Vision: A Survey

ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection

SPADE: Semantic Image Synthesis with Spatially-Adaptive Normalization