Interspeech 2024 관심 논문 리스트
Interspeech 2024 논문들 중 관심 있는 논문 리스트입니다. [Interspeech 2024 Archive]
📋 논문 리스트
Speech Features
- YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation [코드]
- 섹션명: Speech and Audio Analysis and Representations
- 키워드: F0 예측
- 관심 정도: ⭐⭐
- 메모: F0 SOTA, F0가 음성 생성에 주요한 요소로 사용되는 경우가 많기에 확인해보는 것도 좋은 듯함
VC: Voice Conversion
- Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, 공간 음향
- 관심 정도: ⭐
- 메모:
- multi-speaker 상황에서 원하는 대상만 Voice Conversion 가능
- multi-channel로 공간적 요소 고려
- 데모에서 기계음이 존재함
- Neural Codec Language Models for Disentangled and Textless Voice Conversion [코드 (미업데이트)]
- Fine-Grained and Interpretable Neural Speech Editing [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: speech editing
- 관심 정도: ⭐⭐
- 메모:
- speech editing이 잘되고 있음, pitch shifting/time strecthing 가능함
- VC 항목의 다양한 demo가 있었으면 좋았을 것 같음
- FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, diffusion
- 관심 정도: ⭐⭐
- 메모: Diffusion step 1번으로 복원
- DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, real-time
- 관심 정도: ⭐
- 메모: 굉장히 짧은 오디오로도 음성 변환 가능
- Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity [데모]
- 섹션명: Speech Synthesis: Voice Conversion 1
- 키워드: VC, emotion
- 관심 정도: ⭐
- 메모: Emotional Intensity-aware, 구글 논문을 더 먼저 읽어야함..!
- Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC
- 관심 정도: ⭐⭐
- 메모:
- adaptive global response normalization(AGRN) 방법이 흥미로워 보였음
- 특히 ablation studies!
- Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, cross-lingual, training-inference mismatch
- 관심 정도: ⭐⭐⭐
- 메모:
- 성능이 너무 좋아보이는데 코드가 없어서 너무 아쉬움..
- 3초 음성만으로도 VC 진행 가능
- training-inference mismatch 문제 해결하기 위한 방법 제안
“teacher-guided refinement process to form a dual-mode (conversion mode and reconstruction mode) training strategy with the original reconstruction process”
- Residual Speaker Representation for One-Shot Voice Conversion [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, timbre
- 관심 정도: ⭐
- 메모:
- timbre 조절 가능
- 새로운 speaker에 대한 robustness 증가
- layer-wise error modeling 사용해서 성능 향상
- Disentangling prosody and timbre embeddings via voice conversion
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, Disentangling, prosody, timbre
- 관심 정도: ⭐
- 메모: 음성을 prosody와 timbre로 분해, RVC 사용
- PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech Representations [데모 (재생이 안 됨)]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC. knowledge distillation
- 관심 정도: ⭐
- 메모: knowledge distillation
- HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
- supports text and audio prompts
- 음색이 유사도는 살짝 떨어지는 듯하나 아이디어가 좋아 보임
- DreamVoice: Text-Guided Voice Conversion [HuggingFace] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, dataset, diffusion, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
- text-guided VC
- 다른 음성 변환 모델이랑 붙혀서 사용할 수 있음
- 데이터셋도 제공
- Diffusion Probabilistic Models + Classifier-free Guidance
- Hear Your Face: Face-based voice conversion with F0 estimation [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, F0, pitch
- 관심 정도: ⭐
- 메모:
- 얼굴 사진을 보고 음성 변환을 수행함..!
- So-VITS-SVC 아이디어 발전
- 데모 퀄리티 좋음
- Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: VC, pitch, F0, voiced-unvoices, prosody, knowledge distillation, streaming
- 관심 정도: ⭐
- 메모:
- 데모 문장이 다양하지 않아서 아쉬움
- "”The three dimensional prosody feature consists of z-scored log-F0 and energy and a binary voice-unvoiced flag”
SVC: Singing Voice Conversion
- LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance [데모]
- 섹션명: Speech Synthesis: Voice Conversion 2
- 키워드: SVC, diffusion
- 관심 정도: ⭐
- 메모: pre-trained So-VITS-SVC 사용
- MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, diffusion
- 관심 정도: ⭐
- 메모:
- speech 데이터셋으로 노래 음성 변환이 가능함
- Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, pitch
- 관심 정도: ⭐⭐⭐
- 메모:
- 퀄리티가 좋음!
- VITS + music score
- 비교한 다른 논문인 VISinger2(코드)도 보면 좋을 것 같음
- owing to deterministic pitch conditioning, they do not fully address the one-to-many problem.
- integrates variational autoencoders for the periodic and aperiodic components
- eliminates the dependency on an external aligner by estimating the phoneme alignment through a monotonic alignment search within note boundaries.
- pitch augmentation
- we apply the smoothed pitch augmentation method to ensure that the latent variables capture both wide and narrow pitch variations
- For pitch augmentation, we extracted the smoothed F0 using a median filter with a kernel size of 13.
- X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning [데모]
- 섹션명: Speech Synthesis: Singing Voice Synthesis
- 키워드: SVC, multi-lingual
- 관심 정도: ⭐⭐⭐
- 메모:
- 넷 마~ 블!
- 하나의 샘플 안에 일본어, 중국어, 한국어로 자연스럽게 바꾸면서 노래 생성 가능
- language와 speaker 분리: mix-LN transformer (Mix-LN mixes the feature statistics of the speaker embedding, which confuses the model by the mismatched speaker information)
- CFM-based decoder
- conditional flow matching 사용 (matchaTTS, P-Flow)
- 명시적인 pitch 예측 없이도 비슷한 목소리로 생성 가능
- zero-shot/더 풍부한 표현/발화 데이터 사용에 대한 연구는 추후에 진행된다고 함
TTS & Speech Synthesis
- Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [데모]
- 섹션명: Zero-shot TTS
- 키워드: TTS, LLM, emotion
- 관심 정도: ⭐⭐⭐
- 메모:
- source 오디오의 감정을 반영함!
- pretrained LLM 사용해서 semantic token 성능 향상 시킴
- 훑어보는 건 필수
- DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness
- 섹션명: Zero-shot TTS
- 키워드: TTS, HuBERT, teacher-student EMA model, noise augmentation
- 관심 정도: ⭐⭐⭐
- 메모:
- DINO loss를 사용해서 voice cloning 성능 향상
“substantial improvements in naturalness and speaker similarity in both clean and especially real-life noisy scenarios, outperforming traditional AAM-Softmax-based training methods” - HuBERT를 사용하면 노이즈 유무가 포함된 embedding을 얻을 수 있기 때문에, 노이즈 label 없이도 노이즈 있는 데이터 학습 가능
- pretrained speaker verification CAM++ model 사용
- DINO loss를 사용해서 voice cloning 성능 향상
- Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, domain adaptation
- 관심 정도: ⭐⭐⭐
- 메모:
- bin을 사용한 방법에 대해서 더 자세히 보고자 함
- 나는 졸업 논문에서 bin을 3개로 나누었는데, 해당 논문에서는 5개, 10개로 나눴음
- 기반 논문도 확인이 필요해보임
“We implement our idea using the K-nearest neighbors-voice conversion strategy [19], which is a recently proposed approach that achieves impressive results in VC despite its simplicity”
- bin을 사용한 방법에 대해서 더 자세히 보고자 함
- GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, dataset
- 관심 정도: ⭐
- 메모:
- Glottalization, Tenseness, and Resonance label 사용
- Glottalization: control of air flow due to the tension of the glottis (i.e., throat)
- Tenseness: tense vowels in pronunciation involve tension in the tip and root of the tongue, while lax vowels are the opposite.
- Resonance: integration of articulatory phonetics with vocal register insight (흉성, 두성으로 추청됨)
- Glottalization, Tenseness, and Resonance label 사용
- TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, expressive, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
- reference 음성 없이 text-base로 발화 스타일 추출
- 데모에 한국어 있음
- unseen speaker에 대해서는 감정 표현이 약해지는 것 같음 (학습된 화자가 4명이라서 어쩔 수 없는 부분인 것 같아 보임. 더 많은 화자를 사용했을 때의 결과가 궁금함!)
- 2080 Ti 2장의 결과물이란 게 너무 대단함..
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, expressive, LM
- 관심 정도: ⭐⭐
- 메모:
- “음~”과 웃음 소리를 다양한 버전으로 합성할 수 있음
- LM 기반 TTS 모델이고, acoustic decoder는 VALL-E 기반
- Text-aware and Context-aware Expressive Audiobook Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, LM, text
- 관심 정도: ⭐⭐
- 메모:
- text뿐만 아니라 맥락까지 고려한 모델
- 다른 모델 대비 덜 딱딱하게 읽는 느낌이 있음 (데모가 중국어라서 듣는데 한계가 있음)
- Controlling Emotion in Text-to-Speech with Natural Language Prompts [toolkit]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion, text, prompt
- 관심 정도: ⭐⭐⭐
- 메모:
- 감정적인 요소가 있는 text를 prompt로 사용 (ex: (중립)알겠습니다 / (행복)정말요! )
- contribution
- an architecture that allows for separate modeling of a speaker’s voice and the prosody of an utterance, using a natural language prompt for the latter
- a training strategy to learn a strongly generalized prompt conditioning
- a pipeline that allows users to generate speech with fitting prosody without manually selecting the emotion by simply using the text to be read as the prompt
- Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion
- 관심 정도: ⭐
- 메모: 각 감정으로 fine-tuning한 모델과 base model의 차이를 emotion vector로 사용
- EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [데모]
- 섹션명: Speech Synthesis: Expressivity and Emotion
- 키워드: TTS, emotion
- 관심 정도: ⭐⭐
- 메모: 석사 때 emotion sphere와 같이 해보고 싶었는데, 해당 논문에서 방법을 제시해서 궁금함!
- Word-level Text Markup for Prosody Control in Speech Synthesis [코드] [데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody
- 관심 정도: ⭐⭐
- 메모: prosodic markup - prosody를 unsupervise 방법으로 학습하고, control 할 수 있게한 논문
- Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech[데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody
- 관심 정도: ⭐⭐
- 메모:
- 기존 nonautoregressive TTS의 deterministic duration predictor(DET)을 probabilistic duration modelling(OT-CFM-based duration model, FM)로 바꾸고 비교
“We explore the effects of replacing the MSE-based duration predictor in existing NAR TTS approaches with a log-domain duration model based on conditional flow matching” - 비교에 사용한 논문
- a deterministic acoustic model (FastSpeech 2)
- an advanced deep generative acoustic model (Matcha-TTS)
- a probabilistic endto-end TTS model (VITS)
- 기존 nonautoregressive TTS의 deterministic duration predictor(DET)을 probabilistic duration modelling(OT-CFM-based duration model, FM)로 바꾸고 비교
- Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody, duration
- 관심 정도: ⭐⭐⭐
- 메모:
- “designed to precisely control the length of generated speech while maintaining speech quality at different speech rates”
- “a novel duration model based on Mask”GIT-based to enhance the diversity and quality of the phoneme durations”
- Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling [데모]
- 섹션명: Speech Synthesis: Prosody
- 키워드: TTS, prosody, diffusion
- 관심 정도: ⭐⭐
- 메모:
- 억양이 안 닮은 문제를 해결하기 위한 논문
- contribution
1.speaker timbre is a global attribute: speaker encoder to extract global speaker embedding (input: mel spectrograms)
2.diffusion model as a pitch predictor: to match speech prosody diversity by leveraging its natural advantage in generating content diversity
- prosody shows both global consistency and local variations: to model prosody hierarchically, such as frame-level, phoneme level, and word-level, to improve the prosody performance of synthesized speech.
- Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS
- 관심 정도: ⭐
- 메모:
- 자연스럽게 말하는 것과 또박또박 말하는 스타일 선택 가능
- 아이디어 부분을 더 자세히 보면 좋을 것 같다는 생각이 들었음
- Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, codec
- 관심 정도: ⭐⭐
- 메모:
- single-codebook codec, compression and reconstruction on mel-spectrogram
- “Single-Codec performs compression and reconstruction on Mel Spectrogram instead of the raw waveform, enabling efficient compression of speech information while preserving important details, as stated in Tortoise-TTS”
- ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, cross-lingual, code-switching
- 관심 정도: ⭐⭐⭐
- 메모:
- 현대 자동차에서 만듦
- 한 문장 내에서 영어와 한국어 code-switching 가능 (cross-lingual and code-mixed speech with high naturalness), 해당 부분의 아이디어를 자세히 볼 필요 있음
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS
- 관심 정도: ⭐
- 메모:
- Zoom~
- “GAN-based, FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zeroshot voice cloning”
- “Multi-feature Generative Adversarial Training pipeline which uses our discriminator to enhance both acoustic and prosodic features for natural and expressive TTS”
- Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
- 섹션명: Speech Synthesis: Paradigms and Methods 1
- 키워드: TTS, markup, expressive
- 관심 정도: ⭐⭐⭐
- 메모:
- pre-trained Voicebox 사용해서 아래 3가지 경우 control한 음성 생성
- Punctuation: It’s good!
- Emphasis:It’s good
- Laughter: It’s good [laughter]
- “efficient fine-tuning methods to bridge the gap between pre-trained parameters and new fine-grained conditioning modules”
- pre-trained Voicebox 사용해서 아래 3가지 경우 control한 음성 생성
- Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS,
- 관심 정도: ⭐⭐⭐
- 메모:
- neural codec language model
“In contrast with previous TTS codec LM model that leverages decoder-only (GPT) transformers, Small-E relies on encoder-decoder architecture” - Can be easily pretrained and finetuned on midrange GPUs
- Trained on long context
- neural codec language model
- Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [데모] [nvidia blog]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, duration, LLM
- 관심 정도: ⭐⭐⭐
- 메모:
- nvidia, T5-TTS (T5: text-to-text model)
- “first attampt at synthesizing multi-codebook neural audio codecs with an encoder-decoder architecture”
- cross-attention heads가 monotonic alignment를 학습할 수 있도록 만듦
- 연속으로 반복되는 단어나 문장에 대해서 엄청 자연스럽게 발화함
- Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment [데모(안뜸)]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS,
- 관심 정도: ⭐⭐
- 메모: sentence 단위의 음성만으로 자연스러운 longform speech 생성 가능
- 논문제목 [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, text, speech editing
- 관심 정도: ⭐⭐⭐
- 메모:
- Text-based Speech Editing
- Acoustic and Prosody Consistency Losses
- Acoustic: quantify the smooth transition between the editing region and the adjacent context
- Prosody: for capturing the prosody feature from the predicted masked region while also analyzing the overall prosody characteristics present in the original speech
- High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, text
- 관심 정도: ⭐⭐⭐
- 메모:
- 퀄리티 엄청 좋고, control한 오디오도 자연스러움
- Interpreting: text-to-semantic token stage
- k-means clustering on wav2vec 2.0
- mainly focus on phonetic information, but it also dealing with some prosodic information such as speech rate and overall pitch contour.
- Speaking: semantic to the acoustic token stage (HiFi-Codec)
- 논문제목 [코드] [데모]
- 섹션명: Speech Synthesis: Paradigms and Methods 2
- 키워드: TTS, vision, text
- 관심 정도: ⭐
- 메모:
- “generate speech and co-verbal facial movements from text, animating a virtual avatar”
- “The proposed model generates mel-spectrograms and facial features (head, eyes, jaw and lip movements) to drive the virtual avatar’s action units”
Speech Emotion Recognition
- An Effective Local Prototypical Mapping Network for Speech Emotion Recognition
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, Prototype selection
- 관심 정도: ⭐⭐⭐
- 메모:
- IEMOCAP accuracy: 77.42%(WA), 75.82%(UA) 달성
- MIL 기반 Prototype selection 기법 제안
- 석사 때, MIL을 사용해보려고 했던 입장으로 꼭 읽어보고 싶음
- Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction
- 섹션명: Corpora-based Approaches in Automatic Emotion Recognition
- 키워드: emotion, joint training
- 관심 정도: ⭐⭐⭐
- 메모:
- IEMOCAP accuracy: 79.50%(WA), 79.62%(UA) 달성
- ASR, SER를 각각 학습한 후, joint training을 진행하는 방법 사용
- SER에 text도 중요한 요소임을 보여줌
Audio Captioning
- Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모: outperforms the winner of DCASE 2023 Task 6A on almost all metrics.
- Streaming Audio Transformers for Online Audio Tagging [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드: straming
- 관심 정도: ⭐
- 메모: 2초 정도 시간 소요
“The best model, SAT-B, achieves an mAP of 45.1 with a 2s delay, using 8.2 Gflops and 36 MB of memory during inference.””
- Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모:
- ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks [코드]
- 섹션명: Audio Captioning, Tagging, and Audio-Text Retrieval
- 키워드:
- 관심 정도: ⭐
- 메모: SER task에서 시작해서 발전됨(EMOTION, VAD, GENDER), “surpass the performance of open-source state-of-the-art models”
Etc
- Universal Score-based Speech Enhancement with High Content Preservation [코드]
- 섹션명: Generative Speech Enhancement
- 키워드:
- 관심 정도: ⭐
- 메모:
- SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- 섹션명: Speech Synthesis: Evaluation
- 키워드: VC, evaluation
- 관심 정도: ⭐
- 메모:
- VC speaker similarity 평가
- 선행 논문 코드: SVSNet: An end-to-end speaker voice similarity assessment model
- LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning [코드] [데모]
- 섹션명: Speech Synthesis: Tools and Data
- 키워드:
- 관심 정도: ⭐
- 메모: voice tagging에 쓰기 좋을 것 같음
- Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline [코드] [데모]
- 섹션명: Speech Synthesis: Voice Conversion 3
- 키워드: VC, dataset
- 관심 정도: ⭐⭐
- VoxSim: A perceptual voice similarity dataset
- 섹션명: Oth
- 키워드: dataset, speaker similarity
- 관심 정도: ⭐
- 메모: 41k utterance pairs from the VoxCeleb dataset, collect 70k speaker similarity scores through a listening test
- SAMSEMO: New dataset for multilingual and multimodal emotion recognition [코드]
- 섹션명: Oth
- 키워드: dataset, multi-lingual, emotion
- 관심 정도: ⭐⭐
댓글남기기