Voicebank : vraies voix françaises (CML-TTS) + pool anonyme + garde-fou Qwen3
Remplace la voicebank générée par Kokoro (timbre anglais sur français phonémisé -> accent que Qwen3 clonait) par 41 vraies voix FR issues de CML-TTS (livres audio studio) : 1 narrateur dédié, 18F/14M nommées, 4F/4M anonymes réservées. - scripts/import_voices.py : import multi-shards parquet, 1 clip/locuteur (le plus propre via levenshtein), genre estimé par F0 (YIN, anti-octave), filtre débit de parole (ref_text aligné sur l'audio). - VoiceEntry.anonymous + assign_voices : les figurants « anonyme (...) » tirent dans un pool réservé, jamais mélangé avec les voix nommées ; narrateur dédié (fr_narrator remplace fr_f_siwis). - dedup._anon_attrs : genre/âge déduits du nom anonyme (bon genre de voix). - tts/qwen3.py : garde-fou anti-dérive (rejette/réessaie les sorties en boucle ou coupées en estimant la durée plausible du chunk). Limite connue : Qwen3 ne sait pas synthétiser les fragments d'1-2 mots (incises, titres) -> trous ; à traiter (repli Kokoro ou fusion des incises). Inclut aussi du travail en cours antérieur (refacto backend LLM pluggable mlx/lmstudio, benchmark, ajustements frontend/API). Claude-Session: https://claude.ai/code/session_01XSVvcy1mfb4k1xDgib9vVU
This commit is contained in:
61
backend/scripts/delta_alternation.py
Normal file
61
backend/scripts/delta_alternation.py
Normal file
@@ -0,0 +1,61 @@
|
||||
"""Mesure l'effet de la passe d'alternance sur l'attribution (avant/apres).
|
||||
|
||||
Pour chaque modele : charge une fois, analyse le chapitre, intercepte les
|
||||
locuteurs JUSTE avant `_repair_alternation` (etat "avant") puis lit l'etat
|
||||
"apres", et score les deux contre la reference. Isole le gain de la passe
|
||||
deterministe, independamment du cout du modele.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import copy
|
||||
import sys
|
||||
|
||||
from inkflow.analysis import segmenter
|
||||
from inkflow.analysis.benchmark import _load_reference, _score_counts, _counts_to_score
|
||||
from inkflow.analysis.llm.client import LLM
|
||||
from inkflow.analysis.llm.factory import reset_llm_cache
|
||||
from inkflow.epub.parser import load_book, load_chapter_text
|
||||
from inkflow.store import artifacts
|
||||
|
||||
SLUG = "la-colere-de-tiamat"
|
||||
CH = int(__import__("os").environ.get("DELTA_CH", "5"))
|
||||
|
||||
|
||||
def main(model_ids: list[str]) -> None:
|
||||
book = load_book(SLUG)
|
||||
chapter = next(c for c in book.chapters if c.index == CH)
|
||||
ct = load_chapter_text(SLUG, chapter)
|
||||
cast = artifacts.load_cast(SLUG)
|
||||
ref = _load_reference(SLUG, CH)
|
||||
|
||||
orig_repair = segmenter._repair_alternation
|
||||
print(f"{'modele':<40} {'avant':>7} {'apres':>7} {'delta':>7}")
|
||||
for model_id in model_ids:
|
||||
captured: dict[str, list] = {}
|
||||
|
||||
def spy(segments, **kw): # capture l'etat avant reparation
|
||||
captured["before"] = copy.deepcopy(segments)
|
||||
orig_repair(segments, **kw)
|
||||
|
||||
segmenter._repair_alternation = spy
|
||||
try:
|
||||
gemma = LLM(model_id=model_id)
|
||||
analysis, _ = segmenter.analyze_chapter(
|
||||
chapter, ct, gemma, book_chars=list(cast.characters),
|
||||
dedup_gemma=None)
|
||||
finally:
|
||||
segmenter._repair_alternation = orig_repair
|
||||
reset_llm_cache()
|
||||
|
||||
from inkflow.models import ChapterAnalysis
|
||||
before = ChapterAnalysis(index=CH, title=ct.title,
|
||||
segments=captured["before"])
|
||||
s_before = _counts_to_score(CH, _score_counts(ref, before, cast))
|
||||
s_after = _counts_to_score(CH, _score_counts(ref, analysis, cast))
|
||||
b = s_before.speaker_acc_dialogue
|
||||
a = s_after.speaker_acc_dialogue
|
||||
print(f"{model_id:<40} {b:>6.1%} {a:>6.1%} {a - b:>+6.1%}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main(sys.argv[1:] or ["mlx-community/gemma-3-4b-it-4bit"])
|
||||
299
backend/scripts/import_voices.py
Normal file
299
backend/scripts/import_voices.py
Normal file
@@ -0,0 +1,299 @@
|
||||
"""Importe de vraies voix francaises dans la voicebank (clips + ref_text).
|
||||
|
||||
Probleme resolu : `build_voicebank()` generait les clips de reference *avec
|
||||
Kokoro lui-meme* — et la plupart des voix empruntaient un timbre Kokoro
|
||||
**anglais** lisant du francais phonemise. Resultat : un fort accent anglais que
|
||||
Qwen3 clonait fidelement. Ce script **remplace toute la banque** par de vrais
|
||||
enregistrements de locuteurs francais, ce qui donne a Qwen3 une reference de
|
||||
timbre reellement francophone.
|
||||
|
||||
Source : **CML-TTS French** (`ylacombe/cml-tts`, config `french`), CC-BY,
|
||||
non-gated. Corpus de **livres audio** taille pour le TTS : voix studio, registre
|
||||
narrateur, prose reelle. On telecharge des shards parquet (audio WAV 24 kHz
|
||||
embarque) via `huggingface_hub`, shard apres shard, jusqu'a remplir les quotas.
|
||||
|
||||
Allocation des roles (chaque voix = un locuteur distinct, `speaker_id`) :
|
||||
- 1 **narrateur** dedie (`fr_narrator`).
|
||||
- N **voix nommees** par genre (`fr_f_*`, `fr_m_*`) pour les personnages.
|
||||
- M **voix anonymes** par genre (`fr_anon_f_*`, `fr_anon_m_*`, `anonymous=True`),
|
||||
reservees aux figurants "anonyme (...)" par `assign_voices` (jamais melangees
|
||||
avec les voix nommees).
|
||||
|
||||
Qualite : un clip par locuteur, le plus propre (`levenshtein` mini), duree bornee.
|
||||
Genre absent du corpus -> estime par **F0 (YIN, anti-octave)**.
|
||||
|
||||
Usage (depuis backend/, venv actif) :
|
||||
python scripts/import_voices.py # quotas par defaut, REMPLACE la banque
|
||||
python scripts/import_voices.py --named-f 18 --named-m 14 --anon 4
|
||||
python scripts/import_voices.py --shards french/dev/0002.parquet french/dev/0000.parquet
|
||||
|
||||
Le clip est ecrit a son sr natif (24 kHz) ; Qwen3 reechantillonne la reference.
|
||||
La banque resultante a un `ref_audio` partout, donc `build_voicebank()` (legacy)
|
||||
ne la regenerera pas. Le `kokoro_voice` reste renseigne (preset de meme genre)
|
||||
pour le preview/draft Kokoro ; le timbre final vient du ref_audio via Qwen3.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
|
||||
# Permet de lancer le script sans `pip install -e` : on ajoute backend/ au path.
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
||||
|
||||
from inkflow.casting.voicebank import save_voicebank # noqa: E402
|
||||
from inkflow.config import VOICEBANK_DIR # noqa: E402
|
||||
from inkflow.models import VoiceEntry, Voicebank # noqa: E402
|
||||
|
||||
# Presets Kokoro de secours par genre (preview/draft uniquement ; le timbre final
|
||||
# vient du ref_audio clone par Qwen3). On cycle dessus pour varier les previews.
|
||||
_KOKORO_BY_GENDER = {
|
||||
"female": ["af_bella", "af_heart", "af_nicole", "bf_emma"],
|
||||
"male": ["am_fenrir", "am_michael", "bm_george", "am_eric"],
|
||||
}
|
||||
# Shards CML-TTS French par defaut (branche refs/convert/parquet). dev/test
|
||||
# partagent un petit pool fixe de locuteurs (~17F/17M au total) ; la variete est
|
||||
# dans train (chaque shard = quelques lecteurs distincts). On lit test (le plus
|
||||
# fourni) puis des shards train jusqu'a remplir les quotas.
|
||||
_DEFAULT_SHARDS = (
|
||||
["french/test/0000.parquet", "french/dev/0002.parquet"]
|
||||
+ [f"french/train/{i:04d}.parquet" for i in range(12)]
|
||||
)
|
||||
|
||||
|
||||
def _to_mono(arr: np.ndarray) -> np.ndarray:
|
||||
arr = np.asarray(arr, dtype=np.float32)
|
||||
if arr.ndim > 1:
|
||||
arr = arr.mean(axis=1)
|
||||
return arr
|
||||
|
||||
|
||||
def _yin_f0(frame: np.ndarray, sr: int, lo: int, hi: int, thresh: float = 0.15) -> float:
|
||||
"""F0 d'une trame par YIN (anti-octave). 0.0 si non voisee.
|
||||
|
||||
1) fonction de difference d(tau) ; 2) moyenne cumulee normalisee d'(tau) ;
|
||||
3) premier tau sous le seuil absolu (evite de prendre l'octave superieure).
|
||||
C'est l'etape (2)-(3) qui rend YIN robuste aux erreurs d'octave de
|
||||
l'autocorrelation simple (qui faisait passer un homme pour une femme).
|
||||
"""
|
||||
n = len(frame)
|
||||
diff = np.zeros(hi + 1)
|
||||
for tau in range(1, hi + 1):
|
||||
d = frame[: n - tau] - frame[tau:n]
|
||||
diff[tau] = np.dot(d, d)
|
||||
cum = np.cumsum(diff[1:])
|
||||
cmnd = np.ones(hi + 1)
|
||||
taus = np.arange(1, hi + 1)
|
||||
cmnd[1:] = diff[1:] * taus / np.maximum(cum, 1e-9)
|
||||
tau = -1
|
||||
t = lo
|
||||
while t < hi:
|
||||
if cmnd[t] < thresh:
|
||||
while t + 1 < hi and cmnd[t + 1] < cmnd[t]:
|
||||
t += 1 # descend jusqu'au minimum local
|
||||
tau = t
|
||||
break
|
||||
t += 1
|
||||
if tau == -1: # aucun creux net -> min global de la bande
|
||||
tau = lo + int(np.argmin(cmnd[lo:hi]))
|
||||
if cmnd[tau] > 0.6: # vraiment pas de periodicite -> non voisee
|
||||
return 0.0
|
||||
return sr / tau
|
||||
|
||||
|
||||
def estimate_gender(arr: np.ndarray, sr: int) -> tuple[str, float]:
|
||||
"""Estime le genre par F0 mediane (YIN par trame, numpy pur).
|
||||
|
||||
Voix parlee : H ~85-180 Hz (med ~120), F ~165-255 Hz (med ~210). Renvoie
|
||||
("unknown", med) si la mediane tombe dans la zone ambigue 150-180 Hz -> on
|
||||
prefere ecarter le locuteur que de mal le classer (assez de candidats).
|
||||
"""
|
||||
win = int(0.04 * sr)
|
||||
hop = win // 2
|
||||
lo = max(1, int(sr / 350)) # 350 Hz
|
||||
hi = int(sr / 70) # 70 Hz
|
||||
energy_thresh = 0.10 * np.sqrt(np.mean(arr ** 2) + 1e-9)
|
||||
f0s: list[float] = []
|
||||
for start in range(0, max(0, len(arr) - win), hop):
|
||||
frame = arr[start:start + win].astype(np.float64)
|
||||
if np.sqrt(np.mean(frame ** 2)) < energy_thresh:
|
||||
continue
|
||||
f0 = _yin_f0(frame - frame.mean(), sr, lo, hi)
|
||||
if f0 > 0:
|
||||
f0s.append(f0)
|
||||
if len(f0s) < 10:
|
||||
return "unknown", 0.0
|
||||
med = float(np.median(f0s))
|
||||
if 150 <= med <= 180:
|
||||
return "unknown", med
|
||||
return ("male" if med < 165 else "female"), med
|
||||
|
||||
|
||||
def _iter_parquet_rows(dataset: str, shard: str):
|
||||
"""Telecharge le shard parquet (audio embarque) et itere ses lignes en dict."""
|
||||
from huggingface_hub import hf_hub_download
|
||||
import pyarrow.parquet as pq
|
||||
|
||||
print(f" · telechargement {shard}…", flush=True)
|
||||
path = hf_hub_download(dataset, shard, repo_type="dataset",
|
||||
revision="refs/convert/parquet")
|
||||
pf = pq.ParquetFile(path)
|
||||
for batch in pf.iter_batches(batch_size=128):
|
||||
cols = {name: batch.column(name) for name in batch.schema.names}
|
||||
for i in range(batch.num_rows):
|
||||
yield {name: col[i].as_py() for name, col in cols.items()}
|
||||
|
||||
|
||||
def _gather_voices(dataset, shards, min_dur, max_dur, max_lev, need_f, need_m):
|
||||
"""Collecte des locuteurs distincts classes par genre (YIN), shard par shard.
|
||||
|
||||
S'arrete des que chaque genre a assez de candidats. Renvoie
|
||||
{"female": [(spk, lev, bytes, text), ...trie par qualite], "male": [...]}.
|
||||
"""
|
||||
best: dict[object, dict] = {} # speaker_id -> meilleur clip vu
|
||||
classified: dict[object, str] = {} # speaker_id -> gender (cache)
|
||||
buckets = {"female": [], "male": []}
|
||||
|
||||
for shard in shards:
|
||||
for row in _iter_parquet_rows(dataset, shard):
|
||||
dur = row.get("duration") or 0.0
|
||||
if not (min_dur <= dur <= max_dur):
|
||||
continue
|
||||
nwords = row.get("num_words") or 0
|
||||
# Debit de parole : un ref_text qui ne couvre pas l'audio (fragment
|
||||
# tronque, ou audio plein de silence) casse le clonage Qwen3 (sortie
|
||||
# vide). On exige un debit plausible 1.5-4.5 mots/s.
|
||||
wps = nwords / dur if dur else 0
|
||||
if nwords < 8 or not (1.5 <= wps <= 4.5):
|
||||
continue
|
||||
lev = (row.get("levenshtein") or 0) / max(nwords, 1)
|
||||
if lev > max_lev:
|
||||
continue
|
||||
spk = row.get("speaker_id")
|
||||
text = (row.get("text") or "").strip()
|
||||
if spk is None or len(text) < 15:
|
||||
continue
|
||||
cur = best.get(spk)
|
||||
if cur is None or lev < cur["lev"]:
|
||||
best[spk] = {"lev": lev, "bytes": row["audio"]["bytes"], "text": text}
|
||||
|
||||
# Classe les nouveaux locuteurs de ce shard.
|
||||
for spk, c in best.items():
|
||||
if spk in classified:
|
||||
continue
|
||||
arr, sr = sf.read(io.BytesIO(c["bytes"]), dtype="float32")
|
||||
g, _ = estimate_gender(_to_mono(arr), sr)
|
||||
classified[spk] = g
|
||||
if g in buckets:
|
||||
buckets[g].append((spk, c["lev"], c["bytes"], c["text"]))
|
||||
nf, nm = len(buckets["female"]), len(buckets["male"])
|
||||
print(f" -> {nf} femmes / {nm} hommes candidats", flush=True)
|
||||
if nf >= need_f and nm >= need_m:
|
||||
break
|
||||
|
||||
for g in buckets:
|
||||
buckets[g].sort(key=lambda t: t[1]) # plus propre d'abord
|
||||
return buckets
|
||||
|
||||
|
||||
def _write_clip(vid: str, raw: bytes) -> tuple[str, int]:
|
||||
arr, sr = sf.read(io.BytesIO(raw), dtype="float32")
|
||||
arr = _to_mono(arr)
|
||||
rel = f"clips/{vid}.wav"
|
||||
sf.write(str(VOICEBANK_DIR / rel), arr, sr)
|
||||
return rel, sr
|
||||
|
||||
|
||||
def _entry(vid, gender, idx, spk, text, *, anonymous, label) -> VoiceEntry:
|
||||
kokoro = _KOKORO_BY_GENDER[gender][(idx - 1) % len(_KOKORO_BY_GENDER[gender])]
|
||||
rel, _ = _write_clip(vid, spk[2])
|
||||
return VoiceEntry(id=vid, kokoro_voice=kokoro, gender=gender, age="adult",
|
||||
lang="fr", label=label, ref_audio=rel, ref_text=text,
|
||||
anonymous=anonymous)
|
||||
|
||||
|
||||
def import_voices(*, dataset, shards, named_f, named_m, anon, min_dur, max_dur,
|
||||
max_lev) -> Voicebank:
|
||||
need_f = named_f + anon + 1 # +1 narrateur (feminin)
|
||||
need_m = named_m + anon
|
||||
print(f"Objectif : {need_f} femmes / {need_m} hommes (distincts).", flush=True)
|
||||
buckets = _gather_voices(dataset, shards, min_dur, max_dur, max_lev, need_f, need_m)
|
||||
|
||||
fem, mal = buckets["female"], buckets["male"]
|
||||
if len(fem) < need_f or len(mal) < need_m:
|
||||
print(f"⚠ Pas assez de locuteurs (F {len(fem)}/{need_f}, H {len(mal)}/{need_m}) — "
|
||||
"quotas reduits. Ajoute des shards via --shards.", flush=True)
|
||||
named_f = min(named_f, max(0, len(fem) - anon - 1))
|
||||
named_m = min(named_m, max(0, len(mal) - anon))
|
||||
|
||||
# Remplacement complet : on vide les clips existants.
|
||||
clips = VOICEBANK_DIR / "clips"
|
||||
clips.mkdir(parents=True, exist_ok=True)
|
||||
for old in clips.glob("*.wav"):
|
||||
old.unlink()
|
||||
|
||||
entries: list[VoiceEntry] = []
|
||||
fi = mi = 0 # curseurs dans les buckets tries par qualite
|
||||
|
||||
# 1) Narrateur (1re voix feminine, la plus propre).
|
||||
spk = fem[fi]; fi += 1
|
||||
entries.append(_entry("fr_narrator", "female", 1, spk, spk[3],
|
||||
anonymous=False, label="Narrateur (FR)"))
|
||||
# 2) Voix nommees.
|
||||
for i in range(1, named_f + 1):
|
||||
spk = fem[fi]; fi += 1
|
||||
entries.append(_entry(f"fr_f_{i}", "female", i, spk, spk[3],
|
||||
anonymous=False, label=f"Voix F {i} (FR)"))
|
||||
for i in range(1, named_m + 1):
|
||||
spk = mal[mi]; mi += 1
|
||||
entries.append(_entry(f"fr_m_{i}", "male", i, spk, spk[3],
|
||||
anonymous=False, label=f"Voix H {i} (FR)"))
|
||||
# 3) Voix anonymes (reservees aux figurants).
|
||||
for i in range(1, anon + 1):
|
||||
if fi >= len(fem):
|
||||
break
|
||||
spk = fem[fi]; fi += 1
|
||||
entries.append(_entry(f"fr_anon_f_{i}", "female", i, spk, spk[3],
|
||||
anonymous=True, label=f"Anonyme F {i} (FR)"))
|
||||
for i in range(1, anon + 1):
|
||||
if mi >= len(mal):
|
||||
break
|
||||
spk = mal[mi]; mi += 1
|
||||
entries.append(_entry(f"fr_anon_m_{i}", "male", i, spk, spk[3],
|
||||
anonymous=True, label=f"Anonyme H {i} (FR)"))
|
||||
|
||||
vb = Voicebank(entries=entries)
|
||||
save_voicebank(vb)
|
||||
na = sum(1 for e in entries if e.anonymous)
|
||||
print(f"\n✓ {len(entries)} voix → {VOICEBANK_DIR / 'metadata.json'}")
|
||||
print(f" narrateur 1 · nommees {len(entries) - na - 1} · anonymes {na}")
|
||||
for e in entries:
|
||||
tag = " [anon]" if e.anonymous else ""
|
||||
print(f" {e.id:14} {e.gender:6} kokoro={e.kokoro_voice}{tag}")
|
||||
return vb
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--dataset", default="ylacombe/cml-tts")
|
||||
p.add_argument("--shards", nargs="+", default=_DEFAULT_SHARDS,
|
||||
help="Shards parquet a consommer dans l'ordre jusqu'aux quotas.")
|
||||
p.add_argument("--named-f", type=int, default=18, help="Voix feminines nommees.")
|
||||
p.add_argument("--named-m", type=int, default=14, help="Voix masculines nommees.")
|
||||
p.add_argument("--anon", type=int, default=4, help="Voix anonymes par genre.")
|
||||
p.add_argument("--min-dur", type=float, default=6.0)
|
||||
p.add_argument("--max-dur", type=float, default=15.0)
|
||||
p.add_argument("--max-lev", type=float, default=0.5,
|
||||
help="Distance Levenshtein max par mot (qualite ; plus bas = plus propre).")
|
||||
args = p.parse_args()
|
||||
import_voices(dataset=args.dataset, shards=args.shards, named_f=args.named_f,
|
||||
named_m=args.named_m, anon=args.anon, min_dur=args.min_dur,
|
||||
max_dur=args.max_dur, max_lev=args.max_lev)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user