Voice Deepfake Detector & Generator

Detect Deepfake Audio

Upload a WAV audio file to classify it as real or deepfake using our trained ML model.

Recent Detections

No detections yet. Upload an audio file to get started.

Model Evaluation Results

Three feature extraction strategies were trained and compared using 5-fold stratified cross-validation on the combined OSR + synthetic dataset.

Loading results…

Comparison of model performance metrics across different feature extraction methods
Model	Feature Dims	Accuracy	Precision	Recall	F1	Inference (1k samples)

Results file not found. Run the training pipeline to generate metrics:

cd training
pip install -r requirements.txt
python train.py --csv ../osr_features.csv --output ../models/

Confusion Matrices

System Architecture

flowchart LR
  subgraph Frontend["Frontend (GitHub Pages)"]
    UI["index.html\n+ JS / CSS"]
  end
  subgraph Backend["Backend (local / container)"]
    API["FastAPI\n:8000"]
    DET["/detect\nPOST"]
    GEN["/generate\nPOST"]
    API --> DET
    API --> GEN
  end
  subgraph Models["Models (models/)"]
    M1["MFCC\n.pkl"]
    M2["FFT\n.pkl"]
    M3["Hybrid\n.pkl (best)"]
  end
  subgraph Training["Training Pipeline"]
    T["train.py\n5-fold CV"]
  end

  UI -- "HTTP / CORS" --> API
  DET -- "load" --> M3
  GEN -- "Coqui TTS\n/ gTTS fallback" --> GEN
  T -- "saves" --> Models

About This Project

Overview

Voice Deepfake Vishing Detector & Generator is a graduation-level research project that builds a full pipeline for detecting AI-synthesised voices and demonstrating responsible voice-cloning for academic purposes.

Detection Pipeline

MFCC-only: 13-dimensional Mel Frequency Cepstral Coefficient means — fast, lightweight, baseline model.
FFT/Spectral-only: 6-dimensional feature vector (centroid, bandwidth, rolloff, log band energies) — frequency-domain characteristics.
Hybrid (recommended): 19-dimensional MFCC + FFT concatenation — best accuracy/recall for IoT-grade deployment.

All models use Gradient Boosting with StandardScaler pre-processing and 5-fold cross-validation evaluation.

Generation Pipeline

Voice cloning uses IndexTTS2 (index-tts/index-tts by Bilibili) — an industrial-level zero-shot TTS system that clones voice timbre and emotional expression from a short reference clip. It supports duration control and emotion-style separation.

When IndexTTS2 is unavailable (not installed or INDEXTTS_DIR not set), the system falls back to gTTS with a clear warning. gTTS produces a generic voice — it is NOT a voice clone.

Deployment

Frontend: Deployed automatically to GitHub Pages on every push to main.
Backend: Run locally with uvicorn, or via Docker/GHCR container.
CI: GitHub Actions runs lint (ruff) + unit tests on every PR.

Ethics & Security

Research & educational use only.
Consent is mandatory before cloning any voice.
No audio is stored server-side beyond the single request lifetime.
TLS protects transport but does not authenticate voice authenticity.

Voice Deepfake Vishing Detector & Generator

Detect Deepfake Audio

Recent Detections

Generate Voice Clone

Model Evaluation Results

Confusion Matrices

System Architecture

About This Project

Overview

Detection Pipeline

Generation Pipeline

Deployment

Ethics & Security

Links