Detect Deepfake Audio
Upload a WAV audio file to classify it as real or deepfake using our trained ML model.
Recent Detections
No detections yet. Upload an audio file to get started.
Generate Voice Clone
Model Evaluation Results
Three feature extraction strategies were trained and compared using 5-fold stratified cross-validation on the combined OSR + synthetic dataset.
| Model | Feature Dims | Accuracy | Precision | Recall | F1 | Inference (1k samples) |
|---|
Results file not found. Run the training pipeline to generate metrics:
cd training pip install -r requirements.txt python train.py --csv ../osr_features.csv --output ../models/
Confusion Matrices
System Architecture
flowchart LR
subgraph Frontend["Frontend (GitHub Pages)"]
UI["index.html\n+ JS / CSS"]
end
subgraph Backend["Backend (local / container)"]
API["FastAPI\n:8000"]
DET["/detect\nPOST"]
GEN["/generate\nPOST"]
API --> DET
API --> GEN
end
subgraph Models["Models (models/)"]
M1["MFCC\n.pkl"]
M2["FFT\n.pkl"]
M3["Hybrid\n.pkl (best)"]
end
subgraph Training["Training Pipeline"]
T["train.py\n5-fold CV"]
end
UI -- "HTTP / CORS" --> API
DET -- "load" --> M3
GEN -- "Coqui TTS\n/ gTTS fallback" --> GEN
T -- "saves" --> Models
About This Project
Overview
Voice Deepfake Vishing Detector & Generator is a graduation-level research project that builds a full pipeline for detecting AI-synthesised voices and demonstrating responsible voice-cloning for academic purposes.
Detection Pipeline
- MFCC-only: 13-dimensional Mel Frequency Cepstral Coefficient means — fast, lightweight, baseline model.
- FFT/Spectral-only: 6-dimensional feature vector (centroid, bandwidth, rolloff, log band energies) — frequency-domain characteristics.
- Hybrid (recommended): 19-dimensional MFCC + FFT concatenation — best accuracy/recall for IoT-grade deployment.
All models use Gradient Boosting with StandardScaler pre-processing and 5-fold cross-validation evaluation.
Generation Pipeline
Voice cloning uses IndexTTS2 (index-tts/index-tts by Bilibili) — an industrial-level zero-shot TTS system that clones voice timbre and emotional expression from a short reference clip. It supports duration control and emotion-style separation.
When IndexTTS2 is unavailable (not installed or INDEXTTS_DIR not set), the system falls back to gTTS with a clear warning. gTTS produces a generic voice — it is NOT a voice clone.
Deployment
- Frontend: Deployed automatically to GitHub Pages on every push to
main. - Backend: Run locally with
uvicorn, or via Docker/GHCR container. - CI: GitHub Actions runs lint (ruff) + unit tests on every PR.
Ethics & Security
- Research & educational use only.
- Consent is mandatory before cloning any voice.
- No audio is stored server-side beyond the single request lifetime.
- TLS protects transport but does not authenticate voice authenticity.