# English | [中文版](./README.md)
A real-time interactive streaming digital human system enabling synchronized audio-video conversation, which basically meets commercial application standards.
[wav2lip Demo](https://www.bilibili.com/video/BV1scwBeyELA/) | [ernerf Demo](https://www.bilibili.com/video/BV1G1421z73r/) | [musetalk Demo](https://www.bilibili.com/video/BV1gm421N7vQ/)
Domestic Mirror Repository:
## Features
1. Supports multiple digital human models: ernerf, musetalk, wav2lip, Ultralight-Digital-Human.
2. Supports voice cloning.
3. Supports interrupting the digital human while it is speaking.
4. Supports full-body video stitching.
5. Supports WebRTC and virtual camera output.
6. Supports motion choreography: plays custom videos when the digital human is not speaking.
7. Supports custom digital human avatars.
## 1. Installation
Tested on Ubuntu 24.04, Python 3.10, PyTorch 2.5.0, and CUDA 12.4.
### 1.1 Install Dependencies
```bash
conda create -n nerfstream python=3.10
conda activate nerfstream
# If your CUDA version is not 12.4 (check via "nvidia-smi"), install the corresponding PyTorch version from
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
```
For common installation issues, refer to the [FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html).
For CUDA environment setup on Linux, refer to this article:
Troubleshooting for video connection issues:
## 2. Quick Start
- Download Models
Quark Cloud Drive:
Google Drive:
1. Copy `wav2lip256.pth` to the `models` directory of this project and rename it to `wav2lip.pth`.
2. Extract the `wav2lip256_avatar1.tar.gz` archive and copy the entire extracted folder to `data/avatars` of this project.
- Run the Project
Execute: `python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1`
The server must open the following ports: TCP: 8010; UDP: 1-65536
You can access the client in two ways:
(1) Open `http://serverip:8010/webrtcapi.html` in a browser. First click "start" to play the digital human video; then enter any text in the input box and submit it. The digital human will broadcast the text.
(2) Use the desktop client (download link: ).
- Quick Experience
Visit and create an instance with this image to run the project successfully immediately.
If you cannot access Hugging Face, run the following command before starting the project:
```
export HF_ENDPOINT=https://hf-mirror.com
```
## 3. More Usage
For detailed usage instructions:
## 4. Docker Run
No prior installation is required; run directly with Docker:
```
docker run --gpus all -it --network=host --rm registry.cn-zhangjiakou.aliyuncs.com/codewithgpu3/lipku-livetalking:toza2irpHZ
```
The code is located in `/root/livetalking`. First run `git pull` to fetch the latest code, then execute commands as described in Sections 2 and 3.
The following images are available:
- AutoDL Image:
[AutoDL Tutorial](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
- UCloud Image:
Supports opening any port; no additional SRS service deployment is required.
[UCloud Tutorial](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
## 5. Performance
- Performance mainly depends on CPU and GPU: Each video stream compression consumes CPU resources, and CPU performance is positively correlated with video resolution; each lip-sync inference depends on GPU performance.
- The number of concurrent streams when the digital human is not speaking depends on CPU performance; the number of concurrent streams when multiple digital humans are speaking simultaneously depends on GPU performance.
- In the backend logs, `inferfps` refers to the GPU inference frame rate, and `finalfps` refers to the final streaming frame rate. Both need to be above 25 fps to achieve real-time performance. If `inferfps` is above 25 but `finalfps` is below 25, it indicates insufficient CPU performance.
- Real-Time Inference Performance
| Model | GPU Model | FPS |
| :---------- | :--------- | :--- |
| wav2lip256 | RTX 3060 | 60 |
| wav2lip256 | RTX 3080Ti | 120 |
| musetalk | RTX 3080Ti | 42 |
| musetalk | RTX 3090 | 45 |
| musetalk | RTX 4090 | 72 |
A GPU of RTX 3060 or higher is sufficient for wav2lip256, while musetalk requires an RTX 3080Ti or higher.
## 6. Commercial Version
The following extended features are available for users who are familiar with the open-source project and need to expand product capabilities:
1. High-definition wav2lip model.
2. Full voice interaction: supports interrupting the digital human’s response via a wake word or button to ask a new question.
3. Real-time synchronized subtitles: provides the frontend with events for the start and end of each sentence spoken by the digital human.
4. Each connection can specify a corresponding avatar and voice; accelerated avatar image loading.
5. Supports avatars (digital human images) with unlimited duration.
6. Provides a real-time audio stream input interface.
7. Transparent background for the digital human, supporting dynamic background overlay.
8. Real-time avatar switching, supporting multiple digital humans in the same scene.
9. Camera‑driven digital human movements and facial expressions.
For more details:
## 7. Statement
Videos developed based on this project and published on platforms such as Bilibili, WeChat Channels, and Douyin must include the LiveTalking watermark and logo.
---
If this project is helpful to you, please give it a "Star". Contributions from developers interested in improving this project are also welcome.
* Knowledge Planet (for high-quality FAQs, best practices, and Q&A): https://t.zsxq.com/7NMyO
* WeChat Official Account: 数字人技术 (Digital Human Technology)
