For locally run, open-source AI tools on a Debian machine (assuming no internet at runtime, but initial installation and model downloads can occur beforehand), you can achieve text narration (via TTS) and video presentation composition by combining compatible components. Since LLMs are already installed, they can generate scripts, prompts, or content offline. Focus on tools that leverage Python (available via apt or pip) and common AI libraries like PyTorch or Hugging Face Transformers, which support local execution.
These are neural AI-based TTS systems that convert text to natural-sounding audio. They run entirely offline after model setup, with models downloadable from Hugging Face or GitHub.
Piper TTS: A fast, lightweight neural TTS engine optimized for low-resource devices. Supports multiple languages and voices; output is high-quality and prosodic. Install via pip install piper-tts or from source (GitHub: rhasspy/piper). Models (e.g., ONNX format) are small (~100-500 MB) and run on CPU; GPU optional for faster inference. No internet needed post-download.
Coqui TTS (now under Hugging Face TTS): High-quality, multilingual AI TTS with voice cloning capabilities. Supports fine-tuning on custom voices. Install via pip install TTS (GitHub: coqui-ai/TTS). Download models from Hugging Face (e.g., XTTS-v2 for zero-shot cloning). Runs on CPU/GPU; requires ~2-8 GB RAM/VRAM depending on model size. Excellent for narrative styles.
Tortoise TTS: Slow but ultra-high-fidelity TTS with strong expressiveness and voice cloning from short samples. Fast forks like Tortoise-TTS-fast improve speed. Install via pip install tortoise-tts (GitHub: neonbjb/tortoise-tts). Models are ~1-5 GB; best on GPU (4+ GB VRAM recommended) but CPU fallback available.
MeloTTS: Extremely fast TTS with good quality, focused on English but extensible. Install from GitHub (mystique/melo-tts) via pip. Low hardware needs (~2 GB RAM); runs on CPU.
Parler TTS: Decoder-only TTS for controllable, expressive speech. Install via Hugging Face (huggingface/parler-tts). Supports streaming; ~1-3 GB models, CPU/GPU compatible.
Other options like StyleTTS2 (for prosody control) or Fish Speech (for quick cloning) are available via GitHub installs. Use these with your local LLM to generate text scripts, then pipe output to TTS for audio files (e.g., WAV/MP3).
Direct text-to-video AI generation is compute-intensive but feasible locally. For "composing" presentations (e.g., slideshow-style videos with narration, transitions, and visuals), combine AI-generated elements (text/scripts from LLM, images from diffusion models, audio from TTS) using composition tools. Full generative text-to-video models create short clips from prompts.
These create videos from text prompts, suitable for dynamic presentations (e.g., animated explanations or scenes). They require a decent GPU (NVIDIA preferred, via CUDA); models downloadable from Hugging Face.
CogVideoX: Generates 6-second videos (720x480, 8 FPS) from English text prompts with high fidelity and motion adherence. Open-source (THUDM/CogVideoX on Hugging Face). Install via pip install transformers diffusers accelerate. Runs locally offline; enable CPU offload/tiling for memory efficiency. Hardware: 4-5 GB VRAM (e.g., RTX 3060+); inference ~3 minutes on mid-range GPU. Great for short illustrative clips in presentations.
Open-Sora: Supports text-to-video (and image-to-video) for 2-15 second clips at up to 720p. Handles various aspect ratios; two-stage pipeline (text-to-image via Flux, then to video). Open-source (hpcaitech/Open-Sora on GitHub). Install via conda/Python 3.10 + pip dependencies (e.g., xformers for speed). Offline after model download. Hardware: Single GPU (50+ GB VRAM for low-res) or multi-GPU (2-8 cards, 40-60 GB each) for higher res; ~1-30 minutes per video. Ideal for presentation visuals like animated diagrams.
Wan 2.2: Efficient text-to-video model with strong stylization and text readability. Open-weight (Wan-AI/Wan2.2-T2V on Hugging Face). Install via Diffusers library. Offline capable; optimized for local GPUs (8-16 GB VRAM recommended). Generates short clips with good motion; suitable for educational or demo videos.
Other models like Mochi 1 (genmo-ai/models on GitHub) require high-end hardware (4x H100 GPUs) and are less practical for standard Debian setups. VideoCrafter (TencentARC/VideoCrafter on GitHub) is another option for text/image-to-video.
To "compose" full videos (e.g., stitch AI-generated images/slides, add narration audio, transitions): Use these non-AI but scriptable tools with AI outputs.
MoviePy: Python library for video editing/composition. Open-source (GitHub: Zulko/moviepy); install via pip install moviepy. Combine LLM-generated text (as slides via Pillow images), TTS audio, and AI images (e.g., from Stable Diffusion) into videos with effects/transitions. Runs on CPU; no GPU needed. Example: Generate slides as PNGs, overlay text, add voiceover, export MP4.
FFmpeg: Command-line tool for video assembly (available via apt install ffmpeg). Open-source; stitch images/audio into videos offline. Use with scripts to automate (e.g., LLM outputs script, TTS makes audio, Stable Diffusion makes frames).
pip install diffusers, models from Hugging Face) for images from text prompts.For a more automated storyteller-style tool, check StoryTeller (GitHub: jaketae/storyteller). It's open-source and combines Stable Diffusion (images), TTS (narration), and LLM (story generation) into animated videos from prompts. Originally uses cloud APIs, but modify the code to use your local LLM and TTS (e.g., replace OpenAI calls with local inference). Install via GitHub clone + pip deps; runs offline on Linux with GPU for Stable Diffusion.
These tools are Debian-compatible (via apt/pip/conda). Start with lighter ones like Piper + MoviePy for testing; scale to GPU-heavy models like CogVideoX for advanced generation. Ensure NVIDIA drivers/CUDA are set up for GPU acceleration. If hardware is limited, stick to CPU-friendly options like MeloTTS and basic composition.