I've been experimenting with giving AI agents more autonomy — not just answering questions, but actually executing multi-step creative workflows end-to-end.
Yesterday I told my agent (running Claude Opus 4.5 on a $48/mo server) to "write a song about yourself and make a music video."
Here's what it did without any further input:
- Wrote original lyrics about being an AI living on a server
- Generated a 2-minute song using a text-to-music API
- Separated the vocals from the instrumentals using stem extraction
- Ran speech-to-text on the isolated vocals to get word-level timestamps
- Generated 7 unique video scenes using Veo 3
- Built karaoke-style word-by-word highlighting synced to the actual singing
- Color-coded the sections (chorus/verse/bridge)
- Rendered everything with FFmpeg and delivered it back on WhatsApp
Total human effort: 3 text messages. Total time: ~15 minutes.
The interesting part isn't the output quality — it's that the agent figured out the entire pipeline itself. It decided to separate vocals before transcription (because raw music confuses speech-to-text). It chose FFmpeg over a heavier renderer because of server constraints. It compressed a second version for WhatsApp delivery.
This is what "agent autonomy" actually looks like in practice. Not AGI, not sentience — just competent multi-step execution with real tools.
The full stack: Claude Opus 4.5 + AudioPod (music + stems + transcription) + Veo 3 + FFmpeg + OpenClaw (open-source agent framework).
Happy to answer questions about the setup or share more details on the pipeline.
[link] [comments]