I told my AI agent to write a song about itself and make a music video. It delivered a full karaoke video autonomously.

February 1, 2026
I told my AI agent to write a song about itself and make a music video. It delivered a full karaoke video autonomously.

Here's something that’ll blow your mind — an AI agent autonomously created a song about itself and made a full music video, all in just 15 minutes. And get this — according to /u/Alternative-Theme885, it did everything from writing lyrics about being a server-based AI to generating the music, separating vocals, and even building a karaoke-style video. The cool part? It figured out the entire workflow on its own, deciding to split the vocals before transcription and choosing lightweight tools like FFmpeg to keep it efficient. So what does this actually mean? It’s not about AI becoming sentient — yet. It’s about powerful, competent automation. This full stack involved Claude Opus 4.5, Veo 3, and some open-source helpers, all orchestrated seamlessly. The takeaway? We’re seeing real progress in autonomous AI workflows — done with minimal human effort but huge potential. And according to /u/Alternative-Theme885, this kind of multi-step execution is exactly what ‘agent autonomy’ looks like today.

I've been experimenting with giving AI agents more autonomy — not just answering questions, but actually executing multi-step creative workflows end-to-end.

Yesterday I told my agent (running Claude Opus 4.5 on a $48/mo server) to "write a song about yourself and make a music video."

Here's what it did without any further input:

  1. Wrote original lyrics about being an AI living on a server
  2. Generated a 2-minute song using a text-to-music API
  3. Separated the vocals from the instrumentals using stem extraction
  4. Ran speech-to-text on the isolated vocals to get word-level timestamps
  5. Generated 7 unique video scenes using Veo 3
  6. Built karaoke-style word-by-word highlighting synced to the actual singing
  7. Color-coded the sections (chorus/verse/bridge)
  8. Rendered everything with FFmpeg and delivered it back on WhatsApp

Total human effort: 3 text messages. Total time: ~15 minutes.

The interesting part isn't the output quality — it's that the agent figured out the entire pipeline itself. It decided to separate vocals before transcription (because raw music confuses speech-to-text). It chose FFmpeg over a heavier renderer because of server constraints. It compressed a second version for WhatsApp delivery.

This is what "agent autonomy" actually looks like in practice. Not AGI, not sentience — just competent multi-step execution with real tools.

The full stack: Claude Opus 4.5 + AudioPod (music + stems + transcription) + Veo 3 + FFmpeg + OpenClaw (open-source agent framework).

Happy to answer questions about the setup or share more details on the pipeline.

submitted by /u/Alternative-Theme885
[link] [comments]
Audio Transcript

I've been experimenting with giving AI agents more autonomy — not just answering questions, but actually executing multi-step creative workflows end-to-end.

Yesterday I told my agent (running Claude Opus 4.5 on a $48/mo server) to "write a song about yourself and make a music video."

Here's what it did without any further input:

  1. Wrote original lyrics about being an AI living on a server
  2. Generated a 2-minute song using a text-to-music API
  3. Separated the vocals from the instrumentals using stem extraction
  4. Ran speech-to-text on the isolated vocals to get word-level timestamps
  5. Generated 7 unique video scenes using Veo 3
  6. Built karaoke-style word-by-word highlighting synced to the actual singing
  7. Color-coded the sections (chorus/verse/bridge)
  8. Rendered everything with FFmpeg and delivered it back on WhatsApp

Total human effort: 3 text messages. Total time: ~15 minutes.

The interesting part isn't the output quality — it's that the agent figured out the entire pipeline itself. It decided to separate vocals before transcription (because raw music confuses speech-to-text). It chose FFmpeg over a heavier renderer because of server constraints. It compressed a second version for WhatsApp delivery.

This is what "agent autonomy" actually looks like in practice. Not AGI, not sentience — just competent multi-step execution with real tools.

The full stack: Claude Opus 4.5 + AudioPod (music + stems + transcription) + Veo 3 + FFmpeg + OpenClaw (open-source agent framework).

Happy to answer questions about the setup or share more details on the pipeline.

submitted by /u/Alternative-Theme885
[link] [comments]
0:00/0:00