The question I was asking was simple and, depending on your tolerance for technical optimism, either exciting or absurd: can a single AI-generated prompt produce a complete music video from a poem written 115 years ago?
Not a series of clips that required individual direction. Not a storyboard that a human then had to execute. One prompt, dropped into CapCut, producing a coherent visual essay from beginning to end — from Miniver Cheevy in the candlelit tavern to the rusted sword lying in a modern puddle.
The answer, as of this experiment, is: almost. Which is both less than I wanted and more than I expected.
What Songbird Is
Songbird is a tool I built on top of the Musinique workflow — a directorial AI layer that sits between lyrics and image generation. It takes song text and transforms it into a continuous sequence of visual prompts: not disconnected moments, not a grab-bag of related images, but a directed sequence with the logic of a camera operator who has read the script.
The core principle is continuity. Songbird maintains visual threading across an entire piece — the same character, consistent lighting, a camera path that flows rather than cuts randomly. It is built around three ideas borrowed from film: entry (inheriting motion from the preceding moment), beat (one clear event per unit), and exit (a lead-in that points toward what comes next). The result, when it works, feels less like a slideshow and more like a rough cut.
For this experiment, I ran Edwin Arlington Robinson’s “Miniver Cheevy” through the song mode — music-video performance logic — combined with the vocal clone rendering I’ve been developing for Musinique’s public domain poetry series. The voice is a deep baritone trained on my own recordings, extended and processed to carry Robinson’s clipped iambic tetrameter. The production: sparse gospel-blues, brushed percussion, slide guitar in the minor key. The poem as a song that sounds like it was always a song.
Then Songbird generated the prompt. One prompt. Into CapCut.
What the Single Prompt Does
This is the part that is genuinely new, and genuinely interesting to think about.
Standard AI video generation requires what amounts to frame-by-frame direction — you generate an image, evaluate it, prompt the next one, maintain continuity manually through reference images and careful reuse of character seeds and settings. It is time-consuming and requires an eye for where coherence breaks. A four-minute video built this way can take many hours of iteration.
The Songbird-to-CapCut pipeline is attempting something different: compress the entire directorial logic into a single compound prompt that CapCut’s AI video tools then execute as a sequence. The prompt carries not just visual descriptions but implied camera movement, tonal continuity, character consistency, and narrative arc. It is, in essence, a storyboard written in natural language.
For “Miniver Cheevy,” the prompt built seven visual units from Robinson’s eight stanzas — the tavern fantasy, the bare modern room with the suit, the rocking horse in the neon rain, the armor at the rain-streaked window, the rusted sword on the sidewalk, the cracked helmet on the desk, the faded movie poster on the brick wall. Each unit specified not just what was in frame but what the frame felt like — the quality of light, the angle that implied the character’s relationship to what he was seeing, the transition logic that connected one image to the next.
CapCut received this as a single input and generated the full sequence.
What Worked
The visual coherence was better than I expected. The character — lean, worn, dressed in contemporary poverty with remnants of medieval fantasy draped over him — remained legible across the sequence without reference-image locking. The tavern established a color temperature (candlelight warm, stone cold) that persisted as a visual motif through the harder images: even the contemporary sidewalk with the rusty sword had a residual warmth in the amber of distant shop windows.
The transition logic held. This surprised me most. The rocking horse in the rain is a pivot in the poem — it is the moment where Miniver’s fantasy is shown as a child’s toy, stationary, chipped, ignored by the city. The prompt specified the transition into this image as a pull-back from the previous close shot, creating the sudden scale of the commercial street behind the small figure of the horse. CapCut executed this as written. The image arrived with the emotional weight the poem requires.
Songbird’s beat-exit-entry logic appears to be doing work. The prompt is not just a list of image descriptions. It is a sequence with momentum, and that momentum is partly what CapCut is reading.
What Didn’t Work
The poem’s most important device — Robinson’s truncated fourth line, the three-beat “And he had reasons” and “And kept on drinking” that deflates each stanza’s rhetoric — does not survive the translation. A visual prompt cannot embed a structural joke. It can depict a cough. It can frame a man staring at an untouched glass. It cannot reproduce the specific timing of a line that arrives half-a-beat before you expect the stanza to end.
This is not a failure of the tool. It is a limitation of the medium. Poetry that depends on formal compression — the counted syllable, the withheld line — is fundamentally resistant to visual equivalence. You can evoke it. You cannot reproduce it. The visual essay is in conversation with the poem; it is not a translation of it.
There was also character drift in the later images. By the sixth visual unit — the cracked helmet on the desk — the character had lost some definition. The desk and pencils read clearly. The helmet read clearly. The implied human presence (whose desk is this? Robinson’s or Miniver’s?) was ambiguous in a way that was not fully intentional. A second-pass prompt could resolve this. The single-prompt workflow, by design, cannot revise itself.
Why This Matters for Public Domain Poetry
Musinique has been working with public domain texts for almost a year now — setting Poe, Eliot, and others to music through voice synthesis, using the Lyrical Literacy framework to ask what it sounds like when a poem is given back its sonic dimension. The Songbird experiment extends this into a third medium: the visual essay.
What is at stake is not novelty. It is accessibility. Robinson’s poetry is not widely read. “Miniver Cheevy” is in every major American literature anthology, which means it is assigned and summarized and rarely inhabited. The poem is brilliant and people do not know it.
A four-minute video, rendered from a vocal clone in a minor-key blues production, with seven visual units that track the poem’s argument from fantasy to cough — this is a different object than the poem. It does not replace the poem. What it does is create an entry point. A reason to go back to the text. A version of the question that the poem is asking, in a medium that reaches people who are not already reading 1910 American verse.
This is what AI tools should be for: not replacing the original but building the bridge to it.
The Workflow in Practice
For those interested in the technical construction: the process ran in three stages.
First, the vocal clone render. The poem was segmented into singable phrases and processed through the Musinique production chain — baritone voice with minor-key backing, brushed drums, slide guitar. The vocal clone extends my own voice into registers and timbres required by Robinson’s particular cadences. The production treats the poem as a song that was always a song, not a poem that was reluctantly set to music.
Second, Songbird’s song prompt generation. The lyrics were passed into Songbird in performance-mode, which generated the seven-unit directorial sequence with full continuity parameters — character, lighting, color temperature, camera movement, transition logic, beat structure. This is the stage most people would handle manually with a storyboard. Songbird collapses it to a single output.
Third, CapCut execution. The Songbird prompt was dropped into CapCut as a single input. No frame-by-frame direction. No mid-sequence correction. One pass.
The total pipeline from poem to video draft: approximately two hours. The majority of that time was in the vocal clone production, not the image generation.
What Comes Next
The single-prompt workflow is still a rough draft of itself. The character drift issue in the later units needs a consistency layer that Songbird does not yet have — likely a reference-image anchor system that can be passed alongside the text prompt to maintain character fidelity across longer sequences.
The formal compression problem — how do you visualize a three-beat punch line — is harder and may not have a clean solution. The visual essay can be about that rhythm. It cannot perform it. This is a productive constraint rather than a failure: it forces the video into genuine collaboration with the poem rather than replacement of it.
The next Songbird experiment will run a longer text — probably a full Robinson poem like “Richard Cory” or one of the Tilbury Town pieces — to test whether the continuity logic holds across more complex narrative arcs. The public domain poetry series has at least a dozen candidates.
For now: Miniver Cheevy, born too late, has a music video. He would have had thoughts about this. He would have told you about the movies they would have made in a more cinematically enlightened era. He would have thought, and thought, and thought, and thought about it.
Robinson just made the thing.
Produced by Musinique LLC. Vocal clone: Nik Bear Brown baritone, processed through the Musinique production chain. Visual prompts: Songbird song mode. Video generation: CapCut AI. Source poem: Edwin Arlington Robinson, “Miniver Cheevy” (1910), public domain.
Tags: Songbird AI video prompt generator, CapCut single-prompt music video, public domain poetry vocal clone, Miniver Cheevy Robinson visual essay, Musinique AI poetry visualization experiment








