I spent an entire Saturday training a clone of my own voice.
Nearly five hours talking into a microphone, fifteen stories lost to bad audio, my voice gone hoarse by the end, and a bald eagle getting chased around outside my window most of the day.
By the time I stopped, ElevenLabs had what it needed, and I had a clone that sounded like me.
So I did the obvious next thing.
I dropped my voice onto a blog post, generated the narration, and hit play, fully expecting it to come out as smoothly as the stock AI voice I'd been using all along.
It was awful.
The AI generation sounded WAY better than I did
Here's the part I didn't expect - the generic AI voice I'd been using on the site sounded great with almost no effort.
I picked it, dropped in the text, and it read like a professional narrator. Done.
But my own voice, the one I'd given up a whole Saturday to capture, came out flat, rushed, and off-putting.
It tripped over words.
It ran sentences together with no breath.
It hit some lines like it was shouting and it mumbled others.
It was soooo discouraging.
It felt like I had wasted an entire day for nothing. I couldn't figure out why it was such a FAIL.
It made no sense to me at first. How could an AI voice work so easily but my own voice was so bad.
I asked Claude code, it said it was a known limitation with ElevenLabs and long form audio.
I know Claude lies, I don't know why I accepted that as fact - wasted hours as a result.
Here's the answer. If you're about to clone your own voice, this is the thing I wish I knew up front.
Here's the deal
A stock AI voice is tuned end to end by the company that built it. The pacing, the pauses, the rise and fall of a sentence, all of that's baked in at the factory.
You're not just borrowing the sound. You're borrowing years of work on how to actually speak.
A voice clone only copies one thing: what you sound like.
The timbre, the character, the grain of your voice. It doesn't copy the skill of speaking.
Underneath, it's still running on a base engine that, left alone, defaults to short, clipped, rushed delivery.
So you get your exact voice doing a bad job of reading.
It's uncanny in the worst way.
It sounds like you, so every wrong pause and rushed line feels personal, like anyone listening might think that's actually how you read, or how you sound.
Cloning my voice was only step one. Now the real work begins....
I'm black and white, and this was neither
When I'm working with AI on something tangible, I'm very black and white.
Building an app, creating a system, setting up a flow. I know the outcome I want, and I know how to work with AI to get there.
There's a finish line and I can see it the whole way.
Creative work doesn't give me that. Images, design, voice, anything more intangible.
I know this about myself, so I took an AI video creation course just so I could learn how to flex that muscle. I learned how to do character consistency and then taught it to claude for these blog posts.
When I cloned my own voice, I thought that's all I had to do. So I didn't think to research it or to learn more about it. I thought that's it. I just cloned my voice and it's done.
I didn't anticipate the problems before I hit them.
For instance, I had no way of knowing that #1 would read as "hashtag one".
Every time I fixed one thing, another one I'd never have predicted was waiting behind it. It felt like a never-ending list of walls.
I tried to fix it like an engineering problem, one line at a time. And that's where the real time, and the real money, disappeared.
The line-by-line battle
The clone couldn't say "resume." It said re-ZOOM, or re-SHU-may - every single time.
It couldn't say "row" the way you'd say a row of data. It said it like rhymes-with-cow.
"Read" came out "red."
"Close" came out like standing close to someone instead of closing a door.
I spent hours, and a genuinely embarrassing pile of tokens, trying to coax one word into sounding right.
Half a session's worth of usage, on a single word (re-SHU-may).
Looking back it's funny. In the moment it was maddening.
I'd fix the word, regenerate, listen, and it would be wrong a different way.
The answer turned out to be the opposite of what my instinct told me.
We were stuck on one paragraph that kept butchering the same two words, take after take, I'd poured an insane amount of time into it.
Finally I said, "I don't understand how you can get the whole page right and then fight me on this one paragraph an hour later. Throw the entire thing out and start over".
Claude did - and it came back perfect.
That was a real lightbulb moment.
Deleting a bad take and regenerating it from scratch worked far more often than nudging it ever did.
Playing Whack-a-Mole
This is the part I hope to help others with - because it cost me the most and it was the hardest to even see.
I would fix one section of a post. I'd listen, it sounded good, and I'd think I was finished.
Then I'd do one last sweep, play it again top to bottom, and other sections that had been fine were now broken.
Areas I hadn't even touched.
It felt like the work was actively fighting me. Like every step forward knocked something else over.
There were two things happening.
The first was that some of my fixes were global without me realizing it.
When I corrected a word like "Sonnet" so it would stop saying "Sunnet" in one sentence, the correction applied to every single place that word appeared on the page, not just the one spot I was asking it to fix. Now it was broken everywhere.
So fixing one line quietly re-rolled a dozen others, and some of those came back worse.
I was breaking three things every time I fixed one.
The second was sneakier.
The page wasn't always loading my new audio at all. There's a little version stamp that tells a browser "this file changed, go get the fresh one."
That stamp wasn't updating.
So my browser kept serving a stale mix of old and new pieces, and it sounded exactly like my finished sections had fallen apart, when really they'd never loaded.
I lost most of a day chasing problems that were already fixed.
It was like playing wack-a-mole.
And then there were nuances like putting a period at the end of each bullet list so it would stop reading the list as a sentence.
Or teaching it "1M" was "one-million", not "onem".
You can imagine how much fun I had with this one:
The question that brought me back
I stepped away from this more than once.
When I added up how long it was taking to put my own voice on a single post against what I was spending in tokens and time, the payoff just wasn't there.
Claude Code kept telling me this was a known issue, over and over, until I started to believe it was simply how this worked - that cloning a voice for long-form would always take this long, and there was nothing to be done about it.
I know Claude lies, I don't know why I was choosing to believe it on this issue. Not again 😜
Then I was asked a question.
What is one thing you have tried with AI that you couldn't make work?
My answer came out fast. My cloned voice on long-form narration.
Saying it out loud is what did it.
Giving up on something isn't my personality. If someone else has made this work, there's a way to make it work, and I'll find it.
So I came back to it.
I sat down and asked what the real problem was underneath all of it. I'm not one to choose an easy way out or put band-aids on things.
I like to go to the root.
I don't know why I was so blind to it this time.
I tackled it from there, and that's what changed everything.
Here's how I stopped it from ever happening again
The turning point wasn't a clever trick. It was a decision to stop fixing walls one at a time and start writing down what I was learning as I went.
I had Claude make a list of every change that I made to a post that day. And for each change, I asked:
- What happened?
- What caused it?
- How do we fix it?
And the important one... what do we need to do so this never happens again?
I then had it go back through every post I'd already narrated and ran the same four questions across all of it.
Every problem I'd solved before, written down with its cause and its permanent fix.
That became a playbook, a checklist Claude now runs before ever assigning audio to a post. It works down the list and checks each item off first, so the post is set up correctly before the voice ever touches it.
Here's what the playbook protects against.
Sometimes a line reads better out loud than it looks on the page, like a number that should be spoken in full words, or a word the clone keeps mispronouncing. So I attach a spoken-only version to that exact spot. The page still shows the normal text, and only the narration of that one line changes.
When a line comes out wrong, I stopped trying to coax it into shape with hint after hint. I throw that one take out and generate it fresh. A new take almost always lands, where fighting the same one rarely does.
Every paragraph is stored on its own, so when I change one, only that one regenerates. The rest are mathematically untouched. They literally can't change, which means "I fixed one thing and broke another" is no longer possible.
It now bumps automatically, with a guard that warns me if a file changed but the stamp didn't. No more stale mixes. No more chasing ghosts.
Before I generate anything, I run a scanner over the post. It flags the known traps ahead of time: words the clone mispronounces, numbers it'll stumble on, lines it'll rush. I fix them before the first take instead of discovering them by ear over twenty takes.
When a word comes out wrong in one spot, I fix only that spot. Applying a pronunciation fix across the whole page re-rolls every line that uses the word, and some of those come back worse than they started. So I keep each fix pinned to the one place it belongs.
When something sounds too loud or too shrill, I stopped guessing and re-rolling blind. I run a tool that measures each clip's volume, brightness, and pitch and names the one that's off. Then I fix the thing that's actually wrong instead of chasing the wrong problem for an hour.
The result is that the last full post I added my cloned voice to did not require even ONE edit.
Instead of eating a small fortune in tokens and spending half a day on corrections, it was right on the first pass.
It was a costly error spread over a couple of weeks (on and off) but the payoff: almost instant voice narrations in my voice moving forward.
And my reminder - much as I love Claude, sometimes it lies 🤦
A quick word on the other tools
When I couldn't get ElevnLabs to work, Claude recommended some other tools. And so I just want to put a quick note here about those.
I tried Hume, which has a slick feature where you can direct the delivery like an actor. The direction worked, but its free clone only takes thirty to forty-five seconds of training audio, against the three hours ElevenLabs had, and it just didn't sound like me.
I tried Fish Audio too. I couldn't get past the setup, because the whole interface came up in another language.
ElevenLabs stayed. The clone was always good. I just had to learn how to drive it.
What I'd tell you if you're about to do this
Cloning your voice isn't the hard part. Spending the Saturday isn't the hard part.
The hard part is everything after, when you realize a clone gives you your voice but none of the skill of speaking - you have to build that part out yourself.
Write down every wall as you hit it. Fix things locally, never globally.
When a take is bad, throw it out instead of fighting it.
Or better yet - reach out and I'll give you my playbook...you're welcome 😜
If you're working through something like this and it feels like every fix breaks something else, that feeling is real, and it has a cause, and it's fixable. Let's connect.