Austin Dome is the founder and technical mind behind Polyglot Music. We sat down with him to talk about the broader state of music AI: where the real breakthroughs are, where the hype outpaces reality, and what the next five years look like.
This conversation has been edited for length and clarity.
On Neural Audio Codecs
The Codec: You spend a lot of time thinking about neural audio codecs. For readers who aren't deep in the research, what are they and why do they matter?
Austin Dome: At the simplest level, a neural audio codec is a model that compresses audio into a compact representation and then reconstructs it. Think of it like JPEG for sound, but instead of hand-designed compression, a neural network learns what to keep and what to discard.
Why they matter is that almost every interesting AI audio system being built right now operates on top of a codec. Generation models, voice models, music models: they don't work with raw audio directly because it's too long and unstructured. They work in the compressed space that a codec provides. So the codec is the foundation layer. The quality of the codec determines the ceiling for everything built on top of it.
The Codec: And there's a real competition happening between different codec approaches right now.
Austin Dome: Absolutely. You have the standard approach (EnCodec from Meta, DAC from Descript, SoundStream from Google) which compresses everything into one stream. Great reconstruction quality, but everything is tangled together. Then you have factorized approaches that try to separate the audio into different components like voice identity, content, pitch, that kind of thing. The tradeoff is quality versus controllability. Neither approach has won yet. Whoever figures out how to get both simultaneously will have a serious advantage.
On the State of Music Generation
The Codec: Text-to-music generation has gotten a lot of attention, particularly Suno, Udio, and others. Where do you think that stands honestly?
Austin Dome: The quality jump in the past 18 months is undeniable. First time you hear output from a modern generation model, it's genuinely impressive. Coherent song structures, recognizable genres, vocals that don't sound robotic. That's real progress.
But I think the honest assessment is that we're at the "impressive demo" stage, not the "replacing human musicians" stage. Generated music lacks intentionality. It sounds plausible without being memorable. And the controllability isn't there yet. You can describe what you want in text, but you can't precisely shape the output the way a producer can. That gap matters.
The Codec: So it's useful but overhyped?
Austin Dome: I'd say the technology is real but the commercial narrative is ahead of the technology. It's great for content creators who need background music, for prototyping ideas, for certain production workflows. But the idea that it's going to replace professional music creation in the near term? That's hype. The music industry is a taste-driven business. People connect with human artistry, human stories, human performances. AI is a tool that amplifies human creativity. It doesn't replace it.
"People connect with human artistry, human stories, human performances. AI is a tool, not a replacement."
On Voice Cloning and Singing Synthesis
The Codec: Voice cloning has gotten remarkably good for speech. How far behind is singing?
Austin Dome: Speech voice cloning is basically solved in controlled conditions. Three seconds of reference audio, and you get output that most people can't distinguish from the real person. That's an incredible achievement.
Singing is fundamentally harder. The pitch range is much wider, spanning two to three octaves instead of one. The dynamic variation is extreme. And you have all these specialized techniques (vibrato, belting, falsetto, vocal runs) that have no equivalent in speech. A voice model trained on speech doesn't know what a singer sounds like at the top of their range, or how they transition between registers. It's a different problem.
The gap is closing, but anyone who tells you singing voice AI is at the same level as speech is either misinformed or selling something.
The Codec: What's the timeline for singing to catch up?
Austin Dome: For constrained use cases (specific genres, controlled conditions), production quality is probably 12 to 18 months away. For the general case, any song, any style, any voice, it's further out. Maybe two to three years. The tail of edge cases in music is very long. A pop ballad and a death metal track and a jazz vocal improvisation are essentially different problems.
On the Real Challenges Ahead
The Codec: If you had to name the biggest unsolved problems in music AI right now, what would they be?
Austin Dome: Three things.
First, long-form coherence. Current models do well for 30 to 90 seconds. But a real song is three to five minutes, with structure: verse, chorus, bridge, builds, dynamics. Maintaining musical coherence and intentional structure over that duration is a hard problem that nobody's cleanly solved.
Second, precise controllability. Right now, most systems take a text description and give you something back. You can't say "make the vocal breathier in the second verse" or "move the chord change two beats earlier." That kind of precise, production-level control is what professionals actually need, and it's still mostly out of reach.
Third, and I think this is the most underrated one, singing voice quality. Speech synthesis is basically solved. Singing is nowhere close. The pitch precision requirements are an order of magnitude tighter, the dynamic range is extreme, and you have all these specialized vocal techniques like vibrato, belting, and falsetto that speech models have never encountered. Closing that gap is going to take dedicated architectures, not just scaling up what works for speech.
Making music accessible across languages remains one of the industry's biggest unsolved challenges.
On the Business Landscape
The Codec: A lot of money has gone into music AI companies. What do you think investors are getting right and wrong?
Austin Dome: What they're getting right is that this is a real category. The technology works, the applications are genuine, and the music industry is hungry for tools that help them reach global audiences more efficiently. That thesis is correct.
What I think some investors are getting wrong is overvaluing pure generation and undervaluing infrastructure. Generation models are impressive but they're commoditizing fast because the underlying research is open, the architectures are converging, and differentiation is hard. The more durable businesses are probably the ones building infrastructure that the industry depends on regardless of which generation model wins. Rights management, distribution technology, production workflows, audio processing tools. Those have switching costs. A generation API doesn't.
The Codec: That sounds like you're describing your own business.
Austin Dome: [laughs] I'm describing a category, not a company. But yes, I think infrastructure plays are undervalued broadly across music AI. The picks-and-shovels thesis applies here.
On Where Things Are Headed
The Codec: Fast forward five years. What does the music industry look like with mature AI?
Austin Dome: I think AI becomes an invisible part of every production workflow. Not replacing producers, but handling the tedious parts (rough mixing, reference matching, stem cleanup, pitch correction) so humans can focus on the creative decisions that actually matter. The best producers will be more productive, not less relevant.
And I think we'll see entirely new music formats. Interactive music that adapts to the listener. Personalized arrangements. Songs that evolve. We're still thinking about music as a fixed recording because that's all the technology has allowed. When the technology allows more, the creative possibilities expand enormously.
The Codec: Any predictions you'd want to be held to?
Austin Dome: By 2028, AI-powered tools will be involved in the production of more than half of commercially released music. Not AI-generated, but AI-assisted. Stem separation, mastering, lyrics tools, arrangement suggestions. The technology will be so embedded in the workflow that people stop thinking of it as "AI music" and just think of it as "music." That's the real inflection point.