New Apple research could unlock fast-talking Siri


Summary created by Smart Answers AI
In summary:
- Macworld reports that Apple’s new research paper introduces Coarse-Graining (PCG), a method for speeding up Siri’s voice token generation while maintaining quality.
- The technique groups acoustically similar tokens using acoustic similarity groups, thereby avoiding unnecessary processing rigor that slows down current systems.
- This advancement could lead to a much faster and more responsive Siri, addressing user complaints about the assistant’s slow performance.
Hopes for a more accurate and functional Siri voice assistant currently rest heavily on the near-term solution: Apple’s recently announced partnership with Google to use the latter’s Gemini technology to improve its own AI offerings. But in the longer term, a new research paper suggests a method that could allow Apple to make Siri faster on its own.
The paper, Rough acceptance of principles for speculative speech decodingwas written by five researchers working for Apple and Tel Aviv University and published late last month (via 9to5Mac). It proposes a new approach that could, in the words of the researchers, “accelerate the generation of vocal tokens while maintaining speech quality.”
The key to speed, researchers say, is to avoid unnecessary rigor. “For voice LLMs that generate acoustic tokens,” they write, “exact token matching is too restrictive: many discrete tokens are acoustically or semantically interchangeable, which reduces acceptance rates and limits speedups.” In other words, at a certain level of similarity, it does not matter which of two possible speech tokens is selected, since they sound or mean essentially the same thing, and it is a waste of time and processing resources to insist on determining which one is the correct one.
The proposed solution is to group tokens in an acoustically similar way.
“We propose the Coarse-Graining Principle (PCG), a framework that replaces exact token matching with group-level verification,” the paper explains. “We construct acoustic similarity groups (ASGs) in the token embedding space of the target model, capturing its internal semantic and acoustic similarity organization. PCG performs speculative sampling on the coarse-grained distribution on the ASGs and performs rejection sampling at the group level.”
Researchers say this will increase speed without significantly reducing reliability. In experiments (see page 4 of the article), increasing the number of tokens per second decreases accuracy slightly, but much less than with standard speculative decoding.
The document is rather technical, but it is not very long. Check out the pdf to read the whole thing.




