Mozer compression is a lossy speech coding method invented by Forrest S. Mozer for tightly compressing recorded speech data in a way that it can still be played back on cheap integrated circuits or 8-bit microprocessors of the late 1970s. Achieving understandable speech at 1 to 4 kbps on affordable hardware was a great achievement at the time, and Mozer's patented technology was in demand. He first licensed it to TeleSensory Systems for use in calculators for the blind and then to National Semiconductor for use in the DigiTalker chip. In 1984, he co-founded Electronic Speech Systems (ESS), and the codec ended up playing speech in Commodore 64 games such as Impossible Mission and Ghostbusters.
The patent describes the basic technique. It exploits the frequency response of the ear, the relative lack of phase response, and the short-term periodicity of voiced sounds.
- Pre-emphasis filter if needed.
- Determine pitch periods and split the speech into short-time periodic segments. Store the length, period, and possibly volume of these segments.
- Optionally combine identical speech fragments from different times and replace them with pointers to the previous fragment.
- Mozer phase adjustment on individual periods, which consists of a Fourier transform X(f) = FFT(x(t)), a step of replacing all X(f) with their absolute value |X(f)| to discard phase, and an inverse Fourier transform. This concentrates energy in the center of the period, and it makes the pitch period even (symmetric) so that only half need be coded.
- Silence the half of the period with less energy so that only half of the half need be coded.
- Quantize differences between samples to the values [-3, -1, 0, 1, 3] and encode.
Note: Doing some of these steps with a neural net is still patented as of 2011, but my initial experiments show that a neural net may not be necessary for reasonable quality.
So far, gains come from repeating a pitch period (estimated factor of 3), phase adjustment allowing mirroring (factor of 2), and silencing half the period (factor of 2). It turns out that the actual samples are encoded with DPCM to only two bits per sample (factor of 4 vs. 8-bit LPCM). This gives a factor of 24 reduction compared to 4-bit ADPCM, not counting the storage for the periods and the repeat counts.
A decoder for the S14001A and DigiTalker formats is implemented in the emulator MAME, and the source code reveals a few of the low-level design decisions that went into the codec. The old version of the bitstream in the S14001A chip uses deltas of [-3, -1, 0, 1] or [-1, 0, 1, 3] based on the delta index of the previous sample, as a measure against slope overload. This is what allows encoding five values in two bits. The length of a period is fixed at 48 samples (12 stored), and all changes in pitch are done by sliding the playback frequency up and down. I disagree with some of these low-level design decisions and intend to depart from the DigiTalker format in my own codec based on the same principles.
Several generations of products were based on Mozer's technology:
|1976||S14001A (TeleSensory Inc)|
|1979||DigiTalker (National Semiconductor)|
|1982||MX? (ESS): this and later versions were decoded in software on 8-bit CPUs and microcontrollers|
|1983||MX (Sensory Inc): almost identical to DigiTalker except with some bit endianness issues cleaned up|
|1985 or so||CX and PX and etc. later tech|
At the time, Mozer himself did all the compression and mastering of audio for the chips at roughly $400 per word.