stereo.html 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  2. <html>
  3. <head>
  4. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
  5. <title>Ogg Vorbis Documentation</title>
  6. <style type="text/css">
  7. body {
  8. margin: 0 18px 0 18px;
  9. padding-bottom: 30px;
  10. font-family: Verdana, Arial, Helvetica, sans-serif;
  11. color: #333333;
  12. font-size: .8em;
  13. }
  14. a {
  15. color: #3366cc;
  16. }
  17. img {
  18. border: 0;
  19. }
  20. #xiphlogo {
  21. margin: 30px 0 16px 0;
  22. }
  23. #content p {
  24. line-height: 1.4;
  25. }
  26. h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
  27. font-weight: bold;
  28. color: #ff9900;
  29. margin: 1.3em 0 8px 0;
  30. }
  31. h1 {
  32. font-size: 1.3em;
  33. }
  34. h2 {
  35. font-size: 1.2em;
  36. }
  37. h3 {
  38. font-size: 1.1em;
  39. }
  40. li {
  41. line-height: 1.4;
  42. }
  43. #copyright {
  44. margin-top: 30px;
  45. line-height: 1.5em;
  46. text-align: center;
  47. font-size: .8em;
  48. color: #888888;
  49. clear: both;
  50. }
  51. </style>
  52. </head>
  53. <body>
  54. <div id="xiphlogo">
  55. <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.Org"/></a>
  56. </div>
  57. <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
  58. <h2>Abstract</h2>
  59. <p>The Vorbis audio CODEC provides a channel coupling
  60. mechanisms designed to reduce effective bitrate by both eliminating
  61. interchannel redundancy and eliminating stereo image information
  62. labeled inaudible or undesirable according to spatial psychoacoustic
  63. models. This document describes both the mechanical coupling
  64. mechanisms available within the Vorbis specification, as well as the
  65. specific stereo coupling models used by the reference
  66. <tt>libvorbis</tt> codec provided by xiph.org.</p>
  67. <h2>Mechanisms</h2>
  68. <p>In encoder release beta 4 and earlier, Vorbis supported multiple
  69. channel encoding, but the channels were encoded entirely separately
  70. with no cross-analysis or redundancy elimination between channels.
  71. This multichannel strategy is very similar to the mp3's <em>dual
  72. stereo</em> mode and Vorbis uses the same name for its analogous
  73. uncoupled multichannel modes.</p>
  74. <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
  75. later implement a coupled channel strategy. Vorbis has two specific
  76. mechanisms that may be used alone or in conjunction to implement
  77. channel coupling. The first is <em>channel interleaving</em> via
  78. residue backend type 2, and the second is <em>square polar
  79. mapping</em>. These two general mechanisms are particularly well
  80. suited to coupling due to the structure of Vorbis encoding, as we'll
  81. explore below, and using both we can implement both totally
  82. <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
  83. to uncoupled modes], as well as various lossy models that seek to
  84. eliminate inaudible or unimportant aspects of the stereo image in
  85. order to enhance bitrate. The exact coupling implementation is
  86. generalized to allow the encoder a great deal of flexibility in
  87. implementation of a stereo or surround model without requiring any
  88. significant complexity increase over the combinatorially simpler
  89. mid/side joint stereo of mp3 and other current audio codecs.</p>
  90. <p>A particular Vorbis bitstream may apply channel coupling directly to
  91. more than a pair of channels; polar mapping is hierarchical such that
  92. polar coupling may be extrapolated to an arbitrary number of channels
  93. and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
  94. surround. However, the scope of this document restricts itself to the
  95. stereo coupling case.</p>
  96. <a name="sqpm"></a>
  97. <h3>Square Polar Mapping</h3>
  98. <h4>maximal correlation</h4>
  99. <p>Recall that the basic structure of a a Vorbis I stream first generates
  100. from input audio a spectral 'floor' function that serves as an
  101. MDCT-domain whitening filter. This floor is meant to represent the
  102. rough envelope of the frequency spectrum, using whatever metric the
  103. encoder cares to define. This floor is subtracted from the log
  104. frequency spectrum, effectively normalizing the spectrum by frequency.
  105. Each input channel is associated with a unique floor function.</p>
  106. <p>The basic idea behind any stereo coupling is that the left and right
  107. channels usually correlate. This correlation is even stronger if one
  108. first accounts for energy differences in any given frequency band
  109. across left and right; think for example of individual instruments
  110. mixed into different portions of the stereo image, or a stereo
  111. recording with a dominant feature not perfectly in the center. The
  112. floor functions, each specific to a channel, provide the perfect means
  113. of normalizing left and right energies across the spectrum to maximize
  114. correlation before coupling. This feature of the Vorbis format is not
  115. a convenient accident.</p>
  116. <p>Because we strive to maximally correlate the left and right channels
  117. and generally succeed in doing so, left and right residue is typically
  118. nearly identical. We could use channel interleaving (discussed below)
  119. alone to efficiently remove the redundancy between the left and right
  120. channels as a side effect of entropy encoding, but a polar
  121. representation gives benefits when left/right correlation is
  122. strong.</p>
  123. <h4>point and diffuse imaging</h4>
  124. <p>The first advantage of a polar representation is that it effectively
  125. separates the spatial audio information into a 'point image'
  126. (magnitude) at a given frequency and located somewhere in the sound
  127. field, and a 'diffuse image' (angle) that fills a large amount of
  128. space simultaneously. Even if we preserve only the magnitude (point)
  129. data, a detailed and carefully chosen floor function in each channel
  130. provides us with a free, fine-grained, frequency relative intensity
  131. stereo*. Angle information represents diffuse sound fields, such as
  132. reverberation that fills the entire space simultaneously.</p>
  133. <p>*<em>Because the Vorbis model supports a number of different possible
  134. stereo models and these models may be mixed, we do not use the term
  135. 'intensity stereo' talking about Vorbis; instead we use the terms
  136. 'point stereo', 'phase stereo' and subcategories of each.</em></p>
  137. <p>The majority of a stereo image is representable by polar magnitude
  138. alone, as strong sounds tend to be produced at near-point sources;
  139. even non-diffuse, fast, sharp echoes track very accurately using
  140. magnitude representation almost alone (for those experimenting with
  141. Vorbis tuning, this strategy works much better with the precise,
  142. piecewise control of floor 1; the continuous approximation of floor 0
  143. results in unstable imaging). Reverberation and diffuse sounds tend
  144. to contain less energy and be psychoacoustically dominated by the
  145. point sources embedded in them. Thus, we again tend to concentrate
  146. more represented energy into a predictably smaller number of numbers.
  147. Separating representation of point and diffuse imaging also allows us
  148. to model and manipulate point and diffuse qualities separately.</p>
  149. <h4>controlling bit leakage and symbol crosstalk</h4>
  150. <p>Because polar
  151. representation concentrates represented energy into fewer large
  152. values, we reduce bit 'leakage' during cascading (multistage VQ
  153. encoding) as a secondary benefit. A single large, monolithic VQ
  154. codebook is more efficient than a cascaded book due to entropy
  155. 'crosstalk' among symbols between different stages of a multistage cascade.
  156. Polar representation is a way of further concentrating entropy into
  157. predictable locations so that codebook design can take steps to
  158. improve multistage codebook efficiency. It also allows us to cascade
  159. various elements of the stereo image independently.</p>
  160. <h4>eliminating trigonometry and rounding</h4>
  161. <p>Rounding and computational complexity are potential problems with a
  162. polar representation. As our encoding process involves quantization,
  163. mixing a polar representation and quantization makes it potentially
  164. impossible, depending on implementation, to construct a coupled stereo
  165. mechanism that results in bit-identical decompressed output compared
  166. to an uncoupled encoding should the encoder desire it.</p>
  167. <p>Vorbis uses a mapping that preserves the most useful qualities of
  168. polar representation, relies only on addition/subtraction (during
  169. decode; high quality encoding still requires some trig), and makes it
  170. trivial before or after quantization to represent an angle/magnitude
  171. through a one-to-one mapping from possible left/right value
  172. permutations. We do this by basing our polar representation on the
  173. unit square rather than the unit-circle.</p>
  174. <p>Given a magnitude and angle, we recover left and right using the
  175. following function (note that A/B may be left/right or right/left
  176. depending on the coupling definition used by the encoder):</p>
  177. <pre>
  178. if(magnitude>0)
  179. if(angle>0){
  180. A=magnitude;
  181. B=magnitude-angle;
  182. }else{
  183. B=magnitude;
  184. A=magnitude+angle;
  185. }
  186. else
  187. if(angle>0){
  188. A=magnitude;
  189. B=magnitude+angle;
  190. }else{
  191. B=magnitude;
  192. A=magnitude-angle;
  193. }
  194. }
  195. </pre>
  196. <p>The function is antisymmetric for positive and negative magnitudes in
  197. order to eliminate a redundant value when quantizing. For example, if
  198. we're quantizing to integer values, we can visualize a magnitude of 5
  199. and an angle of -2 as follows:</p>
  200. <p><img src="squarepolar.png" alt="square polar"/></p>
  201. <p>This representation loses or replicates no values; if the range of A
  202. and B are integral -5 through 5, the number of possible Cartesian
  203. permutations is 121. Represented in square polar notation, the
  204. possible values are:</p>
  205. <pre>
  206. 0, 0
  207. -1,-2 -1,-1 -1, 0 -1, 1
  208. 1,-2 1,-1 1, 0 1, 1
  209. -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3
  210. 2,-4 2,-3 ... following the pattern ...
  211. ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9
  212. </pre>
  213. <p>...for a grand total of 121 possible values, the same number as in
  214. Cartesian representation (note that, for example, <tt>5,-10</tt> is
  215. the same as <tt>-5,10</tt>, so there's no reason to represent
  216. both. 2,10 cannot happen, and there's no reason to account for it.)
  217. It's also obvious that this mapping is exactly reversible.</p>
  218. <h3>Channel interleaving</h3>
  219. <p>We can remap and A/B vector using polar mapping into a magnitude/angle
  220. vector, and it's clear that, in general, this concentrates energy in
  221. the magnitude vector and reduces the amount of information to encode
  222. in the angle vector. Encoding these vectors independently with
  223. residue backend #0 or residue backend #1 will result in bitrate
  224. savings. However, there are still implicit correlations between the
  225. magnitude and angle vectors. The most obvious is that the amplitude
  226. of the angle is bounded by its corresponding magnitude value.</p>
  227. <p>Entropy coding the results, then, further benefits from the entropy
  228. model being able to compress magnitude and angle simultaneously. For
  229. this reason, Vorbis implements residue backend #2 which pre-interleaves
  230. a number of input vectors (in the stereo case, two, A and B) into a
  231. single output vector (with the elements in the order of
  232. A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
  233. each vector to be coded by the vector quantization backend consists of
  234. matching magnitude and angle values.</p>
  235. <p>The astute reader, at this point, will notice that in the theoretical
  236. case in which we can use monolithic codebooks of arbitrarily large
  237. size, we can directly interleave and encode left and right without
  238. polar mapping; in fact, the polar mapping does not appear to lend any
  239. benefit whatsoever to the efficiency of the entropy coding. In fact,
  240. it is perfectly possible and reasonable to build a Vorbis encoder that
  241. dispenses with polar mapping entirely and merely interleaves the
  242. channel. Libvorbis based encoders may configure such an encoding and
  243. it will work as intended.</p>
  244. <p>However, when we leave the ideal/theoretical domain, we notice that
  245. polar mapping does give additional practical benefits, as discussed in
  246. the above section on polar mapping and summarized again here:</p>
  247. <ul>
  248. <li>Polar mapping aids in controlling entropy 'leakage' between stages
  249. of a cascaded codebook.</li>
  250. <li>Polar mapping separates the stereo image
  251. into point and diffuse components which may be analyzed and handled
  252. differently.</li>
  253. </ul>
  254. <h2>Stereo Models</h2>
  255. <h3>Dual Stereo</h3>
  256. <p>Dual stereo refers to stereo encoding where the channels are entirely
  257. separate; they are analyzed and encoded as entirely distinct entities.
  258. This terminology is familiar from mp3.</p>
  259. <h3>Lossless Stereo</h3>
  260. <p>Using polar mapping and/or channel interleaving, it's possible to
  261. couple Vorbis channels losslessly, that is, construct a stereo
  262. coupling encoding that both saves space but also decodes
  263. bit-identically to dual stereo. OggEnc 1.0 and later uses this
  264. mode in all high-bitrate encoding.</p>
  265. <p>Overall, this stereo mode is overkill; however, it offers a safe
  266. alternative to users concerned about the slightest possible
  267. degradation to the stereo image or archival quality audio.</p>
  268. <h3>Phase Stereo</h3>
  269. <p>Phase stereo is the least aggressive means of gracefully dropping
  270. resolution from the stereo image; it affects only diffuse imaging.</p>
  271. <p>It's often quoted that the human ear is deaf to signal phase above
  272. about 4kHz; this is nearly true and a passable rule of thumb, but it
  273. can be demonstrated that even an average user can tell the difference
  274. between high frequency in-phase and out-of-phase noise. Obviously
  275. then, the statement is not entirely true. However, it's also the case
  276. that one must resort to nearly such an extreme demonstration before
  277. finding the counterexample.</p>
  278. <p>'Phase stereo' is simply a more aggressive quantization of the polar
  279. angle vector; above 4kHz it's generally quite safe to quantize noise
  280. and noisy elements to only a handful of allowed phases, or to thin the
  281. phase with respect to the magnitude. The phases of high amplitude
  282. pure tones may or may not be preserved more carefully (they are
  283. relatively rare and L/R tend to be in phase, so there is generally
  284. little reason not to spend a few more bits on them)</p>
  285. <h4>example: eight phase stereo</h4>
  286. <p>Vorbis may implement phase stereo coupling by preserving the entirety
  287. of the magnitude vector (essential to fine amplitude and energy
  288. resolution overall) and quantizing the angle vector to one of only
  289. four possible values. Given that the magnitude vector may be positive
  290. or negative, this results in left and right phase having eight
  291. possible permutation, thus 'eight phase stereo':</p>
  292. <p><img src="eightphase.png" alt="eight phase"/></p>
  293. <p>Left and right may be in phase (positive or negative), the most common
  294. case by far, or out of phase by 90 or 180 degrees.</p>
  295. <h4>example: four phase stereo</h4>
  296. <p>Similarly, four phase stereo takes the quantization one step further;
  297. it allows only in-phase and 180 degree out-out-phase signals:</p>
  298. <p><img src="fourphase.png" alt="four phase"/></p>
  299. <h3>example: point stereo</h3>
  300. <p>Point stereo eliminates the possibility of out-of-phase signal
  301. entirely. Any diffuse quality to a sound source tends to collapse
  302. inward to a point somewhere within the stereo image. A practical
  303. example would be balanced reverberations within a large, live space;
  304. normally the sound is diffuse and soft, giving a sonic impression of
  305. volume. In point-stereo, the reverberations would still exist, but
  306. sound fairly firmly centered within the image (assuming the
  307. reverberation was centered overall; if the reverberation is stronger
  308. to the left, then the point of localization in point stereo would be
  309. to the left). This effect is most noticeable at low and mid
  310. frequencies and using headphones (which grant perfect stereo
  311. separation). Point stereo is is a graceful but generally easy to
  312. detect degradation to the sound quality and is thus used in frequency
  313. ranges where it is least noticeable.</p>
  314. <h3>Mixed Stereo</h3>
  315. <p>Mixed stereo is the simultaneous use of more than one of the above
  316. stereo encoding models, generally using more aggressive modes in
  317. higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
  318. <p>It is also the case that near-DC frequencies should be encoded using
  319. lossless coupling to avoid frame blocking artifacts.</p>
  320. <h3>Vorbis Stereo Modes</h3>
  321. <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
  322. constructed out of lossless and point stereo. Phase stereo was used
  323. in the rc2 encoder, but is not currently used for simplicity's sake. It
  324. will likely be re-added to the stereo model in the future.</p>
  325. <div id="copyright">
  326. The Xiph Fish Logo is a
  327. trademark (&trade;) of Xiph.Org.<br/>
  328. These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.
  329. </div>
  330. </body>
  331. </html>