Auditory acquisition of information

Sensory systems have evolved because they provide an advantage in the evolutionary process of selection. The advantage of any sensory system is, that it provides information about what is going on in the environment and such enables appropriate action and reaction. Within the physical limits set by the basic mode and design of a particular sensory system, the latter will optimally fulfil the requirement to acquire information about external objects. One can be sure about that because any sensory system that once has existed but was a failure in this respect, was inevitably "filtered out" by evolutionary selection [87], [93], [104].

In view of these notions it is not surprising that higher animals, such as Man, possess an auditory system that can acquire detailed information about sound sources even from large distances. How is this accomplished? When one takes into account a number of serious physical limitations and obstacles in the way of that auditory achievement one indeed wonders how it can be accomplished.

The physical input to the auditory system (of Man) is confined to the sound-pressure oscillations at the two eardrums (the ear signals). From the complex sound field that outside a listener's head is produced by environmental sources, only two single spots are (and can be) "listened to". Obviously, movement of the head somewhat releases from this restriction, but, as we know, at least humans are hardly dependent on movement in the sound field. Ordinarily, humans are not even able to move their pinnae.
Ordinarily, there are multiple sources in the environment, emitting sound. These individual sound waves become mixed in the sound field, i.e., superimposed, and what is available at the eardrum is only a mixture of several or many source signals. There is no physical principle according to which that signal can be decomposed into the contributions of the individual sound sources.
In practically any situation, the sound signals emitted by sources are subject to considerable linear distortion until they arrive at a listener's eardrums. So, the sound signal provided by an individual source not only is distorted by other signals, but is considerably distorted by quasi-random changes of both its amplitude- and phase spectrum, as well.

How is it conceivable that under such conditions the auditory system can achieve a fairly detailed subjective representation both of the general acoustical nature of the environment and of individual external sound sources? There does not exist any physical principle according to which the individual source signals can be recovered from the ear signals, and this is true in a twofold sense:

Firstly (as already mentioned above), there is no physical principle for decomposing an ear signal into the contributions provided by each of the external sources.
Secondly, there is no physical principle according to which the linear distortion imposed on the individual source signals can be "stripped off" to recover the original source signals, i.e., without detailed knowledge of the characteristics of the transmission paths.

The first of the latter two notions makes apparent that auditory segregation of sound objects may be accomplished only (if at all) in a "non-physical" manner, i.e., by intelligently interpreting the ear signals, taking advantage of permanent "auditory knowledge" about what kinds of sound sources do exist in the real world, and what kinds of clues source signals can contribute to the ear signals. As that permanent auditory knowledge base must be conceived as having emerged principally by evolution, it cannot be expected to include knowledge on such kinds of sound sources as, e.g., earphones, radios, telephones, and digital music synthesizers.This is why the auditory system can be cheated by those devices which evoke auditory "images" of acoustic objects that are not really present.

From the second notion it follows that the process of interpretation cannot be dependent on features of sound signals that are vulnerable by linear distortion. In other words, the auditory system must to a large extent be designed such that it primarily takes adavantage of ear-signal parameters that are highly robust with regard to quasi-random distortion of the amplitude- and phase spectrum.

This is where pitch enters the stage. In its primary version, i.e., spetral pitch, it provides an immediate auditory representation of spectral frequency, i.e., of the frequency of Fourier components, or, in more general terms, of spectral discontinuities. This is enormously significant, as spectral frequencies, although ordinarily time variant, are highly robust with regard to linear distortion. That is to say, while the amplitude- and phase spectrum of a source signal ordinarily is considerably distorted when travelling from source to ear, the spectral frequencies remain practically unaffected. Everyone who ever has listened to a musical performance in a concert hall (who has not?) should immediately understand that this is true. In fact, this experience is so common that one easily misses wondering about it.

Explaining why spectral frequency is so robust, is a matter of linear-systems theory; it is not necessary to discuss this here (cf. [87], [93], [96], [104] p. 9). What matters here is the notion that it is auditory extraction of pitches that to a large extent provides for acquisition of information about the characteristics of sound sources. Of course, this is possible only by time-variance of both frequency and pitch, as any parameter that is constant in time cannot convey any information.

Auditory representation of a sound signal by a time-variant pattern of spectral pitches obviously requires some kind of Fourier analysis. Consequently, in higher vertebrates a very efficient and selective peripheral system for time-variant Fourier analysis has evolved, i.e., in the cochlea. The phylogenesis of the auditory system indeed reveals that there must be a strong selection pressure on the evolution of auditory frequency analysis, indicating a pronounced advantage of that achievement (cf. Manley 1986a). From primitive ears that have only a rudimentary apparatus for frequency analysis one may learn that even any primitve kind of frequency analysis is advantageous as compared to having no frequency analysis at all. The above considerations on the robustness of spectral frequency explain why this is so.

However, there is even more benefit implied in auditory frequency analysis. In fact, auditory frequency analysis not only provides for overcoming the problems introduced by linear distortion of the acoustic transmission path; it also provides the key to solving the second major problem, i.e., auditory segregation of sound objects. Obviously, formation of spectral pitches on the basis of Fourier analysis as such is a kind of sound segregation, namely, segregation of the ear signal into spectral pitches. This, of course, does not immediatley yield the desired end result, i.e., representation of multiple sound objects, but provides a reasonable basis for attaining that goal by subsequent auditory interpretation of the spectral-pitch pattern. For instance, in the topic virtual pitch I have outlined that, and how, the auditory system can both discover and represent the periodic type of sound objects, namely, by formation of virtual pitch. Such kind of interpretation is analogous to interpretation of primary visual contours by the visual system, i.e., to achieve a subjective representation of the three-dimensional external world.

I have concluded from these notions that in the auditory system pitch plays a role that in many respects is analogous to that of contour in vision [75], [76], [87], [88], [93], [94], [96], [104], p.23-30. Pitch may indeed be regarded as "auditory contour". Just like visual contour, pitch comes in two varieties: primary, and virtual. To appreciate that analogy, one should be prepared to abstract from the fact that the eye's primary receptor field (the retina) is two-dimensional, while the auditory one (the low-high dimension) is one-dimensional. This merely has the concequence that visual primary contour (a "line") is one-dimensional, while auditory primary contour, i.e., spectral pitch, is null-dimensional (a "point"). Just like a visual Gestalt is defined by a set of contours that occur at appropriate places of the visual receptor plane, an auditory Gestalt is defined by a particular combination of spectral pitches that either occur or are missing at definite points of the auditory low-high dimension.

The latter notion strongly suggests another most interesting analogy, namely, that to the array of bits of a binary number, or "computer word". I appreciate the latter analogy particularly because it makes drastically apparent that pitch combinations, indeed, provide for information.

Many of the above notions may appear more or less self-evident - once one has become aware of them. However, from the contemporary literature it becomes apparent that only few authors, such as Bregman (1990a) and Hartmann (1988a, 1996a), are pursuing the same, or at least a similar, approach.

main page