These are some thoughts on what a modern audio system would require.
Introduction
Direction
Implementation
Components
The codec layer
This compresses and decompresses the audio data. For example, Vorbis, MP3, XMA2, ATRAC, ADPCM, Opus would all be handled here. The compression code would be editor only and the decompression code would be run time. XMA2 and MP3 require special handling to loop seamlessly, so the ability to loop would be a property. The codecs should be able to take a streaming source. Internal loop points are critical for efficient use of the channels, and this should be another supported feature. For example, if you have a machine gun and fire off a unique sound with a long tail per bullet, then channels will be swamped and the system will slow down. If the internal loop point repeated the shot sound until it was 'released', then only a single channel would be used. The input should always be 16 bit PCM so the compression qualities can be experimented with. It should support mono, stereo, 4.0, and 5.1 audio files. The audio properties should be kept to a minimum; can be looped, has subtitles, number of channels, length in milliseconds.Unfortunately, each dialog audio file needs subtitles. This would be the additional meta data of a subtitle id/default subtitle and whether said audio contains any age restricted words. The subtitles need basic markup for when the sound designer decides to do something 'special' or hide bad words. For example, temporarily artificially silence and slow down a line of dialog and then put it back to normal later on so the audio matches the animation. (a Gears of War 'bug' I addressed). Age restricted dialog is tedious to handle; having some meta data here to silence or beep the dialog would work. Having an alternate line of dialog for every mature line is a waste of voice actor, sound designer, and client memory resources.
Automatic lip syncing can be achieved with phoneme extraction and blending viseme mouth animations. This is not perfect, but much better than nothing at all and works for all languages. The lips sync, but the eyes remain the same and there are no facial expressions. For hero scenes, hand animation is better - but that is very animator resource intensive. TTS (text to speech) could be used to stub in dialog and have the phonemes extracted. This would give the story designer a chance to prototype scenes with actual lines of dialog with lip syncing without having to bother the voice actor at all. This would help reduce pickup sessions. TTS could be used to add in a last minute line if the schedule does not allow for any more retakes. (Film Actors Guild notwithstanding).
An common memory optimization is to resample audio down to the minimum frequency that maintains the same quality. Kate Bush requires a higher frequency than Barry White to sound 'lossless'. This is a manual process, very tedious, and format dependent. Automating this would be a boon for memory usage.
The hardware layer
This should be based on the Direct Sound 2 feature set. It's been a while, but it has the buses and channeling required to handle every use case (such as individual speaker volume control; the front center can be boosted for dialog). I think the Mac has a similar API. OpenAL would be able to use a small subset of these features.What would be nice is to have a custom spatialization and attenuation module rather than relying on the host hardware layer. This is not easy. Each bit of hardware handles it slightly differently, and quite poorly in the case of OpenAL. Likewise, reverb.
The game layer
Each audio file would have an associated set of directives (think Unreal Engine sound cue). For example, StereoAmbientEffect, MonoSpotEffect, or JohnDialog. There would only be a few of these used by many audio files. All dialog from the character John would share the same directives. Each directive would be part of a sound group (e.g. dialog) that could be used by the game code to control the overall feel of the audioscape. For example, set audio mode 'intense dialog' would dampen all the generic effects and ambient sounds but boost the player dialog. This should be controlled by the game coder and not automatic on the playing of a directive. Another feature would be to have an audio mode that applies a filter on the final output; the interface would be the same, but the implementation would be very different. The directions would include nodes such as loop, attenuation (from point and planes), modulation, spatialization, volume envelope, filter (lo/hi/custom), random, internal loop, and doppler.The directive properties should be minimal - group (from a hierarchy) and priority.
Prioritization - the sounds should be sorted by attenuated volume and the quietest ones thrown away. This requires care if an audio sample is recorded loudly but some directives artificially reduce the volume to make it sound correct (a Quake 4 bug I fixed). This can be mitigated by properly explaining to the sound designer how prioritization works and giving the directives a priority. e.g. this audio is background, so safe to attenuate out. This is the player shooting a gun, never ever drop.