Make sure the audio being analyzed conforms to the following guidelines:
The audio must be recorded at 16 kHz or greater.
The audio file must be a 16, 24, or 32 bit audio file.
Be aware that there are multiple 32-bit WAV file standards. (type 1 - 32-bit) assumed.
The audio must be in one of the following formats supported by libsndfile:
AIFF (Apple/SGI) (.aiff)
AIFF (Apple/SGI) (.aif)
AU (Sun/NeXT) (.au)AVR (Audio Visual Research) (.avr)
CAF (Apple Core Audio File) (.caf)
FLAC (FLAC Lossless Audio Codec) (.flac)
HTK (HMM Tool Kit) (.htk)
IFF (Amiga IFF/SVX8/SV16) (.iff)
MAT4 (GNU Octave 2.0 / Matlab 4.2) (.mat)
MAT5 (GNU Octave 2.1 / Matlab 5.0) (.mat)
MPC (Akai MPC 2k) (.mpc)
OGG (OGG Container format) (.oga)
OGG (OGG Container format) (.ogg)
PAF (Ensoniq PARIS) (.paf)
PVF (Portable Voice Format) (.pvf)
RAW (header-less) (.raw)
RF64 (RIFF 64) (.rf64)
SD2 (Sound Designer II) (.sd2)
SDS (Midi Sample Dump Standard) (.sds)
SF (Berkeley/IRCAM/CARL) (.sf)
VOC (Creative Labs) (.voc)
W64 (SoundFoundry WAVE 64) (.w64)
WAV (Microsoft) (.wav)
WAV (NIST Sphere) (.wav)
WAVEX (Microsoft) (.wav)
WVE (Psion Series 3) (.wve)
XI (FastTracker 2) (.xi)
Only mono data is used, though FaceFX will reduce the sound to one channel automatically.
The audio file is between 0.5 and 90 seconds long. FaceFX will attempt to analyze longer WAV files by breaking the sound into chunks that are less than 90 seconds long and analyzing the chunks without text. FaceFX will create placeholder animations (a simple mouth open and mouth closed animation) for audio files less than 0.5 seconds. These default min and max values can be changed with the set command using the a_audiomin and a_audiomax variables.
There is a hard maximum on audio files, defaulting to 10 minutes as set by the g_maxaudioduration console variable. Any audio file with duration greater than this limit will fail to load. This is to prevent out-of-memory crashes on very large audio files.
Audio files longer than a_audiomax will analyze without text.
For best results, speech should not start within the first 100 milliseconds of the audio file. This allows the speech detection algorithms to get initialized properly and it also reduces the importance of how negative keyframes are dealt with in the FaceFX integration.