The characteristics of spontaneously spoken language, utterances that are not read or rehearsed, differ significantly from written text. The incremental and irrevocable production often results in speech artifacts, e.g., stuttering, self-correction or a disruptive sentence, which are referred to as (speech) disfluencies and have been shown to decrease the performance of natural language processing systems. Hence, the separation of disfluent speech material from the utterances is a beneficial preprocessing.
In our research on the detection of disfluencies, we analyzed the speech material in our corpus and compared different existing disfluency annotation schemes to their usefulness for our purposes and synthesized a set of 15 different disfluency types.
Furthermore, we developed a system for the automatic detection (and correction) of speech disfluencies. This system combines several hybrid detection approaches, using hand-written rules for the detection of easy detectable types (e.g., hesitations or slip of the tongues) and machine learning techniques for the harder detectable disfluency classes. The rules where simply written based on lexical information, where the machine learning approach uses different multi-modal features (e.g., lexical, prosodic and structural) for the detection. In addition, the system is build in a multi-step arrangement, where the sequence of the detection modules has been learned during the training phase. This allows to address the problem of nested disfluencies - disfluencies contained in other disfluencies - which can hardly be detected in a mono-step system.