Special Sessions & Challenges

The Organizing Committee of INTERSPEECH 2021 is proudly announcing the following special sessions and challenges for INTERSPEECH 2021.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

While speech recognition systems generally work well on the average population with typical speech characteristics, performance on subgroups with unique speaking patterns is usually significantly worse.

Speech that contains non-standard speech patterns (acoustic-phonetic phonotactic, lexical and prosodic patterns) is particularly challenging, both because of the small population with these speech patterns, and because of the generally higher variance of speech patterns. In the case of dysarthric speech, which is often correlated with mobility or other accessibility limitations, accuracy of existing speech recognition systems is often particularly poor, rendering the technology unusable for many speakers who could benefit the most.

In this oral session, we seek to promote interdisciplinary collaborations between researchers and practitioners addressing this problem, to build community and stimulate research. We invite papers analyzing and improving systems dealing with atypical speech.

Topics of interest include, but are not limited to:

  • Automatic Speech Recognition (ASR) of atypical speech
  • Speech-to-Speech conversion/normalization (e.g. from atypical to typical)
  • Voice enhancement and convergence to improve intelligibility of spoken content of atypical speech
  • Automated classification of atypical speech conditions
  • Robustness of speech processing systems for atypical speech in common application scenarios
  • Data augmentation techniques to deal with data sparsity
  • Aspects of creating, managing data quality, and sharing of data sets of atypical speech
  • Multi-modal integration (e.g. video and voice) and its application

URL

Organizers

  • Jordan R. Green, MGH Institute of Health Professions, Harvard University
  • Michael P. Brenner, Harvard University, Google
  • Fadi Biadsy, Google
  • Bob MacDonald, Google
  • Katrin Tomanek, Google

Oriental languages are rich and complex. With the great diversity in terms of both acoustics and linguistics, oriental language is a treasure for multilingual research. The Oriental Language Recognition (OLR) challenge has been conducted for 5 years with big success, and demonstrated many novel and interesting techniques devised by the participants.

The main goal of this special session is to summarize the technical advance of OLR 2020, but it will welcome all submissions related to language recognition and multilingual soeecg processing.

URL

Organizers

  • Dong Wang (Tsinghua University)
  • Qingyang Hong (Xiamen University)
  • Xiaolei Zhang (Northwestern Polytechnical University)
  • Ming Li (Duke Kunshan University)
  • Yufeng Hao (Speechocean)

The ConferencingSpeech 2021 challenge is proposed to stimulate research in multi-channel speech enhancement and aims for processing the far-field speech from microphone arrays in the video conferencing rooms. Targeting the real video conferencing room application, the ConferencingSpeech 2021 challenge database is recorded from real speakers. The number of speakers and distances between speakers and microphone arrays vary according to the sizes of meeting rooms. Multiple microphone arrays from three different types of geometric topology are allocated in each recording environment.

The challenge will have two tasks:

  • Task 1 is multi-channel speech enhancement with single microphone array and focusing on practical application with real-time requirement.
  • Task 2 is multi-channel speech enhancement with multiple distributed microphone arrays, which is non-real-time track and does not have any constraints so that participants could explore any algorithms to obtain high speech quality.

To focus on the development of algorithms, the challenge requires the close training condition. Only provided lists of open source clean speech datasets and noise dataset could be used for training. In addition, the challenge will provide the development set, scripts for simulating the training data, baseline systems for participants to develop their systems. The final ranking of the challenge will be decided by the subjective evaluation. The subjective evaluation will be performed using Absolute Category Ratings (ACR) to estimate a Mean Opinion Score (MOS) through Tencent Online Media Subjective Evaluation platform.

More details about the data and challenge can be found from the evaluation plan of ConferencingSpeech 2021 challenge.

Besides the submitted paper related to ConferencingSpeech 2021 challenge, Paper on multi-channel speech enhancement are all encouraged to submit to this special session.

URL

Organizers

  • Wei Rao, Tencent Ethereal Audio Lab, China
  • Lei Xie, Northwestern Polytechnical University, China
  • Yannan Wang, Tencent Ethereal Audio Lab, China
  • Tao Yu, Tencent Ethereal Audio Lab, USA
  • Shinji Watanabe, Associate Professor, Carnegie Mellon University / Johns Hopkins University, USA
  • Zheng-Hua Tan, Aalborg University, Denmark
  • Hui Bu, AISHELL foundation, China
  • Shidong Shang, Tencent Ethereal Audio Lab, China

The appraisal of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the assessment of the outcome of the treatment. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.

Aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The objective is to gather new insights, advancement of knowledge and practical tools to assist researchers and clinicians in obtaining effective descriptions of voice quality and reliable measures of its acoustic correlates. Topics of interest include, but are not limited to, (i) the statistical analysis and automatic classification, possibly relying on state-of-the-art machine learning approaches, of distinct types of voice quality via non-obtrusively recorded features, (ii) the analysis and simulation of vocal fold vibrations by means of analytical, kinematic or mechanical modelling, (iii) the interpretation and modeling of both acoustic emission and/or high– speed video recordings such as videolaryngoscopy and videokymography, (iv) the synthesis of disordered voices jointly with auditory experimentation involving synthetic and natural disordered voice stimuli.

URL

Organizers

  • Philipp Aichinger (philipp.aichinger@meduniwien.ac.at)
  • Abeer Alwan (alwan@ee.ucla.edu)
  • Carlo Drioli (carlo.drioli@uniud.it)
  • Jody Kreiman (jkreiman@ucla.edu)
  • Jean Schoentgen (jschoent@ulb.ac.be)

Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.

The main objective of the special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.

URL

Organizers

  • Hartmut Helmke (DLR)
  • Pavel Kolcarek (Honeywell)
  • Petr Motlicek (Idiap Research Institute)

Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. The main risk factor for dementia is age, and therefore its greatest incidence is amongst the elderly. Due to the severity of the situation worldwide, institutions and researchers are investing considerably on dementia prevention and early detection, focusing on disease progression. There is a need for cost-effective and scalable methods for detection of dementia from its most subtle forms, such as the preclinical stage of Subjective Memory Loss (SML), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer's Dementia (AD) itself.

The ADReSSo (ADReSS, speech only) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). The challenge builds on the success of the ADReSS Challenge (Luz et Al, 2020), the first such shared-task event focused on AD, which attracted 34 teams from across the world. While a number of researchers have proposed speech processing and natural language procesing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied data sets, consequently hindering reproducibility and comparability of approaches. The ADReSSo Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset. The approaches that performed best on the original ADReSS dataset employed features extracted from manual transcripts, which were provided. The ADReSSo challenge provides a more challenging and improved spontaneous speech dataset, and requires the creation of models straight from speech, without manual transcription. In keeping with the objectives of AD prediction evaluation, the ADReSSo challenge's dataset will be statistically balanced so as to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution. This task focuses AD recognition using spontaneous speech, which marks a departure from neuropsychological and clinical evaluation approaches. Spontaneous speech analysis has the potential to enable novel applications for speech technology in longitudinal, unobtrusive monitoring of cognitive health, in line with the theme of this year's INTERSPEECH, "Speech Everywhere!".

Important Dates

  • January 18, 2021: ADReSSo Challenged announced.
  • March 20, 2021: Model submission deadline.
  • March 26, 2021: Paper submission deadline.
  • April 2, 2021: Paper update deadline.
  • June 2, 2021: Paper acceptance/rejection notification.
  • August 31 - September 3, 2021: INTERSPEECH 2021.

URL

Organizers

  • Saturnino Luz, Usher Institute, University of Edinburgh
  • Fasih Haider, University of Edinburgh
  • Sofia de la Fuente, University of Edinburgh
  • Davida Fromm, Carnegie Mellon University
  • Brian MacWhinney, Carnegie Mellon University

Are you searching for new challenges in speaker recognition? Join SdSV Challenge 2021 which focuses on the analysis and exploration of new ideas for short duration speaker verification.

Following the success of the SdSV Challenge 2020, the SdSV Challenge 2021 focuses on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. The challenge consists of two tasks.

  • Task 1 is defined as speaker verification in text-dependent mode where the lexical content (in both English and Persian) of the test utterances is also taken into consideration.
  • Task 2 is defined as speaker verification in text-independent mode with same- and cross-language trials.

The main purpose of this challenge is to encourage participants on building single but competitive systems, to perform analysis as well as to explore new ideas, such as multi-task learning, unsupervised/self-supervised learning, single-shot learning, disentangled representation learning and so on, for short-duration speaker verification. The participating teams will get access to a train set and the test set drawn from the DeepMine corpus which is the largest public corpus designed for short-duration speaker verification with voice recordings of 1800 speakers. The challenge leaderboard is hosted at CodaLab.

URL

Organizers

  • Hossein Zeinali (Amirkabir University of Technology, Iran)
  • Kong Aik Lee (I2R, A*STAR, Singapore)
  • Jahangir Alam (CRIM, Canada)
  • Lukáš Burget (Brno University of Technology, Czech Republic)

The INTERSPEECH 2021 Acoustic Echo Cancellation (AEC) challenge is designed to stimulate research in the AEC domain by open sourcing a large training dataset, test set, and subjective evaluation framework. We provide two new open source datasets for training AEC models. The first is a real dataset captured using a large-scale crowdsourcing effort. This dataset consists of real recordings that have been collected from over 5,000 diverse audio devices and environments. The second is a synthetic dataset with added room impulse responses and background noise derived from the INTERSPEECH 2020 DNS Challenge. An initial test set will be released for the researchers to use during development and a blind test near the end which will be used to decide the final competition winners. We believe these datasets are large enough to facilitate deep learning and representative enough for practical usage in shipping telecommunication products.

The dataset and rules are available here.

Please feel free to reach out to us, if you have any questions or need clarification about any aspect of the challenge.

URL

Organizers

  • Ross Cutler, Microsoft Corp, USA
  • Ando Saabas, Microsoft Corp, Tallinn
  • Tanel Parnamaa, Microsoft Corp, Tallinn
  • Markus Loide, Microsoft Corp, Tallinn
  • Sten Sootla, Microsoft Corp, Tallinn
  • Hannes Gamper, Microsoft Corp, USA
  • Sebastian Braun, Microsoft Corp, USA
  • Karsten Sorensen, Microsoft Corp, USA
  • Robert Aichner, Microsoft Corp, USA
  • Sriram Srinivasan, Microsoft Corp, USA

Non-autoregressive modeling is a new direction in speech processing research that has recently emerged. One advantage of non-autoregressive models is their decoding speed: decoding is only composed of forward propagation through a neural network, hence complicated left-to-right beam search is not necessary. In addition, they do not assume a left-to-right generation order and thus represent a paradigm shift in speech processing, where left-to-right, autoregressive models have been believed to be legitimate. This special session aims to facilitate knowledge sharing between researchers involved in non-autoregressive modeling across various speech processing fields, including, but not limited to, automatic speech recognition, speech translation, and text to speech, via panel discussions with leading researchers followed by a poster session.

URL

Organizers

  • Katrin Kirchhoff (Amazon)
  • Shinji Watanabe (Carnegie Mellon University)
  • Yuya Fujita (Yahoo Japan Corporation)

The COVID-19 pandemic has resulted in more than 93 million infections, and more than 2 million casualties. Large scale testing, social distancing, and face masks have been critical measures to help contain the spread of the infection. While the list of symptoms is regularly updated, it is established that in symptomatic cases COVID-19 seriously impairs normal functioning of the respiratory system. Does this alter the acoustic characteristics of breathe, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. A COVID-19 diagnosis methodology based on acoustic signal analysis, if successful, can provide a remote, scalable, and economical means for testing of individuals. This can supplement the existing nucleotides based COVID-19 testing methods, such as RT-PCR and RAT.

The DiCOVA Challenge is designed to find answers to the question by enabling participants to analyze an acoustic dataset gathered from COVID-19 positive and non-COVID-19 individuals. The findings will be presented in a special session at Interspeech 2021. The timeliness, and the global societal importance of the challenge warrants focussed effort from researchers across the globe, including from the fields of medical and respiratory sciences, mathematical sciences, and machine learning engineers. We look forward to your participation!

URL

Organizers

  • Neeraj Sharma (Indian Institute of Science, Bangalore, India)
  • Prasanta Kumar Ghosh (Indian Institute of Science, Bangalore, India)
  • Srikanth Raj Chetupalli (Indian Institute of Science, Bangalore, India)
  • Sriram Ganapathy (Indian Institute of Science, Bangalore, India)

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020 and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was used to evaluate challenge submissions. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and (ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric for wide band called DNSMOS for the participants to use during their development phase. The final evaluation will be based on ITU-T P.835 subjective evaluation framework that gives the quality of speech and noise in addition to the overall quality of the speech.

We will have two tracks in this challenge:

  • Track 1: Real-Time Denoising track for wide band scenario
    The noise suppressor must take less than the stride time Ts (in ms) to process a frame of size T (in ms) on an Intel Core i5 quad-core machine clocked at 2.4 GHz or equivalent processor. For example, Ts = T/2 for 50% overlap between frames. The total algorithmic latency allowed including the frame size T, stride time Ts, and any look ahead must be less than or equal to 40ms. For example, for a real-time system that receives 20ms audio chunks, if you use a frame length of 20ms with a stride of 10ms resulting in an algorithmic latency of 30ms, then you satisfy the latency requirements. If you use a frame of size 32ms with a stride of 16ms resulting in an algorithmic latency of 48ms, then your method does not satisfy the latency requirements as the total algorithmic latency exceeds 40ms. If your frame size plus stride T1=T+Ts is less than 40ms, then you can use up to (40-T1) ms future information.
  • Track 2: Real-Time Denoising track for full band scenario
    Satisfy Track 1 requirements but at 48 kHz.

More details about the datasets and the challenge are available in the paper and the challenge github page. Participants must adhere to the rules of the challenge.

URL

Organizers

  • Chandan K A Reddy (Microsoft Corp, USA)
  • Hari Dubey (Microsoft Corp, USA)
  • Kazuhito Koishada (Microsoft Corp, USA)
  • Arun Nair (Johns Hopkins University, USA)
  • Vishak Gopal (Microsoft Corp, USA)
  • Ross Cutler (Microsoft Corp, USA)
  • Robert Aichner (Microsoft Corp, USA)
  • Sebastian Braun (Microsoft Research, USA)
  • Hannes Gamper (Microsoft Research, USA)
  • Sriram Srinivasan (Microsoft Corp, USA)

This special session focuses on privacy-preserving machine learning (PPML) techniques in speech, language and audio processing, including centralized, distributed and on-device processing approaches. Novel contributions and overviews on the theory and applications of PPML in speech, language and audio are invited. We encourage submissions related to ethical and regulatory aspects of PPML in this context. Sending speech, language or audio data to a cloud server exposes private information. One approach called anonymization is to preprocess the data so as to hide information which could identify the user by disentangling it from other useful attributes. PPML is a different approach, which solves this problem by moving computation near the clients. Due to recent advances in Edge Computing and Neural Processing Units on mobile devices, PPML is now a feasible technology for most speech, language and audio applications that enables companies to train on customer data without needing them to share the data. With PPML, data can sit on a customer's device where it is used for model training. During the training process, models from several clients are often shared with aggregator nodes that perform model averaging and sync the new models to each client. Next, the new averaged model is used for training on each client. This process continues and enables each client to benefit from training data on all other clients. Such processes were not possible in conventional audio/speech ML. On top of that, high-quality synthetic data can also be used for training thanks to advances in speech, text, and audio synthesis.

URL

Organizers

  • Harishchandra Dubey (Microsoft)
  • Amin Fazel (Amazon, Alexa)
  • Mirco Ravanelli (MILA,Université de Montréal)
  • Emmanuel Vincent (Inria)

Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 13th edition, we introduce four new tasks and Sub-Challenges:

  • COVID-19 Cough based recognition,
  • COVID-19 Speech based recognition,
  • Escalation level assessment in spoken dialogues,
  • Primates classification based on their vocalisations.

Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools including recent deep learning approaches are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.

Contributions using the provided or equivalent data are sought for (but not limited to):

  • Participation in a Sub-Challenge
  • Contributions around the Challenge topics

Results of the Challenge and Prizes will be presented at Interspeech 2021 in Brno, Czechia.

URL

Organizers

  • Björn Schuller (University of Augsburg, Germany / Imperial College, UK)
  • Anton Batliner (University of Augsburg, Germany)
  • Christian Bergler (FAU, Germany)
  • Cecilia Mascolo (University of Cambridge, UK)
  • Jing Han (University of Cambridge, UK)
  • Iulia Lefter (Delft University of Technology, The Netherlands)
  • Heysem Kaya (Utrecht University, The Netherlands)

The goal of the OpenASR (Open Automatic Speech Recognition) Challenge is to assess the state of the art of ASR technologies for low-resource languages.

The OpenASR Challenge is an open challenge created out of the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program that encompasses more tasks, including CLIR (cross-language information retrieval), domain classification, and summarization. For every year of MATERIAL, NIST supports a simplified, smaller scale evaluation open to all, focusing on a particular technology aspect of MATERIAL. The capabilities tested in the open challenges are expected to ultimately support the MATERIAL task of effective triage and analysis of large volumes of data, in a variety of less-studied languages.

The special session aims to bring together researchers from all sectors working on ASR for low-resource languages to discuss the state of the art and future directions. It will allow for fruitful exchanges between OpenASR20 Challenge participants and other researchers working on low-resource ASR. We invite contributions from OpenASR20 participants, MATERIAL performers, as well as any other researchers with relevant work in the low-resource ASR problem space.

Topics:

  • OpenASR20 Challenge reports, including
    • Cross-lingual training techniques to compensate for ten-hour training condition
    • Factors influencing ASR performance on low resource languages by gender and dialect
    • Resource conditions used for unconstrained development condition
  • IARPA MATERIAL performer reports on low-resource ASR, including
    • Low Resource ASR tailored to MATERIAL’s Cross Language Information Retrieval Evaluation
    • Genre mismatch condition between speech training data and evaluation
  • Other topics focused on low-resource ASR challenges and solutions

URL

Organizers

  • Peter Bell, University of Edinburgh
  • Jayadev Billa, University of Southern California Information Sciences Institute
  • William Hartmann, Raytheon BBN Technologies
  • Kay Peterson, National Institute of Standards and Technology