The Organizing Committee of INTERSPEECH 2021 is proudly announcing the following special sessions and challenges for INTERSPEECH 2021.
Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.
Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.
The Zero Resource Speech Challenge 2021 is the fifth iteration of the Zero Resource Speech Challenge series. The overall goal of the series is to advance research in unsupervised training of speech and dialogue tools, taking inspiration from the fact that young infants learn to perceive and produce speech with no textual supervision, and with applications in speech technology for under-resourced languages.
The 2021 edition takes a novel angle, language modelling from speech. Self-supervised pre-training for ASR has recently shown impressive results that suggest that some sequential models trained from raw speech may be doing more than just acoustic modelling, and may be learning higher-order facts about the trained language. The Zero Resource Speech Challenge 2021 proposes common evaluations to assess whether it is possible to learn lexical, syntactic, or even semantic knowledge from raw speech alone. The challenge invites any type of model, trained on raw speech, capable of assigning scores and distances to novel speech examples. The challenge invites both high-compute-budget models, which require larger-scale compute resources to train or use, and low-compute-budget models.
While speech recognition systems generally work well on the average population with typical speech characteristics, performance on subgroups with unique speaking patterns is usually significantly worse.
Speech that contains non-standard speech patterns (acoustic-phonetic phonotactic, lexical and prosodic patterns) is particularly challenging, both because of the small population with these speech patterns, and because of the generally higher variance of speech patterns. In the case of dysarthric speech, which is often correlated with mobility or other accessibility limitations, accuracy of existing speech recognition systems is often particularly poor, rendering the technology unusable for many speakers who could benefit the most.
In this oral session, we seek to promote interdisciplinary collaborations between researchers and practitioners addressing this problem, to build community and stimulate research. We invite papers analyzing and improving systems dealing with atypical speech.
Topics of interest include, but are not limited to:
With this shared task, which follows the one organized at Interspeech 2020 (http://www.interspeech2020.org/Special_Sessions_and_Challenge/), we intend to advance the research addressing non native children's ASR technology. To reach this goal we will distribute a new set of data, in addition to that used for the 2020 challenge, that will contain additional training data for the English language (acquired from speakers of different native languages) as well as data for developing a German ASR for non native children. The spoken responses in the data set were produced in the context of both English and German speaking proficiency examination.
The following data will be released for this shared task:
For both languages, English and German, a baseline ASR system together with evaluation scripts will be provided.
Oriental languages are rich and complex. With the great diversity in terms of both acoustics and linguistics, oriental language is a treasure for multilingual research. The Oriental Language Recognition (OLR) challenge has been conducted for 5 years with big success, and demonstrated many novel and interesting techniques devised by the participants.
The main goal of this special session is to summarize the technical advance of OLR 2020, but it will welcome all submissions related to language recognition and multilingual soeecg processing.
The ConferencingSpeech 2021 challenge is proposed to stimulate research in multi-channel speech enhancement and aims for processing the far-field speech from microphone arrays in the video conferencing rooms. Targeting the real video conferencing room application, the ConferencingSpeech 2021 challenge database is recorded from real speakers. The number of speakers and distances between speakers and microphone arrays vary according to the sizes of meeting rooms. Multiple microphone arrays from three different types of geometric topology are allocated in each recording environment.
The challenge will have two tasks:
To focus on the development of algorithms, the challenge requires the close training condition. Only provided lists of open source clean speech datasets and noise dataset could be used for training. In addition, the challenge will provide the development set, scripts for simulating the training data, baseline systems for participants to develop their systems. The final ranking of the challenge will be decided by the subjective evaluation. The subjective evaluation will be performed using Absolute Category Ratings (ACR) to estimate a Mean Opinion Score (MOS) through Tencent Online Media Subjective Evaluation platform.
More details about the data and challenge can be found from the evaluation plan of ConferencingSpeech 2021 challenge.
Besides the submitted paper related to ConferencingSpeech 2021 challenge, Paper on multi-channel speech enhancement are all encouraged to submit to this special session.
The appraisal of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the assessment of the outcome of the treatment. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.
Aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The objective is to gather new insights, advancement of knowledge and practical tools to assist researchers and clinicians in obtaining effective descriptions of voice quality and reliable measures of its acoustic correlates. Topics of interest include, but are not limited to, (i) the statistical analysis and automatic classification, possibly relying on state-of-the-art machine learning approaches, of distinct types of voice quality via non-obtrusively recorded features, (ii) the analysis and simulation of vocal fold vibrations by means of analytical, kinematic or mechanical modelling, (iii) the interpretation and modeling of both acoustic emission and/or high– speed video recordings such as videolaryngoscopy and videokymography, (iv) the synthesis of disordered voices jointly with auditory experimentation involving synthetic and natural disordered voice stimuli.
Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.
The main objective of the special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.
Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. The main risk factor for dementia is age, and therefore its greatest incidence is amongst the elderly. Due to the severity of the situation worldwide, institutions and researchers are investing considerably on dementia prevention and early detection, focusing on disease progression. There is a need for cost-effective and scalable methods for detection of dementia from its most subtle forms, such as the preclinical stage of Subjective Memory Loss (SML), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer's Dementia (AD) itself.
The ADReSSo (ADReSS, speech only) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). The challenge builds on the success of the ADReSS Challenge (Luz et Al, 2020), the first such shared-task event focused on AD, which attracted 34 teams from across the world. While a number of researchers have proposed speech processing and natural language procesing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied data sets, consequently hindering reproducibility and comparability of approaches. The ADReSSo Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset. The approaches that performed best on the original ADReSS dataset employed features extracted from manual transcripts, which were provided. The ADReSSo challenge provides a more challenging and improved spontaneous speech dataset, and requires the creation of models straight from speech, without manual transcription. In keeping with the objectives of AD prediction evaluation, the ADReSSo challenge's dataset will be statistically balanced so as to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution. This task focuses AD recognition using spontaneous speech, which marks a departure from neuropsychological and clinical evaluation approaches. Spontaneous speech analysis has the potential to enable novel applications for speech technology in longitudinal, unobtrusive monitoring of cognitive health, in line with the theme of this year's INTERSPEECH, "Speech Everywhere!".
Are you searching for new challenges in speaker recognition? Join SdSV Challenge 2021 which focuses on the analysis and exploration of new ideas for short duration speaker verification.
Following the success of the SdSV Challenge 2020, the SdSV Challenge 2021 focuses on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. The challenge consists of two tasks.
The main purpose of this challenge is to encourage participants on building single but competitive systems, to perform analysis as well as to explore new ideas, such as multi-task learning, unsupervised/self-supervised learning, single-shot learning, disentangled representation learning and so on, for short-duration speaker verification. The participating teams will get access to a train set and the test set drawn from the DeepMine corpus which is the largest public corpus designed for short-duration speaker verification with voice recordings of 1800 speakers. The challenge leaderboard is hosted at CodaLab.
Multilinguality and code-switching are widely prevalent linguistic phenomena in multilingual societies such as India. There is growing interest in building ASR systems that cater to multilingual settings, without the advantage of having very large quantities of labeled speech in multiple languages. This special session focuses on building multilingual and code-switched ASR systems for low-resource Indian languages by introducing a challenge with two tasks that spans seven different Indian languages and totals 590 hours of labeled speech.
Participants will be required to use only the released data (and no additional external data) to build ASR systems for the two above-mentioned tasks.
The INTERSPEECH 2021 Acoustic Echo Cancellation (AEC) challenge is designed to stimulate research in the AEC domain by open sourcing a large training dataset, test set, and subjective evaluation framework. We provide two new open source datasets for training AEC models. The first is a real dataset captured using a large-scale crowdsourcing effort. This dataset consists of real recordings that have been collected from over 5,000 diverse audio devices and environments. The second is a synthetic dataset with added room impulse responses and background noise derived from the INTERSPEECH 2020 DNS Challenge. An initial test set will be released for the researchers to use during development and a blind test near the end which will be used to decide the final competition winners. We believe these datasets are large enough to facilitate deep learning and representative enough for practical usage in shipping telecommunication products.
The dataset and rules are available here.
Please feel free to reach out to us, if you have any questions or need clarification about any aspect of the challenge.
Non-autoregressive modeling is a new direction in speech processing research that has recently emerged. One advantage of non-autoregressive models is their decoding speed: decoding is only composed of forward propagation through a neural network, hence complicated left-to-right beam search is not necessary. In addition, they do not assume a left-to-right generation order and thus represent a paradigm shift in speech processing, where left-to-right, autoregressive models have been believed to be legitimate. This special session aims to facilitate knowledge sharing between researchers involved in non-autoregressive modeling across various speech processing fields, including, but not limited to, automatic speech recognition, speech translation, and text to speech, via panel discussions with leading researchers followed by a poster session.
The COVID-19 pandemic has resulted in more than 93 million infections, and more than 2 million casualties. Large scale testing, social distancing, and face masks have been critical measures to help contain the spread of the infection. While the list of symptoms is regularly updated, it is established that in symptomatic cases COVID-19 seriously impairs normal functioning of the respiratory system. Does this alter the acoustic characteristics of breathe, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. A COVID-19 diagnosis methodology based on acoustic signal analysis, if successful, can provide a remote, scalable, and economical means for testing of individuals. This can supplement the existing nucleotides based COVID-19 testing methods, such as RT-PCR and RAT.
The DiCOVA Challenge is designed to find answers to the question by enabling participants to analyze an acoustic dataset gathered from COVID-19 positive and non-COVID-19 individuals. The findings will be presented in a special session at Interspeech 2021. The timeliness, and the global societal importance of the challenge warrants focussed effort from researchers across the globe, including from the fields of medical and respiratory sciences, mathematical sciences, and machine learning engineers. We look forward to your participation!
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020 and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was used to evaluate challenge submissions. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and (ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric for wide band called DNSMOS for the participants to use during their development phase. The final evaluation will be based on ITU-T P.835 subjective evaluation framework that gives the quality of speech and noise in addition to the overall quality of the speech.
We will have two tracks in this challenge:
More details about the datasets and the challenge are available in the paper and the challenge github page. Participants must adhere to the rules of the challenge.
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this naturalistic data resource. As an initial step to motivate a stream-lined and collaborative effort from the speech and language community, UTDallas-CRSS is hosting a series of progressively complex tasks to promote advanced research on naturalistic “Big Data” corpora. This began with ISCA INTERSPEECH-2019: "The FEARLESS STEPS Challenge: (FS-#1)". This first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the “First Step” towards extracting high-level information from such massive unlabeled corpora. This was followed with ISCA INTERSPEECH-2020 which held the Special Session for FEARLESS STEPS Challenge (FS-2), which focused on developing supervised learning strategies for the 100 hour Challenge Corpus.
As a natural progression following the successful Inaugural Challenge FS#1 and FEARLESS STEPS Challenge Phase-#2, the FEARLESS STEPS Challenge Phase-#3 focuses on development of single-channel supervised learning strategies with an aim to test system generalizability to varying channel and mission data. FS#3 also provides an additional challenge task of Conversational Analysis, motivating researchers to work on natural language understanding and group dynamics analysis. FS#3 provides 80 hours of ground-truth data through Training and Development sets, with 20 hours of blind-set Apollo-11 evaluation data, 5 hours of unseen channel evaluation data and an additional 5 hours of blind-set Apollo-13 mission evaluation data. Based on feedback from the Fearless Steps participants, additional Tracks for streamlined speech recognition, speaker diarization, and conversational analysis have been included in the FS#3. The results for this Challenge will be presented at the ISCA INTERSPEECH-2021 Special Session. We encourage participants to explore any and all research tasks of interest with the Fearless Steps Corpus – with suggested Task Domains listed below. Research participants can however, also utilize the FS#3 corpus to explore additional problems dealing with naturalistic data, which we welcome as part of the special session.
This special session focuses on privacy-preserving machine learning (PPML) techniques in speech, language and audio processing, including centralized, distributed and on-device processing approaches. Novel contributions and overviews on the theory and applications of PPML in speech, language and audio are invited. We encourage submissions related to ethical and regulatory aspects of PPML in this context. Sending speech, language or audio data to a cloud server exposes private information. One approach called anonymization is to preprocess the data so as to hide information which could identify the user by disentangling it from other useful attributes. PPML is a different approach, which solves this problem by moving computation near the clients. Due to recent advances in Edge Computing and Neural Processing Units on mobile devices, PPML is now a feasible technology for most speech, language and audio applications that enables companies to train on customer data without needing them to share the data. With PPML, data can sit on a customer's device where it is used for model training. During the training process, models from several clients are often shared with aggregator nodes that perform model averaging and sync the new models to each client. Next, the new averaged model is used for training on each client. This process continues and enables each client to benefit from training data on all other clients. Such processes were not possible in conventional audio/speech ML. On top of that, high-quality synthetic data can also be used for training thanks to advances in speech, text, and audio synthesis.
Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 13th edition, we introduce four new tasks and Sub-Challenges:
Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools including recent deep learning approaches are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.
Contributions using the provided or equivalent data are sought for (but not limited to):
Results of the Challenge and Prizes will be presented at Interspeech 2021 in Brno, Czechia.
The goal of the OpenASR (Open Automatic Speech Recognition) Challenge is to assess the state of the art of ASR technologies for low-resource languages.
The OpenASR Challenge is an open challenge created out of the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program that encompasses more tasks, including CLIR (cross-language information retrieval), domain classification, and summarization. For every year of MATERIAL, NIST supports a simplified, smaller scale evaluation open to all, focusing on a particular technology aspect of MATERIAL. The capabilities tested in the open challenges are expected to ultimately support the MATERIAL task of effective triage and analysis of large volumes of data, in a variety of less-studied languages.
The special session aims to bring together researchers from all sectors working on ASR for low-resource languages to discuss the state of the art and future directions. It will allow for fruitful exchanges between OpenASR20 Challenge participants and other researchers working on low-resource ASR. We invite contributions from OpenASR20 participants, MATERIAL performers, as well as any other researchers with relevant work in the low-resource ASR problem space.
In the last decade, machine learning (ML) and deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification (SV), automatic speech recognition(ASR) and keyword spotting (KWS). However, in practice, it is very difficult to get proper performance without expertise of machine learning and speech processing. Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train effective models given a specific task requirement without any human intervention. Moreover, some methods belonging to AutoML, such as Automated Deep Learning (AutoDL) and meta-learning have been used in KWS and SV tasks respectively. A series of AutoML competitions, e.g., automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners.
Keyword spotting, usually as the entrance of smart device terminals, such as mobile phone, smart speakers, or other intelligent terminals, has received a lot of attention in both academia and industry. Meanwhile, out of consideration of fun and security, the personalized wake-up mode has more application scenarios and requirements. Conventionally, the solution pipeline is combined of KWS and text dependent speaker verification (TDSV) system, and in which case, two systems are optimized separately. On the other hand, there are always few data belonging to the target speaker, so both of KWS and speaker verification(SV) in that case can be considered as low resource tasks.
In this challenge, we propose the automated machine learning for Personalized Keyword Spotting (Auto-KWS) which aims at proposing automated solutions for personalized keyword spotting tasks. Basically, there are several specific questions that can be further participants explored, including but not limited to:
Additionally, participants should also consider:
We have already organized two successful automated speech classification challenge AutoSpeech1 in ACML2019 and AutoSpeech2020 in INTERSPEECH2020, which are the first two challenges that combine AutoML and speech tasks. This time, our challenge Auto-KWS will focus on personalized keyword spotting tasks for the first time, and the released database will also serve as a benchmark for researches in this filed and boost the idea exchanging and discussion in this area.
Learning prosodic representations from the speech signal have become common practice in emotion classification and often yield a better performance over the respective baselines. Likewise, novel TTS systems learn prosodic embeddings that allow them to produce prosodically varied, expressive speech. However, it often remains opaque how to interpret these learned prosodic representations, let alone how to meaningfully modify them or to use them to perform specific tasks.
This special session focuses on the interpretation, modification and application of learned prosodic representations in emotional speech classification and synthesis. It aims to bring together the different communities of explainable artificial intelligence (XAI), synthesis and classification to tackle a common problem.
Topics of interest include, but are not limited to:
We strongly encourage interdisciplinary submissions that combine methods from synthesis, analysis and explainable artificial intelligence.