Special Sessions & Challenges

The Organizing Committee of INTERSPEECH 2021 is proudly announcing the following special sessions and challenges for INTERSPEECH 2021.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

The Zero Resource Speech Challenge 2021 is the fifth iteration of the Zero Resource Speech Challenge series. The overall goal of the series is to advance research in unsupervised training of speech and dialogue tools, taking inspiration from the fact that young infants learn to perceive and produce speech with no textual supervision, and with applications in speech technology for under-resourced languages.

The 2021 edition takes a novel angle, language modelling from speech. Self-supervised pre-training for ASR has recently shown impressive results that suggest that some sequential models trained from raw speech may be doing more than just acoustic modelling, and may be learning higher-order facts about the trained language. The Zero Resource Speech Challenge 2021 proposes common evaluations to assess whether it is possible to learn lexical, syntactic, or even semantic knowledge from raw speech alone. The challenge invites any type of model, trained on raw speech, capable of assigning scores and distances to novel speech examples. The challenge invites both high-compute-budget models, which require larger-scale compute resources to train or use, and low-compute-budget models.

URL

Organizers

  • Emmanuel Dupoux (EHESS / Cognitive Machine Learning / Facebook)
  • Ewan Dunbar (University of Toronto)
  • Mathieu Bernard (INRIA)
  • Nicolas Hamilakis (École Normale Supérieure)
  • Maureen de Seyssel (INRIA)
  • Tu Anh Nguyen (INRIA/Facebook)

While speech recognition systems generally work well on the average population with typical speech characteristics, performance on subgroups with unique speaking patterns is usually significantly worse.

Speech that contains non-standard speech patterns (acoustic-phonetic phonotactic, lexical and prosodic patterns) is particularly challenging, both because of the small population with these speech patterns, and because of the generally higher variance of speech patterns. In the case of dysarthric speech, which is often correlated with mobility or other accessibility limitations, accuracy of existing speech recognition systems is often particularly poor, rendering the technology unusable for many speakers who could benefit the most.

In this oral session, we seek to promote interdisciplinary collaborations between researchers and practitioners addressing this problem, to build community and stimulate research. We invite papers analyzing and improving systems dealing with atypical speech.

Topics of interest include, but are not limited to:

  • Automatic Speech Recognition (ASR) of atypical speech
  • Speech-to-Speech conversion/normalization (e.g. from atypical to typical)
  • Voice enhancement and convergence to improve intelligibility of spoken content of atypical speech
  • Automated classification of atypical speech conditions
  • Robustness of speech processing systems for atypical speech in common application scenarios
  • Data augmentation techniques to deal with data sparsity
  • Aspects of creating, managing data quality, and sharing of data sets of atypical speech
  • Multi-modal integration (e.g. video and voice) and its application

URL

Organizers

  • Jordan R. Green, MGH Institute of Health Professions, Harvard University
  • Michael P. Brenner, Harvard University, Google
  • Fadi Biadsy, Google
  • Bob MacDonald, Google
  • Katrin Tomanek, Google

With this shared task, which follows the one organized at Interspeech 2020 (http://www.interspeech2020.org/Special_Sessions_and_Challenge/), we intend to advance the research addressing non native children's ASR technology. To reach this goal we will distribute a new set of data, in addition to that used for the 2020 challenge, that will contain additional training data for the English language (acquired from speakers of different native languages) as well as data for developing a German ASR for non native children. The spoken responses in the data set were produced in the context of both English and German speaking proficiency examination.

The following data will be released for this shared task:

  • ~100 hours of English transcribed speech, to be used as training set
  • ~6 hours English transcribed speech (3 hours to be used as developemnt set and 3 hours as test set)
  • ~5 hours of German transcribed speech, to be used as training set
  • ~60 hours of German non transcribed speech, to be used as training set
  • ~2.5 hours of German transcribed speech (1 hour to be used as developemnt and 1 hour and half to be used as test set)

For both languages, English and German, a baseline ASR system together with evaluation scripts will be provided.

Important Dates

  • Release of training data, development data, and baseline system: February 10, 2021
  • Test data released: March 10, 2021
  • Submission of results on test set: March 17, 2021
  • Test results announced: March 20, 2021

URL

Organizers

  • Daniele Falavigna, Fondazione Bruno Kessler
  • Abhinav Misra, Educational Testing Service
  • Chee Wee Leong, Educational Testing Service
  • Kate Knill, Cambridge University
  • Linlin Wang, Cambridge University

Oriental languages are rich and complex. With the great diversity in terms of both acoustics and linguistics, oriental language is a treasure for multilingual research. The Oriental Language Recognition (OLR) challenge has been conducted for 5 years with big success, and demonstrated many novel and interesting techniques devised by the participants.

The main goal of this special session is to summarize the technical advance of OLR 2020, but it will welcome all submissions related to language recognition and multilingual soeecg processing.

URL

Organizers

  • Dong Wang (Tsinghua University)
  • Qingyang Hong (Xiamen University)
  • Xiaolei Zhang (Northwestern Polytechnical University)
  • Ming Li (Duke Kunshan University)
  • Yufeng Hao (Speechocean)

The ConferencingSpeech 2021 challenge is proposed to stimulate research in multi-channel speech enhancement and aims for processing the far-field speech from microphone arrays in the video conferencing rooms. Targeting the real video conferencing room application, the ConferencingSpeech 2021 challenge database is recorded from real speakers. The number of speakers and distances between speakers and microphone arrays vary according to the sizes of meeting rooms. Multiple microphone arrays from three different types of geometric topology are allocated in each recording environment.

The challenge will have two tasks:

  • Task 1 is multi-channel speech enhancement with single microphone array and focusing on practical application with real-time requirement.
  • Task 2 is multi-channel speech enhancement with multiple distributed microphone arrays, which is non-real-time track and does not have any constraints so that participants could explore any algorithms to obtain high speech quality.

To focus on the development of algorithms, the challenge requires the close training condition. Only provided lists of open source clean speech datasets and noise dataset could be used for training. In addition, the challenge will provide the development set, scripts for simulating the training data, baseline systems for participants to develop their systems. The final ranking of the challenge will be decided by the subjective evaluation. The subjective evaluation will be performed using Absolute Category Ratings (ACR) to estimate a Mean Opinion Score (MOS) through Tencent Online Media Subjective Evaluation platform.

More details about the data and challenge can be found from the evaluation plan of ConferencingSpeech 2021 challenge.

Besides the submitted paper related to ConferencingSpeech 2021 challenge, Paper on multi-channel speech enhancement are all encouraged to submit to this special session.

URL

Organizers

  • Wei Rao, Tencent Ethereal Audio Lab, China
  • Lei Xie, Northwestern Polytechnical University, China
  • Yannan Wang, Tencent Ethereal Audio Lab, China
  • Tao Yu, Tencent Ethereal Audio Lab, USA
  • Shinji Watanabe, Associate Professor, Carnegie Mellon University / Johns Hopkins University, USA
  • Zheng-Hua Tan, Aalborg University, Denmark
  • Hui Bu, AISHELL foundation, China
  • Shidong Shang, Tencent Ethereal Audio Lab, China

The appraisal of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the assessment of the outcome of the treatment. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.

Aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The objective is to gather new insights, advancement of knowledge and practical tools to assist researchers and clinicians in obtaining effective descriptions of voice quality and reliable measures of its acoustic correlates. Topics of interest include, but are not limited to, (i) the statistical analysis and automatic classification, possibly relying on state-of-the-art machine learning approaches, of distinct types of voice quality via non-obtrusively recorded features, (ii) the analysis and simulation of vocal fold vibrations by means of analytical, kinematic or mechanical modelling, (iii) the interpretation and modeling of both acoustic emission and/or high– speed video recordings such as videolaryngoscopy and videokymography, (iv) the synthesis of disordered voices jointly with auditory experimentation involving synthetic and natural disordered voice stimuli.

URL

Organizers

  • Philipp Aichinger (philipp.aichinger@meduniwien.ac.at)
  • Abeer Alwan (alwan@ee.ucla.edu)
  • Carlo Drioli (carlo.drioli@uniud.it)
  • Jody Kreiman (jkreiman@ucla.edu)
  • Jean Schoentgen (jschoent@ulb.ac.be)

Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.

The main objective of the special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.

URL

Organizers

  • Hartmut Helmke (DLR)
  • Pavel Kolcarek (Honeywell)
  • Petr Motlicek (Idiap Research Institute)

Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. The main risk factor for dementia is age, and therefore its greatest incidence is amongst the elderly. Due to the severity of the situation worldwide, institutions and researchers are investing considerably on dementia prevention and early detection, focusing on disease progression. There is a need for cost-effective and scalable methods for detection of dementia from its most subtle forms, such as the preclinical stage of Subjective Memory Loss (SML), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer's Dementia (AD) itself.

The ADReSSo (ADReSS, speech only) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). The challenge builds on the success of the ADReSS Challenge (Luz et Al, 2020), the first such shared-task event focused on AD, which attracted 34 teams from across the world. While a number of researchers have proposed speech processing and natural language procesing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied data sets, consequently hindering reproducibility and comparability of approaches. The ADReSSo Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset. The approaches that performed best on the original ADReSS dataset employed features extracted from manual transcripts, which were provided. The ADReSSo challenge provides a more challenging and improved spontaneous speech dataset, and requires the creation of models straight from speech, without manual transcription. In keeping with the objectives of AD prediction evaluation, the ADReSSo challenge's dataset will be statistically balanced so as to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution. This task focuses AD recognition using spontaneous speech, which marks a departure from neuropsychological and clinical evaluation approaches. Spontaneous speech analysis has the potential to enable novel applications for speech technology in longitudinal, unobtrusive monitoring of cognitive health, in line with the theme of this year's INTERSPEECH, "Speech Everywhere!".

Important Dates

  • January 18, 2021: ADReSSo Challenged announced.
  • March 20, 2021: Model submission deadline.
  • March 26, 2021: Paper submission deadline.
  • April 2, 2021: Paper update deadline.
  • June 2, 2021: Paper acceptance/rejection notification.
  • August 31 - September 3, 2021: INTERSPEECH 2021.

URL

Organizers

  • Saturnino Luz, Usher Institute, University of Edinburgh
  • Fasih Haider, University of Edinburgh
  • Sofia de la Fuente, University of Edinburgh
  • Davida Fromm, Carnegie Mellon University
  • Brian MacWhinney, Carnegie Mellon University

Are you searching for new challenges in speaker recognition? Join SdSV Challenge 2021 which focuses on the analysis and exploration of new ideas for short duration speaker verification.

Following the success of the SdSV Challenge 2020, the SdSV Challenge 2021 focuses on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. The challenge consists of two tasks.

  • Task 1 is defined as speaker verification in text-dependent mode where the lexical content (in both English and Persian) of the test utterances is also taken into consideration.
  • Task 2 is defined as speaker verification in text-independent mode with same- and cross-language trials.

The main purpose of this challenge is to encourage participants on building single but competitive systems, to perform analysis as well as to explore new ideas, such as multi-task learning, unsupervised/self-supervised learning, single-shot learning, disentangled representation learning and so on, for short-duration speaker verification. The participating teams will get access to a train set and the test set drawn from the DeepMine corpus which is the largest public corpus designed for short-duration speaker verification with voice recordings of 1800 speakers. The challenge leaderboard is hosted at CodaLab.

URL

Organizers

  • Hossein Zeinali (Amirkabir University of Technology, Iran)
  • Kong Aik Lee (I2R, A*STAR, Singapore)
  • Jahangir Alam (CRIM, Canada)
  • Lukáš Burget (Brno University of Technology, Czech Republic)

Multilinguality and code-switching are widely prevalent linguistic phenomena in multilingual societies such as India. There is growing interest in building ASR systems that cater to multilingual settings, without the advantage of having very large quantities of labeled speech in multiple languages. This special session focuses on building multilingual and code-switched ASR systems for low-resource Indian languages by introducing a challenge with two tasks that spans seven different Indian languages and totals 590 hours of labeled speech.

  • Task 1: This task involves building a multilingual ASR system in six Indian languages, namely Hindi, Marathi, Odia, Telugu, Tamil, and Gujarati. The blind test set for the final leaderboard will comprise recordings from a subset of (or all) these six languages.
  • Task 2: This task involves building a code-switching ASR system for Hindi-English and Bengali-English speech. The blind test set will comprise recordings from either of these code-switched language pairs. (Any of the speech data from Task 1 can be used to aid the systems built for Task 2.)

Participants will be required to use only the released data (and no additional external data) to build ASR systems for the two above-mentioned tasks.

URL

Organizers

  • Kalika Bali (Microsoft Research India)
  • Samarth Bharadwaj (IBM Research India)
  • Prasanta Kumar Ghosh (IISc, Bangalore)
  • Preethi Jyothi (IIT Bombay)
  • Shreya Khare (IBM Research India)
  • Ashish Mittal (IBM Research India)
  • Jai Nanavati (Navana Tech India Private Limited)
  • Raoul Nanavati (Navana Tech India Private Limited)
  • Srinivasa Raghavan (Navana Tech India Private Limited)
  • Sunita Sarawagi (IIT Bombay)
  • Vivek Seshadri (Microsoft Research India)

The INTERSPEECH 2021 Acoustic Echo Cancellation (AEC) challenge is designed to stimulate research in the AEC domain by open sourcing a large training dataset, test set, and subjective evaluation framework. We provide two new open source datasets for training AEC models. The first is a real dataset captured using a large-scale crowdsourcing effort. This dataset consists of real recordings that have been collected from over 5,000 diverse audio devices and environments. The second is a synthetic dataset with added room impulse responses and background noise derived from the INTERSPEECH 2020 DNS Challenge. An initial test set will be released for the researchers to use during development and a blind test near the end which will be used to decide the final competition winners. We believe these datasets are large enough to facilitate deep learning and representative enough for practical usage in shipping telecommunication products.

The dataset and rules are available here.

Please feel free to reach out to us, if you have any questions or need clarification about any aspect of the challenge.

URL

Organizers

  • Ross Cutler, Microsoft Corp, USA
  • Ando Saabas, Microsoft Corp, Tallinn
  • Tanel Parnamaa, Microsoft Corp, Tallinn
  • Markus Loide, Microsoft Corp, Tallinn
  • Sten Sootla, Microsoft Corp, Tallinn
  • Hannes Gamper, Microsoft Corp, USA
  • Sebastian Braun, Microsoft Corp, USA
  • Karsten Sorensen, Microsoft Corp, USA
  • Robert Aichner, Microsoft Corp, USA
  • Sriram Srinivasan, Microsoft Corp, USA

Non-autoregressive modeling is a new direction in speech processing research that has recently emerged. One advantage of non-autoregressive models is their decoding speed: decoding is only composed of forward propagation through a neural network, hence complicated left-to-right beam search is not necessary. In addition, they do not assume a left-to-right generation order and thus represent a paradigm shift in speech processing, where left-to-right, autoregressive models have been believed to be legitimate. This special session aims to facilitate knowledge sharing between researchers involved in non-autoregressive modeling across various speech processing fields, including, but not limited to, automatic speech recognition, speech translation, and text to speech, via panel discussions with leading researchers followed by a poster session.

URL

Organizers

  • Katrin Kirchhoff (Amazon)
  • Shinji Watanabe (Carnegie Mellon University)
  • Yuya Fujita (Yahoo Japan Corporation)

The COVID-19 pandemic has resulted in more than 93 million infections, and more than 2 million casualties. Large scale testing, social distancing, and face masks have been critical measures to help contain the spread of the infection. While the list of symptoms is regularly updated, it is established that in symptomatic cases COVID-19 seriously impairs normal functioning of the respiratory system. Does this alter the acoustic characteristics of breathe, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. A COVID-19 diagnosis methodology based on acoustic signal analysis, if successful, can provide a remote, scalable, and economical means for testing of individuals. This can supplement the existing nucleotides based COVID-19 testing methods, such as RT-PCR and RAT.

The DiCOVA Challenge is designed to find answers to the question by enabling participants to analyze an acoustic dataset gathered from COVID-19 positive and non-COVID-19 individuals. The findings will be presented in a special session at Interspeech 2021. The timeliness, and the global societal importance of the challenge warrants focussed effort from researchers across the globe, including from the fields of medical and respiratory sciences, mathematical sciences, and machine learning engineers. We look forward to your participation!

URL

Organizers

  • Neeraj Sharma (Indian Institute of Science, Bangalore, India)
  • Prasanta Kumar Ghosh (Indian Institute of Science, Bangalore, India)
  • Srikanth Raj Chetupalli (Indian Institute of Science, Bangalore, India)
  • Sriram Ganapathy (Indian Institute of Science, Bangalore, India)

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020 and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was used to evaluate challenge submissions. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and (ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric for wide band called DNSMOS for the participants to use during their development phase. The final evaluation will be based on ITU-T P.835 subjective evaluation framework that gives the quality of speech and noise in addition to the overall quality of the speech.

We will have two tracks in this challenge:

  • Track 1: Real-Time Denoising track for wide band scenario
    The noise suppressor must take less than the stride time Ts (in ms) to process a frame of size T (in ms) on an Intel Core i5 quad-core machine clocked at 2.4 GHz or equivalent processor. For example, Ts = T/2 for 50% overlap between frames. The total algorithmic latency allowed including the frame size T, stride time Ts, and any look ahead must be less than or equal to 40ms. For example, for a real-time system that receives 20ms audio chunks, if you use a frame length of 20ms with a stride of 10ms resulting in an algorithmic latency of 30ms, then you satisfy the latency requirements. If you use a frame of size 32ms with a stride of 16ms resulting in an algorithmic latency of 48ms, then your method does not satisfy the latency requirements as the total algorithmic latency exceeds 40ms. If your frame size plus stride T1=T+Ts is less than 40ms, then you can use up to (40-T1) ms future information.
  • Track 2: Real-Time Denoising track for full band scenario
    Satisfy Track 1 requirements but at 48 kHz.

More details about the datasets and the challenge are available in the paper and the challenge github page. Participants must adhere to the rules of the challenge.

URL

Organizers

  • Chandan K A Reddy (Microsoft Corp, USA)
  • Hari Dubey (Microsoft Corp, USA)
  • Kazuhito Koishada (Microsoft Corp, USA)
  • Arun Nair (Johns Hopkins University, USA)
  • Vishak Gopal (Microsoft Corp, USA)
  • Ross Cutler (Microsoft Corp, USA)
  • Robert Aichner (Microsoft Corp, USA)
  • Sebastian Braun (Microsoft Research, USA)
  • Hannes Gamper (Microsoft Research, USA)
  • Sriram Srinivasan (Microsoft Corp, USA)

The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this naturalistic data resource. As an initial step to motivate a stream-lined and collaborative effort from the speech and language community, UTDallas-CRSS is hosting a series of progressively complex tasks to promote advanced research on naturalistic “Big Data” corpora. This began with ISCA INTERSPEECH-2019: "The FEARLESS STEPS Challenge: (FS-#1)". This first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the “First Step” towards extracting high-level information from such massive unlabeled corpora. This was followed with ISCA INTERSPEECH-2020 which held the Special Session for FEARLESS STEPS Challenge (FS-2), which focused on developing supervised learning strategies for the 100 hour Challenge Corpus.

As a natural progression following the successful Inaugural Challenge FS#1 and FEARLESS STEPS Challenge Phase-#2, the FEARLESS STEPS Challenge Phase-#3 focuses on development of single-channel supervised learning strategies with an aim to test system generalizability to varying channel and mission data. FS#3 also provides an additional challenge task of Conversational Analysis, motivating researchers to work on natural language understanding and group dynamics analysis. FS#3 provides 80 hours of ground-truth data through Training and Development sets, with 20 hours of blind-set Apollo-11 evaluation data, 5 hours of unseen channel evaluation data and an additional 5 hours of blind-set Apollo-13 mission evaluation data. Based on feedback from the Fearless Steps participants, additional Tracks for streamlined speech recognition, speaker diarization, and conversational analysis have been included in the FS#3. The results for this Challenge will be presented at the ISCA INTERSPEECH-2021 Special Session. We encourage participants to explore any and all research tasks of interest with the Fearless Steps Corpus – with suggested Task Domains listed below. Research participants can however, also utilize the FS#3 corpus to explore additional problems dealing with naturalistic data, which we welcome as part of the special session.

Timeline

  • Challenge Start Date (Data Release): February 7, 2021
  • INTERSPEECH-2020 Papers dealing with FEARLESS STEPS deadline: April 2, 2021

Challenge Tasks in Phase-3 (FS#3):

  1. Speech Activity Detection (SAD)
  2. Speaker Recognition:
    • 2a. Track 1: Speaker Identification (SID)
    • 2b. Track 2: Speaker Verification (SV)
  3. Speaker Diarization (SD):
    • 3a. Track 1: Diarization using reference SAD
    • 3b. Track 2: Diarization using system SAD
  4. Automatic Speech Recognition (ASR):
    • 4a. Track 1: ASR using reference Diarization
    • 4b. Track 2: Continuous stream ASR
  5. Conversational Analysis
    • 5a. Track 1: Hotspot Detection
    • 5. Track 2: Extractive Summarization

URL

Organizers

  • John H.L. Hansen, University of Texas at Dallas
  • Christopher Ceiri, Linguistic Data Consortium
  • Omid Sadjadi, NIST
  • Aditya Joglekar, University of Texas at Dallas
  • Meena Chandra Shekar, University of Texas at Dallas

This special session focuses on privacy-preserving machine learning (PPML) techniques in speech, language and audio processing, including centralized, distributed and on-device processing approaches. Novel contributions and overviews on the theory and applications of PPML in speech, language and audio are invited. We encourage submissions related to ethical and regulatory aspects of PPML in this context. Sending speech, language or audio data to a cloud server exposes private information. One approach called anonymization is to preprocess the data so as to hide information which could identify the user by disentangling it from other useful attributes. PPML is a different approach, which solves this problem by moving computation near the clients. Due to recent advances in Edge Computing and Neural Processing Units on mobile devices, PPML is now a feasible technology for most speech, language and audio applications that enables companies to train on customer data without needing them to share the data. With PPML, data can sit on a customer's device where it is used for model training. During the training process, models from several clients are often shared with aggregator nodes that perform model averaging and sync the new models to each client. Next, the new averaged model is used for training on each client. This process continues and enables each client to benefit from training data on all other clients. Such processes were not possible in conventional audio/speech ML. On top of that, high-quality synthetic data can also be used for training thanks to advances in speech, text, and audio synthesis.

URL

Organizers

  • Harishchandra Dubey (Microsoft)
  • Amin Fazel (Amazon, Alexa)
  • Mirco Ravanelli (MILA,Université de Montréal)
  • Emmanuel Vincent (Inria)

Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 13th edition, we introduce four new tasks and Sub-Challenges:

  • COVID-19 Cough based recognition,
  • COVID-19 Speech based recognition,
  • Escalation level assessment in spoken dialogues,
  • Primates classification based on their vocalisations.

Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools including recent deep learning approaches are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.

Contributions using the provided or equivalent data are sought for (but not limited to):

  • Participation in a Sub-Challenge
  • Contributions around the Challenge topics

Results of the Challenge and Prizes will be presented at Interspeech 2021 in Brno, Czechia.

URL

Organizers

  • Björn Schuller (University of Augsburg, Germany / Imperial College, UK)
  • Anton Batliner (University of Augsburg, Germany)
  • Christian Bergler (FAU, Germany)
  • Cecilia Mascolo (University of Cambridge, UK)
  • Jing Han (University of Cambridge, UK)
  • Iulia Lefter (Delft University of Technology, The Netherlands)
  • Heysem Kaya (Utrecht University, The Netherlands)

The goal of the OpenASR (Open Automatic Speech Recognition) Challenge is to assess the state of the art of ASR technologies for low-resource languages.

The OpenASR Challenge is an open challenge created out of the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program that encompasses more tasks, including CLIR (cross-language information retrieval), domain classification, and summarization. For every year of MATERIAL, NIST supports a simplified, smaller scale evaluation open to all, focusing on a particular technology aspect of MATERIAL. The capabilities tested in the open challenges are expected to ultimately support the MATERIAL task of effective triage and analysis of large volumes of data, in a variety of less-studied languages.

The special session aims to bring together researchers from all sectors working on ASR for low-resource languages to discuss the state of the art and future directions. It will allow for fruitful exchanges between OpenASR20 Challenge participants and other researchers working on low-resource ASR. We invite contributions from OpenASR20 participants, MATERIAL performers, as well as any other researchers with relevant work in the low-resource ASR problem space.

Topics:

  • OpenASR20 Challenge reports, including
    • Cross-lingual training techniques to compensate for ten-hour training condition
    • Factors influencing ASR performance on low resource languages by gender and dialect
    • Resource conditions used for unconstrained development condition
  • IARPA MATERIAL performer reports on low-resource ASR, including
    • Low Resource ASR tailored to MATERIAL’s Cross Language Information Retrieval Evaluation
    • Genre mismatch condition between speech training data and evaluation
  • Other topics focused on low-resource ASR challenges and solutions

URL

Organizers

  • Peter Bell, University of Edinburgh
  • Jayadev Billa, University of Southern California Information Sciences Institute
  • William Hartmann, Raytheon BBN Technologies
  • Kay Peterson, National Institute of Standards and Technology

In the last decade, machine learning (ML) and deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification (SV), automatic speech recognition(ASR) and keyword spotting (KWS). However, in practice, it is very difficult to get proper performance without expertise of machine learning and speech processing. Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train effective models given a specific task requirement without any human intervention. Moreover, some methods belonging to AutoML, such as Automated Deep Learning (AutoDL) and meta-learning have been used in KWS and SV tasks respectively. A series of AutoML competitions, e.g., automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners.

Keyword spotting, usually as the entrance of smart device terminals, such as mobile phone, smart speakers, or other intelligent terminals, has received a lot of attention in both academia and industry. Meanwhile, out of consideration of fun and security, the personalized wake-up mode has more application scenarios and requirements. Conventionally, the solution pipeline is combined of KWS and text dependent speaker verification (TDSV) system, and in which case, two systems are optimized separately. On the other hand, there are always few data belonging to the target speaker, so both of KWS and speaker verification(SV) in that case can be considered as low resource tasks.

In this challenge, we propose the automated machine learning for Personalized Keyword Spotting (Auto-KWS) which aims at proposing automated solutions for personalized keyword spotting tasks. Basically, there are several specific questions that can be further participants explored, including but not limited to:

  • How to automatically handle multilingual, multi accent or various keywords?
  • How to make better use of additional tagged corpus automatically?
  • How to integrate keyword spotting task and speaker verification task?
  • How to jointly optimize personalized keyword spotting with speaker verification?
  • How to design multi-task learning for personalized keyword spotting with speaker verification?
  • How to automatically design effective neural network structures?
  • How to reasonably use meta-learning, few-shot learning, or other AutoML technologies in this task?

Additionally, participants should also consider:

  • How to automatically and efficiently select appropriate machine learning model and hyper-parameters?
  • How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
  • How to keep the computational and memory cost acceptable?

We have already organized two successful automated speech classification challenge AutoSpeech1 in ACML2019 and AutoSpeech2020 in INTERSPEECH2020, which are the first two challenges that combine AutoML and speech tasks. This time, our challenge Auto-KWS will focus on personalized keyword spotting tasks for the first time, and the released database will also serve as a benchmark for researches in this filed and boost the idea exchanging and discussion in this area.

Timeline

  • Feb 26th: Feedback Phase starts
  • Mar 26th: Feedback Phase ends, Private Phase starts
  • Mar 26th: Interspeech paper submission deadline
  • Jun 2nd: Interspeech paper notification
  • Aug 31st: Interspeech starts

URL

Organizers

  • Hung-Yi Lee, Professor at College of Electrical Engineering and Computer Science National Taiwan University, Taiwan
  • Lei Xie, Professor at Audio, Speech and Language Processing Lab (NPU-ASLP), Northwestern Polytechnical University Xian, China
  • Tom Ko, Assistant Professor at Southern University of Science and Technology, China
  • Wei-Wei Tu, 4Pardigm Inc., China
  • Isabelle Guyon, Universte Paris-Saclay, France, ChaLearn, USA
  • Qiang Yang, Hong Kong University of Science and Technology, Hong Kong, China
  • Xiawei Guo, 4Paradigm Inc., China
  • Yuxuan He, 4Paradigm Inc., China
  • Shouxiang Liu, 4Paradigm Inc., China
  • Jingsong Wang, 4Paradigm Inc., China
  • Zhen Xu, 4Paradigm Inc., China
  • Chunyu Zhao, 4Paradigm Inc., China

Learning prosodic representations from the speech signal have become common practice in emotion classification and often yield a better performance over the respective baselines. Likewise, novel TTS systems learn prosodic embeddings that allow them to produce prosodically varied, expressive speech. However, it often remains opaque how to interpret these learned prosodic representations, let alone how to meaningfully modify them or to use them to perform specific tasks.

This special session focuses on the interpretation, modification and application of learned prosodic representations in emotional speech classification and synthesis. It aims to bring together the different communities of explainable artificial intelligence (XAI), synthesis and classification to tackle a common problem.

Topics of interest include, but are not limited to:

  1. Latent space exploration
  2. Interpretation of learned representations using methods from Explainable Artificial Intelligence (XAI)
  3. Controlled embedding modification in expressive speech synthesis
  4. Application and comparison of learned representations for classification

We strongly encourage interdisciplinary submissions that combine methods from synthesis, analysis and explainable artificial intelligence.

URL

Organizers

  • Pol van Rijn, Neuroscience department, Max Planck Institute for Empirical Aesthetics, Frankfurt
  • Silvan Mertes, Lab for Human-Centered Artificial Intelligence, Augsburg University
  • Dominik Schiller, Lab for Human-Centered Artificial Intelligence, Augsburg University