Interspeech 2021 will feature six high-quality tutorials.

Monday August 30, 11:00-14:00 CEST

This tutorial covers the theory and practical applications of intonation research. The following three topics will be introduced to speech technology engineers and researchers new to the field of intonation and prosody: a. the fundamentals of the autosegmental-metrical theory of intonational phonology (AM), a widely accepted phonological framework of intonation; b. a range of automatic and manual annotation methods that can create fast or detailed transcriptions of prosody; c. state-of-the-art modelling techniques for explaining intonation.


  • Amalia Arvaniti
  • Kathleen (Katie) Jepson
  • Cong Zhang
  • Katherine Marcoux

Amalia Arvaniti

Amalia Arvaniti is the Chair of English Language and Linguistics at Radboud University, Netherlands. She received her Ph.D. from the University of Cambridge (1991) and has since then held appointments at the University of Kent (2012-2020), the University of California, San Diego (2002-2012), the University of Cyprus (1995-2001), and the University of Oxford (1991-1994). She has published extensively on prosody, particularly on the phonetics and phonology of intonation, and the nature and measurement of speech rhythm. Her research is currently supported by an ERC Advanced grant titled SPRINT which investigates the role of variation in the phonetics and phonology of the intonation systems of English and Greek.

Amalia was co-editor and then editor of the Journal of the International Phonetic Association (2014-2015 and 2015-2019 respectively). She also serves on the editorial board of the Journal of Phonetics, Journal of Greek Linguistics, and the Studies in Laboratory Phonology series of Language Science Press; from 2000 to 2020 she was also on the editorial board of Phonology. She is currently the President of the Executive Permanent Council for the Organisation of the International Congress of Phonetic Sciences (2019-2023).

Kathleen Jepson

Kathleen Jepson is a postdoctoral researcher on Amalia Arvaniti’s ERC-funded SPRINT project, based at Radboud University. She received her Bachelor’s degree (Honours) from the Australian National University in 2013, and her PhD in Linguistics from the University of Melbourne in 2019. Kathleen’s doctoral research, supervised by Prof. Janet Fletcher, Dr. Ruth Singer, and Dr. Hywel Stoakes, was a description of aspects of the prosodic system of Djambarrpuyŋu, an Australian Indigenous language. She has experience in conducting data collection for prosodic analysis in remote locations, and developing analyses of under-described languages. Kathleen’s research interests include the production and perception of prosody, particularly intonation, as well as language description of under-resourced languages in Australia and the Pacific region.

Cong Zhang

Cong Zhang is a postdoctoral researcher on the ERC-funded SPRINT project at Radboud University. She is in charge of collecting, analysing, modelling, and interpreting the English intonation data. Cong received her DPhil degree from the University of Oxford in 2018, with a thesis examining the interaction of tone and intonation in Tianjin Mandarin. Following her DPhil, she worked as a TTS Linguistics Engineer at A-Lab, Rokid Inc., where she led a project for developing a Singing Voice Synthesis system. Cong’s research covers various aspects of speech prosody; she is also interested in bridging the gap between linguistics theories and speech technology.

Katherine Marcoux

Katherine Marcoux is the lab manager of SPRINT. She assists with various aspects of the research process, mainly focusing on data analysis. Marcoux completed her MSc at the Universitat Pompeu Fabra, after which she began her PhD thesis at Radboud University investigating the production and perception of native and non-native Lombard speech. She is currently finalizing her doctoral manuscript.

Dealing with overlapping speech remains one of the great challenges of speech processing. Target speech extraction consists of directly estimating speech of a desired speaker in a speech mixture, given clues about that speaker, such as a short enrollment utterance or video of the speaker. It is an emergent field of research that has gained increased attention since it provides a practical alternative to blind source separation for processing overlapping speech. Indeed, by focusing on extracting only one speaker, target speech extraction can relax some of the limitations of blind source separation, such as the necessity of knowing the number of speakers, and the speaker permutation ambiguity.

In this tutorial, we will present an in-depth review of neural target speech extraction including audio, visual, and multi-channel approaches, covering the basic concepts up to the most recent developments in the field. We will provide a uniformed presentation of the different approaches to emphasize their similarities and differences. We will also discuss extensions to other tasks such as speech recognition or voice activity detection and diarization.

Marc Delcroix

Marc Delcroix received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and the École Centrale Paris, Paris, France, in 2003, and the Ph.D. degree from Hokkaido University, Sapporo, Japan, in 2007. He was a Research Associate with NTT Communication Science Laboratories (CS labs), Kyoto, Japan, from 2007 to 2008 and 2010 to 2012, where he then became a Permanent Research Scientist in 2012. He was a Visiting Lecturer with Waseda University, Tokyo, Japan, from 2015 to 2018. He is currently a Distinguished Researcher with CS labs.

His research interests cover various aspects of speech signal processing such as robust speech recognition, speech enhancement, target speech extraction, model adaptation, etc. Together with Kateřina Žmolíková, they pioneered the field of neural network-based target speech extraction and he has been actively pursuing research on that direction, publishing also early works on target speaker-ASR, audio-visual target speech extraction, and presenting a show-and-tell on the topic at ICASSP'19.

Dr. Delcroix is a member of the IEEE Signal Processing Society Speech and Language Processing Technical Committee (SLTC). He was one of the organizers of the REVERB Challenge 2014 and the ASRU 2017. He was also a senior affiliate at the Jelinek workshop on speech and language technology (JSALT) in 2015 and 2020.

Kateřina Žmolíková

Kateřina Žmolíková received the B.Sc. degree in information technology in 2014 and the Ing. degree in mathematical methods in information technology in 2016 from the Faculty of Information Technology, Brno University of Technology (BUT), Czech Republic, where she is currently working towards her Ph.D. degree. Since 2013, she has been part of the Speech@FIT research group at BUT. She took part in an internship in Toshiba Research Laboratory in Cambridge in 2014 and in the Signal Processing Research Group in NTT in Kyoto in 2017. She also took part in the Jelinek workshop on speech and language technology in 2015 and 2020.

During her Ph.D. degree, she focuses on the topic of target speech extraction using neural networks. Her research covers the general design of the target-informed neural networks and their integration with multi-channel and automatic speech recognition systems. She has experience with teaching courses in signal processing and speech processing courses at the Brno University of Technology. She also took part in programs outreaching to high-school students, presenting lectures introducing speech technology.

This tutorial introduces K2, the cutting-edge successor to the Kaldi speech processing, which consists of several Python-centric modules to enable building speech recognition systems, along with its enabling counterparts, Lhotse and Icefall. The participants will learn how to perform swift data manipulation with Lhotse; how to build and leverage auto-differentiable weighted finite state transducers with k2; and how these two can be combined to create Pytorch-based, state-of-the-art hybrid ASR system recipes from Snowfall, the precursor to Icefall.

Dr. Daniel Povey is an expert in ASR, best known as the lead author of the Kaldi toolkit and also for popularizing discriminative training (now known as "sequence training" in the form of MMI and MPE). He has worked in various research positions at IBM, Microsoft and Johns Hopkins University, and is now Chief Voice Scientist of Xiaomi Corporation in Beijing, China.

Dr. Piotr Żelasko is an expert in ASR and spoken language understanding, with extensive experience in developing practical and scalable ASR solutions for industrial-strength use. He worked with successful speech processing start-ups - Techmo (Poland) and IntelligentWire (USA, acquired by Avaya). At present, he is a research scientist at Johns Hopkins University.

Prof. Sanjeev Khudanpur has 25+ years of experience working on almost all aspects of human language technology, including ASR, machine translation, and information retrieval. He has lead a number of research projects from NSF, DARPA, IARPA, and industry sponsors, and published extensively. He has trained more than 40 PhD and Masters students to use Kaldi for their dissertation work.

Monday August 30, 15:00-18:00 CEST

Weighted finite-state automata (WFSAs) have been a critical building block in modern automatic speech recognition. However, their use in conjunction with "end-to-end" deep learning systems is limited by the lack of efficient frameworks with support for automatic differentiation. This limitation is being overcome with the advent of new frameworks like GTN and K2. This tutorial will cover the basics of WFSAs and review their application in speech recognition. We will then explain the core concepts of automatic differentiation and show how to use it with WFSAs to rapidly experiment with new and existing algorithms. We will conclude with a discussion of the open challenges and opportunities for WFSAs to grow as a central component in automatic speech recognition and related applications.

Awni Hannun

Awni is a research scientist at the Facebook AI Research (FAIR) lab, focusing on low-resource machine learning, speech recognition, and privacy. He earned a Ph.D. in computer science from Stanford University. Prior to Facebook, he worked as a research scientist in Baidu's Silicon Valley AI Lab, where he co-led the Deep Speech projects.

SpeechBrain is a novel open-source speech toolkit natively designed to support various speech and audio processing applications. It currently supports a large variety of tasks, such as speech recognition, speaker recognition, speech enhancement, speech separation, multi-microphone signal processing, just to name a few. This toolkit is very flexible, modular, easy-to-use, well-document, and can be used to quickly develop speech technologies. With this tutorial, we would like to present, for the first time, SpeechBrain to the INTERSPEECH attenders. First, the design and the general architecture of SpeechBrain will be discussed. Then, its flexibility and simplicity will be shown through practical examples on different speech tasks.

Mirco Ravanelli is currently a postdoc researcher at Mila (Université de Montréal) working under the supervision of Prof. Yoshua Bengio. His main research interests are deep learning, speech recognition, far-field speech recognition, cooperative learning, and self-supervised learning. He is the author or co-author of more than 40 papers on these research topics. He received his PhD (with cum laude distinction) from the University of Trento in December 2017. Mirco is an active member of the speech and machine learning communities. He is founder and leader of the SpeechBrain project.

Titouan Parcollet is an associate professor in computer science at the Laboratoire Informatique d’Avignon (LIA), from Avignon University (FR) and a visiting scholar at the Cambridge Machine Learning Systems Lab from the University of Cambridge (UK). Previously, he was a senior research associate at the University of Oxford (UK) within the Oxford Machine Learning Systems group. He received his PhD in computer science from the University of Avignon (France) and in partnership with Orkis focusing on quaternion neural networks, automatic speech recognition, and representation learning. His current work involves efficient speech recognition, federated learning and self-supervised learning. He is also currently collaborating with the Mila-Quebec AI institute on the SpeechBrain project.

Training Automatic Speech Recognition (ASR) models usually requires transcribing large quantities of audio, which is both expensive and time-consuming. To overcome this limitation, and many semi-supervised training approaches have been proposed to take advantage of abundant unpaired audio and text data. In this tutorial we describe the conceptual understanding and implementation of semi-supervised speech applications - Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) applications. We begin the tutorial with concepts for core building blocks which include Speech pre-processing, Transformer, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). We also describe the state-of-the-art approaches in this domain, and the key ideas underlying them.

We walk through the code for implementations. We provide details for installation prerequisites and code using Jupyter notebooks with comments on concepts, key steps, visualization and results.

We believe that a self-contained tutorial giving a good overview of the core techniques with sufficient mathematical background along with actual code will be of immense help to participants.

Omprakash Sonie

Om is a data scientist at Flipkart who has been working on Speech Recognition Systems, Recommender Systems and Natural Language Processing. Om is passionate about providing guidance to budding data scientists for quality machine learning, deep learning and reinforcement learning using DeepThinking.AI platform. Om is organiser of local Deep Learning meetup. Om plans to write books on Code to Concept for Machine. Om (as primary author) has presented tutorials and conducted hands-on workshops at KDD, WWW (TheWeb), RecSys (2018, 2019), ECIR, IJCAI, GTC-Nvidia and various meet-ups.

Venkateshan Kannan

Venkateshan is a data scientist at Flipkart who is presently working in the domain of speech recognition. In the past, he has worked on diverse problems related to complex networks, information theory, disease modeling, dynamic assignment algorithms, vehicle route optimization, etc. He has a PhD. in theoretical physics.

The tutorials will take place physically at the Faculty of Information Technology of Brno University of Technology and will be accessible both physically and virtually. A special registration and payment for Tutorials is needed, please see "Tutorials" button on the Registration form.