Keynote 1, Tue 31 Aug 17:00 CEST - Room A+B


When research on automatic speech recognition started, the statistical (or data-driven) approach was associated with methods like Bayes decision rule, hidden Markov models, Gaussian models and expectation-maximization algorithm. Later extensions included discriminative training and hybrid hidden Markov models using multi-layer perceptrons and recurrent neural networks. Some of the methods originally developed for speech recognition turned out to be seminal for other language processing tasks like machine translation, handwritten character recognition and sign language processing. Today's research on speech and language processing is dominated by deep learning, which is typically identified with methods like attention modelling, sequence-to-sequence processing and end-to-end processing.

In this talk, I will present my personal view of the historical developments of research on speech and language processing. I will put particular emphasis on the framework of Bayes decison rule and on the question of how the various approaches developed fit into this framework.


Hermann Ney is a professor of computer science at RWTH Aachen University, Germany. His main research interests lie in the area of statistical classification, machine learning and neural networks with specific applications to speech recognition, handwriting recognition, machine translation and other tasks in natural language processing.

He and his team participated in a large number of large-scale joint projects like the German project VERBMOBIL, the European projects TC-STAR, QUAERO, TRANSLECTURES, EU-BRIDGE and US-American projects GALE, BOLT, BABEL. His work has resulted in more than 700 conference and journal papers with an h index of 100+ and 60000+ citations (based on Google scholar). More than 50 of his former PhD students work for IT companies on speech and language technoloy.

The results of his research contributed to various operational research prototypes and commercial systems. In 1993 Philips Dictation Systems Vienna introduced a large-vocabulary continuous-speech recognition product for medical applications. In 1997 Philips Dialogue Systems Aachen introduced a spoken dialogue system for traintable information via the telephone. In VERBMOBIL, his team introduced the phrase-based approach to data-driven machine translation, which in 2008 was used by his former PhD students at Google as starting point for the service Google Translate. In TC-STAR, his team built the first research prototype system for spoken language translation of real-life domains.

Awards: 2005 Technical Achievement Award of the IEEE Signal Processing Society; 2013 Award of Honour of the International Association for Machine Translation; 2019 IEEE James L. Flanagan Speech and Audio Processing Award; 2021 ISCA Medal for Scientific Achievements.

Keynote 2, Wed 1 Sep 15:00 CEST, Room A+B


Conversational AI (ConvAI) systems have applications ranging from personal assistance, health assistance to customer services. They have been in place since the first call centre agent went live in the late 1990s. More recently, smart speakers and smartphones are powered with conversational AI with similar architecture as those from the 90s. On the other hand, research on ConvAI systems has made leaps and bounds in recent years with sequence-to-sequence, generation-based models. Thanks to the advent of large scale pre-trained language models, state-of-the-art ConvAI systems can generate surprisingly human-like responses to user queries in open domain conversations, known as chit-chat. However, these generation based ConvAI systems are difficult to control and can lead to inappropriate, biased and sometimes even toxic responses. In addition, unlike previous modular conversational AI systems, it is also challenging to incorporate external knowledge into these models for task-oriented dialog scenarios such as personal assistance and customer services, and to maintain consistency.

With great power comes great responsibility. We must address the many ethical and technical challenges of generation based conversational AI systems to control for bias and safety, consistency, style, knowledge incorporation, etc. In this talk, I will introduce state-of-the-art generation based conversational AI approaches, and will point out remaining challenges of conversational AI and possible directions for future research, including how to mitigate inappropriate responses. I will also present some ethical guidelines that conversational AI systems can follow.


Pascale Fung is a Professor at the Department of Electronic & Computer Engineering and Department of Computer Science & Engineering at The Hong Kong University of Science & Technology (HKUST), and a visiting professor at the Central Academy of Fine Arts in Beijing. She is an elected Fellow of the Association for Computational Linguistics (ACL) for her “significant contributions towards statistical NLP, comparable corpora, and building intelligent systems that can understand and empathize with humans”. She is an Fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her “contributions to human-machine interactions”, and an elected Fellow of the International Speech Communication Association for “fundamental contributions to the interdisciplinary area of spoken language human-machine interactions”. She is the Director of HKUST Centre for AI Research (CAiRE), an interdisciplinary research center on top of all four schools at HKUST. She is the founding chair of the Women Faculty Association at HKUST. She is an expert on the Global Future Council, a think tank for the World Economic Forum where she started to advocate for AI ethics issue since 2015. She represents HKUST on Partnership on AI to Benefit People and Society. She was invited as an AI expert to the UN panel on Lethal Autonomous Weapons, the UN Economic and Social Council, and various EU official panels. She is a member of the IEEE Working Group to develop an IEEE standard - Recommended Practice for Organizational Governance of Artificial Intelligence. She is on the Board of Governors of the IEEE Signal Processing Society. Her research team has won several best and outstanding paper awards at ACL, ACL and NeurIPS conferences and workshops. She is currently the Editor-in-Chief of the ACL Rolling Review system and the Diversity and Inclusion Chair of Neurips 2021.

Keynote 3, Thu 2 Sep 15:00 CEST, Room A+B


As we navigate our everyday life, we are continuously parsing through a cacophony of sounds that are constantly impinging on our senses. This ability to sieve through everyday sounds and pick-out signals of interest may seem intuitive and effortless, but it is a real feat that involves complex brain networks that balance the sensory signal with our goals, expectations, attentional state and prior knowledge (what we hear, what we want to hear, what we expect to hear, what we know). A similar challenge faces computer systems that need to adapt to dynamic inputs, evolving objectives and novel surrounds. A growing body of work in neuroscience has been amending our views of processing in the brain; replacing the conventional view of ‘static’ processing with a more ‘active’ and malleable mapping that rapidly adapts to the task at hand and listening conditions. After all, humans and most animals are not specialists, but generalists whose perception is shaped by experience, context and changing behavioral demands. The talk will discuss theoretical formulations of these adaptive processes and lessons to leverage attentional feedback in algorithms for detecting and separating sounds of interest (e.g. speech, music) amidst competing distractors.


Mounya Elhilali is a professor of Electrical and Computer Engineering at the Johns Hopkins University with a joint appointment in the department of Psychology and Brain Sciences. She directs the Laboratory for Computational Audio Perception and is affiliated with the Center for Speech and Language Processing and the Center for Hearing and Balance. Her research examines sound processing by humans and machines in noisy soundscapes, and investigates reverse engineering intelligent processing of sounds by brain networks with applications to speech and audio technologies and medical systems. She received her Ph.D. degree in Electrical and Computer Engineering from the University of Maryland, College Park. Dr. Elhilali was named the Charles Renn faculty scholar in 2015, received a Johns Hopkins catalyst award in 2017 and was recognized as outstanding women innovator in 2020. She is the recipient of the National Science Foundation career award and the Office of Naval Research young investigator award.

Keynote 4, Fri 3 Sep 15:00 CEST, Room A+B


Statistical language modeling has been labeled as an AI-complete problem by many famous researchers of the past. However, despite all the progress made in the last decade, it remains unclear how much progress towards truly intelligent language models we made.

In this talk, I will present my view on what has been accomplished so far, and what scientific challenges are still in front of us. We need to focus more on developing new mathematical models with certain properties, such as the ability to learn continually and without explicit supervision, generalize to novel tasks from limited amounts of data, and the ability to form non-trivial long-term memory. I will describe some of our attempts to develop such models within the framework of complex systems.


Tomas Mikolov is a researcher at CIIRC, Prague. Currently he leads a research team focusing on development of novel techniques within the area of complex systems, artificial life and evolution. Previously, he did work at Facebook AI and Google Brain, where he led development of popular machine learning tools such as word2vec and fastText. He obtained PhD at the Brno University of Technology in 2012 for his work on neural language models (the RNNLM project). His main research interest is to understand intelligence, and to create artificial intelligence that can help people to solve complex problems.