Long Dialogue Summarization: What Works and What’s Next

6 min readNov 27, 2021

This post gives a high level idea on different techniques available for the long dialogue summarization, what are challenges imposed and what is the scope for future work.

Photo by Wynand van Poortvliet on Unsplash

What is Dialogue Summarization and Why is it Needed?

The common medium for sharing ideas and conversations between humans is speech and text, mostly its speech. With the immeasurable conversations happening over the online messaging platforms and with the increase of online meetings during pandemic, comes extensive amounts of meeting transcripts. Therefore there is an enormous need for the readers to capture the salient features of the long conversations happening in meetings, interviews or TV serials.

The goal of Dialogue summarization is to generate concise summaries for long dialogues such as meetings, TV serials and chit chats in order to help readers capture salient information more efficiently.

What is document summarization and methods:

AI has a tremendous impact in mundane day to day activities such as categorizing and classifying texts and documents. Automatic Text summarization refers to creating a coherent and fluent summary consisting of only main points from the document. The main approaches followed for document summarization are extract based summarization and abstraction based summarization. As compared to document summarization, long dialogues are more interactive in nature, more context dependent and the information is more sparsely distributed. Moreover the informal language used in dialogues pose great difficulties in modeling the relevant and salient features.

Problems with Long Inputs with Classical Transformers

The recent transformer based pretrained models face few challenges in terms of dialogue length and the context dependency. The real world dialogues exceed the input limits posed on these transformer models and makes the information more context dependent and sparsely distributed.

In this post we discuss three different strategies to deal with lengthy input problem:

Longformer

Transformers have achieved state of the art results for many NLP tasks. However Transformers cannot process long sequences simply because too much time and resources are required. The self attention operation of the transformer based model scales quadratically with the length of the input sequence. The Longformer variant LED addresses this issue by using an attention mechanism that scales linearly with the sequence length.

This Longformer variant has both encoder and decoder Transformer stacks and instead of full self attention in the encoder it uses efficient local +global attention pattern. The decoder uses the full self attention to the entire encoded tokens and to previously decoded locations. Longformer can accept up to 16K tokens and by using its long-encoder-decoder (LED) variant, it has shown tremendous improvement for long dialogue summarization.

Important Tips -

LEDForConditionalGeneration is an extension of BartForConditionalGeneration exchanging the traditional self-attention layer with Longformer’s chunked self-attention layer. LEDTokenizer is an alias of BartTokenizer.
LED works very well on long-range sequence-to-sequence tasks where the input_ids largely exceed a length of 1024 tokens.
LED pads the input_ids to be a multiple of config.attention_window if required. Therefore a small speed-up is gained, when LEDTokenizer is used with the pad_to_multiple_of argument.
LED makes use of global attention by means of the global_attention_mask (see LongformerModel). For summarization, it is advised to put global attention only on the first <s> token. For question answering, it is advised to put global attention on all tokens of the question.
To fine-tune LED on all 16384, it is necessary to enable gradient checkpointing by executing model.gradient_checkpointing_enable().

Retrieve Then Summarize

This is a two stage process called Retrieve and summarize pipeline. It first retrieves the most relevant subtext from the dialogue and then feeds it to a summarizer to summarize the long dialogue.

Different retrievers could be used for this process:

TF-IDF — This technique is based on a bag of words and measures the term frequency (TF) and normalizes with the inverse document frequency (IDF).
BM25 — It is similar to TF-IDF but it also takes into account document length and term saturation.
Locator — This uses convolution neural networks with BERT

The output of various retrievers can be fed to a fine tuned BART model to generate a summary.

HMNet- Hierarchical Meeting Summarization Network

An end to end deep learning framework HMNet is also an encoder decoder architecture which generates summaries for meeting transcripts. HMNet handles the long sequence problem by employing a two level transformer structure to encode the meeting transcripts.

Word-level Transformer
Turn-level Transformer

The decoder is a transformer that finally generates the summary tokens.

Evaluation Metrics -

ROGUE — One of the most popular and a standard metric for evaluating summarization tasks is ROGUE (Recall-Oriented Understudy for Gisting Evaluation). It is a set of metrics.

Stronger performance selection metrics which are highly correlated with human judgement have been proposed recently. BERTScore and Moverscore combine contextualized representation with a distance measure and hence perform best.

New Frontiers and their challenges

Multi-modal Dialogue Summarization -

Non verbal information like visual recordings of the meeting can add more information and enrich the textual dialogues. Visual features provide more insights like emotional discussion or facial features but due to privacy issues it becomes hard to acquire such data. Future work can consider building multi modal summarization models under federal learning framework.

Multi modal messages like pictures, emojis and voice messages are frequently used in conversation over Watsapp and Wechat. It enriches the dialogue flow hence future work should consider designing a system that adds new entities provided by these emojis, pictures and emotions associated with them to produce more meaningful summaries.

Multi-domain Dialogue Summarization-

Shared information between different domains can effectively be used to learn tasks from other domains. For example using a meeting dataset to learn email tasks. Recent works have proposed the diverse dialogue summarization dataset wrapping up information from different domains.

Future work can consider meta learning techniques or pre-trained models to use such datasets.

Conclusion

This post presents a comprehensive survey on the progress of dialogue summarization. It summarizes different techniques available for summarizations, their challenges and provides a highlight on new frontiers. I hope this article aids research of the dialogue summarization.

Thanks for reading!!

References :

https://arxiv.org/pdf/2004.05150.pdf

https://arxiv.org/pdf/2109.04609.pdf

https://huggingface.co/transformers/model_doc/led.html

https://aclanthology.org/2020.findings-emnlp.19.pdf

https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460

https://arxiv.org/pdf/2107.03175.pdf