Text generation for Floor Plan

Mayuri Lalwani
5 min readApr 23, 2021
Source: iStockphoto

What is Image Captioning?

The task of Image Captioning is to generate textual description of an image. This process comprises of both: image understanding and language generation components. Both the tasks are challenging and open problems in AI.

A picture may be worth a thousand words but sometimes it’s the words that are most useful. Especially in the real estate domain, the floor plan is used by the architects to show the interior of the building. Floor plan is basically a graphical document and hence it requires some specific information to understand it, unlike natural photograph. Template-based retrievals, n-grams, grammar rules, RNN, LSTM and GRU are some of the approaches to solve this problem. In this article we are going to discuss text generation from floor plan images using two techniques: Description Synthesis from Image Cue (DSIC) and Transformer Based Description Generation (TBDG).

Techniques for generating Text for floor plan:

1. Description Synthesis from Image Cue (DSIC) –

The most popular approach for language modeling and text generation is encoder decoder framework. Fig. 1 shows a typical architecture of the model. Here, RPN (Region Proposal Network) act as an encoder and hierarchical RNN act as a decoder.

First step is to extract region wise visual features from the image. The CNN plus RPN is used to generate region proposals. Since there are average 5 sentences per paragraph, top 5 region proposals are pooled into a single pooling vector using a projection matrix. The projection vector is trained end to end with the sentence RNN and the word RNN.

This network uses two RNNs in hierarchy. One is used to learn sentence topic vectors (S-RNN) which is single layered from the pooled features whereas the other is used for learning words for respective sentence topic vectors (W-RNN). LSTM network is used for both the RNNs. Sentence LSTM consists of 512 units, word LSTM consists of 512 units and hence the fully connected layer size is 1024.

Fig. 1: Hierarchical RNN to yield paragraph from floor plans

The pooled vectors are fed as input to the S-RNN for each image and respective paragraph. Each input generates 5 sentences and 60 words. The output of S-RNN is given as input to the W-RNN which generates the words in each sentence. For generating further words for each sentence, create a probability distribution for each word in the sentence and set the threshold value to 0.5.

The weighted sum of cross entropy losses of loss of sentence and loss of word is calculated.

Loss(sent) = Loss over probability over a sentence topic

Loss(word) = Loss over probability over word generation.

1. Transformer Based Description Generation (TBDG) –

As shown in fig. 2, TBDB is a transformer based model which gives the entire input sequence to the decoder. The encoded captions (We) is fed as input to the encoder (Bi-LSTM unit). Here the decoder is a LSTM network with 256 units. It is connected with a Time-Distributed dense layer and a SoftMax activation function.

In TBDG, RPN learns region wise captions from the dataset and generates a paragraph based description instead of multi sentenced paragraphs, which makes it different from DSIC. Since the floor plan is an image document, additional knowledge is needed to generate the descriptions. Therefore the data is in the tuple format (I, We, K) where I is the image of the floor plan, We is the word cues and K is the paragraph description about the floor plan.

Here the corpus K is preprocessed by removing extra lines, whitespaces unknown symbols and punctuations and tokenized using PTB tokenizer [2]. A vocabulary of most frequently occurring words is generated out of it.

The region wise captions generated by the RPN in the DSIC model serves as an extra knowledge required and are fed to the decoder unit as input. The top 5 captions with the highest probability are selected and merged as a one dimensional vector to generate a paragraph describing the floor plan.

If the paragraph length is too large it may end up in vanishing gradient problem. This problem can be solved by selecting a few keywords of common categories like kitchen, bedroom, bathroom, porch and so on from the corpus. To shorten the length of each paragraph we then extract only those sentences which consists the keywords.

Fig. 2: Framework of TBDG of generating paragraph description from input floor plan image

Other state of the art for automatic image captioning:

The VIVO system by Microsoft accurately provides a caption for an image even when the image has no explicit or direct captioning in the training data.

Fig. 3: Source

An example of VIVO system is shown in the fig. 3. The old model would generate caption for the image as “a man in a blue shirt”. The new VIVO based model generates more superior caption as “a man in a surgical scrubs looking at a tablet”.

Conclusion and Future Work

In this article we briefly discussed about DSIC and TBDG techniques for generating textual descriptions specifically for floor plan images. However, these floor plan images are different from natural images. They are 2D line drawing images with binary pixel values. Due to the lack of information at every pixel, various state of the art text generation methods for natural images do not work well on these floor plan images.

The transformer based technique TBDG, takes both images and word cues to create a paragraph. The other technique we discussed, DSIC is a hierarchical recurrent neural network based model which generates descriptions by directly learning features from images.

For future work, these model can be made more generalized by improving the network architecture and re-designing the method of taking word cues so as to accommodate variable floor plan images.

This article is based on a survey paper, the link is shared below. Hope you enjoyed reading this article. Thanks for reading.

Reference links:

1. Survey Paper: https://arxiv.org/pdf/2103.08298.pdf

2. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English:The Penn Treebank (1993)

3. https://pureai.com/articles/2020/10/14/automatic-imagecaptioning.aspx

--

--

Mayuri Lalwani

M.S. in Software Engineering at San Jose State University