
The MIT and MIT-IBM Watson AI Lab researchers are attempting to improve the task of vision-and-language navigation through their new navigation method. This multistep navigation task involves converting visual representations into pieces of language, which are then fed into one large language model that achieves all parts of the task. By utilizing language-based representations, the method enables the generation of a huge amount of synthetic training data and can be used in situations that lack enough visual data for training. The researchers found that combining their language-based inputs with visual signals leads to better navigation performance.

The new method developed by the researchers differs from traditional techniques that use visual representations for navigation in several ways. Firstly, instead of encoding visual features from images of a robot's surroundings as visual representations, which is computationally intensive, the new method creates text captions that describe the robot's point-of-view. Secondly, the new method utilizes a large language model to process the text captions and predict the actions a robot should take to fulfill a user's language-based instructions. Lastly, the new method can generate a huge amount of synthetic training data using language-based representations, which is not possible with traditional vision-based techniques.

The main advantages of using language-based inputs over visual representations, according to the research findings, are as follows:
While the language-based approach does not outperform techniques that use visual features, it performs well in situations that lack enough visual data for training. Moreover, combining language-based representations with vision-based methods can improve an agent's ability to navigate.