The purpose of the NASA-QA dataset is to provide a benchmark for evaluating model performance in extractive question answering related to NASA-related themes1. This dataset helps researchers and professionals improve their capacity to carry out accurate and effective natural language processing tasks in domains such as Earth sciences, astronomy, physics, astrophysics, heliophysics, planetary sciences, and biology.
Large Language Models (LLMs) are trained on vast amounts of text data, often sourced from the internet. This data includes diverse topics and language styles, allowing the models to learn grammar, semantics, and general language understanding. The training process involves predicting the next word in a sequence, given the context of preceding words, and adjusting the model's parameters to minimize prediction errors.
INDUS specializes in Earth sciences, astronomy, physics, astrophysics, heliophysics, planetary sciences, and biology, among other multidisciplinary subjects. It is designed to cover the body of knowledge in these fields and address different needs with various types of models, such as Encoder Model, Contrastive-Learning-Based General Text Embedding Model, and Smaller Model Versions.