![]() So, we evaluate Gemini models on several new held-out evaluation datasets that were recently released, such as WMT23 and Math-AMC 2022-2023 problems, or internally generated from non-web sources, such as Natural2Code. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. This suggests that the benchmark results are susceptible to the pretraining dataset composition. ![]() LAMBADA (Paperno et al., 2016).Īs part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). We performed an extensive leaked data analysis after training to ensure the results we report here are as scientifically sound as possible, but still found some minor issues and decided not to report results on e.g. ‘ Evaluation on these benchmarks is challenging and may be affected by data contamination. “ Also, it’s worth reading the Gemini authors discussion on the nuance of these evaluations in the paper (also on the same page), pulling it out for ease: It is also the first model to exceed this threshold, with the prior state-of-the-art result at 86.4%.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |