I worked at more than one startup that was trying to develop and commercialize foundation models before the technology was ready. We didn't have the "chatbot" paradigm and were always focused on evaluation for a specific task.
I built a model trainer with eval capabilities that I felt was a failure, I mean it worked, but it felt like a terrible bodge just like the tools you're talking about. Part of it is that some the models we were training were small and could be run inside scikit-learn's model selection tools which I've come to seen as "basically adequate" for classical ML but other models might take a few days to train on a big machine which required us to develop inferior model selection tools that worked with processes too big to fit in a single address space but also gave us inferior model selection for small models. (The facilities for model selection in hugginface are just atrocious in my mind)
I see a lot of bad frameworks for LLMs that make the same mistakes I was making back then but I'm not sure what the answer is, although I think it can be solved for particular domains. For instance, I have a design for a text classifier trainer which I think could handle a wide range of problems where the training set is between 50-500,000 examples.
I saw a lot of lost opportunities in the 2010s where people could have built a workable A.I. application if they were willing to build training and eval sets and they wouldn't. I got pretty depressed when I talked to tens of vendors in the full text search space and didn't find any that were using systematic evaluation to improve their relevance. I am really hopeful today that evaluation is a growing part of the conversation.
Sounds like a nightmare. How do you deal with the nondeterministic behaviour of the LLMs when trying to debug why they did something wrong?
I built a model trainer with eval capabilities that I felt was a failure, I mean it worked, but it felt like a terrible bodge just like the tools you're talking about. Part of it is that some the models we were training were small and could be run inside scikit-learn's model selection tools which I've come to seen as "basically adequate" for classical ML but other models might take a few days to train on a big machine which required us to develop inferior model selection tools that worked with processes too big to fit in a single address space but also gave us inferior model selection for small models. (The facilities for model selection in hugginface are just atrocious in my mind)
I see a lot of bad frameworks for LLMs that make the same mistakes I was making back then but I'm not sure what the answer is, although I think it can be solved for particular domains. For instance, I have a design for a text classifier trainer which I think could handle a wide range of problems where the training set is between 50-500,000 examples.
I saw a lot of lost opportunities in the 2010s where people could have built a workable A.I. application if they were willing to build training and eval sets and they wouldn't. I got pretty depressed when I talked to tens of vendors in the full text search space and didn't find any that were using systematic evaluation to improve their relevance. I am really hopeful today that evaluation is a growing part of the conversation.