Look at your data.
- Create a data annotation interface to add free-form notes. this forces you to get close to the data
- Annotate until you feel you’re not learning anything new.
- Now ask an LLM to create cateogries and assign each sample to categories (multiclass classification)
- Aggregate and analyze failure modes
Now, with every tweak to the app, prompt engineering, model change etc, run evals to compare and contrast.
Follow up -
https://pbs.twimg.com/media/GnrgGMxXEAA__h0?format=jpg&name=4096x4096
Andrew Ng recommendation on evals- https://x.com/AndrewYNg/status/1912908679344693711