Look at your data.

  • Create a data annotation interface to add free-form notes. this forces you to get close to the data
  • Annotate until you feel you’re not learning anything new.
  • Now ask an LLM to create cateogries and assign each sample to categories (multiclass classification)
  • Aggregate and analyze failure modes

Now, with every tweak to the app, prompt engineering, model change etc, run evals to compare and contrast.

Follow up -

https://pbs.twimg.com/media/GnrgGMxXEAA__h0?format=jpg&name=4096x4096

Andrew Ng recommendation on evals- https://x.com/AndrewYNg/status/1912908679344693711