Interpreting vision models with sparse dictionary learning: a case for hierarchical learning
Recently, Anthropic demonstrated the power of sparse dictionary learning as an interpretability tool in a large language model. They applied the method to a ...