Posts by Tag

Alignment

Interpreting vision models with sparse dictionary learning: a case for hierarchical learning

4 minute read

Recently, Anthropic demonstrated the power of sparse dictionary learning as an interpretability tool in a large language model. They applied the method to a ...

Mitigating LLM Sycophancy

8 minute read

A sycophantic AI is easily manipulated by the user to give an incorrect response. The model provides the correct answer to a neutral prompt, but pri...

Demonstrating LLM Sycophancy

18 minute read

A sycophantic AI is easily manipulated by the user to give an incorrect response. The model provides the correct answer to a neutral prompt, but pri...

Back to Top ↑

Safety

Interpreting vision models with sparse dictionary learning: a case for hierarchical learning

4 minute read

Recently, Anthropic demonstrated the power of sparse dictionary learning as an interpretability tool in a large language model. They applied the method to a ...

Mitigating LLM Sycophancy

8 minute read

A sycophantic AI is easily manipulated by the user to give an incorrect response. The model provides the correct answer to a neutral prompt, but pri...

Demonstrating LLM Sycophancy

18 minute read

A sycophantic AI is easily manipulated by the user to give an incorrect response. The model provides the correct answer to a neutral prompt, but pri...

Back to Top ↑

Fail

Failing repeatedly

6 minute read

What we (don’t) say about failure

Back to Top ↑

Startup

Failing repeatedly

6 minute read

What we (don’t) say about failure

Back to Top ↑

advice

Getting a job in a new field

6 minute read

Here is all my advice about getting a job, particularly if you are coming from academia or switching fields. If you are considering a career change, I recomm...

Back to Top ↑

Proposal

Interpreting vision models with sparse dictionary learning: a case for hierarchical learning

4 minute read

Recently, Anthropic demonstrated the power of sparse dictionary learning as an interpretability tool in a large language model. They applied the method to a ...

Back to Top ↑