Welcome to the AI world. It's exciting & scary but hopefully more exciting than scary. And hopefully we'll figure out how to make aliens on GPUs love us, rather than hate us (or just be indifferent towards us).
I'm spending most of my time thinking about technical AI alignment these days.
A couple of weeks back I realized that I have enough of a sense of how an LLM thinks to be able to write a prompt that would jailbreak both GPT-4 and Claude at the same time in a way that I hadn't seen anyone do it before... and it just worked. I eventually ended up coming up with the shortest, to my knowledge, prompt that only uses plain English that jailbreaks both GPT-4 and Claude (it's 2 sentences..).
The project I'm thinking the most about right now is investigating steganography in GPT-4 and Claude & I'm looking for a research assistant to join me on it. I think there's a ton to learn about how models really think from looking into this. Relevant: GPT-3 will ignore tools when it disagrees with them.
If you're interested, please send me a note with (1) whether we should expect to find it or not, (2) how you'd approach this & (3) what we can learn about LLMs from the project. If things are going well and we are both interested, this could be a longer-term engagement.
Some people in the AI field think the risks of AGI (and successor systems) are fictitious; we would be delighted if they turn out to be right, but we are going to operate as if these risks are existential.
Everyone on your timeline is fretting about AI risk ... Maybe you've even developed a slight distaste for it all ... That’s what I used to think too ... Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!
Have a great April & as always feel free to reach out!