Interpretability (the understanding of AI internals) is an unsolved problem - especially with the speed of its advancing capabilities, and the new architectures/developments being released every month.
Is interpretability needed? While it's possible that advanced AI is somehow "naturally aligned" to be pro-human and pro-Earth, there's no benefit to assuming that this is true. It seems unlikely that all advanced AI would be fully aligned in all the possible scenarios and edge cases.
Neuronpedia's role is to accelerate understanding of AI models, so that when they get powerful enough, we have a better chance of aligning them. If we can increase the probability of a good outcome by even 0.01%, that's an expected value of saving many, many current and future lives - certainly a worthwhile and meaningful endeavor.
Check out our announcement post for the details.
Code, issues, documentation available at our Github.
4TB+ in data available at our Public Datasets.
Donors
David Chanin 2Neuronpedia is amazing! Thank you for building this