AI Safety & Ethics
Ensuring intelligent systems are aligned, interpretable, and beneficial to all.
Our Mission
As AI systems gain capability, ensuring they behave as intended becomes one of the defining challenges of our time. Yazhvin's Safety & Ethics research bridges machine learning, philosophy, policy, and human factors to design AI that is robust, understandable, and aligned with human values.
Core Research Areas
Scalable Alignment
Training methods and reward modeling to align large models with human intent, including RLHF and preference learning.
Interpretability
Techniques to understand internal mechanisms of neural networks — from activation tracing to causal attribution.
Robustness & Red Teaming
Stress-testing systems for adversarial behavior, out-of-distribution generalization, and catastrophic failures.
Responsible Deployment
Developing frameworks for safe rollouts, impact assessments, and continual monitoring in real-world contexts.
Value Learning
Modeling and incorporating diverse human values into agent behavior through debate, simulation, and deliberation.
Policy & Governance
Collaborating on global norms, safety benchmarks, and transparency mechanisms for foundation model regulation.
Building trustworthy AI is a collective responsibility.
Yazhvin is open-sourcing its alignment tools and publishing safety evaluations — join us to shape a secure AI future.