AI Safety & Ethics

Ensuring intelligent systems are aligned, interpretable, and beneficial to all.

Our Mission

As AI systems gain capability, ensuring they behave as intended becomes one of the defining challenges of our time. Yazhvin's Safety & Ethics research bridges machine learning, philosophy, policy, and human factors to design AI that is robust, understandable, and aligned with human values.

Core Research Areas

Scalable Alignment

Training methods and reward modeling to align large models with human intent, including RLHF and preference learning.

Interpretability

Techniques to understand internal mechanisms of neural networks — from activation tracing to causal attribution.

Robustness & Red Teaming

Stress-testing systems for adversarial behavior, out-of-distribution generalization, and catastrophic failures.

Responsible Deployment

Developing frameworks for safe rollouts, impact assessments, and continual monitoring in real-world contexts.

Value Learning

Modeling and incorporating diverse human values into agent behavior through debate, simulation, and deliberation.

Policy & Governance

Collaborating on global norms, safety benchmarks, and transparency mechanisms for foundation model regulation.

Building trustworthy AI is a collective responsibility.

Yazhvin is open-sourcing its alignment tools and publishing safety evaluations — join us to shape a secure AI future.