Hongsheng Liu - Research Scientist

LLM ServingFeaturedPart 1

DeepSeek Model Structure Analysis: MLA, MTP, and MoE Deep Dive

📚 DeepSeek-MoE Inference Series • Part 1 of 5

March 26, 2024•22 min read

Comprehensive analysis of DeepSeek model architecture components: Multi-head Latent Attention (MLA), Multi-Token Prediction (MTP), and Mixture of Experts (MoE). Explore the technical foundations that enable DeepSeek's high-performance inference capabilities with detailed architectural insights and implementation analysis.

DeepSeekMLAMTPMoEModel ArchitectureLLM ServingTechnical Analysis

🔗 Key References:

Multi-head Latent Attention (MLA)

Technical deep dive into MLA architecture

Multi-Token Prediction (MTP)

Understanding MTP implementation and benefits

Mixture of Experts (MoE)

MoE architecture and scaling strategies

Read Full Article

LLM ServingFeaturedSeries Overview

DeepSeek-MoE Inference: Optimizing Mixture of Experts for Production

📚 DeepSeek-MoE Inference Series • Series Overview of 5

March 25, 2024•18 min read

Deep dive into DeepSeek-MoE inference optimization, exploring expert routing strategies, memory management, and performance tuning for production deployments. Learn advanced techniques for scaling Mixture of Experts models with vLLM and distributed serving architectures.

DeepSeek-MoEMixture of ExpertsLLM ServingvLLMExpert RoutingPerformance Optimization

Read Full Article

LLM ServingFeatured

vLLM Deep Dive: Anatomy of High-Performance LLM Serving

Sep. 15, 2025•15 min read

A comprehensive deep dive into the anatomy of vLLM, exploring its architecture, optimization techniques, and performance characteristics. Learn how vLLM achieves high-throughput serving with PagedAttention, continuous batching, and advanced memory management.

vLLMLLM ServingPagedAttentionPerformance OptimizationMemory Management

Read on vLLM Blog

AI4ScienceFeatured

Understanding Physics-Informed Neural Networks: A Comprehensive Guide

March 15, 2024•12 min read

Explore the fundamentals of Physics-Informed Neural Networks (PINNs) and their applications in scientific computing. Learn how to integrate physical laws into neural network architectures for solving complex PDEs. Includes detailed video explanation of core concepts.

PINNsPhysics-Informed MLScientific ComputingNeural ODEs

Read Full Article Video Explanation

LLM ServingFeatured

Optimizing vLLM for Production: Performance Tips and Best Practices

February 28, 2024•8 min read

A practical guide to optimizing vLLM deployments for production environments. Covers memory management, batching strategies, and hardware acceleration techniques for maximum throughput.

vLLMLLM ServingPerformance OptimizationProduction Deployment

Read Full Article

LLM Serving

Ascend NPU Optimization for Large Language Models

January 8, 2024•11 min read

Comprehensive guide to optimizing Large Language Models on Ascend NPUs. Covers vLLM-Ascend integration, memory optimization, and performance tuning strategies.

Ascend NPUvLLM-AscendHardware AccelerationLLM Optimization

Read Full Article

AI4Science

Conservation Laws in Machine Learning: Theory and Practice

November 15, 2023•13 min read

Understand how to incorporate conservation laws into machine learning models. Explore theoretical foundations and practical implementations for physically consistent AI.

Conservation LawsPhysics-Informed MLScientific ConstraintsTheory

Read Full Article

Technical Blogs

DeepSeek Model Structure Analysis: MLA, MTP, and MoE Deep Dive

🔗 Key References:

DeepSeek-MoE Inference: Optimizing Mixture of Experts for Production

vLLM Deep Dive: Anatomy of High-Performance LLM Serving

Understanding Physics-Informed Neural Networks: A Comprehensive Guide

Optimizing vLLM for Production: Performance Tips and Best Practices

Ascend NPU Optimization for Large Language Models

Conservation Laws in Machine Learning: Theory and Practice

Stay Updated