airMeng

AI Framework Engineer

GitHub

About

AI Framework Engineer in Shanghai

Work Experience

AI Framework Engineer

Intel

Jan 2020 - Present

Projects

LLAMA.cpp SYCL

Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation. Co-work with the community owners and response to issues related to Intel. Published blog: Run LLMs on Intel GPUs Using llama.cpp https://medium.com/intel-analytics-software/run-llm-on-intel-gpus-using-llama-cpp-579d3595954e

Intel-Extension-For-Transformers

Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem Highly optimized hand written X86 assembly [kernels](https://github.com/intel/neural-speed/tree/main) for Intel hardware, targeting advanced compression algorithm especially for LLM. Develop [GPU kernels](https://github.com/intel/neural-speed/tree/xetla) for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)

Intel Neural Compressor

Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.

Intel-Extension-for-Pytorch

Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation. Blogs: Llama2 support on MTL <https://www.intel.com/content/www/us/en/developer/articles/technical/weight-only-quantization-in-llm-inference.html> Llama3 day0 support on MTL iGPU <https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html>

Awards

2023 Intel China Eployee of the Year(EOY)

Jan 2023

Publications

Method and apparatus for accelerating deep leaning inference based on hw-aware sparsity pattern

US Patent

Jan 2022

HW-aware sparsity patterns

Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding

US Patent

Jan 2022

Method and apparatus for optimizing inference of deep neural networks

US Patent

Jan 2021

HW-aware cost model to predict performance for quantization recipes