Shanghai
AI Framework Engineer in Shanghai
AI Framework Engineer
Intel
Jan 2020 - Present
LLAMA.cpp SYCL
Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation. Co-work with the community owners and response to issues related to Intel. Published blog: Run LLMs on Intel GPUs Using llama.cpp https://medium.com/intel-analytics-software/run-llm-on-intel-gpus-using-llama-cpp-579d3595954e
Intel-Extension-For-Transformers
Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem Highly optimized hand written X86 assembly [kernels](https://github.com/intel/neural-speed/tree/main) for Intel hardware, targeting advanced compression algorithm especially for LLM. Develop [GPU kernels](https://github.com/intel/neural-speed/tree/xetla) for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)
Intel Neural Compressor
Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.
Intel-Extension-for-Pytorch
Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation. Blogs: Llama2 support on MTL <https://www.intel.com/content/www/us/en/developer/articles/technical/weight-only-quantization-in-llm-inference.html> Llama3 day0 support on MTL iGPU <https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html>
2023 Intel China Eployee of the Year(EOY)
Jan 2023
Method and apparatus for accelerating deep leaning inference based on hw-aware sparsity pattern
US Patent
Jan 2022
HW-aware sparsity patterns
Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding
US Patent
Jan 2022
Method and apparatus for optimizing inference of deep neural networks
US Patent
Jan 2021
HW-aware cost model to predict performance for quantization recipes