職位描述
We are looking for a senior-level engineer to focus on high-performance model inference across PC and Android platforms. The role centers on optimizing LLM/multimodal models for low latency and efficient memory use, implementing C++ runtimes, applying advanced acceleration techniques, and collaborating closely with research teams to bring optimized inference solutions into production environments.
Key Responsibilities
? Design and implement optimized model inference pipelines for PC (x86/AMD/Intel) and Android (ARM).
? Apply quantization, operator/kernal fusion, memory optimization, and runtime scheduling techniques.
? Work with at least one major inference stack: llama.cpp, Qualcomm AI SDKs (QNN/QAIRT/QSDK) or MTK Neuro Pillot; better to have experience with OpenVINO, Ryzen AI, and other inference SDKs.
? Profile and tune CPU/GPU/NPU performance using industry-standard profiling tools.
? Collaborate with model researchers to translate new methods into efficient runtime implementations.
Required Qualifications
? Master’s degree or above, with 3+ years experience in model inference, runtime engineering, or performance optimization.
? Strong C++ programming skills; familiarity with Android NDK/JNI is a plus.
? Solid understanding of transformer architectures, inference mechanisms, and acceleration methods.
? Hands-on experience with at least one of: llama.cpp, Qualcomm AI SDKs (QNN/QAIRT/QSDK) or MTK Neuro Pillot; better to have experience with OpenVINO, Ryzen AI, and other inference SDKs.
? Ability to read English technical papers and documentation; English communication preferred.
Preferred
? Experience with ONNX Runtime, TVM, XNNPack, or mobile performance tools.
? Contributions to open-source inference or optimization frameworks.