Alay Shah

Machine Learning Software Engineer

TensorOpera AI (formerly FEDML, Inc)

Biography

Alay Shah is a Machine Learning Software Engineer at TensorOpera AI, currently building AI / ML platform that facilitates distributed execution and deployment for GenAI tasks. Anything that falls at the intersection of AI / ML and Distributed Systems excites him the most!

Interests

Large Language Models
Distributed Systems
Deep Learning
Computer Vision
Artificial Intelligence
Machine Learning

Education

Master’s in Computer Science, 2021
University of Southern California
Post Graduate Diploma in Data Science, 2018
International Institute of Information Technology
Bachelor’s in Mechanical Engineering, 2017
Gujarat Technological University

Experience

Machine Learning Software Engineer

TensorOpera AI (formerly FEDML, Inc)

September 2023 – Present California

Building AI/ML platform that facilitates distributed execution and deployment for GenAI tasks.
Led the development of hardware-agnostic orchestration and scheduling layer that enables spot jobs and model deployments on a decentralized geo-distributed compute plane.
Projects where I’ve made significant contributions: Launch, Deploy, Compute, ScaleLLM, Storage
Technologies: Python, Java, SQL, Git, Docker, Kubernetes, Pytorch, TensorRT, Redis, R2, Postgres, MQTT Bash, Jenkins, Jira, Telemetry and Observability

Backend Software Engineer

Palantir Technologies, Inc.

July 2021 – September 2023 California

Designed and developed Datadog-like inhouse observability service for cloud resources resulting in 75% reduction in downtime and 50% increase in response time.
Contributed to developing systems for the automatic creation of runtime environments. (Patent filed, currently pending approval)
Developed service to automatically transition assets to a multi-tenant environment, reducing human effort by 90%.
Contributed to improving authorization frameworks for data protection in multi-tenant setups.
Technologies: Java, Python, Golang, Bash, SQL, Git, Docker, Kubernetes, AWS, Observability

Research Assistant

USC Viterbi School of Engineering

August 2020 – May 2021 California

Advised by Professor Salman Avestimehr
Research Areas: Distributed Systems, Deep Learning, Computer Vision, Federated Learning, Machine Learning
Projects: FedCV, FedSegment
Technologies: Python, Pytorch, Communication Protocols

Software Engineer Intern

Amazon Web Services

May 2020 – August 2020 Washington

Developed visualization dashboard for forecast reports, aiding evaluation and improvement of ML model forecasts via user-friendly scenario-driven charts with customizable options.
Technologies: Vue, Javascript, Java

Skills

Programming Languages

Python, Java, Golang, C/C++, Bash, SQL

Tools & Technologies

Docker, Kubernetes, AWS, Git, Redis, R2, Postgres, MQTT, Telemetry & Observability

Machine Learning

Pytorch, Tensorflow, Keras, TensorRT

Featured Publications

Yuhang Yao, Han Jin, Alay Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He

July, 2024

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

In this work, we optimized both engine and platform. Our study reveals that with the growing complexity of LLM applications, the platform latency will be the major bottleneck. Our take being, instead of optimizing the local inference speed, the industrial research should focus more on simplifying the serving gateway and optimizing the platform.

Chaoyang He, Alay Shah, Zhenheng Tang, Di Fan1Adarshan Naiynar Sivashunmugam, Keerti Bhogaraju, Mita Shimpi, Li Shen, Xiaowen Chu, Mahdi Soltanolkotabi, Salman Avestimehr

November, 2021 2022 AAAI Conference on Artificial Intelligence

FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks

In this work, we propose an easy-to-use federated learning framework for diverse computer vision tasks, including image classification, image segmentation, and object detection, dubbed FedCV.

Patents

Zsombor Jancso, Akshay Agrawal, Alay Shah, Anshul Ajit Lodha, David Cohen, Ilya Nepomnyashchiy, Justin Cassidy, Jessie Anderson, Michael Glazer, Rory Grant, Vibha Kathuria, Volodymyr Kot, Xinyi Fu

Nov 29, 2023

Systems and methods to automatically create runtime environments.

In some examples, methods and systems to automatically create runtime environments are provided. For example, a method includes: receiving a request to create a runtime environment; automatically generating a cluster of nodes based on the request, wherein the cluster of nodes are configured to run one or more containerized applications for the runtime environment; automatically applying a manifest onto the cluster of nodes, wherein the manifest includes one or more configurations associated with the runtime environment; and automatically deploying one or more software products into the cluster of nodes.

Projects

FedSegment

A Federated Learning Framework for Image Segmentation

Distributed Healthcare Resource Allocation System with Dynamic Offloading

Implemented a computational offloading distributed system based on client-server architecture using UDP and TCP sockets.

Alay Shah

Machine Learning Software Engineer

TensorOpera AI (formerly FEDML, Inc)

Biography

Experience

Skills

Featured Publications

Patents

Projects

Contact