Alay Shah

Alay Shah

Machine Learning Software Engineer

TensorOpera AI (formerly FEDML, Inc)

Biography

Alay Shah is a Machine Learning Software Engineer at TensorOpera AI, currently building AI / ML platform that facilitates distributed execution and deployment for GenAI tasks. Anything that falls at the intersection of AI / ML and Distributed Systems excites him the most!

Interests
  • Large Language Models
  • Distributed Systems
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • Machine Learning
Education
  • Master’s in Computer Science, 2021

    University of Southern California

  • Post Graduate Diploma in Data Science, 2018

    International Institute of Information Technology

  • Bachelor’s in Mechanical Engineering, 2017

    Gujarat Technological University

Experience

 
 
 
 
 
TensorOpera AI (formerly FEDML, Inc)
Machine Learning Software Engineer
September 2023 – Present California
  • Building AI/ML platform that facilitates distributed execution and deployment for GenAI tasks.
  • Led the development of hardware-agnostic orchestration and scheduling layer that enables spot jobs and model deployments on a decentralized geo-distributed compute plane.
  • Projects where I’ve made significant contributions: Launch, Deploy, Compute, ScaleLLM, Storage
  • Technologies: Python, Java, SQL, Git, Docker, Kubernetes, Pytorch, TensorRT, Redis, R2, Postgres, MQTT Bash, Jenkins, Jira, Telemetry and Observability
 
 
 
 
 
Palantir Technologies, Inc.
Backend Software Engineer
July 2021 – September 2023 California
  • Designed and developed Datadog-like inhouse observability service for cloud resources resulting in 75% reduction in downtime and 50% increase in response time.
  • Contributed to developing systems for the automatic creation of runtime environments. (Patent filed, currently pending approval)
  • Developed service to automatically transition assets to a multi-tenant environment, reducing human effort by 90%.
  • Contributed to improving authorization frameworks for data protection in multi-tenant setups.
  • Technologies: Java, Python, Golang, Bash, SQL, Git, Docker, Kubernetes, AWS, Observability
 
 
 
 
 
USC Viterbi School of Engineering
Research Assistant
August 2020 – May 2021 California
  • Advised by Professor Salman Avestimehr
  • Research Areas: Distributed Systems, Deep Learning, Computer Vision, Federated Learning, Machine Learning
  • Projects: FedCV, FedSegment
  • Technologies: Python, Pytorch, Communication Protocols
 
 
 
 
 
Amazon Web Services
Software Engineer Intern
May 2020 – August 2020 Washington
  • Developed visualization dashboard for forecast reports, aiding evaluation and improvement of ML model forecasts via user-friendly scenario-driven charts with customizable options.
  • Technologies: Vue, Javascript, Java

Skills

coding
Programming Languages

Python, Java, Golang, C/C++, Bash, SQL

distributed-systems
Tools & Technologies

Docker, Kubernetes, AWS, Git, Redis, R2, Postgres, MQTT, Telemetry & Observability

machine-learning
Machine Learning

Pytorch, Tensorflow, Keras, TensorRT

Projects

*
FedSegment
A Federated Learning Framework for Image Segmentation
FedSegment
Distributed Healthcare Resource Allocation System with Dynamic Offloading
Implemented a computational offloading distributed system based on client-server architecture using UDP and TCP sockets.
Distributed Healthcare Resource Allocation System with Dynamic Offloading

Contact