Leveraging Machine Learning for Improved Distributed System Performance

September 13, 2024
3:00pm EST
112 Nelson Auditorium, SEC Anderson
Speaker: Hifza Khalid - PhD Defense
Host: Alva Couch

Abstract

PhD defense:

Our research focuses on enhancing the performance of large distributed systems through machine learning techniques. In the first project, we analyzed Linux performance data to recommend optimal settings for network applications in a client-server setting, finding that a lack of data diversity limited our ability to make tuning recommendations for system configuration. For the second project, we utilized an Alibaba dataset to model batch task scheduling, achieving high prediction accuracy for task arrivals, resource requirements and task lifetimes, though lifetime predictions were less accurate when combining CPU and memory data. The third project applied deep reinforcement learning for task scheduling aiming to maximize cloud resource utilization by strategically delaying and consolidating batch tasks onto fewer machines. Our results showed that REINFORCE significantly improved CPU and memory utilization and reduced resource fragmentation in machines compared to traditional methods, while DDQN, though efficient in sample usage, performed worse under high loads due to job drops and sub-optimal planning.