Leveraging Machine Learning for Improved Distributed System Performance
Abstract
PhD defense:
Our research focuses on enhancing the performance of large distributed systems through machine learning techniques. In the first project, we analyzed Linux performance data to recommend optimal settings for network applications in a client-server setting, finding that a lack of data diversity limited our ability to make tuning recommendations for system configuration. For the second project, we utilized an Alibaba dataset to model batch task scheduling, achieving high prediction accuracy for task arrivals, resource requirements and task lifetimes, though lifetime predictions were less accurate when combining CPU and memory data. The third project applied deep reinforcement learning for task scheduling aiming to maximize cloud resource utilization by strategically delaying and consolidating batch tasks onto fewer machines. Our results showed that REINFORCE significantly improved CPU and memory utilization and reduced resource fragmentation in machines compared to traditional methods, while DDQN, though efficient in sample usage, performed worse under high loads due to job drops and sub-optimal planning.