BIG DATA TECHNOLOGIES LECTURE 7 TASK SCHEDULING FOR

BIG DATA TECHNOLOGIES LECTURE 7: TASK SCHEDULING FOR EFFICIENT EXECUTION Assoc. Prof. , Ph. D. Habil. Marc FRÎNCU marc. frincu@e-uvt. ro

MOTIVATION In a large parallel/distributed system there are usually M processes (threads) and N interconnected cores/processors �M >> N

ISSUES How to plan each task on available resources s. t. : � It finishes as soon as possible � Finishes by a deadline � Maximizes its throughput � Is cost effective (cloud) � Is power efficient ?

EXECUTION CYCLE OF A TASK

HOW DO WE PLAN A TASKS Between running and waiting Between running and ready Between waiting and ready Two ways � Preemptive Processes can be stopped and later resumed � Nonpreemptive (collaborative) Processes execute without interruption

HOW TO CHOOSE THE NEXT TASK? CPU usage � Keep CPU/core as busy as possible Throughput � Number of processed tasks per second � High data volumes Turnaround time � Time difference between starting a tasks and finising it Waiting time � Time between the submission of a task and its actual execution start Response time � Time between starting a task and its first response

EXECUTION MODES (SIMPLIFIED) Task � One or more processes (threads) made up of instructions (operations) � They are either independent of dependent (workflows) � They have priorities, deadlines, etc. � Can be preemptive or nonpreemtive Resource � Set of physical machines and their network interconnect � Machines Dedicated Parallel Identic (a task execution takes the same time on any machine) Uniform (a task will execute 2 x faster on a 2 x larger machine) Unrelated (execution time unrelated to CPU speed) Planning Function that maps tasks to resources � Minimizes a given objective (set) � Time, energy, throughput, cost, lateness

OPTIMAL VS. SUBOPTIMAL SCHEDULING Optimal scheduling is achieved when a certain optimality criterion is minimized (global minimum reached) NP-complete problem � Optimal solutions exist for small number of problems Linear programming Few resources and tasks � Scheduling Greedy heuristics � Heuristics algorithms are usually suboptimal must provide solution close to the optimal ε-optimal = suboptimal / optimal

OFFLINE VS. ONLINE SCHEDULING OFFLINE � All tasks are allocated to resources before their execution starts Allocation cannot be later modified Relies on an analytical solution and historical data + status of the resources Deterministic process All tasks are known before scheduling ONLINE � During runtime as tasks arrive Considers the current state of the system Nondeterministic process Optimal scheduling not possible – we do not know future processes Dynamic � Tasks are rescheduled If the selected resource cannot meet the tasks objectives move task to another resource

SCHEDULING VS. LOAD BALANCING Scheduling � Choosing tasks/resources s. t. several objectives are met � NP-complete problem Load balancing � Choosing resources s. t. system load is balanced � Bin-packing NP-hard problem How do we select tasks of different requirements s. t. their sum is close to the maximum usage of the resource?

CLASSIFICATION OF SCHEDULING ALGORITHMS

Heterogeneity degree COMPLEXITY OF THE SCHEDULING PROBLEM Distributed systems/cloud Hibrid systems CPUGPU FPGA Clusters Single machines Multicore/ multiprocessor/ manycore Communication impact

LECTURE SOURCES https: //www. it. uu. se/edu/course/homepage/oskomp/ vt 07/lectures/scheduling_algorithms/handout. pdf http: //web. info. uvt. ro/~mfrincu/frincu-thesis-en. pdf