CSCE 513 Computer Architecture Lecture 18Linux Performance II

  • Slides: 38
Download presentation
CSCE 513 Computer Architecture Lecture 18–Linux Performance II Topics n Measuring Process time n

CSCE 513 Computer Architecture Lecture 18–Linux Performance II Topics n Measuring Process time n timeval timespec Improving performance n n November 27, 2017

Linux – Sytem Info saluda> lscpu Architecture: i 686 CPU op-mode(s): 32 -bit, 64

Linux – Sytem Info saluda> lscpu Architecture: i 686 CPU op-mode(s): 32 -bit, 64 -bit CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 4 CPU socket(s): 1 Vendor ID: Genuine. Intel CPU family: 6 Model: 15 Stepping: 11 CPU MHz: 2393. 830 Virtualization: VT-x L 1 d cache: 32 K L 1 i cache: 32 K L 2 cache: 4096 K –saluda> 2– CSCE 513 Fall 2017

Control Panel System and Sec… System … … – 3– CSCE 513 Fall 2017

Control Panel System and Sec… System … … – 3– CSCE 513 Fall 2017

– 4– CSCE 513 Fall 2017

– 4– CSCE 513 Fall 2017

Task Manager Ctrl+Alt+Delete then select Task Manager – 5– CSCE 513 Fall 2017

Task Manager Ctrl+Alt+Delete then select Task Manager – 5– CSCE 513 Fall 2017

/proc /usr/bin/procinfo r 2 d 2> procinfo The program 'procinfo' is currently not installed.

/proc /usr/bin/procinfo r 2 d 2> procinfo The program 'procinfo' is currently not installed. To run 'procinfo' please ask your administrator to install the package 'procinfo' – 6– CSCE 513 Fall 2017

time filetimes - #seconds since Jan 1, 1970 time of day - struct timeval

time filetimes - #seconds since Jan 1, 1970 time of day - struct timeval seconds microseconds Functions: § gettimeofday, § getrusage – 7– CSCE 513 Fall 2017

timeval structure struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_t tv_usec; /* microseconds

timeval structure struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_t tv_usec; /* microseconds */ }; – 8– CSCE 513 Fall 2017

Gettimeofday – Wall time #include <stdio. h> #include <sys/time. h> int main (int argc,

Gettimeofday – Wall time #include <stdio. h> #include <sys/time. h> int main (int argc, char** argv) { struct timeval tval. Before, tval. After; gettimeofday (&tval. Before, NULL); int i =0; while ( i < 10000) { i ++; } gettimeofday (&tval. After, NULL); printf("Time in microseconds: %ld microsecondsn", ((tval. After. tv_sec - tval. Before. tv_sec)*1000000 L +tval. After. tv_usec) - tval. Before. tv_usec ); – 9– return 0; } CSCE 513 Fall 2017

Time Command r 2 d 2> time real gcc mt_mm. c -lpthread 0 m

Time Command r 2 d 2> time real gcc mt_mm. c -lpthread 0 m 0. 071 s user 0 m 0. 036 s sys 0 m 0. 020 s – 10 – CSCE 513 Fall 2017

r 2 d 2> time. /a. out 100 100 8 With 1 threads: 0.

r 2 d 2> time. /a. out 100 100 8 With 1 threads: 0. 012000 seconds With 2 threads: 0. 008001 seconds With 3 threads: 0. 008000 seconds … With 6 threads: 0. 016001 seconds With 7 threads: 0. 008001 seconds With 8 threads: 0. 004000 seconds real 0 m 0. 038 s user 0 m 0. 076 s sys 0 m 0. 008 s – 11 – CSCE 513 Fall 2017

r 2 d 2> man getrusage GETRUSAGE(2) NAME Linux Programmer's Manual GETRUSAGE(2) - getrusage

r 2 d 2> man getrusage GETRUSAGE(2) NAME Linux Programmer's Manual GETRUSAGE(2) - getrusage - get resource usage SYNOPSIS #include <sys/time. h> #include <sys/resource. h> int getrusage(int who, struct rusage *usage); DESCRIPTION getrusage() returns resource usage measures for who, which can be one of the following: n n n – 12 – RUSAGE_SELF RUSAGE_CHILDREN RUSAGE_THREAD CSCE 513 Fall 2017

I-7 links • move to multicore NY times • http: //www. nytimes. com/2004/05/08/business/08 chip.

I-7 links • move to multicore NY times • http: //www. nytimes. com/2004/05/08/business/08 chip. html? ex =1399348800&en=98 cc 44 ca 97 b 1 a 562&ei=5007 • http: //software. intel. com/en-us/articles/addingparallelism-sample-code • http: //software. intel. com/en-us/articles/performancetools-for-software-developers-how-do-i-run-the-intelipp-samples • http: //software. intel. com/en-us/code-downloads • http: //software. intel. com/en-us/intel-parallel-universemagazine – 13 – CSCE 513 Fall 2017

matmul. c - get seconds used struct timeval { __kernel_time_t tv_sec; /* seconds */

matmul. c - get seconds used struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_t tv_usec; /* microseconds */ }; – 14 – CSCE 513 Fall 2017

double seconds(int nmode){ struct rusage buf; double temp; getrusage( nmode, &buf ); /* Get

double seconds(int nmode){ struct rusage buf; double temp; getrusage( nmode, &buf ); /* Get system time and user time in micro-seconds. */ temp = (double)buf. ru_utime. tv_sec*1. 0 e 6 + (double)buf. ru_utime. tv_usec + (double)buf. ru_stime. tv_sec*1. 0 e 6 + (double)buf. ru_stime. tv_usec; /* Return the sum of system and user time in SECONDS. */ return( temp*1. 0 e-6 ); } – 15 – CSCE 513 Fall 2017

Matrix Multiplication tstart = seconds(RUSAGE_SELF); for(i=0; i<m; ++i){ for(j=0; j<n; ++j){ C[i][j] = 0.

Matrix Multiplication tstart = seconds(RUSAGE_SELF); for(i=0; i<m; ++i){ for(j=0; j<n; ++j){ C[i][j] = 0. 0; for(x=0; x<k; ++x){ C[i][j] = C[i][j] + A[i][x] * B[x][j]; } } } tend = seconds(RUSAGE_SELF); – 16 – CSCE 513 Fall 2017

allocmatrix double **allocmatrix(int nrows, int ncols) { double **m; int i; m=(double **) malloc((unsigned)(nrows)*sizeof(double*));

allocmatrix double **allocmatrix(int nrows, int ncols) { double **m; int i; m=(double **) malloc((unsigned)(nrows)*sizeof(double*)); if (!m) nerror("allocation failure 1 in matrix()"); for(i=0; i<nrows; i++) { m[i]=(double *) malloc((unsigned)(ncols) * sizeof(double)); if (!m[i]) nerror("allocation failure 2 in matrix()"); } return m; } – 17 – CSCE 513 Fall 2017

More Notes on Matrix Multiplication § drand 48, random for generating elements of arrays

More Notes on Matrix Multiplication § drand 48, random for generating elements of arrays randomly § nerror (error_text) – § fprintf(stderr, "Run-time error. . . n%sn ", error_text); § freematrix § m = atoi(argv[1]); § § – 18 – aoti = alpha to integer atoi( argv[1] ) equiv to strtol( argv[1] , (char **) NULL, 10); CSCE 513 Fall 2017

struct timespec – finer process times * /usr/include/time. h contains the timespec definition *

struct timespec – finer process times * /usr/include/time. h contains the timespec definition * struct timespec{ * __time_t tv_sec; * long int tv_nsec; * }; */ int clock_getres(clockid_t clk_id, struct timespec *res); int clock_gettime(clockid_t clk_id, struct timespec *tp); int clock_settime(clockid_t clk_id, const struct timespec *tp); – 19 – CSCE 513 Fall 2017

time 0. c struct timespec start; struct timespec finish; int retval; retval = clock_getres(CLOCK_MONOTONIC,

time 0. c struct timespec start; struct timespec finish; int retval; retval = clock_getres(CLOCK_MONOTONIC, &clk_resolution); retval = clock_gettime(CLOCK_MONOTONIC, &start); … do something … retval = clock_gettime(CLOCK_MONOTONIC, &finish); dumptimespec(&finish); – 20 – CSCE 513 Fall 2017

dumptimespec(struct timespec *ts){ printf("seconds=%ld, nanoseconds=%ldn", ts->tv_sec, ts->tv_nsec); } – 21 – CSCE 513 Fall

dumptimespec(struct timespec *ts){ printf("seconds=%ld, nanoseconds=%ldn", ts->tv_sec, ts->tv_nsec); } – 21 – CSCE 513 Fall 2017

Amdahl’s Law % Parallelizable Suppose you have an enhancement or improvement in a design

Amdahl’s Law % Parallelizable Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used – 22 – Ref. CAAQA CSCE 513 Fall 2017

Cache Performance. – 23 – CSAPP – Bryant O’Hallaron CSCE 513 Fall 2017

Cache Performance. – 23 – CSAPP – Bryant O’Hallaron CSCE 513 Fall 2017

row major strides vs column strides arrays stored in row major order row major

row major strides vs column strides arrays stored in row major order row major strides column strides – 24 – CSCE 513 Fall 2017

Matrix Multiply blocking-loop CAAQA page 90 for (jj=0; jj < N; jj = jj+B)

Matrix Multiply blocking-loop CAAQA page 90 for (jj=0; jj < N; jj = jj+B) for (kk=0; kk < N; kk = kk+B) for(i=0; i<m; ++i){ for(j=jj; j<n; ++j){ C[i][j] = 0. 0; for(k=kk; k<min(kk+B, N); ++k){ C[i][j] = C[i][j] + A[i][k] * B[x][k]; } } } – 25 – CSCE 513 Fall 2017

cpp macros cpp – c preprocessor directives #define MAXLINE 1024 #define min(a, b) ((a)<(b)

cpp macros cpp – c preprocessor directives #define MAXLINE 1024 #define min(a, b) ((a)<(b) ? (a) : (b)) Well #define max(a, b) ({ __typeof__ (a) _a = (a); __typeof__ (b) _b = (b); _a > _b ? _a : _b; }) http: //stackoverflow. com/questions/3437404/min-and-max-in-c – 26 – CSCE 513 Fall 2017

Pointer Arithmetic If p is a pointer what is p++ ? Decl char *

Pointer Arithmetic If p is a pointer what is p++ ? Decl char * float * double * struct v * p++ p+1 p+4 p+8 p+ sizeof struct v p+4 p+1 p+4 p+8 p*5 p p*4 p*8 If a is the name of an array, then a is a pointer to the base of the array, i. e. &a[0][0]. – 27 – CSCE 513 Fall 2017

Array reference macro &v = address-of operator Recall from data structures &A[i][j] = &A[0][0]

Array reference macro &v = address-of operator Recall from data structures &A[i][j] = &A[0][0] + skip i rows + skip j elements = &A[0][0] + i * rowsize + j * elementsize = &A[0][0] + i * numcols*elementsize + j * elementsize A(i, j) = address of macro for A[i][j] with n columns #define A(i, j) (A + (i)*n + j) – 28 – CSCE 513 Fall 2017

Matrix multiply with address macro for(i=0; i<m; ++i){ for(j=0; j<n; ++j){ C[i][j] = 0.

Matrix multiply with address macro for(i=0; i<m; ++i){ for(j=0; j<n; ++j){ C[i][j] = 0. 0; for(x=0; x<k; ++x){ *C(i, j) = *C(i, j) + *A(i, x) * B(x, j); } } } Note pointer dereference – 29 – CSCE 513 Fall 2017

Libraries BLAS – Basic Linear Algebra Subroutines (1970’s) BLAS 2 BLAS 3 BOOST –

Libraries BLAS – Basic Linear Algebra Subroutines (1970’s) BLAS 2 BLAS 3 BOOST – 30 – CSCE 513 Fall 2017

Reading Writing matrices fprintf(fp, “%e “, a[i][j]); fscanf(fp, “%e”, &a[i][k]); Problem? Solution: fread and

Reading Writing matrices fprintf(fp, “%e “, a[i][j]); fscanf(fp, “%e”, &a[i][k]); Problem? Solution: fread and fwrite § read and write without conversion #include <stdio. h> size_t fread(void *ptr, size_t size, size_t nmemb, FILE *str); size_t fwrite(const void *ptr, size_t size, size_t nmemb, – 31 – FILE *str); CSCE 513 Fall 2017

fread and fwrite #include <stdio. h> size_t fread(void *ptr, size_t size, size_t nmemb, FILE

fread and fwrite #include <stdio. h> size_t fread(void *ptr, size_t size, size_t nmemb, FILE *str); size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *str); – 32 – CSCE 513 Fall 2017

Valgrind Suite of Tools Valgrind includes six production-quality tools: • a memory error detector,

Valgrind Suite of Tools Valgrind includes six production-quality tools: • a memory error detector, • two thread error detectors, • a cache and branch-prediction profiler, • a call-graph generating cache and • branch-prediction profiler, and • a heap profiler. Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2. – 33 – http: //valgrind. org/ CSCE 513 Fall 2017

Linux 101 Hacks e. Book, by Ramesh Natarajan Chapter 12: System Monitoring and Performance

Linux 101 Hacks e. Book, by Ramesh Natarajan Chapter 12: System Monitoring and Performance • Hack 89. Free Command • Hack 90. Top Command Hack 91. Df Command • • • – 34 – Hack 92. Du Command Hack 93. Lsof Commands Hack 94. Vmstat Command Hack 95. Netstat Command Hack 96. Sysctl Command Hack 97. Nice Command Hack 98. Renice Command Hack 99. Kill Command Hack 100. Ps Command Hack 101. Sar Command http: //www. thegeekstuff. com/linux-101 -hacks-ebook/ CSCE 513 Fall 2017

GNU Binutils The GNU Binutils are a collection of binary tools. The main ones

GNU Binutils The GNU Binutils are a collection of binary tools. The main ones are: § § ld - the GNU linker. as - the GNU assembler. § But they also include: § § § – 35 – addr 2 line - Converts addresses into filenames and line numbers. ar - A utility for creating, modifying and extracting from archives. c++filt - Filter to demangle encoded C++ symbols. dlltool - Creates files for building and using DLLs. gold - A new, faster, ELF only linker, still in beta test. gprof - Displays profiling information. CSCE 513 Fall 2017

GNU Binutils (continued) § § § – 36 – nlmconv - Converts object code

GNU Binutils (continued) § § § – 36 – nlmconv - Converts object code into an NLM. nm - Lists symbols from object files. objcopy - Copys and translates object files. objdump - Displays information from object files. ranlib - Generates an index to the contents of an archive. readelf - Displays information from any ELF format object file. size - Lists the section sizes of an object or archive file. strings - Lists printable strings from files. strip - Discards symbols. windmc - A Windows compatible message compiler. windres - A compiler for Windows resource files. CSCE 513 Fall 2017

– 37 – CSCE 513 Fall 2017

– 37 – CSCE 513 Fall 2017

– 38 – CSCE 513 Fall 2017

– 38 – CSCE 513 Fall 2017