hw1: second submission

hw1: added figure for stride/csize graph
2022-10-05 20:23:30 +02:00 · 2022-10-05 20:04:15 +02:00
3 changed files with 135 additions and 13 deletions
--- a/Project1/project_1_maggioni_claudio.pdf
+++ b/Project1/project_1_maggioni_claudio.pdf
--- a/Project1/project_1_maggioni_claudio.tex
+++ b/Project1/project_1_maggioni_claudio.tex
@ -6,6 +6,10 @@
 \usepackage{graphicx}
 \usepackage{tikz}
 \usepackage{multirow}
+\usepackage{makecell}
+\usepackage{booktabs}
+\usepackage[nomessages]{fp}
+\usetikzlibrary{decorations.markings}

 \begin{document}

@ -13,20 +17,22 @@
 \setduedate{12.10.2022 (midnight)}

 \serieheader{High-Performance Computing Lab}{2022}{Student: Claudio
-Maggioni}{Discussed with: ---}{Solution for Project 1}{}
+Maggioni}{Discussed with: --}{Solution for Project 1}{}
 \newline

-\assignmentpolicy
-In this project you will practice memory access optimization,
-performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.  
+%\assignmentpolicy
+%In this project you will practice memory access optimization,
+%performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.  
+
+\tableofcontents

 \section{Explaining Memory Hierarchies \punkte{25}}

 \subsection{Memory Hierarchy Parameters of the Cluster}

-By identifying the memory hierarchy parameters through \texttt{likwid-topology} 
-for the cache topology and \texttt{free -g} for the amount of primary memory I
-find the following values:
+By invoking \texttt{likwid-topology} for the cache topology and \texttt{free -g}
+for the amount of primary memory, the following memory hierarchy parameters are
+found:

 \begin{center}
 \begin{tabular}{llll}
@ -41,10 +47,11 @@ All values are reported using base 2 IEC byte units. The cluster has 2 sockets
 and a total of 20 cores (10 per socket). The cache topology diagram reported by
 \texttt{likwid-topology -g} is shown in Figure \ref{fig:topo}.

+\pagebreak[4]
+
 \begin{figure}[t]
    \begin{center}
       Socket 0:\vspace{0.3cm}
-       
        \begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}
            \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\\hline
            32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32
@ -70,6 +77,75 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by

 \subsection{Memory Access Pattern of \texttt{membench.c}}

+\begin{figure}[t]
+\begin{center}
+\begin{tikzpicture}
+    \tikzset{->-/.style={decoration={
+  markings,
+  mark=at position .75 with {\arrow{>}}},postaction={decorate}}};
+
+    \draw (0,0) grid (5,1);
+    \draw [dashed] (5,0) -- (5.5,0);
+    \draw [dashed] (5,1) -- (5.5,1);
+    \draw [dashed] (6.5,0) -- (7,0);
+    \draw [dashed] (6.5,1) -- (7,1);
+    \draw (7,0) grid (12,1);
+
+    \foreach \r in {0,1,...,4}{
+        \fill (\r + 0.5,0.5) circle [radius=2pt];
+        \draw[->-] (\r-0.5,0.5) to[bend left] (\r+0.5,0.5);
+        \draw (\r + 0.5, -0.5) node {$\r$};
+    }
+    \draw[->-] (4.5,0.5) to[bend left] (5.5,0.5);
+    \foreach \r in {7,8,...,11}{
+        \fill (\r + 0.5,0.5) circle [radius=2pt];
+        \FPeval{l}{round(\r + 128 - 12, 0)}
+        \draw[->-] (\r-0.5,0.5) to[bend left] (\r+0.5,0.5);
+        \draw (\r + 0.5, -0.5) node {$\l$};
+    }
+    
+    \draw (0,-3) grid (3,-2);
+    \draw [dashed] (3,-2) -- (3.5,-2);
+    \draw [dashed] (3,-3) -- (3.5,-3);
+    \draw [dashed] (4,-2) -- (4.5,-2);
+    \draw [dashed] (4,-3) -- (4.5,-3);
+    \draw (4.5,-2) -- (7.5,-2);
+    \draw (4.5,-3) -- (7.5,-3);
+    \foreach \r in {4.5,5.5,...,7.5}{
+        \draw (\r,-3) -- (\r,-2);
+    }
+    \draw [dashed] (7.5,-2) -- (8,-2);
+    \draw [dashed] (7.5,-3) -- (8,-3);
+    \draw [dashed] (8.5,-2) -- (9,-2);
+    \draw [dashed] (8.5,-3) -- (9,-3);
+    \draw (9,-3) grid (12,-2);
+
+    \fill (0.5,-2.5) circle [radius=2pt];
+    \fill (6,-2.5) circle [radius=2pt];
+    \fill (11.5,-2.5) circle [radius=2pt];
+    \foreach \r in {0,1,2}{
+        \draw (\r + 0.5, -3.5) node {$\r$};
+    }
+    \foreach \r in {9,10,11}{
+        \FPeval{l}{round(\r - 12, 0)}
+        \draw (\r + 0.5, -3.5) node {\tiny $2^{20}  \l$};
+    }
+    \foreach \r in {4.5,5.5}{
+        \FPeval{l}{round(\r - 6.5, 0)}
+        \draw (\r + 0.5, -3.5) node {\tiny $2^{10}  \l$};
+    }
+    \draw (7,-3.5) node {\tiny $2^{10}$};
+    \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
+    \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
+    \draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
+\end{tikzpicture}
+\end{center}
+    \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
+    128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
+    \texttt{stride = $2^{10}$} (below)}
+    \label{fig:access}
+\end{figure}
+
 The benchmark \texttt{membench.c} measures the average time of repeated read and
 write overations across a set of indices of a stack-allocated array of 32-bit
 signed integers. The indices vary according to the access pattern used, which in
@ -84,7 +160,8 @@ and so on and so forth.
 Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
 access all indexes between 0 and 127 sequentially, and for \texttt{csize =
 $2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
-index $2^{10}-1$, and finally index $2^{20}-1$.
+index $2^{10}-1$, and finally index $2^{20}-1$. The access patterns for these
+two configurations are shown visually in Figure \ref{fig:access}.

 \subsection{Analyzing Benchmark Results}

@ -212,8 +289,9 @@ implementing the pseudocode, my implementation:
 \end{figure}

 The results of the matrix multiplication benchmark for the naive, blocked, and
-BLAS implementations are shown in Figure \ref{fig:bench}. The blocked
-implementation achieves approximately 50\% more FLOPS than the naive
+BLAS implementations are shown in Figure \ref{fig:bench} as a graph of GFlop/s
+over matrix size or in Figure \ref{fig:benchtab} as a table. The blocked
+implementation achieves on average 50\% more FLOPS than the naive
 implementation thanks to the optimisations in space and temporal cache locality
 described. However, the blocked implementation achives less than a tenth of
 FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
@ -221,9 +299,53 @@ optimization the latter one is able to exploit.

 \begin{figure}[t]
    \includegraphics[width=\textwidth]{timing.pdf}
-    \caption{Results of the matrix multiplication benchmark for the naive,
-    blocked, and BLAS implementations}
+    \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,
+    blocked, and BLAS implementations. The Y-axis is log-scaled.}
    \label{fig:bench}
 \end{figure}

+\begin{figure}[t]
+\begin{center}
+\begin{tabular}{c|cc|cc|cc}
+    \toprule
+    & \multicolumn{2}{c|}{Naive} & \multicolumn{2}{c|}{Blocked} &
+    \multicolumn{2}{c}{BLAS} \\
+    \makecell{Size} & \makecell{MFLOPS} &
+    \makecell{\% CPU} & \makecell{MFLOPS} &
+    \makecell{\% CPU} & \makecell{MFLOPS} &
+    \makecell{\% CPU} \\
+    \midrule
+        31 & 2393.33 & 6.50 & 2112.63 & 5.74 & 23449.20 & 63.72 \\
+        32 & 2400.13 & 6.52 & 2187.44 & 5.94 & 28198.90 & 76.63 \\
+        96 & 1998.74 & 5.43 & 2325.39 & 6.32 & 32542.30 & 88.43 \\
+        97 & 1996.01 & 5.42 & 2322.81 & 6.31 & 29801.30 & 80.98 \\
+        127 & 1923.81 & 5.23 & 2330.30 & 6.33 & 28557.80 & 77.60 \\
+        128 & 1731.98 & 4.71 & 2282.93 & 6.20 & 32643.30 & 88.70 \\
+        129 & 1903.31 & 5.17 & 2334.25 & 6.34 & 31198.20 & 84.78 \\
+        191 & 1736.78 & 4.72 & 2345.91 & 6.37 & 32247.30 & 87.63 \\
+        192 & 1694.44 & 4.60 & 2345.38 & 6.37 & 32830.60 & 89.21 \\
+        229 & 1715.10 & 4.66 & 2351.01 & 6.39 & 34360.90 & 93.37 \\
+        255 & 1720.39 & 4.67 & 2335.21 & 6.35 & 33477.70 & 90.97 \\
+        256 & 777.65 & 2.11 & 2306.48 & 6.27 & 33473.90 & 90.96 \\
+        257 & 1729.27 & 4.70 & 2330.68 & 6.33 & 33686.50 & 91.54 \\
+        319 & 1704.80 & 4.63 & 2360.03 & 6.41 & 34335.20 & 93.30 \\
+        320 & 1414.84 & 3.84 & 2364.53 & 6.43 & 36438.10 & 99.02 \\
+        321 & 1741.30 & 4.73 & 2366.38 & 6.43 & 35433.70 & 96.29 \\
+        417 & 1733.00 & 4.71 & 2378.34 & 6.46 & 36133.70 & 98.19 \\
+        479 & 1731.17 & 4.70 & 2233.05 & 6.07 & 32951.40 & 89.54 \\
+        480 & 1678.77 & 4.56 & 2187.87 & 5.95 & 37260.00 & 101.25 \\
+        511 & 1733.60 & 4.71 & 2224.61 & 6.05 & 34128.00 & 92.74 \\
+        512 & 782.96 & 2.13 & 2284.85 & 6.21 & 36526.40 & 99.26 \\
+        639 & 1714.42 & 4.66 & 2292.78 & 6.23 & 35249.20 & 95.79 \\
+        640 & 663.42 & 1.80 & 2264.70 & 6.15 & 36538.70 & 99.29 \\
+        767 & 1690.82 & 4.59 & 2324.83 & 6.32 & 35718.50 & 97.06 \\
+        768 & 792.04 & 2.15 & 2363.92 & 6.42 & 32116.80 & 87.27 \\
+        769 & 1696.95 & 4.61 & 2321.31 & 6.31 & 33033.90 & 89.77 \\
+    \bottomrule
+\end{tabular}
+\end{center}
+    \caption{MFlop/s and CPU utlisation per matrix size of the matrix
+    multiplication benchmark for the naive, blocked, and BLAS implementations.}
+    \label{fig:benchtab}
+\end{figure}
 \end{document}
--- a/project_1_maggioni_claudio.zip
+++ b/project_1_maggioni_claudio.zip
Author	SHA1	Message	Date
Claudio Maggioni	66cdcedb9b	hw1: second submission	2022-10-05 20:23:30 +02:00
Claudio Maggioni	7e716e1db2	hw1: added figure for stride/csize graph	2022-10-05 20:04:15 +02:00