Detecting Memory-Boundedness with Hardware Performance Counters

Daniel Molka; Robert Schöne; Daniel Hackenberg; Wolfgang E. Nagel

doi:doi:10.1145/3030207.3030223

Detecting Memory-Boundedness with Hardware Performance Counters

Publikation: Beitrag zu Konferenzen › Paper › Beigetragen

Beitragende

Daniel Molka - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) (Autor:in)
Robert Schöne - , Professur für Rechnerarchitektur, Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) (Autor:in)
Daniel Hackenberg - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) (Autor:in)
Wolfgang E. Nagel - , Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Professur für Rechnerarchitektur (Autor:in)

Abstract

Modern processors incorporate several performance monitoring units, which can be used to count events that occur within different components of the processor. They provide access to information on hardware resource usage and can therefore be used to detect performance bottlenecks. Thus, many performance measurement tools are able to record them complementary to information about the application behavior. However, the exact meaning of the supported hardware events is often incomprehensible due to the system complexity and partially lacking or even inaccurate documentation. For most events it is also not documented whether a certain rate indicates a saturated resource usage. Therefore, it is usually difficult to draw conclusions on the performance impact from the observed event rates. In this paper, we evaluate whether hardware performance counters can be used to measure the capacity utilization within the memory hierarchy and estimate the impact of memory accesses on the achieved performance. The presented approach is based on a small selection of micro-benchmarks that constantly stress individual components in the memory subsystem, ranging from caches to main memory. These workloads are used to identify hardware performance counters that provide good estimates for the utilization of individual components in the memory hierarchy. However, since access latencies can be interleaved with computing instructions, a high utilization of the memory hierarchy does not necessarily result in low performance. We therefore also investigate which stall counters provide good estimates for the number of cycles that are actually spent waiting for the memory hierarchy.

Details

Originalsprache	Englisch
Seiten	27-38
Seitenumfang	12
Publikationsstatus	Veröffentlicht - 2017
Peer-Review-Status	Nein

Externe IDs

Scopus	85019043873
ORCID	/0000-0002-8491-770X/work/141543277
ORCID	/0009-0003-0666-4166/work/151475570

Schlagworte

Schlagwörter

Benchmarking, hardware performance counters, performance analysis