Do humans and Convolutional Neural Networks attend to similar areas during scene classification: Effects of task and image type

Romy Müller; Marcel Dürschmidt; Julian Ullrich; Carsten Knoll; Sascha Weber; Steffen Seitz

doi:10.3390/app14062648

Do humans and Convolutional Neural Networks attend to similar areas during scene classification: Effects of task and image type

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Romy Müller - , Professur für Ingenieurpsychologie und angewandte Kognitionsforschung (Autor:in)
Marcel Dürschmidt - , Technische Universität Dresden (Autor:in)
Julian Ullrich - , Technische Universität Dresden, Heinrich Heine Universität Düsseldorf (Autor:in)
Carsten Knoll - , Professur für Grundlagen der Elektronik (Autor:in)
Sascha Weber - , Professur für Ingenieurpsychologie und angewandte Kognitionsforschung (Autor:in)
Steffen Seitz - , Professur für Grundlagen der Elektronik (Autor:in)

Abstract

Deep neural networks are powerful image classifiers but do they attend to similar image areas as humans? While previous studies have investigated how this similarity is shaped by technological factors, little is known about the role of factors that affect human attention. Therefore, we investigated the interactive effects of task and image characteristics. We varied the intentionality of the tasks used to elicit human attention maps (i.e., spontaneous gaze, gaze-pointing, manual area selection). Moreover, we varied the type of image to be categorized (i.e., singular objects, indoor scenes consisting of object arrangements, landscapes without distinct objects). The human attention maps generated in this way were compared to the attention maps of a convolutional neural network (CNN) as revealed by a method of explainable artificial intelligence (Grad-CAM). The influence of human tasks strongly depended on image type: for objects, human manual selection produced attention maps that were most similar to CNN, while the specific eye movement task had little impact. For indoor scenes, spontaneous gaze produced the least similarity, while for landscapes, similarity was equally low across all human tasks. Our results highlight the importance of taking human factors into account when comparing the attention of humans and CNN.

Details

Originalsprache	Englisch
Aufsatznummer	2648
Seitenumfang	31
Fachzeitschrift	Applied Sciences (Switzerland)
Jahrgang	14 (2024)
Ausgabenummer	6
Publikationsstatus	Veröffentlicht - 21 März 2024
Peer-Review-Status	Ja

Externe IDs

ORCID	/0000-0002-8389-8869/work/156335443
Mendeley	5cfd767f-16fb-330a-a715-9f88325506b0
Scopus	85192517094

Schlagworte

Schlagwörter

convolutional neural networks (CNN), categorization, image classification, attention maps, explainable artificial intelligence (XAI), scene viewing, eye movements

Bibliotheksschlagworte

600 Technik, Technologie

Forschungsportal der TU Dresden