Kolmogorov-Arnold Attention: Is Learnable Attention Better for Vision Transformers?

Subhajit Maity¹ Killian Hitsman² Xin Li² Aritra Dutta^{2, 1}

¹Department of Computer Science, University of Central Florida
²Department of Mathematics, University of Central Florida

A one-stop shop for any Kolmogorov-Arnold Activation for learning Transformer Attentions.

[Paper] [arXiv] (coming soon!) [Code]

Abstract

Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). Given the success of replacing MLP with KAN, this work designs and asks whether a similar replacement in the attention can bring benefits. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis into this, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect these architectures' performance on the classification task by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and transferability to other datasets, and contrast them with vanilla ViTs. KArAt's learnable activation shows a better attention score across all ViTs, indicating better token-to-token interactions, contributing to better inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures that require a careful understanding of learnable activations.

Background

KANs were integrated with neural network architectures, the primary of them being conventional MLPs or convolution neural networks (CNNs). But KAN's exploration of more advanced architectures such as Transformers, remains limited. VisionKAN replace the MLP layers inside the encoder blocks of data-efficient image Transformers (DeiTs) with KANs and proposed DeiT+KAN, Yang & Wang proposed two variants; ViT+KAN that replaces the MLP layers inside ViT's encoder blocks, and Kolmogorov-Arnold Transformer (KAT), albeit similar to ViT+KAN but with a refined group-KAN strategy. From these works, we realized simply replacing MLPs with KANs might not guarantee better generalizability, but a properly designed KAN could. We note that while DeiT+KAN and ViT+KAN replace the MLP layer with KAN in the Transformer's encoder block, KAT implements a sophisticated group-KAN strategy that reuses a learnable function among a group of units in the same layer and chooses different bases for different encoder blocks. Importantly, we also note that the community remains put from designing a learnable multihead attention module, the heart of the Transformers. Therefore, in the rise of second-generation Transformers, such as Google’s TITAN and SAKANA AI’s Transformer² that mimic the human brain, we want to ask: Is it worth deploying learnable multi-head self-attention to the (vision) Transformers?

Design

Let \(\mathcal{A}^{i,j}\in \mathbb{R}^{N\times N}\) be the attention matrix for \(i^{\rm th}\) head in the \(j^{\rm th}\) encoder block. For \(k\in[N]\), the softmax activation function, \(\sigma:\mathbb{R}^N\to (0,1)^N\), operates on the \(k^{\rm th}\) row vector, \(\mathcal{A}^{i,j}_{k,:}\in\mathbb{R}^{1\times N}\) of \(\mathcal{A}^{i,j}\) to produce component-wise output \( \sigma(\mathcal{A}^{i,j}_{k,l})=\tfrac{e^{\mathcal{A}^{i,j}_{k,l}}}{\sum\limits_{n=1}^N e^{\mathcal{A}^{i,j}_{k,n}}} \).
Instead of using the softmax function, we can use a learnable activation function, \({\tilde{\sigma}}\) on the row vectors of each attention head \(\mathcal{A}^{i,j}\). With any choice of the basis functions (e.g., B-Spline, Fourier, Fractals, Wavelets, etc.), the activated attention row vector, \({\tilde{\sigma}(\mathcal{A}^{i,j}_{k,:})}\) for \(k\in [N]\) can be written as \(\tilde{\sigma}\left[(\mathcal{A}^{i,j}_{k,:})\right) = \left(\Phi^{i,j}\left[(\mathcal{A}^{i,j}_{k,:})^\top\right]\right)^\top\).

*(i)* The traditional softmax self-attention for \(i^{\rm th}\) head in the \(j^{\rm th}\) encoder block. *(ii)* The Kolmogorov-Arnold Attention (KArAt) replaces the softmax with a learnable operator \(\Phi^{i,j}\). *(iii)* The regular KArAt uses an operator matrix, \(\Phi^{i,j}\) with \(N^2\) learnable units acting on each row of \(\mathcal{A}^{i,j}\) and is prohibitively expensive. *(iv)* While \(\Phi^{i,j}\) of size \(N\times N\) is impossible to implement due to computational constraints, Modular KArAt uses an operator \(\widehat{\Phi}^{i,j}\) with \(N \times r\) learnable units \(r\ll N\), followed by a linear projector with learnable weights \(W\in \mathbb{R}^{r \times N}\).

Our Architecture

Instead of using an \(N\times N\) operator, \({\Phi^{i,j}}\), we use a reduced sized operator \({\widehat{\Phi}^{i,j}}\) of \(r\times N\), such that \({r\ll N}\), and the new learned activation is \(r\)-dimensional for \(k\in[N]\). That is, \({\widehat{\Phi}^{i,j}}\) down-projects each attention row vector of \(\mathcal{A}^{i,j}\) to a lower dimensional subspace. This process significantly reduces the computational overhead. Next, we post-multiply using another learnable weight matrix, \(W^{i,j}\in\mathbb{R}^{N\times r}\) to project them back to their original dimension. For each \(k\in[N]\), this operation results in computing \({\widehat{\sigma}(\mathcal{A}^{i,j}_{k,:}) =\left[W^{i,j}\widehat{\Phi}^{i,j}\left[(\mathcal{A}^{i,j}_{k,:})^\top\right]\right]^{\top}}\).

*(a)* Blockwise attention mode where \(\Phi^{i,1}\neq\Phi^{i,2}\neq\cdots\neq\Phi^{i,L}\) for all \(i=1,2,...,h\) *(b)* universal attention mode, \(\Phi^{i,1}=\Phi^{i,2}=\cdots=\Phi^{i,L}=\Phi^{i}\) for all \(i=1,2,...,h\).

We consider two configurations for updating the operator \(\widehat{\Phi}^{i,j}\). (a) Blockwise: In this configuration, each encoder block involves learning the attention through \(h\) distinct operators \(\widehat{\Phi}\) for each of the \(h\) heads, totaling \(hL\) operators. Like the MHSA architecture in ViTs, the blockwise configuration is designed to learn as many different data representations as possible. (b) Universal: The motivation behind this configuration comes from the KAT. In KAT, the MLP head is replaced with different variations of a KAN head — and Group-KAN. In Group-KAN, the KAN layer shares rational base functions and their coefficients among the edges. Inspired by this, in our update configuration, all \(L\) encoder blocks share the same \(h\) operators; \(\widehat{\Phi}^{i,j} = \widehat{\Phi}^{i}\) for \(j=1,2,\ldots, L\). Rather than learning attention through \(hL\) operators, this configuration only uses \(h\) operators. Here, we share all learnable units and their parameters in each head across \(L\) blocks. We postulate that blockwise mode with more operators captures more nuances from the data. In contrast, the universal mode is suitable for learning simpler decision boundaries from the data.

Choice of Basis and Base Activation

Different basis functions and their representations for \(b(x) = 0\). *For our experiments, the morlet central frequency hyperparameter is \(\omega_0 = 5\), but other nonnegative values can be used. **Meyer is defined to be \((m \circ \nu)(\tilde{x}) = m(\nu(\tilde{x}))\), where \(m(t) = \mathbb{I}(-\infty, \frac{1}{2}](t) + \mathbb{I}[\frac{1}{2}, 1)(t)\cos(\frac{\pi}{2}\nu(2t-1))\) and \(\nu(x) = x^4(35 - 84x + 70x^2 - 20x^3)\mathbb{I}[0,1](x)\).
Basis	Function Representation, \(\phi(x)\)	Initialized Specifications
Fourier	\(\sum_{k=1}^{G} (a_k \cos(kx) + b_k \sin(kx))\)	\(a_k, b_k \sim \mathcal{N}(0, 1)\), \(G\) denotes the grid size
Rational \((m,n)\)	\(\frac{a_0 + a_1 x + \dots + a_m x^m}{1 + \|b_1 x + \dots + b_n x^n\|}\)	\(a_i, b_j \sim \mathcal{N}(0, 1)\), \(i=0, \dots, m\), \(j=1, \dots, n\)
Mexican Hat	\(\frac{2}{\pi^{1/4}\sqrt{3}} \left(1 - \tilde{x}^2\right) e^{-\frac{\tilde{x}^2}{2}}\), \(\tilde{x} = \frac{x-\tau}{s}\)	\(w \sim \mathcal{N}(0, 1)\), (Translation) \(\tau = 0\), (Scale) \(s=1\)
Morlet*	\(w \cos(\omega_0 \tilde{x}) e^{-\frac{\tilde{x}^2}{2}}\), \(\tilde{x} = \frac{x-\tau}{s}\)
DOG	\(-w \frac{d}{dx}\left(e^{-\frac{\tilde{x}^2}{2}}\right)\), \(\tilde{x} = \frac{x-\tau}{s}\)
Meyer**	\(w \sin(\pi\|\tilde{x}\|) (m \circ \nu)(\tilde{x})\), \(\tilde{x} = \frac{x-\tau}{s}\)
Shannon	\(w~\text{sinc}\left(\frac{\tilde{x}}{\pi}\right)\omega(\tilde{x})\), \(\tilde{x} = \frac{x-\tau}{s}\)	\(\omega(\tilde{x})\) is the symmetric hamming window

By design, KArAt can utilize any basis function for activating the attention units. Therefore, in addition to the Fourier basis, we embed 5 different wavelet bases, and the Rational Function basis into KArAt. The function representations for these choices of bases are outlined in the above table. Along with these Bases we also can choose base activation \(b(x)\) from Zero, SiLU, and Identity. Following this the general expression of our Fourier KArAt can be given as \(\widehat{\phi}^{i,j}_{pq}(\mathcal{A}^{i,j}_{k,q})= b(\mathcal{A}^{i,j}_{k,q}) + \sum_{m=1}^{G}a_{pqm}\cos{(m\mathcal{A}^{i,j}_{k,q})}+b_{pqm}\sin{(m\mathcal{A}^{i,j}_{k,q})}\). However, empirically, we found that Zero base activation (\(b(x)=0\)) with Fourier basis outperforms the rest of the combinations. For more details, see the paper.

Quantitative Results

Image Recognition

Detailed performance of the best-performing Fourier KArAt models compared to the conventional vanilla baselines. The best and the second best Top-1 accuracies are given in red and blue, respectively. The ↓ and ↑ arrows indicate the relative loss and gain, respectively, to the base models.
Model	CIFAR-10		CIFAR-100		Imagenet-1K		Parameters
Model	Acc.@1	Acc.@5	Acc.@1	Acc.@5	Acc.@1	Acc.@5	Parameters
ViT-Base	83.45	99.19	58.07	83.70	72.90	90.56	85.81M
+ G₁B	81.81 (1.97% ↓)	99.01	55.92 (3.70% ↓)	82.04	68.03 (6.68% ↓)	86.41	87.51M (1.98% ↑)
+ G₁U	80.75 (3.24% ↓)	98.76	57.36 (1.22% ↓)	82.89	68.83 (5.58% ↓)	87.69	85.95M (0.16% ↑)
ViT-Small	81.08	99.02	53.47	82.52	70.50	89.34	22.05M
+ G₃B	79.78 (1.60% ↓)	98.70	54.11 (1.20% ↑)	81.02	67.77 (3.87% ↓)	87.51	23.58M (6.94% ↑)
+ G₃U	79.52 (1.92% ↓)	98.85	53.86 (0.73% ↑)	81.45	67.76 (3.89% ↓)	87.60	22.18M (0.56% ↑)
ViT-Tiny	72.76	98.14	43.53	75.00	59.15	82.07	5.53M
+ G₃B	76.69 (5.40% ↑)	98.57	46.29 (6.34% ↑)	77.02	59.11 (0.07% ↓)	82.01	6.29M (13.74% ↑)
+ G₃U	75.56 (3.85% ↑)	98.48	46.75 (7.40% ↑)	76.81	57.97 (1.99% ↓)	81.03	5.59M (1.08% ↑)

Transfer Learning

Performance of Fourier KArAt fine-tuned on small datasets (SVHN, Oxford Flowers 102, and STL-10) from the ImageNet-1K pre-trained weights.
Model	Acc.@1
Model	SVHN	Flowers 102	STL-10
ViT-Base	97.74	92.24	97.26
+ G₁B	96.83	89.66	95.30
+ G₁U	97.21	89.43	95.78
ViT-Small	97.48	91.46	96.09
+ G₃B	97.04	89.67	95.26
+ G₃U	97.11	90.08	95.45
ViT-Tiny	96.69	84.21	93.20
+ G₃B	96.37	83.67	93.09
+ G₃U	96.39	83.70	92.93

Different Architectures

Performance comparison of Fourier-KArAt and softmax attention on various Vision Transformer architectures.
Model	CIFAR-10		Imagenet-1K
Model	Acc.@1	Acc.@5	Acc.@1	Acc.@5
ConViT-Tiny	71.36	97.86	57.91	81.79
+ G₃B	75.57	98.61	56.57	80.75
+ G₃U	74.51	98.63	56.51	80.93
Swin-Tiny	84.83	99.43	76.14	92.81
+ G₃B	79.34	98.81	73.19	90.97

Object Detection & Instance Segmentation

Fourier KArAt on object detection and instance segmentation tasks on MS COCO dataset. The header *Box* and *Mask* refer to detection and segmentation tasks, respectively.
Model (Initialization)	Box			Mask
Model (Initialization)	AP	AP₅₀	AP₇₅	AP	AP₅₀	AP₇₅
ViT-Det-Base (ImageNet-1K)	32.34	49.16	34.43	28.35	46.04	29.48
+ G₁B	22.48	36.82	23.13	20.11	34.09	20.46
+ G₁U	26.68	43.25	27.88	24.01	40.21	24.66
ViT-Det-Base (Random)	15.28	26.88	15.22	13.39	24.34	12.86
+ G₁B	10.32	18.77	10.05	8.94	16.51	8.50
+ G₁U	10.99	20.04	10.82	9.69	17.99	9.25

Qualitative Results

Detection and Segmentation tasks inference visualization using ViT-Det with ViT-Base as backbone. or each sample, the ground-truth is given on the right side and the inference is on the left.

Key Takeways from Empirical Data

The Top-1 accuracies show Blockwise configuration is more desirable over Universal for image classification.
We can also infer that the Fourier KArAt transfers well in fine-tuning tasks. However, the transferability of hyperparameters, e.g., grid size \(G\), across datasets remains an open question.
KArAt generalizes well. For random initialization in detection, the vanilla ViT-Det has an advantage over KArAt, and thus, we see a small performance gap. The best KArAt hyperparameters for this task are yet to be found.

Overall, KArAt shows significant potential if the incompatibilities are properly addressed.

Compute Requirements

A detailed comparison of computing requirements.We compare the training times for 100 epochs with the hyperparameter settings given in Table 5 for all the datasets CIFAR-10, CIFAR- 100, and ImageNet-1K. We also compare the throughputs of different models on ImageNet-1K; the throughput results will be similar for other datasets as the input size is 224 × 224.

The overall computation for Fourier KArAt variants is higher than their conventional softmax MHSA. Primarily, the Fourier KArAt variants have a longer training time. We also observe that universal mode G_nU training times are consistently slightly less than the blockwise modes G_nB. During the training, we monitored the GPU memory requirements, and as expected, Fourier KArAt variants utilize significantly more memory than traditional MHSA. In particular, the GPU memory requirements scale by 2.5 − 3×, compared to the traditional softmax MHSA. We also see slightly faster inference in universal mode than blockwise, except for ViT-Base. While there is a massive training time discrepancy between vanilla ViTs and Fourier KArAt ViTs, the inference speeds for Fourier KArAt variants are comparable to their vanilla counterparts. Although there is a minor difference in throughput between universal and blockwise modes during inference, theoretically, both variants for any model with the same grid size should have the same number of FLOPs.

Parameter, computation, and memory requirement for Fourier-KArAt (with hidden dimension, \(r = 12\)) compared to the traditional softmax attention. This Table particularly shows the individual computation required for the attention activation. The memory requirement shown is approximate and is based on averages of batches of 32 images of resolution 224 × 224. We note that changing r will affect the performance and memory requirements. In our main paper, all the experiments were performed with \(r = 12\).
Model	Parameters		GFLOPs		GPU Memory
Model	Attention Activation	Total	Attention Activation	Total	GPU Memory
ViT-Base	0	85.81M	0.016	17.595	7.44 GB
+ G₁B	1.70M	87.51M	0.268	17.847	17.36 GB
+ G₁U	0.14M	85.95M	0.268	17.847	16.97 GB
ViT-Small	0	22.05M	0.008	4.614	4.15 GB
+ G₃B	1.53M	23.58M	0.335	4.941	11.73 GB
+ G₃U	0.13M	22.18M	0.335	4.941	11.21 GB
ViT-Tiny	0	5.53M	0.005	1.262	2.94 GB
+ G₃B	0.76M	6.29M	0.168	1.425	7.48 GB
+ G₃U	0.06M	5.59M	0.168	1.425	7.29 GB

Low Rank Structure Analysis of Attention

Spectral decomposition of the attention matrix for ViT-Tiny on CIFAR-10 dataset with traditional softmax attention and our learnable Fourier KArAt. The traditional softmax attention and our learnable Fourier KArAt have almost similar low-rank structure, before activation functions are used.

Primarily, it is delineated by existing literature that the attention matrices in the vision transformers are usually sparse and low rank. We empirically verify this claim, and we note that the attention matrix has a low rank structure before and after softmax operation. We observe that the traditional softmax attention and our learnable Fourier KArAt have almost similar low-rank structures before the attention activation is applied. After the activations are applied we see that the traditional MHSA has significantly larger singular values than its KArAt variant, G₃B. It can also be noticed that before the activation, both traditional MHSA and Fourier KArAt feature a sharp drop in singular values between the 50^th and 75^th indices. This sharp drop in singular values vanishes in softmax attention, indicating a normalization. However, due to the hidden dimension \(r\), Fourier KArAt enforces a much lower rank than the traditional MHSA.

Loss Landscapes

3D-visualization of Loss landscape for ViT-Tiny and ViT-Base along the two largest principal component directions of the successive change of model parameters. KArAt’s loss landscapes are significantly less smooth than traditional attention; spiky loss landscapes are undesirable in terms of optimization stability and generalizability of the resulting model.

These visualizations show that Fourier KArAt in ViT architectures significantly impacts the smoothness of the loss surfaces. ViT-Tiny, with the fewest parameters, has the smoothest loss landscape and modest generalizability. In contrast, ViT-Tiny+Fourier KArAt’s loss landscape is spiky; it indicates the model is full of small-volume minima. However, the model is modest in the number of parameters, so the gradient descent optimizer can still find an optimized model with better generalizability than the vanilla ViT-Tiny. ViT-Base, however, has more parameters than Tiny, and its loss surface is much spikier than ViT-Tiny. Finally, the loss surface of ViT-Base+Fourier KArAt is most spiky, making it a narrow margin model with sharp minima in which small perturbations in the parameter space lead to high misclassification due to their exponentially larger volume in high dimensional spaces.

Attention Visualization

Vit-Tiny Attention map visualization. Original images for inference (the left), the attention score (middle), and image regions of the dominant head (Top row: Fourier KArAt, bottom row: traditional MHSA).

From the above figure, it can be observed that the vanilla softmax attention focuses on sparse feature interactions, while the Fourier KArAt captures dominant feature interactions. As Fourier KArAt is not restricted to take values in \([0, 1]\) unlike vanilla softmax attention, it has the flexibility to capture the negative interactions.

Vit-Tiny attention map characterization. Original image for inference (the center), the attention maps (top row), and contributing image regions (bottom row) for all three heads in ViT-Tiny: traditional MHSA (left) and Fourier KArAt G3B (right). The traditional MHSA sporadically focuses on fine-grained features of the primary object in the image. In contrast, the learnable attention in Fourier KArAt identifies the primary object features present significantly across all heads.

BibTex

 @article{maity2025karat,

    title={Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?},

    author={Maity, Subhajit and Hitsman, Killian and Li, Xin and Dutta, Aritra},

    booktitle={arXiv preprint arxiv:2503.10632},

    year={2025}

  }

Acknowledgements: We thank Dr. Srijan Das from the University of North Carolina at Charlotte for his valuable feedback and suggestions and for arranging the computational resources.