Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). Given the success of replacing MLP with KAN, this work designs and asks whether a similar replacement in the attention can bring benefits. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis into this, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect these architectures' performance on the classification task by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and transferability to other datasets, and contrast them with vanilla ViTs. KArAt's learnable activation shows a better attention score across all ViTs, indicating better token-to-token interactions, contributing to better inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures that require a careful understanding of learnable activations.
KANs were integrated with neural network architectures, the primary of them being conventional MLPs or convolution neural networks (CNNs). But KAN's exploration of more advanced architectures such as Transformers, remains limited. VisionKAN replace the MLP layers inside the encoder blocks of data-efficient image Transformers (DeiTs) with KANs and proposed DeiT+KAN, Yang & Wang proposed two variants; ViT+KAN that replaces the MLP layers inside ViT's encoder blocks, and Kolmogorov-Arnold Transformer (KAT), albeit similar to ViT+KAN but with a refined group-KAN strategy. From these works, we realized simply replacing MLPs with KANs might not guarantee better generalizability, but a properly designed KAN could. We note that while DeiT+KAN and ViT+KAN replace the MLP layer with KAN in the Transformer's encoder block, KAT implements a sophisticated group-KAN strategy that reuses a learnable function among a group of units in the same layer and chooses different bases for different encoder blocks. Importantly, we also note that the community remains put from designing a learnable multihead attention module, the heart of the Transformers. Therefore, in the rise of second-generation Transformers, such as Google’s TITAN and SAKANA AI’s Transformer2 that mimic the human brain, we want to ask: Is it worth deploying learnable multi-head self-attention to the (vision) Transformers?
Let \(\mathcal{A}^{i,j}\in \mathbb{R}^{N\times N}\) be the attention matrix for \(i^{\rm th}\) head in the \(j^{\rm th}\) encoder block. For \(k\in[N]\), the softmax activation function, \(\sigma:\mathbb{R}^N\to (0,1)^N\), operates on the \(k^{\rm th}\) row vector, \(\mathcal{A}^{i,j}_{k,:}\in\mathbb{R}^{1\times N}\) of \(\mathcal{A}^{i,j}\) to produce component-wise output
\(
\sigma(\mathcal{A}^{i,j}_{k,l})=\tfrac{e^{\mathcal{A}^{i,j}_{k,l}}}{\sum\limits_{n=1}^N e^{\mathcal{A}^{i,j}_{k,n}}}
\).
Instead of using the softmax function, we can use a learnable activation function, \({\tilde{\sigma}}\) on the row vectors of each attention head \(\mathcal{A}^{i,j}\). With any choice of the basis functions (e.g., B-Spline, Fourier, Fractals, Wavelets, etc.), the activated attention row vector, \({\tilde{\sigma}(\mathcal{A}^{i,j}_{k,:})}\) for \(k\in [N]\) can be written as \(\tilde{\sigma}\left[(\mathcal{A}^{i,j}_{k,:})\right) = \left(\Phi^{i,j}\left[(\mathcal{A}^{i,j}_{k,:})^\top\right]\right)^\top\).
Instead of using an \(N\times N\) operator, \({\Phi^{i,j}}\), we use a reduced sized operator \({\widehat{\Phi}^{i,j}}\) of \(r\times N\), such that \({r\ll N}\), and the new learned activation is \(r\)-dimensional for \(k\in[N]\). That is, \({\widehat{\Phi}^{i,j}}\) down-projects each attention row vector of \(\mathcal{A}^{i,j}\) to a lower dimensional subspace. This process significantly reduces the computational overhead. Next, we post-multiply using another learnable weight matrix, \(W^{i,j}\in\mathbb{R}^{N\times r}\) to project them back to their original dimension. For each \(k\in[N]\), this operation results in computing \({\widehat{\sigma}(\mathcal{A}^{i,j}_{k,:}) =\left[W^{i,j}\widehat{\Phi}^{i,j}\left[(\mathcal{A}^{i,j}_{k,:})^\top\right]\right]^{\top}}\).
We consider two configurations for updating the operator \(\widehat{\Phi}^{i,j}\). (a) Blockwise: In this configuration, each encoder block involves learning the attention through \(h\) distinct operators \(\widehat{\Phi}\) for each of the \(h\) heads, totaling \(hL\) operators. Like the MHSA architecture in ViTs, the blockwise configuration is designed to learn as many different data representations as possible. (b) Universal: The motivation behind this configuration comes from the KAT. In KAT, the MLP head is replaced with different variations of a KAN head — and Group-KAN. In Group-KAN, the KAN layer shares rational base functions and their coefficients among the edges. Inspired by this, in our update configuration, all \(L\) encoder blocks share the same \(h\) operators; \(\widehat{\Phi}^{i,j} = \widehat{\Phi}^{i}\) for \(j=1,2,\ldots, L\). Rather than learning attention through \(hL\) operators, this configuration only uses \(h\) operators. Here, we share all learnable units and their parameters in each head across \(L\) blocks. We postulate that blockwise mode with more operators captures more nuances from the data. In contrast, the universal mode is suitable for learning simpler decision boundaries from the data.
Basis | Function Representation, \(\phi(x)\) | Initialized Specifications |
---|---|---|
Fourier | \(\sum_{k=1}^{G} (a_k \cos(kx) + b_k \sin(kx))\) | \(a_k, b_k \sim \mathcal{N}(0, 1)\), \(G\) denotes the grid size |
Rational \((m,n)\) | \(\frac{a_0 + a_1 x + \dots + a_m x^m}{1 + |b_1 x + \dots + b_n x^n|}\) | \(a_i, b_j \sim \mathcal{N}(0, 1)\), \(i=0, \dots, m\), \(j=1, \dots, n\) |
Mexican Hat | \(\frac{2}{\pi^{1/4}\sqrt{3}} \left(1 - \tilde{x}^2\right) e^{-\frac{\tilde{x}^2}{2}}\), \(\tilde{x} = \frac{x-\tau}{s}\) | \(w \sim \mathcal{N}(0, 1)\), (Translation) \(\tau = 0\), (Scale) \(s=1\) |
Morlet* | \(w \cos(\omega_0 \tilde{x}) e^{-\frac{\tilde{x}^2}{2}}\), \(\tilde{x} = \frac{x-\tau}{s}\) | |
DOG | \(-w \frac{d}{dx}\left(e^{-\frac{\tilde{x}^2}{2}}\right)\), \(\tilde{x} = \frac{x-\tau}{s}\) | |
Meyer** | \(w \sin(\pi|\tilde{x}|) (m \circ \nu)(\tilde{x})\), \(\tilde{x} = \frac{x-\tau}{s}\) | |
Shannon | \(w~\text{sinc}\left(\frac{\tilde{x}}{\pi}\right)\omega(\tilde{x})\), \(\tilde{x} = \frac{x-\tau}{s}\) | \(\omega(\tilde{x})\) is the symmetric hamming window |
By design, KArAt can utilize any basis function for activating the attention units. Therefore, in addition to the Fourier basis, we embed 5 different wavelet bases, and the Rational Function basis into KArAt. The function representations for these choices of bases are outlined in the above table. Along with these Bases we also can choose base activation \(b(x)\) from Zero, SiLU, and Identity. Following this the general expression of our Fourier KArAt can be given as \(\widehat{\phi}^{i,j}_{pq}(\mathcal{A}^{i,j}_{k,q})= b(\mathcal{A}^{i,j}_{k,q}) + \sum_{m=1}^{G}a_{pqm}\cos{(m\mathcal{A}^{i,j}_{k,q})}+b_{pqm}\sin{(m\mathcal{A}^{i,j}_{k,q})}\). However, empirically, we found that Zero base activation (\(b(x)=0\)) with Fourier basis outperforms the rest of the combinations. For more details, see the paper.
Model | CIFAR-10 | CIFAR-100 | Imagenet-1K | Parameters | |||
---|---|---|---|---|---|---|---|
Acc.@1 | Acc.@5 | Acc.@1 | Acc.@5 | Acc.@1 | Acc.@5 | ||
ViT-Base | 83.45 | 99.19 | 58.07 | 83.70 | 72.90 | 90.56 | 85.81M |
+ G1B | 81.81 (1.97% ↓) | 99.01 | 55.92 (3.70% ↓) | 82.04 | 68.03 (6.68% ↓) | 86.41 | 87.51M (1.98% ↑) |
+ G1U | 80.75 (3.24% ↓) | 98.76 | 57.36 (1.22% ↓) | 82.89 | 68.83 (5.58% ↓) | 87.69 | 85.95M (0.16% ↑) |
ViT-Small | 81.08 | 99.02 | 53.47 | 82.52 | 70.50 | 89.34 | 22.05M |
+ G3B | 79.78 (1.60% ↓) | 98.70 | 54.11 (1.20% ↑) | 81.02 | 67.77 (3.87% ↓) | 87.51 | 23.58M (6.94% ↑) |
+ G3U | 79.52 (1.92% ↓) | 98.85 | 53.86 (0.73% ↑) | 81.45 | 67.76 (3.89% ↓) | 87.60 | 22.18M (0.56% ↑) |
ViT-Tiny | 72.76 | 98.14 | 43.53 | 75.00 | 59.15 | 82.07 | 5.53M |
+ G3B | 76.69 (5.40% ↑) | 98.57 | 46.29 (6.34% ↑) | 77.02 | 59.11 (0.07% ↓) | 82.01 | 6.29M (13.74% ↑) |
+ G3U | 75.56 (3.85% ↑) | 98.48 | 46.75 (7.40% ↑) | 76.81 | 57.97 (1.99% ↓) | 81.03 | 5.59M (1.08% ↑) |
Model | Acc.@1 | ||
---|---|---|---|
SVHN | Flowers 102 | STL-10 | |
ViT-Base | 97.74 | 92.24 | 97.26 |
+ G1B | 96.83 | 89.66 | 95.30 |
+ G1U | 97.21 | 89.43 | 95.78 |
ViT-Small | 97.48 | 91.46 | 96.09 |
+ G3B | 97.04 | 89.67 | 95.26 |
+ G3U | 97.11 | 90.08 | 95.45 |
ViT-Tiny | 96.69 | 84.21 | 93.20 |
+ G3B | 96.37 | 83.67 | 93.09 |
+ G3U | 96.39 | 83.70 | 92.93 |
Model | CIFAR-10 | Imagenet-1K | ||
---|---|---|---|---|
Acc.@1 | Acc.@5 | Acc.@1 | Acc.@5 | |
ConViT-Tiny | 71.36 | 97.86 | 57.91 | 81.79 |
+ G3B | 75.57 | 98.61 | 56.57 | 80.75 |
+ G3U | 74.51 | 98.63 | 56.51 | 80.93 |
Swin-Tiny | 84.83 | 99.43 | 76.14 | 92.81 |
+ G3B | 79.34 | 98.81 | 73.19 | 90.97 |
Model (Initialization) | Box | Mask | ||||
---|---|---|---|---|---|---|
AP | AP50 | AP75 | AP | AP50 | AP75 | |
ViT-Det-Base (ImageNet-1K) | 32.34 | 49.16 | 34.43 | 28.35 | 46.04 | 29.48 |
+ G1B | 22.48 | 36.82 | 23.13 | 20.11 | 34.09 | 20.46 |
+ G1U | 26.68 | 43.25 | 27.88 | 24.01 | 40.21 | 24.66 |
ViT-Det-Base (Random) | 15.28 | 26.88 | 15.22 | 13.39 | 24.34 | 12.86 |
+ G1B | 10.32 | 18.77 | 10.05 | 8.94 | 16.51 | 8.50 |
+ G1U | 10.99 | 20.04 | 10.82 | 9.69 | 17.99 | 9.25 |
The overall computation for Fourier KArAt variants is higher than their conventional softmax MHSA. Primarily, the Fourier KArAt variants have a longer training time. We also observe that universal mode GnU training times are consistently slightly less than the blockwise modes GnB. During the training, we monitored the GPU memory requirements, and as expected, Fourier KArAt variants utilize significantly more memory than traditional MHSA. In particular, the GPU memory requirements scale by 2.5 − 3×, compared to the traditional softmax MHSA. We also see slightly faster inference in universal mode than blockwise, except for ViT-Base. While there is a massive training time discrepancy between vanilla ViTs and Fourier KArAt ViTs, the inference speeds for Fourier KArAt variants are comparable to their vanilla counterparts. Although there is a minor difference in throughput between universal and blockwise modes during inference, theoretically, both variants for any model with the same grid size should have the same number of FLOPs.
Model | Parameters | GFLOPs | GPU Memory | ||
---|---|---|---|---|---|
Attention Activation | Total | Attention Activation | Total | ||
ViT-Base | 0 | 85.81M | 0.016 | 17.595 | 7.44 GB |
+ G1B | 1.70M | 87.51M | 0.268 | 17.847 | 17.36 GB |
+ G1U | 0.14M | 85.95M | 0.268 | 17.847 | 16.97 GB |
ViT-Small | 0 | 22.05M | 0.008 | 4.614 | 4.15 GB |
+ G3B | 1.53M | 23.58M | 0.335 | 4.941 | 11.73 GB |
+ G3U | 0.13M | 22.18M | 0.335 | 4.941 | 11.21 GB |
ViT-Tiny | 0 | 5.53M | 0.005 | 1.262 | 2.94 GB |
+ G3B | 0.76M | 6.29M | 0.168 | 1.425 | 7.48 GB |
+ G3U | 0.06M | 5.59M | 0.168 | 1.425 | 7.29 GB |
Primarily, it is delineated by existing literature that the attention matrices in the vision transformers are usually sparse and low rank. We empirically verify this claim, and we note that the attention matrix has a low rank structure before and after softmax operation. We observe that the traditional softmax attention and our learnable Fourier KArAt have almost similar low-rank structures before the attention activation is applied. After the activations are applied we see that the traditional MHSA has significantly larger singular values than its KArAt variant, G3B. It can also be noticed that before the activation, both traditional MHSA and Fourier KArAt feature a sharp drop in singular values between the 50th and 75th indices. This sharp drop in singular values vanishes in softmax attention, indicating a normalization. However, due to the hidden dimension \(r\), Fourier KArAt enforces a much lower rank than the traditional MHSA.
These visualizations show that Fourier KArAt in ViT architectures significantly impacts the smoothness of the loss surfaces. ViT-Tiny, with the fewest parameters, has the smoothest loss landscape and modest generalizability. In contrast, ViT-Tiny+Fourier KArAt’s loss landscape is spiky; it indicates the model is full of small-volume minima. However, the model is modest in the number of parameters, so the gradient descent optimizer can still find an optimized model with better generalizability than the vanilla ViT-Tiny. ViT-Base, however, has more parameters than Tiny, and its loss surface is much spikier than ViT-Tiny. Finally, the loss surface of ViT-Base+Fourier KArAt is most spiky, making it a narrow margin model with sharp minima in which small perturbations in the parameter space lead to high misclassification due to their exponentially larger volume in high dimensional spaces.
From the above figure, it can be observed that the vanilla softmax attention focuses on sparse feature interactions, while the Fourier KArAt captures dominant feature interactions. As Fourier KArAt is not restricted to take values in \([0, 1]\) unlike vanilla softmax attention, it has the flexibility to capture the negative interactions.
@article{maity2025karat,
title={Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?},
author={Maity, Subhajit and Hitsman, Killian and Li, Xin and Dutta, Aritra},
booktitle={arXiv preprint arxiv:2503.10632},
year={2025}
}
Acknowledgements: We thank Dr. Srijan Das from the University of North Carolina at Charlotte for his valuable feedback and suggestions and for arranging the computational resources.
Copyright: CC BY-NC-SA 4.0 © Subhajit Maity | Last updated: 26 Jun 2025 | Template Credit: DreamBooth