참고
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[2] Tay, Yi, et al. "Efficient transformers: A survey." ACM Computing Surveys (CSUR) (2020).

[3] Cho, Sungjun, et al. "Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost." arXiv preprint arXiv:2210.15541 (2022).

[4] Rohe, Karl, et al. "A note on quickly sampling a sparse matrix with low rank expectation." The Journal of Machine Learning Research 19.1 (2018): 3040-3052.

[5] Tay, Yi, et al. "Long range arena: A benchmark for efficient transformers." arXiv preprint arXiv:2011.04006 (2020).

[6] Wang, Alex, et al. "GLUE: A multi-task benchmark and analysis platform for natural language understanding." arXiv preprint arXiv:1804.07461 (2018).