참고
[1] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[2] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.

[3] Mu, Norman, et al. "Slip: Self-supervision meets language-image pre-training." arXiv preprint arXiv:2112.12750 (2021).

[4] Li, Yangguang, et al. "Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm." arXiv preprint arXiv:2110.05208 (2021).

[5] Lee, Janghyeon, et al. "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training." arXiv preprint arXiv:2209.13430 (2022).