참고

1. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

2. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical Image Computing and Computer-Assisted Intervention?MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015.

3. He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

4. Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

5. Bala?evi?, Ivana, et al. "Towards In-context Scene Understanding." arXiv preprint arXiv:2306.01667 (2023).

6. Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.

7. Gupta, Agrim, et al. "Siamese Masked Autoencoders." arXiv preprint arXiv:2305.14344 (2023).

8. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

9. Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

10. Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.

11. Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

12. Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

13. Schick, Timo, et al. "Toolformer: Language models can teach themselves to use tools." arXiv preprint arXiv:2302.04761 (2023).

14. Hao, Shibo, et al. "ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings." arXiv preprint arXiv:2305.11554 (2023).

15. Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." Advances in Neural Information Processing Systems 35 (2022): 23716-23736.

16. Yang, Zhengyuan, et al. "The dawn of lmms: Preliminary explorations with gpt-4v (ision)." arXiv preprint arXiv:2309.17421 9.1 (2023).

17. Liu, Haotian, et al. "Visual instruction tuning." arXiv preprint arXiv:2304.08485 (2023).

18. Dai, W., et al. "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv 2023." arXiv preprint arXiv:2305.06500.

19. Huang, Shaohan, et al. "Language is not all you need: Aligning perception with language models." arXiv preprint arXiv:2302.14045 (2023).

20. Sun, Yasheng, et al. "ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation." arXiv preprint arXiv:2308.00906 (2023).

21. Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

22. Saxena, Saurabh, et al. "The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation." arXiv preprint arXiv:2306.01923 (2023).

23. Chen, Shoufa, et al. "DiffusionDet: Diffusion model for object detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

24. Tian, Yonglong, et al. "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners." arXiv preprint arXiv:2306.00984 (2023).

25. Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

26. Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

27. Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.

28. Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." arXiv preprint arXiv:2305.18290 (2023).

29. Fan, Ying, et al. "DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models." arXiv preprint arXiv:2305.16381 (2023).

30. Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).