참고
[1] Brown et al., “Language models are few-shot learners”, In Advances in Neural Information Processing Systems, 2020
[2] Ramesh et al., “Zero-shot text-to-image generation”, In Proceedings of the International Conference on Machine Learning, 2021
[3] Radford et al., “Learning transferable visual models from natural language supervision”, In Proceedings of the International Conference on Machine Learning, 2021
[4] Oord et al., “Neural discrete representation learning”, In Advances in Neural Information Processing Systems, 2017
[5] Ding et al., “Cogview: Mastering text-to-image generation via transformers”, In Advances in Neural Information Processing Systems, 2021
[6] Kim et al., “L-Verse: Bidirectional Generation Between Image and Text”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
[7] Dhariwal et al., “Diffusion Models Beat GANs on Image Synthesis”, In Advances in Neural Information Processing Systems, 2021
[8] Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Mode”, Arxiv Preprint, 2021
[9] Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, Arxiv Preprint, 2022
[10] Saharia et al., Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, Arxiv Preprint, 2022
[11] Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, Journal of Machine Learning Research, 2020
[12] Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning”, https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model, 2022
[13] Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Mode”, Arxiv Preprint, 2021