MG-VTON: A Lightweight Virtual Try-on Model with Knowledge Distillation for Real-time Performance

Xuan Yu

doi:10.54691/prt02137

Authors

Xuan Yu

DOI:

https://doi.org/10.54691/prt02137

Keywords:

Virtual try-on, Generative Adversarial Networks (GANs), knowledge distillation, real-time performance.

Abstract

In recent years, virtual try-on technology has seen a continuous surge in public visibility and has become a key tool for many companies to boost sales and enhance user experience. Existing virtual try-on methods are mainly divided into two categories: those based on Generative Adversarial Networks (GANs) and those based on diffusion models. GAN-based methods have been widely applied due to their compact model structures and fast execution speed, but there is still room for improvement in image quality and detail fidelity. In contrast, diffusion model-based methods excel in generating high-quality and realistic images, but their high computational complexity and slow inference speed limit their practicality in real-time applications. To address these issues, this paper proposes a lightweight and efficient virtual try-on model called MG-VTON that does not require human parsing. By introducing knowledge distillation techniques, we have streamlined the model to significantly improve computational efficiency and inference speed. Moreover, MG-VTON can still generate high-quality and realistic try-on effects without relying on human parsing. This work offers new insights for the further development of virtual try-on technology, enhancing the user experience and providing companies with more competitive solutions in digital apparel presentation and marketing.

Downloads

Download data is not yet available.

References

[1] Tang M , Wang H , Tang L ,et al.CAMA: Contact‐Aware Matrix Assembly with Unified Collision Handling for GPU‐based Cloth Simulation[J].Computer Graphics Forum, 2016, 35(2):511-521.DOI:10.1111/cgf.12851.

[2] Cao C , Wu H , Weng Y ,et al.Real-time Facial Animation with Image-based Dynamic Avatars[J].ACM Transactions on Graphics, 2016, 35(4):1-12.DOI:10.1145/2897824.2925873.

[3] Han X , Wu Z , Wu Z ,et al.VITON: An Image-based Virtual Try-on Network[J]. 2017.DOI:10.1109/CVPR.2018.00787.

[4] Ge Y , Song Y , Zhang R ,et al.Parser-Free Virtual Try-on via Distilling Appearance Flows[J]. 2021.DOI:10.48550/arXiv.2103.04559.

[5] Choi S , Park S , Lee M ,et al.VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization[J]. 2021.DOI:10.48550/arXiv.2103.16874.

[6] Lee S , Gu G , Park S ,et al.High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions[J]. 2022.DOI:10.48550/arXiv.2206.14180.

[7] Lewis K M , Varadharajan S , Kemelmacher-Shlizerman I .TryOnGAN: body-aware try-on via layered interpolation[J].ACM Transactions on Graphics, 2021, 40(4):1-10.DOI:10.1145/3450626.3459884.

[8] Zhu L , Yang D , Zhu T L ,et al.TryOnDiffusion: A Tale of Two UNets[J].ArXiv, 2023, abs/2306.08276.DOI:10.48550/arXiv.2306.08276.

[9] Morelli D , Baldrati A , Cartella G ,et al.LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On[J].ArXiv, 2023, abs/2305.13501.DOI:10.48550/arXiv.2305.13501.

[10] Liu G , Song D , Tong R ,et al.Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching[J]. 2021.DOI:10.1609/aaai.v35i3.16309.

[11] Wang B, Zheng H, Liang X, et al. Toward characteristic-preserving image-based virtual try-on network[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 589-604.

[12] Cui A , Mckee D , Lazebnik S .Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing[J]. 2021.DOI:10.48550/arXiv.2104.07021.

[13] Neuberger A , Borenstein E , Hilleli B ,et al.Image Based Virtual Try-On Network From Unpaired Data[J].IEEE, 2020.DOI:10.1109/CVPR42600.2020.00523.

[14] Gou J, Sun S, Zhang J, et al. Taming the power of diffusion models for high-quality virtual try-on with appearance flow[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 7599-7607.

[15] Zeng J, Song D, Nie W, et al. CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8372-8382.

[16] Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//International conference on machine learning. PMLR, 2015: 2256-2265.

[17] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.

[18] Kim J, Gu G, Park M, et al. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8176-8185.

[19] Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3836-3847.

[20] Xu Y, Gu T, Chen W, et al. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on[J]. arXiv preprint arXiv:2403.01779, 2024.

[21] Choi Y, Kwak S, Lee K, et al. Improving diffusion models for virtual try-on[J]. arXiv preprint arXiv:2403.05139, 2024.

[22] Ye H, Zhang J, Liu S, et al. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models[J]. arXiv preprint arXiv:2308.06721, 2023.

[23] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695.

[24] Karras T. A Style-Based Generator Architecture for Generative Adversarial Networks[J]. arXiv preprint arXiv:1812.04948, 2019.

[25] Yang H, Zhang R, Guo X, et al. Towards photo-realistic virtual try-on by adaptively generating-preserving image content[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 7850-7859.

[26] Minar M R, Tuan T T, Ahn H, et al. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on[C]//CVPR workshops. 2020, 3: 10-14.

[27] He S, Song Y Z, Xiang T. Style-based global appearance flow for virtual try-on[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3470-3479.

[28] Xie Z, Huang Z, Dong X, et al. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23550-23559.

[29] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27.

[30] Xie Z, Huang Z, Dong X, et al. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23550-23559.

[31] Qin D, Leichner C, Delakis M, et al. Mobilenetv4-universal models for the mobile ecosystem. arXiv 2024[J]. arXiv preprint arXiv:2404.10518.

[32] Nguyen-Ngoc K N, Phan-Nguyen T T, Le K D, et al. DM-VTON: Distilled mobile real-time virtual try-on[C]//2023 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 2023: 695-700.

[33] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015: 234-241.

[34] Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer International Publishing, 2016: 694-711.

[35] Sun D, Roth S, Black M J. A quantitative analysis of current practices in optical flow estimation and the principles behind them[J]. International Journal of Computer Vision, 2014, 106: 115-137.

[36] Yang B, Gu S, Zhang B, et al. Paint by example: Exemplar-based image editing with diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18381-18391.

[37] Zhu Z, Feng X, Chen D, et al. Designing a better asymmetric vqgan for stablediffusion[J]. arXiv preprint arXiv:2306.04632, 2023.

[38] Qiao S, Wang Y, Li J. Real-time human gesture grading based on OpenPose[C]//2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2017: 1-6.

[39] Güler R A, Neverova N, Kokkinos I. Densepose: Dense human pose estimation in the wild[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7297-7306.

[40] He S, Song Y Z, Xiang T. Style-based global appearance flow for virtual try-on[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3470-3479.