Let us know if you have any comments, questions, or want to share your creations (Notebooks at the end of the post)
Ernesto @vedax & diavlex @diavlex_ai
Artists combine colors and brushstrokes to paint their masterpieces, they do not create paintings pixel by pixel. However, most of the current generative AI Art methods based on Machine Learning (ML) are still centered to teach machines how to ‘paint’ at the pixel-level in order to achieve or mimic some painting style, e.g., GANs-based approaches and style transfer. This might be effective, but not very intuitive, specially when explaining this process to artists, who are familiar with colors and brushstrokes.
Our goal is to teach a machine how to paint using a combination of colors and strokes by telling it in natural language what to paint. How can we achieve this?
Materials and Approach
We need two basic ingredients:
A ML model that connects text with images, that is, it should be able to associate text with visual concepts. We use CLIP (Contrastive Language–Image Pre-training) for this task. 
The following steps capture the essence of our idea, note that we use pseudocode based on Python, which does not necessary reflect the models' API. Please have a look to the notebook itself for the actual code and methods used.
Specify what to paint, e.g.,
prompt = "black sheep"
Encode the text using CLIP's language portion to obtain the text features
text_features = clip_model.encode_text(prompt)
Initialize a list of brushstrokes or
actionsand ask the
neural painterto paint on a canvas. At the beginning the canvas will look random.
canvas = neural_painter.paint(actions)
Use the vision portion of the CLIP model to extract the image features of this initial
image_features = clip_model.encode_image(canvas)
The goal is to teach the neural painter to modify the strokes (i.e., its actions) depending on how different is what it is painting to the initial text request (
prompt). For example, in the perfect case scenario, the cosine similarity between the text and image feature vectors should be 1.0.
Using this intuition, we use as the loss to guide the optimization process the the cosine distance, that measure how different the vectors are. The cosine distance in our case corresponds to
loss = 1.0 - cos(text_features, image_features).
We minimize this loss adapting the neural painter
actionsthat in the end should produce a canvas as close as possible to the original request.
Enjoy Neural Painting! ;)
Notebook at Kaggle
Note: If you are part of Kaggle's community this is probably a better option given that a currently Kaggle offers a
NVIDIA TESLA P100for all notebook sessions. Whereas using Google Colab you might get a less powerful GPU for your session.
Notebook at Colab
- Paper: https://arxiv.org/abs/2111.10283
- Blogpost: https://www.libreai.com/the-joy-of-neural-painting/
- Code Repo [MIT License]: https://github.com/libreai/neural-painters-x – Part of the code is used in this notebook.
- Paper: https://arxiv.org/abs/1904.08410
- Blogpost: https://reiinakano.com/2019/01/27/world-painters.html
- Code Repo [MIT License]: https://github.com/reiinakano/neural-painters-pytorch – Part of the code is used in this notebook.
 Learning Transferable Visual Models From Natural Language Supervision. CLIP. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. OpenAI. 2021.
- Paper: https://arxiv.org/abs/2103.00020
- Blogpost: https://openai.com/blog/clip/
- Code Repo [MIT License]: https://github.com/openai/CLIP