Previous slide
Next slide
Toggle fullscreen
Open presenter view
Hitchhiker's Guide to Text-To-Image Generation
Hardip Patel
Presentation in blog format
Intro
Fulltime: Backend-Heavy Full-Stack developer
Sparetime: Work on my
Accountability
site
Hobbies:
Snooker (very recent)
Box Cricket
Try new hobbies
Currently Reading
Make by Pieter Levels
Why this topic? ...even though you're GenAI "noob"
Provide beginner's perspective
Wanted to help close the
barrier to entry
gap
Inspiration ...for getting into it
Want to create dynamically updating hero pic for my
Accountability
site
Pieter Levels
(Check
Photo AI
)
Sayak Paul
Overpowered
Abhishek Thakur
Journey Overview
Tried
Midjournery
on Discord very very early
Tested
Automatic1111
after watching Overpowered
Reached saturation with UI, so wanted to try with code
So hopped on to
Google Colab
Tried
ComfyUI
for this talk and it is
quite awesome
to say the least
What is Stable Diffusion?
Text to Image model, combination of...
Language Model, to transform Text to Latent Representation
Generative Image Model, image conditioned on that Representation
Based on Diffusion (Probablistic) Models
Class of Latent Variable Generative models
UI Tools for No-Code
Automatic1111
ComfyUI
Invoke AI
DiffusionBee
Automatic1111
Installation Link
Widely used
Good extension support
Most compatible
But unstable...
ComfyUI
Installation Link
Tutorial/Guide
Getting slack lately
Intuitive UI
Very stable
Terminologies (1/5)
PyTorch
deep learning framework based on Torch
Base Model
Foundational model upon with specific model variants are made
For example, v1.5, v2, XL 0.9, XL 1.0
Checkpoint (Model)
Pretrained Weights
Types of images model is trained on
For example, Juggernaut XL, Anything v3.0, epicRealism, etc...
Terminologies (2/5)
Guidance Scale (CFG)
Controls how much a process
follows a text prompt
LoRA
(
LO
w
R
ank
A
daptation Technology)
Add specific styles or characters while mantaining manageable file sizes
PEFT
(
P
arameter
E
fficient
F
ine-
T
uning)
Adapting Pre-trained Language Model(PLMs) to fine-tune extra parameters while keeping original parameters frozen.
Used to create LoRA
Terminologies (3/5)
Weights
Numerical values associated with the connections between neurons in neural network architecture
Visualize
Prompt
Text based instruction
Terminologies (4/5)
Text encoder
Transformer language model
Tokenizes text to be fed into U-Net
U-Net
Takes encoded text (plain text processed into a format it can understand) and a noisy array of numbers as inputs
VAE
Encodes and decodes images to and from a smaller latent space
Visualize
Terminologies (5/5)
Pipeline
Running diffusion models in inference by bundling all the necessary components.
Provides flexibility
Seed
Fine-Tuning
Train a wide dataset model on a narrow dataset model
Code demo
Inference Code
Model Fine-Tuning code
Prepare Images for Training using
Birme
Don't use token which is already trained
Inference code with trained model
ComfyUI with trained model
Further capabilities of Stable Diffusion
Inpainting
Restore/Repair image
Outpainting
Extend canvas of the image
Image To Image
New image from input as image and text prompt
New image will follow the composition and color of input image
Depth To Image
Take depth of the input image for composition of new image
THAT'S ALL FOLKS!
Credits
Towards Data Science
Hugging Face
Google Colab
BIRME
Automatic1111
Comfy UI