RawGen

Learning Camera Raw Image Generation

A unified framework for text-to-raw and sRGB-to-raw generation across arbitrary camera sensors.

Dongyoung Kim1 · Junyong Lee1* · Abhijith Punnappurath1* ·
Mahmoud Afifi1* · Sangmin Han2 · Alex Levinshtein1 · Michael S. Brown1

1AI Center – Toronto, Samsung Electronics    2Yonsei University    * Equal Contribution

Coord_X: 43.6532
Coord_Y: -79.3832
Rec [●]
ISO 100 · 1/250 · f/2.8
Abstract
The Challenge
Cameras capture scene-referred linear raw images, processed by ISPs into 8-bit sRGB. While raw data is more faithful for low-level vision, large-scale raw datasets remain scarce and device-specific. Existing diffusion models synthesize photo-finished sRGB, not physically consistent linear representations.

RawGen introduces a generative approach that learns the complex distribution of raw sensor data directly, enabling high-fidelity generation from either text descriptions or standard sRGB images across arbitrary camera sensors.

Many-to-One Reconstruction: Maps multiple photo-finished sRGB to a single canonical linear reference, robust to unknown ISPs.

Camera-Agnostic Generation: First unified framework for text-driven linear and camera-specific raw synthesis without retraining.

Scalable Data Synthesis: Text-to-raw alleviates data acquisition challenges in illuminant estimation, neural ISP, and denoising.

RAW output
sRGB input
Focusing...
Sensor: Canon EOS-1Ds III
Experimental Results
Results Showcase

RawGen generates camera raw images from sRGB inputs or text prompts. Click to zoom.

Multi-Sensor Universal Adaptation
Any Camera. One Model.

Each camera sensor has a unique spectral sensitivity; even under the same lighting and scene, different sensors produce different colors. RawGen learns this device-specific mapping, generating raw images for any target camera from a single trained model. Select a scene and camera to see the results.

Camera result Camera result
Canon 1Ds III
Method
The RawGen Architecture

A three-stage generative pipeline built on FLUX.1-Kontext.

Raw → CIE XYZDiverse sRGBDiT Fine-tuneVAE DecoderI2R / T2R
Fig.2 Method Overview
Overview

Three-Stage Generative Pipeline

RawGen converts arbitrary sRGB images or text prompts into sensor-realistic camera raw via three stages: (1) a many-to-one data construction that pairs diverse sRGB renditions with a single CIE XYZ anchor, (2) DiT denoiser and VAE decoder fine-tuning on the constructed pairs, and (3) unified Image-to-Raw / Text-to-Raw inference.

Step 1 of 4: Data Construction

Many-to-One Training Data

From a raw image, we derive a CIE XYZ anchor. Then we generate N sRGB variants by randomizing white balance, tone-curve, and contrast.

$z_{\text{sRGB}}^{(n)} = E_{\text{VAE}}\!\left(I_{\text{sRGB}}^{(n)}\right), \quad z_{\text{XYZ}} = E_{\text{VAE}}(I_{\text{XYZ}})$
Step 2 of 4: Denoiser Fine-Tuning

DiT Denoiser Tuning

We tune the DiT to predict the rectified-flow velocity target using LoRA:

$\mathcal{L}_{\text{denoise}} = \mathbb{E}_{n,t,\epsilon}\left\| v_{\text{gt}} - v_\theta(z_t, t;\, z_{\text{sRGB}}^{(n)})\right\|_2^2$

sRGB context and noisy target latents are concatenated along the sequence dimension.

Step 3 of 4: Decoder Fine-Tuning

VAE Decoder Tuning

The VAE decoder is retargeted from sRGB to linear XYZ:

$\hat{I}_{\text{XYZ}} = D_{\text{VAE}}(z_{\text{XYZ}}), \quad \mathcal{L}_{\text{recon}} = \|\hat{I}_{\text{XYZ}} - I_{\text{XYZ}}\|_1$
Step 4 of 4: Inference

I2R & T2R Inference

🖼️ Image-to-Raw

sRGB → VAE enc → DiT → XYZ latent → VAE dec → camera raw

✍️ Text-to-Raw

Text → FLUX.1 latent → DiT → XYZ latent → VAE dec → camera raw

Many-to-One Robustness
One Scene, Five Edits,
One Reconstruction

We evaluate on the MIT-Adobe FiveK dataset, where each scene is retouched by five professional photographers (Expert A–E) with distinct aesthetic preferences. Conventional methods fail when the input sRGB comes from an unknown rendering style, but RawGen's many-to-one training maps all five diverse renditions to a single canonical CIE XYZ, reliably recovering the same linear image regardless of photo-finishing.

Scene 1 Scene 1
Scene 2 Scene 2
Scene 3 Scene 3
Scene 4 Scene 4
Scene 5 Scene 5
sRGB Input CIE XYZ Net RawGen (Ours) Ground Truth
Many-to-one comparison grid
MethodExpert AExpert BExpert CExpert DExpert E
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
CIE XYZ Net19.60.78621.04.84319.49.78019.44.80618.64.792
InvISP16.04.68516.04.68514.92.63214.24.68013.30.647
Raw-Diffusion19.30.78320.66.84018.79.77118.69.79717.49.775
RawGen (Ours)23.20.84324.35.85823.37.83923.51.85323.89.850

Table 1: sRGB-to-XYZ on MIT-Adobe FiveK across five expert styles.

Downstream Utility
Scalable Raw Synthesis for Downstream Tasks

Collecting device-specific raw data at scale is costly and labor-intensive. RawGen's Text-to-Raw pipeline generates diverse, camera-specific raw images from text prompts alone, without physical capture. With just 3K synthetic samples, downstream tasks like illuminant estimation, neural ISP learning, and raw denoising approach or exceed real-data performance.

✏️
Text Prompt
"A sunlit kitchen with warm tones"
RawGen
Text-to-Raw
🗂️
3K Raw Images
Camera-specific synthetic data
☀️ Illuminant Estimation
📷 Neural ISP
🧽 Raw Denoising
Illuminant Estimation (Error ↓)
EnlightenGAN
7.01°
UPI
6.26°
Graphics2RAW
4.21°
RawGen
3.14°
Real data
3.02°
Neural ISP (PSNR ↑)
EnlightenGAN
35.58
UPI
36.43
Graphics2RAW
38.10
RawGen
38.42
Real data
38.32
Denoising ISO 1600 (PSNR ↑)
EnlightenGAN
48.82
UPI
49.05
Graphics2RAW
49.37
RawGen
50.63
Real data
49.80
Denoising ISO 3200 (PSNR ↑)
EnlightenGAN
47.25
UPI
47.51
Graphics2RAW
48.16
RawGen
48.57
Real data
48.25
Show Full Table
MethodIlluminant Est.Neural ISPDenoise 1600Denoise 3200
MeanMedW25%PSNRSSIMPSNRSSIMPSNRSSIM
EnlightenGAN7.016.8211.0735.58.96548.82.99147.25.988
UPI6.265.8910.3336.43.96649.05.99047.51.988
Graphics2RAW4.213.388.5738.10.97449.37.99148.16.989
RawGen3.142.117.3738.42.97050.63.99448.57.992
Real3.022.176.7738.32.97449.80.99348.25.990
Raw-Domain Editing
Generate Once, Edit Forever

RawGen decouples content synthesis from rendering by producing a scene-referred linear image. Every downstream edit becomes a standard ISP operation, with no re-inference or retraining.

🔃
Inverse-ISP Methods

Device-Specific Inversion

Tuned to specific cameras and fixed imaging assumptions. Heterogeneous sRGB inputs outside the training distribution often cause failures.

⚠ device-dependent
🔄
Camera-Controllable Generation

Re-Inference Per Edit

>20 sec / edit

Re-runs full diffusion for every parameter change. Only supports parameters explicitly learned during training. New edit types require retraining.

✕ fixed parameter set
RawGen (Ours)

Generate Once, Edit Freely

~0.53 ms / edit

Generate a linear raw image once, then apply any ISP operation (white balance, exposure, tone mapping) with any software pipeline.

✓ all edit types supported
Citation
BibTeX

If you find RawGen useful, please cite our work.

@article{kim2026rawgen,
  title     = {RawGen: Learning Camera Raw Image Generation},
  author    = {Dongyoung Kim and Junyong Lee and Abhijith Punnappurath and Mahmoud Afifi and Sangmin Han and Alex Levinshtein and Michael S. Brown},
  journal   = {arXiv preprint},
  year      = {2026}
}