RawGen

Learning Camera Raw Image Generation

A unified framework for text-to-raw and sRGB-to-raw generation across arbitrary camera sensors.

Dongyoung Kim¹ · Junyong Lee¹* · Abhijith Punnappurath¹* ·
Mahmoud Afifi¹* · Sangmin Han² · Alex Levinshtein¹ · Michael S. Brown¹

¹AI Center – Toronto, Samsung Electronics ²Yonsei University * Equal Contribution

Paper Code (TBD) Video (TBD)

Coord_X: 43.6532
Coord_Y: -79.3832

Rec [●]
ISO 100 · 1/250 · f/2.8

Abstract

The Challenge

Cameras capture scene-referred linear raw images, processed by ISPs into 8-bit sRGB. While raw data is more faithful for low-level vision, large-scale raw datasets remain scarce and device-specific. Existing diffusion models synthesize photo-finished sRGB, not physically consistent linear representations.

RawGen introduces a generative approach that learns the complex distribution of raw sensor data directly, enabling high-fidelity generation from either text descriptions or standard sRGB images across arbitrary camera sensors.

Many-to-One Reconstruction: Maps multiple photo-finished sRGB to a single canonical linear reference, robust to unknown ISPs.

Camera-Agnostic Generation: First unified framework for text-driven linear and camera-specific raw synthesis without retraining.

Scalable Data Synthesis: Text-to-raw alleviates data acquisition challenges in illuminant estimation, neural ISP, and denoising.

Focusing...

Sensor: Canon EOS-1Ds III

Multi-Sensor Universal Adaptation

Any Camera. One Model.

Each camera sensor has a unique spectral sensitivity; even under the same lighting and scene, different sensors produce different colors. RawGen learns this device-specific mapping, generating raw images for any target camera from a single trained model. Select a scene and camera to see the results.

Canon 1Ds III

Method

The RawGen Architecture

A three-stage generative pipeline built on FLUX.1-Kontext.

Raw → CIE XYZ→Diverse sRGB→DiT Fine-tune→VAE Decoder→I2R / T2R

Overview

Three-Stage Generative Pipeline

RawGen converts arbitrary sRGB images or text prompts into sensor-realistic camera raw via three stages: (1) a many-to-one data construction that pairs diverse sRGB renditions with a single CIE XYZ anchor, (2) DiT denoiser and VAE decoder fine-tuning on the constructed pairs, and (3) unified Image-to-Raw / Text-to-Raw inference.

Step 1 of 4: Data Construction

Many-to-One Training Data

From a raw image, we derive a CIE XYZ anchor. Then we generate N sRGB variants by randomizing white balance, tone-curve, and contrast.

$z_{\text{sRGB}}^{(n)} = E_{\text{VAE}}\!\left(I_{\text{sRGB}}^{(n)}\right), \quad z_{\text{XYZ}} = E_{\text{VAE}}(I_{\text{XYZ}})$

Step 2 of 4: Denoiser Fine-Tuning

DiT Denoiser Tuning

We tune the DiT to predict the rectified-flow velocity target using LoRA:

$\mathcal{L}_{\text{denoise}} = \mathbb{E}_{n,t,\epsilon}\left\| v_{\text{gt}} - v_\theta(z_t, t;\, z_{\text{sRGB}}^{(n)})\right\|_2^2$

sRGB context and noisy target latents are concatenated along the sequence dimension.

Step 3 of 4: Decoder Fine-Tuning

VAE Decoder Tuning

The VAE decoder is retargeted from sRGB to linear XYZ:

$\hat{I}_{\text{XYZ}} = D_{\text{VAE}}(z_{\text{XYZ}}), \quad \mathcal{L}_{\text{recon}} = \|\hat{I}_{\text{XYZ}} - I_{\text{XYZ}}\|_1$

Step 4 of 4: Inference

I2R & T2R Inference

🖼️ Image-to-Raw

sRGB → VAE enc → DiT → XYZ latent → VAE dec → camera raw

✍️ Text-to-Raw

Text → FLUX.1 latent → DiT → XYZ latent → VAE dec → camera raw

Many-to-One Robustness

One Scene, Five Edits,
One Reconstruction

We evaluate on the MIT-Adobe FiveK dataset, where each scene is retouched by five professional photographers (Expert A–E) with distinct aesthetic preferences. Conventional methods fail when the input sRGB comes from an unknown rendering style, but RawGen's many-to-one training maps all five diverse renditions to a single canonical CIE XYZ, reliably recovering the same linear image regardless of photo-finishing.

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

sRGB Input CIE XYZ Net RawGen (Ours) Ground Truth

Method	Expert A		Expert B		Expert C		Expert D		Expert E
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
CIE XYZ Net	19.60	.786	21.04	.843	19.49	.780	19.44	.806	18.64	.792
InvISP	16.04	.685	16.04	.685	14.92	.632	14.24	.680	13.30	.647
Raw-Diffusion	19.30	.783	20.66	.840	18.79	.771	18.69	.797	17.49	.775
RawGen (Ours)	23.20	.843	24.35	.858	23.37	.839	23.51	.853	23.89	.850

Table 1: sRGB-to-XYZ on MIT-Adobe FiveK across five expert styles.

Downstream Utility

Scalable Raw Synthesis for Downstream Tasks

Collecting device-specific raw data at scale is costly and labor-intensive. RawGen's Text-to-Raw pipeline generates diverse, camera-specific raw images from text prompts alone, without physical capture. With just 3K synthetic samples, downstream tasks like illuminant estimation, neural ISP learning, and raw denoising approach or exceed real-data performance.

✏️

Text Prompt

"A sunlit kitchen with warm tones"

RawGen

Text-to-Raw

🗂️

3K Raw Images

Camera-specific synthetic data

☀️ Illuminant Estimation

📷 Neural ISP

🧽 Raw Denoising

Illuminant Estimation (Error ↓)

EnlightenGAN

7.01°

UPI

6.26°

Graphics2RAW

4.21°

RawGen

3.14°

Real data

3.02°

Neural ISP (PSNR ↑)

EnlightenGAN

35.58

UPI

36.43

Graphics2RAW

38.10

RawGen

38.42

Real data

38.32

Denoising ISO 1600 (PSNR ↑)

EnlightenGAN

48.82

UPI

49.05

Graphics2RAW

49.37

RawGen

50.63

Real data

49.80

Denoising ISO 3200 (PSNR ↑)

EnlightenGAN

47.25

UPI

47.51

Graphics2RAW

48.16

RawGen

48.57

Real data

48.25

Show Full Table

Method	Illuminant Est.			Neural ISP		Denoise 1600		Denoise 3200
Method	Mean	Med	W25%	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
EnlightenGAN	7.01	6.82	11.07	35.58	.965	48.82	.991	47.25	.988
UPI	6.26	5.89	10.33	36.43	.966	49.05	.990	47.51	.988
Graphics2RAW	4.21	3.38	8.57	38.10	.974	49.37	.991	48.16	.989
RawGen	3.14	2.11	7.37	38.42	.970	50.63	.994	48.57	.992
Real	3.02	2.17	6.77	38.32	.974	49.80	.993	48.25	.990

Raw-Domain Editing

Generate Once, Edit Forever

RawGen decouples content synthesis from rendering by producing a scene-referred linear image. Every downstream edit becomes a standard ISP operation, with no re-inference or retraining.

🔃

Inverse-ISP Methods

Device-Specific Inversion

Tuned to specific cameras and fixed imaging assumptions. Heterogeneous sRGB inputs outside the training distribution often cause failures.

⚠ device-dependent

🔄

Camera-Controllable Generation

Re-Inference Per Edit

>20 sec / edit

Re-runs full diffusion for every parameter change. Only supports parameters explicitly learned during training. New edit types require retraining.

✕ fixed parameter set

⚡

RawGen (Ours)

Generate Once, Edit Freely

~0.53 ms / edit

Generate a linear raw image once, then apply any ISP operation (white balance, exposure, tone mapping) with any software pipeline.

✓ all edit types supported

Citation

BibTeX

If you find RawGen useful, please cite our work.

@article{kim2026rawgen,
  title     = {RawGen: Learning Camera Raw Image Generation},
  author    = {Dongyoung Kim and Junyong Lee and Abhijith Punnappurath and Mahmoud Afifi and Sangmin Han and Alex Levinshtein and Michael S. Brown},
  journal   = {arXiv preprint},
  year      = {2026}
}