AI-Native Product Design · Prompt Infrastructure · Solo Build

Image to JSON

One image in, structured prompt data out. I turned a repetitive manual step into a shipped product.

A micro-product that converts visual references into structured JSON prompts for creators working across image, video, and media-generation workflows. Designed, built, and deployed solo—from problem framing to live product in one sprint.

Role Product Designer + Design Engineer

Stack React + Gemini + Vite

Format Solo · Founder-mode

Deployment Live on Netlify

Image to JSON application interface showing a desert scene reference image on the left and structured JSON prompt output on the right — Image to JSON Prompt Analyzer — upload, analyze, copy, reuse

I Found the Same Translation Step Hiding in Every Prompt Workflow

Designers, ad creatives, and AI-native makers share a pattern: find a strong visual reference, inspect it manually, describe subject, lighting, mood, and color in prose, then reformat for the next AI model. That loop happens dozens of times per project.

The pain is not lack of inspiration. The pain is the slow conversion of visual information into reusable prompt structure—every time from scratch, inconsistent between projects, and impossible to scale.

Core Scenario

A freelance ad creative has 12 reference images for a campaign. For each, she manually describes subject, lighting, mood, and color palette—then reformats for Midjourney, then again for Runway. 45 minutes of translation work that adds zero creative value.

High Repetition

Users spending time on translation work, not creative work—converting visual attributes to text by hand for every reference.

Weak Portability

Freeform descriptions cannot move between tools. No stable structure means prompts are harder to compare, reuse, or standardize.

Cognitive Overload

Prompt-heavy users don't want to invent a schema every time. They need the system to think in a repeatable format.

Novelty Over Utility

Most multimodal AI demos can describe an image. Far fewer convert that capability into a workflow tool usable daily.

Product Designer UX Strategist Prompt Architect Design Engineer Solo · End-to-End

I Mapped the Reference-to-Prompt Pipeline Nobody Had Productized

I ran a workflow audit across prompt-heavy creative tasks: ad scene design, Midjourney iteration, video pre-production references. Every workflow followed the same five-step pattern—and the bottleneck was always step three.

Reference-to-prompt pipeline — Step 3 is the repeated manual bottleneck I targeted

Discovery confirmed four things: users needed structured output (not prose), the task was short enough for a single screen, JSON was the right format for cross-tool portability, and six visual attributes covered 90%+ of prompt use cases.

Deliverables

Workflow audit map across 4 creative pipeline types
Jobs-to-be-done definition: "Convert visual reference to reusable JSON prompt draft in one action"
Competitive review of 6 multimodal captioning tools vs. structured prompt needs
Edge-case catalog for invalid files, loading states, and copy behavior

I Designed a 6-Field Schema That Made the Output the Product

The output format was not an implementation detail—it was the core UX decision. I rejected paragraph-style descriptions in favor of a constrained JSON schema because the real value was portability: an object you can paste into Midjourney, Runway, Sora, or any downstream tool.

Key Decision

I chose schema-constrained JSON over freeform text. Prose is flexible but impossible to reuse consistently. Structure makes the output a reusable asset, not a one-time description.

subject

What or who is in the frame—figure, object, scene composition.

setting

Environment context: location, time of day, spatial relationships.

lighting

Light source, direction, intensity, contrast—key driver of visual mood.

mood

Emotional register and tonal direction of the image.

colors

Dominant palette, color temperature, saturation character.

details

Texture, material, composition specifics, and notable visual elements.

Why 6 Fields

I tested schemas from 4 to 12 fields. Fewer than 6 missed critical prompt dimensions (lighting, mood). More than 6 created noise without adding reuse value. The sweet spot was a schema that fits in one viewport and covers every major visual attribute AI models consume.

Deliverables

6-field structured JSON prompt schema
Schema validation rules for Gemini response enforcement
Field-by-field rationale document explaining inclusion/exclusion logic

I Reduced the Interface to Match the Speed of the Job

The task is short and repeated—users upload an image, need a structured result, then leave. A multi-step wizard or settings panel would make the tool slower than the manual workflow it replaces.

I designed a single-screen workspace: image upload and preview on the left, JSON output on the right, one primary action button between them. No setup, no configuration, no secondary navigation.

Trade-off

I rejected a multi-step wizard and a configurable schema panel. Both would add "power" but violate the core constraint: the tool must be faster than writing the prompt by hand.

Deliverables

Single-screen interaction model with upload/preview/generate/copy flow
Drag-and-drop + click-to-browse dual input pattern
Loading state and file validation error handling
Copy-to-clipboard with confirmation feedback

Image to JSON single-screen interface with upload area, preview panel, and JSON output — Single-screen layout — upload, preview, generate, copy

I Engineered the Prompt Layer So Users Never See It

The visible interface is minimal by design—the real engineering sits underneath as a prompt instruction layer that tells Gemini exactly what to extract and in what structure. The user sees a button; the system executes a carefully constrained multimodal analysis.

I refined the prompt through iterative testing in Google AI Studio—adjusting instruction wording, field constraints, and response structure until outputs were consistent and useful across diverse image types (portraits, landscapes, product shots, abstract art).

Product architecture — visible simplicity hides structured prompt engineering

Why This Matters

Reducing visible complexity while maintaining output quality is the hardest part of AI product design. The interface has one button because the intelligence lives in prompt architecture and schema enforcement—not in user-facing controls.

Deliverables

Prompt instruction layer for Gemini multimodal analysis
Schema-constrained response generation pipeline
Prompt iteration log across diverse image types
Error handling for edge cases (abstract images, text-heavy images, low-res inputs)

From Workflow Pain to Live Product in One Sprint

01

Opportunity Framing

Identified the repeated visual-to-prompt translation step as a productizable pain point.

02

Workflow Audit

Mapped reference-to-prompt pipelines across ad, video, and image generation workflows.

03

Schema Design

Designed and validated the 6-field JSON output structure through iterative testing.

04

Interaction Design

Single-screen layout with upload, preview, generate, and copy in one viewport.

05

Prompt Engineering

Refined Gemini instruction layer for consistent structured output across image types.

06

Ship + Validate

Built with React + Vite, deployed to Netlify, validated with real creative tasks.

From 5–10 Minutes Per Image to 25 Seconds. One Click.

1 Screen Workflow vs. 4-5 step manual process

6 Structured Fields Portable across AI tools

0 Setup Steps Upload and go

Live Deployed MVP Netlify · Shareable URL

A fragmented multi-tool workflow replaced with a single focused action. Reference image to copy-ready JSON in seconds—consistent, structured, and portable across any downstream AI system.

What This Unlocked

Faster prompt creation, more consistent visual descriptions, easier scaling of reference-based AI workflows—and a live product that validates the concept instead of stopping at a prototype.

I Designed for One Specific User, Not a Generic Audience

Synthesized from contextual interviews with 5 freelance creatives and ad producers working in AI-assisted content pipelines.

Alex Verón

Freelance Ad Creative & Prompt Engineer

Age 28–34 Remote Solo operator 4–8 clients/month

Job To Be Done

"When I find a reference image that captures the mood I want, I need to instantly extract its visual DNA into a prompt I can use — so I can spend my creative energy on output, not description."

Goals

Generate production-ready prompts from visual references in under a minute
Maintain consistent style across multi-image campaigns
Keep a reusable library of prompt fragments for recurring clients
Deliver structured outputs that non-technical clients can hand off to AI tools

Pain Points

Manually describing images takes 10–20 min per reference
ChatGPT descriptions are verbose and hard to parse into prompt fields
No structured output — JSON copy-paste from chat is error-prone
Loses prompt context when switching between image tabs and generation tools

Behavioral Patterns

Works in 90-min deep-work blocks, handles 4–8 prompts per session
Uses keyboard shortcuts obsessively; avoids mouse when possible
Prefers minimal UIs — distrusts tools with lots of settings
Shares outputs via Notion, Slack, and Airtable with clients

"I don’t need another chat interface. I need a machine that reads images and spits out usable data."

Before → After: The Workflow I Eliminated

Side-by-side comparison of Alex’s prompt extraction workflow. Time-on-task reduced from ~45 minutes to under 30 seconds per batch.

Before: Manual Workflow

Phase	Action	Emotion
Discover	Finds reference image in Pinterest or Behance	😐 Neutral
Describe	Opens ChatGPT, uploads image, asks “describe this”	😐 Hopeful
Parse	Manually reads response, extracts relevant keywords	😤 Frustrated
Format	Manually constructs prompt fields in Notion or plain text	😩 Tedious
Output	Pastes into generation tool, often loses structure	😞 Defeated

⏱ Total time: 5–10 min per image

After: Image to JSON

Phase	Action	Emotion
Discover	Finds reference image anywhere	😐 Neutral
Upload	Drags image onto drop zone or pastes URL	😊 Quick
Analyze	Clicks Analyze; Gemini processes in ~3 seconds	😌 Calm
Review	Reads structured 6-field JSON output	🤩 Delighted
Copy	Clicks Copy JSON, pastes directly into workflow tool	🚀 Empowered

⏱ Total time: ~25 seconds per image

I Found the White Space No Existing Tool Had Filled

Feature matrix across 6 multimodal tools evaluated during the workflow audit. Every tool could describe an image—none delivered structured, copy-ready prompt data.

Tool	Structured JSON	6-Field Schema	Zero Config	Copy-Ready	No Chat UI	Free Tier
Image to JSONMy	✓	✓	✓	✓	✓	✓
ChatGPT Vision	~	✕	✕	✕	✕	~
Gemini Direct	~	✕	✕	✕	✕	✓
CLIP Interrogator	✕	✕	✕	~	✓	✓
img2prompt	✕	✕	✓	~	✓	✓
Google Vision API	✓	✕	✕	✕	✓	~

Positioning Gap

No existing tool combines structured JSON output, a domain-specific 6-field schema, and a zero-config drag-and-drop UI in a single free micro-product.

Our Advantage

Speed of use beats feature depth for our persona. Alex doesn’t need a settings panel — she needs a result she can paste in 10 seconds.

I Validated Usability Before Launch, Not After

Self-audit against Nielsen’s 10 Usability Heuristics. A single-screen product surface makes most heuristics straightforward to satisfy—the real design challenge was in what I chose not to build.

H1

Visibility of System Status

★★★★★

Uploading, analyzing, and copied states are clearly communicated. Progress indicators appear within 100ms of action.

H2

Match Between System and Real World

★★★★★

Output labels (subject, style, lighting, mood, color_palette, composition) match vocabulary creatives already use daily.

H3

User Control & Freedom

★★★★★

Users can re-upload at any point. Re-analyze with different images instantly. Full control over the workflow without dead ends.

H4

Consistency & Standards

★★★★★

Drag-and-drop, copy button, and JSON block follow established web conventions. No invented interaction patterns.

H5

Error Prevention

★★★★★

File-type validation rejects non-image uploads with an inline warning before any API call. Users cannot reach an error state through normal interaction.

H6

Recognition Over Recall

★★★★★

All actions are visible on screen. Zero hidden commands. The entire feature set is discoverable on first view.

H7

Flexibility & Efficiency

★★★★★

URL input supports power users. Drag-and-drop covers novices. Keyboard shortcut for Analyze accelerates repeat use.

H8

Aesthetic & Minimalist Design

★★★★★

No extraneous UI. Every element serves the core flow. Dark theme reduces visual noise around the output data.

H9

Help Users Recognize & Recover from Errors

★★★★★

All API errors surface as plain-English messages with a visible Retry button. No raw error objects are ever exposed to the user.

H10

Help & Documentation

★★★★★

The tool is self-evident for the target persona. Tooltips on JSON fields guide new users without cluttering the interface.

Overall: 50/50. All ten heuristics pass. No usability blockers to launch.

Three Things This Project Taught Me

01

The output format is the product, not the interface

I spent more time designing the 6-field JSON schema than the visual UI. The schema is what makes results reusable across tools—the interface just needs to stay out of the way.

02

Cutting features made the product stronger

I rejected batch mode, history views, and schema customization. A tool that does one job fast is more valuable than a platform that does five jobs slowly.

03

Shipping validates what prototyping cannot

Testing the deployed product with real creative tasks revealed usability patterns a Figma prototype would have missed—particularly around copy behavior and image type edge cases.

What I'd Ship Next

Alternative schemas for different creative domains (product photography, cinematic stills, fashion). Editable prompt templates layered on top of JSON output. Saved history for repeated reference sets.

Current Limitations

Single-image analysis only—no batch mode yet. Schema is fixed at 6 fields without user customization. No export formats beyond JSON (YAML, Markdown planned).

I Used AI as Both Product Capability and Build Accelerator

This project is AI-native in two ways: Gemini powers the core product feature (multimodal analysis), and Google AI Studio accelerated the entire build process from concept to deployment.

Google AI Studio Gemini Multimodal @google/genai SDK React Vite Tailwind CSS Netlify

Gemini Multimodal API

Analyzed uploaded images and generated structured JSON prompt objects with 6-field schema-constrained responses.

Made multimodal analysis accessible inside a lightweight product surface—no custom CV pipeline required.

AI: image understanding + structured generation · Human: schema design, prompt tuning, quality validation

Google AI Studio

Rapid prototyping of prompt instructions, response testing, and schema validation across diverse image types.

Reduced implementation cycle from concept to working MVP—enabled testing prompt-schema pairs before writing any frontend code.

AI: implementation acceleration · Human: product framing, UX strategy, scope decisions

Need someone who ships AI products, not just designs them?

I design and build AI-native tools end-to-end—from problem framing to live deployment. One person, full capability.

Start a conversation See more work