How AI Photo Animation Works: The Technology Behind Moving Photos

From Parlor Trick to Breakthrough Technology

The idea of making still photos move is older than you might think. Harry Potter's moving portraits captured the imagination long before AI made it real. But the actual technology has undergone a remarkable transformation in just six years — from crude face warping to photorealistic video generation.

Understanding how this technology works does not require a PhD in machine learning. The core concepts are intuitive, and knowing them helps you appreciate why some tools produce dramatically better results than others.

The Evolution of AI Photo Animation

First Order Motion Model (2019)

The modern era of AI photo animation began in 2019 with the First Order Motion Model, published by researchers at the University of Trento. This approach worked by detecting keypoints on a source face, then transferring motion from a driving video to the source image.

The results were impressive for the time but had clear limitations. The model struggled with large head movements, often produced warping artifacts around the edges of the face, and required a separate driving video to define the motion pattern.

Generative Adversarial Networks (2020-2022)

The next major advance came from GANs — Generative Adversarial Networks. This is the technology behind MyHeritage's Deep Nostalgia and several similar tools from that era.

A GAN consists of two neural networks in competition. The generator creates synthetic images, while the discriminator tries to distinguish them from real ones. This is the technology behind MyHeritage's Deep Nostalgia — see our full comparison of Deep Nostalgia alternatives for how it stacks up today. Through this adversarial process, the generator learns to produce increasingly realistic outputs.

For photo animation, GAN-based systems were trained on video datasets to learn how faces move. When given a still photo, the generator would produce a sequence of frames showing plausible facial motion.

Aspect	First Order Motion	GAN-Based
Year of prominence	2019-2020	2020-2022
Motion source	External driving video	Learned motion patterns
Quality	Moderate, visible artifacts	Good, some uncanny valley
Flexibility	Any motion possible	Limited to trained patterns
Speed	Fast	Moderate

GAN-based animation was a clear step forward but carried its own set of problems. The motion patterns were often templated — every face performed roughly the same sequence of movements. The adversarial training process could be unstable, leading to occasional artifacts like flickering, distorted teeth, or unnatural eye movement. And the resolution was typically limited.

Diffusion Models (2023-Present)

The most significant leap came with diffusion models, which have largely displaced GANs as the state of the art for image and video generation.

Diffusion models work on a fundamentally different principle. Instead of learning through adversarial competition, they learn to reverse a gradual noise-addition process. During training, the model observes how clean video frames are progressively corrupted with random noise. It then learns to reverse this process — starting from pure noise and progressively refining it into a clean, realistic video frame.

This approach produces several advantages over GANs:

More stable training. No adversarial collapse or mode dropping.
Higher quality output. Finer details, more natural textures, fewer artifacts.
Better diversity. Each generation starts from different random noise, producing unique results rather than templated motion.
Scalability. Diffusion models improve predictably with more training data and compute.

How Seedance 2.0 Works

Seedance 2.0, the video diffusion model that powers Incarn, represents the current frontier of this technology. Here is what happens under the hood when you upload a photo.

Image Understanding

The model first analyzes the source photograph using a vision encoder. This step extracts detailed information about the subject: facial structure, expression, head pose, lighting direction, image composition, and the relationship between foreground and background elements.

This is not simple face detection. The model builds a rich internal representation of the entire image, understanding spatial relationships and physical plausibility.

Motion Planning

Based on its understanding of the image, the model plans a motion sequence that would be natural for the specific subject and pose. A person with a slight smile might break into a fuller smile. A subject looking slightly off-camera might turn toward the viewer.

This is where Seedance 2.0 differs most dramatically from older tools. There is no library of preset motions. The model generates a unique motion plan for each image based on what it has learned about how real people move in similar poses and expressions.

Frame Generation via Diffusion

The model then generates video frames through the iterative diffusion process. Starting from structured noise conditioned on the source image, it refines each frame over multiple steps — typically 20 to 50 denoising steps — until a clean, detailed video frame emerges.

Each frame is generated with awareness of all other frames in the sequence, ensuring temporal consistency. This prevents the flickering and frame-to-frame inconsistency that plagued earlier approaches.

Post-Processing

Final post-processing steps handle color consistency, edge refinement, and format encoding. The result is a short, high-definition video clip — typically 3 to 5 seconds — ready for viewing and download.

Technical Comparison Across Generations

Capability	First Order (2019)	GAN-Based (2021)	Diffusion (2025+)
Output resolution	256x256 typical	512x512 typical	Up to 1080p
Temporal consistency	Low — flickering common	Moderate	High
Motion diversity	Depends on driving video	Limited templates	Unique per image
Fine detail (hair, fabric)	Poor	Fair	Excellent
Handling of occlusions	Poor	Moderate	Good
Training stability	Moderate	Low (mode collapse risk)	High
Inference speed	Fast (<5s)	Moderate (10-30s)	Moderate (30-60s)

What Makes a Good Animation

Understanding the technology explains why certain photos animate better than others — and what to look for when evaluating animation quality.

Facial Landmark Clarity

The model's ability to generate natural motion depends heavily on accurately understanding the source face. Photos where facial landmarks (eyes, nose, mouth, jawline) are clearly visible give the model the best foundation to work with.

Pose Plausibility

The animation must be physically plausible for the subject's pose. A person photographed mid-turn has different plausible next movements than someone facing the camera directly. Advanced models like Seedance 2.0 account for this; simpler models apply the same motion regardless.

Temporal Coherence

The hallmark of a good animation is temporal coherence — the sense that each frame flows naturally from the previous one. Poor temporal coherence manifests as jitter, flickering, or unnatural jumps in movement. Diffusion models achieve better coherence because they generate all frames with global awareness of the complete sequence.

The Uncanny Valley

The uncanny valley — the discomfort humans feel when something looks almost but not quite human — remains the central challenge. GAN-based animations often fall into this valley with unnatural eye movements or rigid facial expressions. Diffusion models have pushed the boundary significantly, producing animations that feel natural to most viewers, though they are not yet indistinguishable from real video.

The Computational Challenge

Generating animated video from a single photo is computationally expensive. Each frame requires dozens of denoising steps, and a 3-second video at 24 frames per second means generating 72 individual frames with full temporal awareness.

This is why tools like Incarn run on cloud GPU infrastructure rather than in your browser. The processing for a single animation involves billions of mathematical operations — workloads that require dedicated AI accelerator hardware.

The trade-off is speed versus quality. The iterative refinement process that makes diffusion models so good also makes them slower than real-time. A typical animation takes 30 to 60 seconds to generate — fast enough for a great user experience, but not instant.

What Comes Next

The field is advancing rapidly. Several trends point to where AI photo animation is heading.

Higher resolution and longer duration. Current models produce excellent results at standard HD resolution for a few seconds. Next-generation models will push toward 4K output and longer, more complex motion sequences.

Better physics understanding. Future models will better simulate the physical world — how hair falls, how fabric drapes, how light interacts with moving surfaces. This will further reduce artifacts and push animations closer to photorealistic video.

Real-time generation. As hardware improves and model architectures become more efficient, processing times will shrink. Real-time photo animation on consumer devices is likely within a few years.

Interactive control. Users will gain more control over the type and direction of motion. Rather than accepting whatever the model generates, you might specify "look left and smile" or "nod slowly."

Try It Yourself

The best way to understand the technology is to see it in action. Incarn lets you animate a photo for free without creating an account — upload any portrait and see the result in under a minute. If you want a step-by-step walkthrough, check out our complete guide to animating old photos.

The gap between a still photo and a moving portrait is not just technical. It is emotional. And that is what makes this technology worth understanding.

Frequently Asked Questions

Is AI photo animation the same as deepfake technology?

They share underlying AI architectures, but the intent and application are different. AI photo animation generates natural motion for a person within their own photograph. Deepfakes typically involve mapping one person's appearance onto another person's movements, often without consent. Responsible photo animation tools like Incarn are designed for personal and family use, animating your own photos rather than impersonating others.

Why do some photos animate better than others?

The quality of animation depends primarily on three factors: face visibility (clear, unobstructed facial features), image resolution (higher resolution provides more detail for the model), and lighting (even lighting helps the model accurately interpret facial structure). Photos with all three factors in their favor will produce the most natural-looking animations.

How does Seedance 2.0 compare to other video diffusion models?

Seedance 2.0 is among the leading video diffusion models specifically optimized for image-to-video generation, which is the core task in photo animation. While other models like Stable Video Diffusion and Runway Gen-3 also use diffusion architectures, Seedance 2.0 has been fine-tuned for portrait animation quality — producing more natural facial movement and better temporal consistency for this specific use case. We go into more detail in our Seedance 2.0 vs Kling comparison.

Will AI-animated photos keep improving over time?

Yes. Each new generation of models produces noticeably better results. Photos animated today will likely look outdated compared to animations generated two years from now. This is one reason to preserve your original high-quality scans — you can re-animate them with future tools for even better results.