你写了一段自认为很完美的 prompt,点了生成,等了几十秒,结果出来一段……嗯,很"AI"的东西。画面模糊、角色飘忽、镜头像是随机选的、动作僵硬得像 PPT 翻页。
这种挫败感太熟悉了。
问题往往不在 Kling 3.0 这个模型上——它的能力其实已经相当强了,15 秒的原生视频、最多 6 个镜头的 multi-shot、自带音频和对话口型同步。问题出在我们还在用写图片 prompt 的思路去写视频 prompt。
一句话总结:Kling 3.0 需要你用导演的脑子思考,而不是用摄影师的。
你不是在描述一张静态的画面,你是在给一段影片写拍摄脚本。这个思维转换,是所有后面技巧的根基。
先看一个用 Kling 3.0 生成的实际视频,感受一下"导演思维"prompt 跑出来的效果:
Kling 3.0 Prompt 结构详解
很多人写 AI 视频提示词的时候,习惯性地把所有想要的东西一股脑堆上去——"一个美女在海边跑步,阳光很美,画面很好看"。这种写法在图片生成时代或许还能凑合,但在 Kling 3.0 上,它几乎注定失败。
为什么?因为视频是时间的艺术。它有运动、有节奏、有镜头语言。模型需要你告诉它这些东西,不然它只能瞎猜。
经过大量实测和对官方文档的研究,一个靠谱的 Kling 3.0 Prompt 通常包含五个层面:
场景与环境——不是"海边",而是"黄昏时分的东南亚热带海滩,潮湿的沙地上倒映着橘红色的天光,远处有椰子树的剪影"。具体的空间细节让模型知道该渲染什么。
角色与外观——不是"一个女人",而是"穿着白色亚麻长裙的亚洲女性,黑色长发被海风吹起,赤脚"。每一个视觉锚点都在帮助模型保持角色一致性。
动作时间线——这是视频 prompt 区别于图片 prompt 的关键。你需要按时间顺序描述动作:"她缓慢地沿着海岸线走动,低头看着脚边的浪花,然后抬起头望向远方,头发随风散开。" 注意是按序展开,不是并列堆砌。
镜头运动——Kling 3.0 真的能听懂电影术语。"Slow tracking shot from behind"、"Medium close-up with shallow depth of field"、"Dolly push-in"——这些不是装饰,它们直接决定了画面质感。
音频与氛围——如果你开启了音频生成,可以加上:"gentle wave sounds, distant seagull calls, soft ambient wind"。模型会生成匹配的音效。
来看看弱 prompt 和强 prompt 的差距:
| 层面 | 弱写法 | 强写法 |
|---|---|---|
| 场景 | A beach | Tropical shoreline at golden hour, wet sand reflecting amber sky |
| 角色 | A woman | Young woman in white linen dress, bare feet, black hair caught by wind |
| 动作 | She walks | She walks slowly along the waterline, pauses to look down at the foam around her ankles, then lifts her gaze toward the horizon |
| 镜头 | Camera follows her | Slow tracking shot from behind, gradually pushing in to a medium close-up as she turns |
| 音频 | Nice music | Gentle wave crashes, distant seagull, soft wind through hair |
差距不是文字量的差距,是信息密度的差距。
看一个实际效果——下面这个视频就是 Kling 3.0 在角色一致性上的表现,同一角色在不同场景中保持了稳定的面部特征和服装细节:
Clean Kling 3.0 Prompt Framework
好,知道五个层面了,但实际动手写的时候怎么组织?我自己反复测试后摸索出一个框架,可以叫它 SCALE 框架:
Shot(镜头类型+运动)→ Character(角色+外观)→ Action(动作时间线)→ Lighting & Location(光线+场景)→ Extra(音频/风格/技术参数)
为什么把镜头放在最前面?因为镜头语言几乎是 Kling 3.0 最敏感的信号。一个 "Handheld shoulder-cam with subtle drift" 开头和一个 "Static tripod, locked-off wide shot" 开头,出来的东西风格完全不同。镜头定义了整个视频的"气质"。
拿一个具体例子拆解:
Slow dolly push-in through a narrow neon-lit ramen shop at night.
A young Japanese couple sits side by side at the counter, steam
rising from their bowls. The woman picks up her chopsticks, twirls
the noodles slowly, and takes a bite while the man watches with a
slight smile. Flickering magenta neon sign reflects off the foggy
window behind them. Shot on 35mm film with warm grain and shallow
focus, glowing bokeh from street lights outside. Soft clinking of
bowls, quiet slurping sounds, distant city hum.逐层拆开看:
- Shot: Slow dolly push-in(慢速推近镜头)
- Character: A young Japanese couple, 有具体动作分配
- Action: picks up → twirls → takes a bite, man watches——动作有顺序,有节奏
- Lighting & Location: narrow neon-lit ramen shop, flickering magenta neon, foggy window
- Extra: 35mm film, warm grain, shallow focus, 具体音效
注意整段 prompt 读起来像一个连续的镜头描述,不是关键词清单。这种叙事流是 Kling 3.0 生成连贯视频的关键。
用类似结构生成的拉面店场景效果,感受一下叙事流 prompt 和关键词堆砌的差距:
关于镜头语言,这里列一些 Kling 3.0 特别吃得准的术语:
- Dolly push-in / pull-out — 推近/拉远
- Tracking shot — 跟踪镜头
- Handheld / shoulder-cam — 手持/肩扛,带晃动
- Static tripod — 三脚架固定机位
- Whip-pan — 快速横摇
- Crash zoom — 急速变焦
- Rack focus — 焦点转换(前景到背景)
- Low-angle / High-angle — 低角度/高角度
- FPV drone — 第一人称无人机视角
- Dutch angle — 倾斜构图
这些不是花哨的词汇。每一个术语背后都对应了一种镜头行为,Kling 3.0 是真的能区分并执行的。
色彩和情绪控制也是镜头语言的一部分。下面这个视频展示了具体命名光源和色调带来的效果——对比"dramatic lighting"之类的模糊描述,差距一目了然:
常见失败问题与优化方法
跑了几百次之后,我发现翻车的 prompt 几乎都能归到以下几类。
问题一:"万能描述"综合症
你写了 "cinematic, beautiful, high quality, 4K, masterpiece"——然后什么都没改善。这些词对 Kling 3.0 来说几乎没有信息量。模型需要的是具体指令,不是形容词表演。
修复方法:把每一个形容词替换成具体的视觉参考。"cinematic" → "shot on 35mm film with anamorphic lens flare"。"beautiful lighting" → "golden hour backlight catching dust particles through warehouse windows"。
问题二:没有镜头指令
这可能是最普遍的问题。你详细描述了角色和场景,但完全没有告诉模型镜头在做什么。结果就是模型自己随机选了一个角度,很可能不是你想要的。
修复方法:每个 prompt 至少包含一个明确的镜头描述。即使是最简单的 "medium shot, static camera" 也比什么都不写强太多。
问题三:动作堆砌而非动作序列
错误写法:"She is running, laughing, waving, and her hair is flowing."——四个动作并列,模型不知道先做哪个。
正确写法:"She breaks into a run along the pier, a laugh escaping her lips, then turns back and waves with one hand while her hair streams behind her."——动作有先后,有因果。
问题四:对话场景中用代词
两个人的对话场景,如果你写 "he says... then she replies... then he..."——模型很容易搞混谁是谁。
正确做法是给每个角色一个标签并保持一致:
[Detective Morris, gruff baritone]: "Where were you last night?"
[Sarah, nervous whisper]: "I was home. You can check the cameras."每次对话都带上角色名和声音描述,模型的口型同步和角色定位会准确得多。
问题五:Image-to-Video 时重复描述图片内容
用图片生视频时,很多人会在 prompt 里把图片上已经有的东西再描述一遍。这是浪费。Kling 3.0 已经能"看见"你上传的图片了,它会自动保留图片中的身份、布局和文字细节。
你的 prompt 应该专注于变化——从这张图出发,接下来发生什么。角色怎么动,镜头怎么移,场景怎么演变。
当动作描述写得足够具体时,Kling 3.0 能渲染出相当真实的物理运动效果:
Multi-shot 写法
Kling 3.0 Multi-shot 是这个模型最强的叙事能力之一——一次生成最多 6 个连续镜头,总时长可达 15 秒。但用好它有门道。
基本格式很简单,用 "Shot 1"、"Shot 2" 这样的标注来分隔每个镜头。每个 shot 都应该是一个独立完整的镜头描述,包含前面说的那五个层面。
节奏控制是关键。一条重要的经验:4-6 个 shot 配 10-15 秒是最佳节奏。如果你在 10 秒里塞了 6 个 shot,每个 shot 不到 2 秒,整个视频会像快进一样。反过来,如果只有 2 个 shot 撑 15 秒,画面会显得拖沓。
镜头之间的过渡也要想好。不同的 shot 之间,Kling 3.0 会自动做过渡处理,但你可以通过镜头类型的变化来引导过渡风格。比如从一个 wide establishing shot 切到一个 close-up,这种"大跳切"模型处理得很自然。但如果两个连续的 shot 都是类似角度的 medium shot,过渡可能会显得突兀。
来看一个完整的 multi-shot prompt 示例:
Shot 1: Wide establishing shot of a rain-soaked Tokyo alley at
night. Neon signs in Japanese flicker above, their colors bleeding
into puddles on the asphalt. A lone figure in a dark trench coat
stands at the far end, face hidden. Camera slowly dollies forward.
Shot 2: Medium close-up from behind as the figure begins walking.
Rain drips from the coat's collar. Footsteps echo on wet ground.
The camera tracks at shoulder height, slightly handheld.
Shot 3: The figure stops at a ramen stall, pulls back the noren
curtain. Warm golden light spills out, contrasting the cold blue
of the alley. Cut to a frontal medium shot as they sit down.
Shot 4: Close-up on hands wrapping around a steaming bowl. Steam
rises into frame. The figure lifts the bowl slightly, inhaling.
Camera pushes in slowly with shallow depth of field.
Shot 5: Wide shot from inside the stall looking out. The figure
eats quietly while rain intensifies outside. Neon reflections
dance across the wet street. Soft ambient rain, distant traffic,
quiet slurping.5 个 shot,从远到近再到远,有明确的场景推进、情绪变化和镜头逻辑。这种东西 Kling 3.0 跑出来的效果是非常惊艳的。
下面两个视频分别展示了 Kling 3.0 的 multi-shot 场景生成和帧控制能力:
Motion Control 写法
Kling Motion Control 是一个容易被误用的功能。很多人搞混了它的工作原理,导致 prompt 写得完全错误。
Motion Control 的逻辑是:你提供一个参考视频作为动作来源,一张图片作为角色/外观来源,然后模型把参考视频中的动作"映射"到你的角色上。
所以,关键的反直觉规则来了:你的 prompt 里不需要描述任何动作。 动作已经由参考视频定义了。你的 prompt 应该只关注两件事——
角色长什么样:
- 服装风格(正式、休闲、街头、商务)
- 年龄和气质
- 面部细节和肤质
- 整体视觉调性(写实、风格化)
环境是什么样:
- 场所类型(工作室、办公室、街头、教室)
- 灯光设定(柔光、浅景深、电影感)
- 氛围(专业、温馨、紧张)
一个正确的 Kling Motion Control prompt 长这样:
Confident marketing spokesperson in tailored navy suit with
crisp white shirt, clean-shaven, professional grooming. Modern
corporate studio with soft diffused lighting and shallow depth
of field, subtle grey gradient background. Cinematic realism,
commercial broadcast quality.注意:没有一个词在描述动作。因为动作完全来自你上传的参考视频——表情、手势、嘴型、身体语言,全部由参考视频决定。
千万不要这样写: "The spokesperson raises their hand and gestures while speaking to the camera, then nods and points forward." 这样写反而会干扰模型,因为它不知道该听参考视频的还是听你 prompt 的。
Motion Control 的典型应用场景包括品牌宣传视频、企业培训内容、AI 虚拟主播、多语言本地化(同一个动作映射到不同形象的角色上)等。
10 个可复制 Prompt 模板
以下每个模板都经过实测,可以直接复制使用或根据需要调整。
模板 1:电影叙事(Multi-shot)
Shot 1: Wide aerial drone shot descending toward a foggy mountain
village at dawn. Smoke rises from chimneys, a river cuts through
the valley. Atmospheric, muted earth tones.
Shot 2: Medium tracking shot following an old man walking along
a cobblestone path with a wooden cane. He wears a weathered brown
jacket and flat cap. Morning dew on surrounding grass.
Shot 3: Close-up of his weathered hands pushing open a heavy
wooden door. Paint peeling, iron hinges creaking. Camera follows
him inside.
Shot 4: Interior medium shot. Warm amber light from a fireplace
illuminates a cluttered workshop. He sits at a bench, picks up a
half-finished wooden carving, and examines it closely.
Shot 5: Macro close-up of his hands carving with a small knife.
Wood shavings curl and fall. Firelight flickers across the
detailed carving of a bird. Soft crackling fire, faint wind outside.模板 2:对话场景(Lip-sync)
对话场景的关键是音频同步和角色标签。下面这个视频展示了 Kling 3.0 的音视频同步能力:
Medium shot, warmly lit coffee shop interior with exposed brick
walls. Two women sit across from each other at a small wooden table.
[Anna, cheerful mid-range voice]: "I finally quit my job yesterday."
She leans back with a relieved smile, wrapping both hands around
her mug.
[Maya, surprised, slightly high-pitched]: "Wait, seriously? What
happened?"
Maya sets down her cup and leans forward. Camera slowly pushes in
to a tighter two-shot. Ambient cafe noise, soft jazz in background,
ceramic clinking.模板 3:高速动作/追逐
Dynamic FPV drone shot racing through a narrow urban alley at
night. A parkour runner in black athletic gear sprints ahead,
vaulting over dumpsters and sliding under scaffolding. Sparks
fly as his hand drags along a metal railing. Camera performs
a 360-degree barrel roll as he leaps between rooftops. Neon
signs blur past, rain streaks across the lens. Motion blur,
high contrast, adrenaline pacing. Heavy breathing, shoes
impacting concrete, distant sirens.模板 4:微距/ASMR 美食
Extreme macro close-up, shallow depth of field. A hand slowly
drizzles warm honey over a stack of golden pancakes. The honey
catches morning window light, forming a thick glossy ribbon that
pools between layers. Steam rises gently. Camera holds static on
tripod, then slowly pushes in as butter begins to melt and slide
down the side. Crispy sizzle, viscous honey dripping, soft plate
clink. Shot on macro 100mm lens, warm color grade.模板 5:产品展示
Slow orbit camera movement circling a matte black wireless
headphone floating against a clean dark gradient background.
Subtle rim lighting traces the product edge in cool white.
The headphones rotate 180 degrees revealing cushion detail and
brushed metal accents. Camera pulls back slightly as soft
volumetric light rays sweep across frame. Minimal, premium
aesthetic. Clean electronic ambient tone, soft bass pulse.模板 6:街拍纪录片风格
手持 VHS 风格是 Kling 3.0 非常擅长的一类。看看这个效果——grain、chromatic aberration、晃动全都到位了:
Handheld shoulder-cam following a street musician playing
saxophone on a rainy Paris sidewalk at dusk. He wears a
worn leather jacket, fedora tilted low. Camera sways naturally
with operator movement, occasionally rack-focusing between
the musician and blurred pedestrians with umbrellas. Wet
cobblestones reflect warm streetlamp glow and blue twilight
sky. Shot on 35mm film with visible grain. Raw saxophone
melody, rain pattering on pavement, distant cafe chatter.模板 7:科幻/赛博朋克
这个类型和下面的镜面迷宫实验视频有异曲同工之妙——同样是超现实场景,关键在于用具体的光线和质感让画面可信:
Slow tracking shot through a massive cyberpunk marketplace.
Holographic vendor signs flicker in Mandarin and English.
A woman with chrome cybernetic arm browses a stall selling
glowing circuit boards. Dense fog at ground level, magenta
and cyan neon reflecting off her metallic implants. She
picks up a component, inspects it, nods to the vendor.
Camera cranes up revealing the towering megastructure above.
Synth ambient drone, electronic market chatter, servo whirr
from her arm.模板 8:自然风光
Timelapse-style wide shot of rolling Tuscan hills at golden
hour. Cypress trees cast long shadows across wheat fields.
Clouds drift slowly, light shifts from warm gold to deep
amber. A narrow dirt road winds through the landscape.
Camera mounted on slow dolly track, imperceptible lateral
movement adds parallax depth. Transition to normal speed
as a flock of birds lifts from a distant tree line. Wind
through grass, distant church bell, birdsong.模板 9:文字/Logo 动画
Clean white background. Bold black 3D text "KLING 3.0"
assembles letter by letter from floating particles. Each
letter materializes with a subtle metallic sheen and casts
a soft shadow. Once complete, the camera performs a slow
push-in while the text rotates 15 degrees. A thin gold
underline draws itself beneath the text. Text remains sharp
and readable throughout the entire animation. Minimal,
corporate. Soft whoosh per letter, subtle impact on completion.模板 10:Motion Control 角色替换
Professional female news anchor in crisp navy blazer and
pearl earrings, natural makeup, warm skin tone. Modern
broadcast studio with large LED screen showing world map
behind her. Three-point studio lighting: key light at 45
degrees camera-left, fill light camera-right, backlight
rim on hair. Clean depth of field, sharp on subject,
slightly soft background. Broadcast television quality,
16:9 framing, color-calibrated.(配合参考视频使用,视频中的所有说话动作、手势、表情都会自动映射到这个角色上。)
写这篇 Kling 3.0 Prompt 教程的时候,我回头看了自己最早写的那些 prompt,真的会笑出来。"A beautiful girl running on the beach, cinematic, 4K"——当时还纳闷为什么出来的东西那么拉。
视频生成 Prompt 写法和图片真的是两回事。核心不在于写了多少字,而在于你是否像一个导演一样思考——镜头在做什么,角色在做什么,这些动作之间是什么关系,环境怎么配合,光线从哪来。
如果你只记住一件事,就记住这个:写 prompt 的时候,闭上眼,想象你正坐在监视器后面,对着摄影师和演员喊 action。你需要告诉他们的那些信息,就是 Kling 3.0 需要你写的东西。
去试试吧。把你之前翻车的 prompt 拿出来,用 SCALE 框架重写一遍,对比一下出来的效果差距。大概率你会回来看第二遍这篇指南。

