Commandline video editing
What is video editing?
How MLT works
An example
Numbers in MLT
Blanks
Producers
Advantages of MLT over GUI-based video editors
Storyboarding
Camera matching

Commandline video editing

This is tutorial for MLT which is a free and open source commandline video editor. I do not assume any previous familiarity with video editing for this tutorial. However, as many of you might already have some idea about video editing, it is important to point out at the outset that MLT is completely different from the other video editors.

What is video editing?

Any video consists of mainly three things:

some movie shots
some still pictures
some audio

Not all the videos need have all of these, and some may have more (like subtitles). Creating a complete video by combining these ingredients (possibly doing minor modifications to them) is called video editing.

How MLT works

First you must procure all the ingredients (record the movie segments, the audio segments, create the still frames). Then you have to tell MLT how you want to combine them. Finally you let the software do the combination, and produce the required video.

This is basically how any video editting software works. The main difference between MLT and a typical video editing software is in how you specify the combination plan. Most video editors use GUI and rely on drag-n-drop. They try to mimick the classical film editing techniques where real celuloid films were cut with real scissors. MLT takes a more modern approach: it requires the specification as an XML file.

An example

<?xml version="1.0" ?>
<mlt>
	<profile description="DV/DVD NTSC"  
                 display_aspect_den="600" display_aspect_num="1024" 
                 frame_rate_den="1" frame_rate_num="30" 
                 height="600" width="1024"/>
		<multitrack>
			<playlist id="Background Track">
				<producer in="0" length="5000" novdpau="1">
					<property name="mlt_service">color</property>
				</producer>
			</playlist>
			<playlist id="Track 2">
				<blank length="5"/>
				<producer in="1" out="100">
					<property name="resource">output.mp4</property>
				</producer>
				<producer in="600" out="600">
					<property name="resource">output.mp4</property>
				</producer>
			</playlist>
		</multitrack>
</mlt>

This file may appear daunting at first. But actually it contains some obvious pieces of information in a structured way. The file lies inside the <mlt>...</mlt> tags. There are just two things inside:

a <profile/> tag (containing information about the over all video, like its size, frame rate etc).
A <multitrack>...</multitrack> that contains the body of the video.

To understand the structure of the multitrack, think of the video in layers. An example may be like a newscast where you see the nwscaster talking about an indident. As she is speaking, a video of the incident appears in an inset. Here there are two layers. The bottom layer holds the newcaster video, the top one holds the inset video. Similalry, you may have a layer for headlines or subtitles etc. Each layer can hold different contents at different time poinnts. For example, the inset laer will show videos of different incidents at different time poinnts, and will be blank the rest of the time.

In MLT language, a layer is called a playlist. It can hold two types of things:

a blank section
a non-blank section, called a producer in MLT.

To get into deeper details, we need to use numbers. So let us talk a little bit about how MLT uses numbers.

Numbers in MLT

A computer can use integers and floating point numbers. Integers are stored exactly, while floating point numbers are stored approximately. This approximation may produce too much error to be tolerated in a video. So MLT does not allow floating point numbers at all. Everything is expressed using only integers. Non-integer numbers are expressed as fractions. Take, for example, the apsect ratio of the display, ie, width/height. This is expressed as the fraction display_aspect_num/display_aspect_den. Similarly, the frame rate (ie, number of frames per second) is exresed as frame_rate_num / frame_rate_den.

Time is always expressed in integer number of frame units. Dividing it by the frame rate will give you the time in seconds.

Blanks

The only specification for a blank is its duration, or length, a number expressed in frames.

Producers

A non-blank segment is called a producer in NLT, because something needs to produce the frames during that segment. Typically a producer is a portion of a media file (movie, image or audio). I said "portion" because you may need only a part of the contents of that file. There are three important specifications for a producer:

The name of the media file
The starting time
The finishing time

There is a subtlety in specifying the two time points. As usual these are expressed in frame units, which required the frame rate to convert to seconds. However, it is quite possible that the media file us difernt frame rate. So the time points in frame units are first converted to seconds using frame rate of the current video. Then the producer is quried with that time in seconds. It is up to the producer to interpret the time using its own frame rate. This mechanism allows us to use media files of different frame rates inside the same MLT file. Thus when we write

<producer in="600" out="1500">
	<property name="resource">output.mp4</property>
</producer>

and we have frame rate 30, we mean "play output.mp4 from 20 sec to 50 sec". It is important to understand that the 20 sec is from the start of output.mp4, and not of the current video!

Advantages of MLT over GUI-based video editors

MLT requires less computing resource to run. A GUI-based video editor will need lots of RAM just to produce thumbnails and previews. These often require entire files in the RAM. On the other hand, MLT can manage by storing only parts of file at a time. I am not sure MLT indeed does that, but I can edit vidoes where one media file is 1.3 GB in size using just a netbook running on an atom processor. Typical GUI-based video editors go to sleep even to load it.
Text editors are more sophisticated than video editors. You can use regular expressions, autocompletions etc. Imagine editing a video is a crowded bus! I can manage that with MLT. Just unthinkable with a GUI-based video editor!
If you are producing many short videos of similar structure, then the text file manipulation can save a lot of time.
The MLT format is closer to our way of thinking. For example, if we say that we have one second of a video clip followed by one minute of another, we do not speak the word "minute" slowly such that it takes exactly sixty times of the word "second". But in a GUI-based video editor that's precisely how it is done, forcing us to zoom the time line in and out in again and again.
MLT makes is easy to add comments to parts of the video. Also one can comment out a section for experimenting.

Storyboarding

There are three different types of storyboards:

For shooting the videos (showing camera angles, parts of face/body covered etc)
For special effects (showing where to cut)
For compositing (showing placements, motions of PIPs etc)

We shall discuss the last one here. For compositing our visual ingredients may be considered as rectangles, that you glue on the screen. Each frame of the storyboard is a picture with an arrow. The arrow may have zero or more line segments. Actually, "an arrow with zero line segment" means "no arrow". We associate $k+1$ times points with a frame whose arrow has $k$ line segments. If the time points are labelled as $t_0,...,t_k,$ then $t_0$ corresponds to when we enter this frame. Then $t_i$ shows when the motion showed by the $i$-th segment starts. In most cases we have $k=0$ or $1.$

Having at most one arrow per frame means we are allowng at most one thing to move at a time. If we allow multiple motions then we can introduce as many arrows. By the way, motion means compsiting motions (ie, when the rectangles move or resize or change transparency). A video running in a fixed rectangle with fixed transparency value is not considered moving.

An example will help. Consider the storyboard here. The first screen is blank. No arrow, so no motion. The 0 written below the frame means this screen appears at time 0. The second frame show a heading LSQ. It appears at time 20. Then at time 30 the panel strats sliding down. At 100 the motion stops and we enter frame 3. A second downward slide starts at time 950, and so on.

The Storyboarder software is great for creating storyboards. The Xsheet app is also good.

The following format is useful (I have written an ANTLR4 compiler for it):






//This is a sample file
//to showcase various features of 
//the new language.
1




v vid: DSCN2305.MOV 600
b vid2: DSCN2303.MOV 600
p red: red.svg
2





-----------
3




start=0:	vid[in]
4





redenters=start+50: red[in,x=20,y=10,w=30,h=30]
5





redwillmove=redenters+100
redwillmove-50: red[w=32,h=32]
redwillmove:	red[>]
6





aha= redwillmove+150:	red[y=60]
        vid[out]
        vid2[in]
7




silly= aha+250:    red[x=10]
8





last = silly+300
last:    red[out]
        vid2[out]
9




--------
10




end: 299
11

Explanation of the code:

1:: A comment. Starts with // and continues till the end of the line.
2:: The assets are declared here. There are four types of assets: audio (a), video without audio (v), video with audio (b), picture (p) and panels (d). For types a,v and b, we may have an additional start frame [defaults to 0].
3:: Separator, consists of one of more hyphens. This marks the end of asset declaration and start of video specifications.
4:: We define an identifier start to be the number 0. Also at this frame we want asset vid to appear on screen.
5:
6:
7:
8:
9:
10:
11:

The grammar is in this file. The actions are here.

Camera matching

Click here.

//This is a sample file //to showcase various features of //the new language.	1
v vid: DSCN2305.MOV 600 b vid2: DSCN2303.MOV 600 p red: red.svg	2
-----------	3
start=0: vid[in]	4
redenters=start+50: red[in,x=20,y=10,w=30,h=30]	5
redwillmove=redenters+100 redwillmove-50: red[w=32,h=32] redwillmove: red[>]	6
aha= redwillmove+150: red[y=60] vid[out] vid2[in]	7
silly= aha+250: red[x=10]	8
last = silly+300 last: red[out] vid2[out]	9
--------	10
end: 299	11

Table of contents