Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Saturday, Feb 10, 2018 - 23:08 SGT
Posted By: Gilbert

Fake It Until

Dragonboated for the first time in what, sixteen years, for my unit's Cohesion Day - there's a reassuring simplicity to just pulling, in rhythm, over and over again.

Returning to my usual life, I'll continue with the first of two overdue tech discussions:


It was coming. Hey, netizens have been pulling humourous face-swapping photos for eons, first with Photoshop and then with apps (also: the ubiquitous SnapChat dogface filter), and there was Face2Face a couple of years back [paper], which admittedly focused purely on mouth gestures. How hard, then, could transplanting the eyes and nose be?

As it turns out, not very. Some months ago, FakeApp was released, allowing seamless face transfer on videos. Expectedly, this has energized dabblers to put it to productive use, such as subbing Nicholas Cage in as every actor in a movie, but probably mostly to, ahem, more risqué ends.

What manner of sorcery is this?!

And yes, the technology. A brief comparison with the now-dated Face2Face mouth expression transfer tech: then, a similarity-based energy metric was used to retrieve the closest-seeming mouth appearance from a frame in the target (to be doctored) video, relative to the source gesture. And, actually, the DeepFake pipeline remains broadly the same:

For each frame in the video,
  1. Locate the face (apparently with Histogram of Oriented Gradients [HOG] landmarks)
  2. Transfer a desired face over the original face
Point 1 is, of course, not that impressive - it has been in every run-of-the-mill smartphone camera for ages. It is the convincing face transfer that has had observers agog, and perhaps not surprisingly, the not-so-secret behind this wizardry is (deep) neural nets - more specifically, deep convolutional autoencoders.

So, what's an autoencoder? It can be thought of as a function consisting of two parts: an encoder that converts an input into another (usually compressed) representation, and a decoder after that that converts the representation back into the (perhaps slightly different) input. If it helps, compression can be thought of as a form of autoencoding: when you zip a file, the original binary data is converted (encoded) into a (hopefully smaller) representation, the zipped file; and when that zipped file is uncompressed (decoded), the original file is obtained.

Furthering this intuition, image compression is however not quite autoencoding, for popular formats such as JPEG - it's more of discarding relatively unimportant data, such that the compressed representation remains directly interpretable as a (lower-quality version of the) image. Certainly, neural network based autoencoders (henceforth, just autoencoders) can be trained on image data, althought as the Keras tutorial notes, this is seldom worth it for purely compression purposes.

The beauty of autoencoders lies instead in their flexibility - just throw them any (and enough) data, and they'll generally learn a decent representation for you. This property is cleverly exploited in the DeepFake setup, which utilizes a single encoder, and two decoders:

The heart of the system
(Source: DeepFakes Explained video, at 6:30)

Before we continue, a quick technical note: DeepFake is built on Google's TensorFlow library (which would have saved me rolling my own GPU code some years back), and the more user-friendly GUI FakeApp is a 1.8GB torrent download. The actual underlying scripts are however much more lightweight, and can be gotten from an unofficial GitHub repo. The autoencoder architecture can then be examined at plugins\Model_Original.py, with a low-memory variant Model_LowMem.py apparently the same except for the dimensionality of the dense (representation) layer of the encoder being halved, from 1024 to 512. Some user-added generative adversarial net scripts are also included.

The slightly-surprising part here, is that there are no constraints applied on the dense layer, unlike for example in variational autoencoders; as such, it seems that faces from different people with the same expressions do naturally map to similar representations in the dense layer. In other words, if a photo of Person A with mouth open has a vector representation va in the trained encoder, Person B with the same expression would produce a vector vbva. Then again, since the basic structure of (aligned) faces are all but identical on major landmarks like the eyes, nose and mouth, perhaps this is not that unexpected.

What remains is conceptually straightforward - with hundreds (prefably thousands or more) of images of both subjects (victims?), we train the shared encoder to produce the same dense representations, for the same expressions of each subject. Then, to morph Person B's face onto Person A's face in the original video, we detect Person A's face in each frame, and run it through the encoder before decoding it with the decoder for Person B. Recall that Decoder B is specialized to generate only Person B's faces - therefore, we'll get an image of Person B with the same expression as that of Person A, which we can then merge straight back into the video.

Perhaps the most famous example
(Source: dailymail.co.uk)

A pertinent point, then, is that the generated and overlaid face is not truly that of the target person. Strictly speaking, it is an indirect representation, akin to a sketch artist producing his own rendition of his sitter. This setup also allows the decoder to make up for missing data to an extent, by using its learned conceptual expression of a particular face, to "fill in" for missing data. And, despite some very impressive examples, creating good fake videos still takes some work.

Firstly, a ton of images of both subjects are required, with insufficient images leading to bad outputs; this is however perhaps not that big of a problem for celebrities. Secondly, rarer profile poses may be an issue, as are close-ups, with the autoencoder apparently working on 64 pixel square inputs. This may explain why the successful examples tend to be on clips where the subject is some distance away, and mostly facing the camera straight-on. Lastly, unlike Face2Face for example, there doesn't seem to be flow constraints between frames. This may contribute to sudden strange "flashes", if a frame in the middle of a sequence has a less-compatible face generated for it.

Of course, none of these weaknesses are insurmountable, especially for well-funded professional outfits (such as the now-exposed American Deep State). This has clear implications on video as evidence, particularly combined with voice synthesis - perhaps even by the same encoder-decoder mechanism - that would allow convincing evidence to be produced of any public figure saying and doing just about anything. Very fortunately, interest thus far has been mainly restricted to naughty vids, which has led to the deepfakes subreddit being banned (go to r/fakeapp instead), together with the vids themselves on major platforms - but, let's be honest, bans have never worked...

[To be continued with machine translation...]

comments (0) - email - share - print - direct link
trackbacks (0) - trackback url

Next: Doublespeak

Related Posts:
Bet Or No Bet
Properties Of Properties
Call To Hams
The Week After

Back to top

Copyright © 2006-2020 GLYS. All Rights Reserved.