T hese days, most people are accustomed with augmented reality apps, even if they aren’t quite what was envisioned by science fiction in the past. We’re still some way away from the Star Trek Holodeck, but the fact of the matter is that you can wear new glasses or a funny hat with Snapchat or Facebook Messenger, and nobody is surprised anymore. In fact, the surprise only occurs when the algorithm fails and the digital enhancement disappears, leaving you just staring at your face again.
Pose Estimation is the name of the process that allows a computer system to identify a human and to estimate or approximate the way in which they are standing (their pose). This is most often achieved using a single 2D image or video input.
It isn’t difficult for a human to identify another human face or body, so why should it be difficult for a computer algorithm to do the same? Generally speaking, humans have one head with two eyes, a nose, and a mouth. Their bodies have two arms and two legs…surely, it’s just a matter of counting!
Why Pose Estimation is Difficult
One factor that humans have as an advantage over many digital systems is stereoscopic vision. Pose estimation, even for Snapchat, usually involves taking a single 2D image and inferring the existence of 3D objects within it.
For a human with experience of the real world, this is not too hard, but for an algorithm it can be quite tricky. Even a human can be fooled by optical illusions and patterns that look like faces, so it is difficult for software with no real-world experience to determine what it is looking at.
But this is just the start of the problem for pose estimation software. It’s not just seeing patterns that appear to be faces that aren’t really there, sometimes real faces don’t look like faces at all (depending on the angle). And if you’re just working from an outline or silhouette of a whole body, how can you tell if it is facing towards you or away from you?
It is truly surprising and exhilarating to discover how far pose estimation has come over the last three decades.
The origins of Pose Estimation
Back in 1989, a paper from the University of Washington Department of Electrical Engineering (“Pose Estimation from Corresponding Point Data” by Robert M Haralick, Chung Nan Lee, Xinhua Zhuang, Vinay G Vaidya, and Man Bae Kim) attempted to tackle the problems facing engineers working on “computer vision”.
Their system required clear inputs and user intervention to determine the location of bodies within the images they were working with – however, they did find ways of producing 2D projections with perspective from 2D sources, effectively building the basis for creating a 3D digital environment from 2D input.
Further research from the University of Texas at Austin helped to establish the requirements for detecting human motion. Unlike machines, humans move in an elastic and non-rigid way. This presents problems of identification as it cannot be guaranteed that even an individual human will move in the same way twice.
Based on studies from the 70s, it was found that human movement could be inferred from light sources and contours around articulation points – essentially, identifying the joints of the human body and calculating human-like movement based on those spatial points.
By combining these works, an algorithm to detect human shapes became a workable proposition.
It’s not science…
Science relies on facts and figures to prove points. It has long been said that correlation does not indicate causation, but with pose estimation this isn’t necessarily true. By 1998, identifying humans and human motion within video images was found to be more efficient using correlation – if it looks like a human, it probably is.
This reduces the need for exacting tests to determine if a subject is definitely a human, which means the computing power required is minimized. What once would have taken a powerful computer many hours to analyze could be matched by a lower-powered computer in minutes.
Finding points in 3D space using 2D inputs became simpler, and the use of neural networks – the basis of much AI – allowed “noisy” inputs to be used. This meant that busy backgrounds could be ignored, and the subject could be found. Geometric body models had been used with limited success, but a system that “learned” about the human form would be much more flexible.
By 2006, research was accelerating and papers on the subject of computer vision and human motion capture were averaging over 100 per year. New algorithms and methods were being created at a rapid rate for quickly determining what human shapes were, tracking such shapes, and estimating the pose.
The problem of noisy backgrounds had been overcome, and humans and their motions could be identified with relative ease. Handheld devices opened new opportunities (for example, AR hats and glasses) and it was not long before stores realized the possibilities – what if users could try on outfits without having to buy them first?
This would take pose estimation to a new level – a hat and glasses is one thing but requiring that the human be identified while simultaneously creating an internal model that could be “dressed” by the software seems nigh-on impossible.
Could these problems be overcome? And could a handheld device be powerful enough to even attempt to solve these problems?
You’ll find out in part two of this three-part series!