From avatars to visual capture

Figure 10.14: A collection of starter avatars offered by Second Life.

Figure 10.15: Holographic communication research from Microsoft in 2016. A 3D representation of a person is extracted in real time and superimposed in the world, as seen through augmented reality glasses (Hololens).

How should others see you in VR? This is one of the most intriguing questions because it depends on both the social context and on the technological limitations. A clear spectrum of possibilities exists. At one extreme, a user may represent himself through an avatar, which is a 3D representation that might not correspond at all to his visible, audible, and behavioral characteristics; see Figure 10.14. At the other extreme, a user might be captured using imaging technology and reproduced in the virtual world with a highly accurate 3D representation; see Figure 10.15. In this case, it may seem as if the person were teleported directly from the real world to the virtual world. Many other possibilities exist along this spectrum, and it is worth considering the tradeoffs.

One major appeal of an avatar is anonymity, which offers the chance to play a different role or exhibit different personality traits in a social setting. In a phenomenon called the Proteus effect, it has been observed that a person's behavior changes based on the virtual characteristics of the avatar, which is similar to the way in which people have been known to behave differently when wearing a uniform or costume [360]. The user might want to live a fantasy, or try to see the world from a different perspective. For example, people might develop a sense of empathy if they are able to experience the world from an avatar that appears to be different in terms of race, gender, height, weight, age, and so on.

Users may also want to experiment with other forms of embodiment. For example, a group of children might want to inhabit the bodies of animals while talking and moving about. Imagine if you could have people perceive you as if you as an alien, an insect, an automobile, or even as a talking block of cheese. People were delightfully surprised in 1986 when Pixar brought a desk lamp to life in the animated short Luxo Jr. Hollywood movies over the past decades have been filled with animated characters, and we have the opportunity to embody some of them while inhabiting a virtual world!

Figure 10.16: The Digital Emily project from 2009: (a) A real person is imaged. (b) Geometric models are animated along with sophisticated rendering techniques to produce realistic facial movement.
\psfig{file=figs/,widt...,width=3.0truein} \\
(a) & (b)

Now consider moving toward physical realism. Based on the current technology, three major kinds of similarity can be independently considered:

  1. Visual appearance: How close does the avatar seem to the actual person in terms of visible characteristics?
  2. Auditory appearance: How much does the sound coming from the avatar match the voice, language, and speech patterns of the person?
  3. Behavioral appearance: How closely do the avatar's motions match the body language, gait, facial expressions, and other motions of the person?
The first kind of similarity could start to match the person by making a kinematic model in the virtual world (recall Section 9.4) that corresponds in size and mobility to the actual person. Other simple matching such as hair color, skin tone, and eye color could be performed. To further improve realism, texture mapping could be used to map skin and clothes onto the avatar. For example, a picture of the user's face could be texture mapped onto the avatar face. Highly accurate matching might also be made by constructing synthetic models, or combining information from both imaging and synthetic sources. Some of the best synthetic matching performed to date has been by researchers at the USC Institute for Creative Technologies; see Figure 10.16. A frustrating problem, as mentioned in Section 1.1, is the uncanny valley. People often describe computer-generated animation that tends toward human realism as seeing zombies or talking cadavers. Thus, being far from perfectly matched is usually much better than ``almost'' matched in terms of visual appearance.

For the auditory part, users of Second Life and similar systems have preferred text messaging. This interaction is treated as if they were talking aloud, in the sense that text messages can only be seen by avatars that would have been close enough to hear it at the same distance in the real world. Texting helps to ensure anonymity. Recording and reproducing voice is simple in VR, making it much simpler to match auditory appearance than visual appearance. One must take care to render the audio with proper localization, so that it appears to others to be coming from the mouth of the avatar; see Chapter 11. If desired, anonymity can be easily preserved in spite of audio recording by using real-time voice-changing software (such as MorphVOX or Voxal Voice Changer); this might be preferred to texting in some settings.

Finally, note that the behavioral experience could be matched perfectly, while the avatar has a completely different visual appearance. This is the main motivation for motion capture systems, in which the movements of a real actor are recorded and then used to animate an avatar in a motion picture. Note that movie production is usually a long, off-line process. Accurate, real-time performance that perfectly matches the visual and behavioral appearance of a person is currently unattainable in low-cost VR systems. Furthermore, capturing the user's face is difficult if part of it is covered by a headset, although some recent progress has been made in this area [180].

Figure 10.17: Oculus Social Alpha, which was an application for Samsung Gear VR. Multiple users could meet in a virtual world and socialize. In this case, they are watching a movie together in a theater. Their head movements are provided using head tracking data. They are also able to talk to each other with localized audio.

On the other hand, current tracking systems can be leveraged to provide accurately matched behavioral appearance in some instances. For example, head tracking can be directly linked to the avatar head so that others can know where the head is turned. Users can also understand head nods or gestures, such as ``yes'' or ``no''. Figure 10.17 shows a simple VR experience in which friends can watch a movie together while being represented by avatar heads that are tracked (they can also talk to each other). In some systems, eye tracking could also be used so that users can see where the avatar is looking; however, in some cases, this might enter back into the uncanny valley. If the hands are tracked, which could be done using controllers such as those shown in Figure 10.12, then they can also be brought into the virtual world.

Steven M LaValle 2016-12-31