Trending in 2014:
- dashboard cams
- cop cams
- home security cameras
- GoPro
- Google Glass
- stealth video glasses
- first person hyperlapse video
- quantified self
We are recording more and more of our personal lives, and we are starting to do it with passive devices that film whatever is in front of them.
What happens when the technology advances a few more steps? We will have:
We enter a new era of digital rights when people begin recording their everyday conversations. If you film me talking to you, you own the recording. You can play it for yourself whenever you want. But there would be terrible consequences if you can make it public without my permission, or hide it from me. Most communication would go off camera.
There has to be a way for the subjects of a video to control access to it.
How do we know who is in a video? The video is geo-coded, and let's say the direction it's pointed in is known, and the positions of the people nearby are known (since they are also creating geo-coded videos), then we can be accurate about who is in the video, and therefore who needs to be consulted on allowing access to people who were not there. Face recognition is a double-check.
Some principles:
People can decide to make portions of their feeds available to various groups, and the general public is one such group.
Taken together, these videos form a complete record of whatever is seen and heard by all the people wearing recording devices, and we are stipulating that is everyone.
The vast majority of the information we receive is through sound and image. By a remarkable coincidence, these are also the two senses that can be captured and replayed at will. Or maybe it's not a coincidence. We could have developed methods to record and play back sequences of smell, or touch, or tastes, by why? Odorama never caught on. Audio and video is enough.
There is a much richer physical reality that causes the video and audio, far more difficult to capture, but we are also stipulating that the recording devices are as good as our senses. For our normal human purposes, the deep physical reality below our awareness is not important. We have only our perceptions of it to work with to build the world of experience and ideas. The perceptual world is becoming tractable in a way the physical world below cannot be. These feeds form the bedrock and raw data for any true description of the perceptual world.
The feeds available to you are stitched together into an interface like Street View. You will be able to pan and zoom around the map and inside buildings, now with a lot more granularity and the added dimension of time. You can sit at a cafe and watch the world go by all day, thanks to feeds from multiple patrons, or see a concert from the point of view of every audience member.
Street View Video will become a partial 3d model of the physical world, over time. How do we combine multiple feeds? Street View does its best to morph one street photo to the next as you virtually walk down the street. The corresponding action for video is a lot more complicated. The work on hyperlapse video applies here, now applied to multiple feeds from multiple sources. Each new feed of the same event from a different perspective clarifies some part of the unified model.
Your complete history is a combination of your own feed and the feeds of everyone around you. The 3d model gives out at the edges. You can't go watch that tree falling in a forest unless someone was there to see it.
We can also make symbolic models using this raw data, recognizing speech and faces across multiple videos. Individuals would be represented by point locations, and speech by text. This leaves behind most of the detail at the raw feed level in favor of a more compact model that's easier to navigate and contains much of the interesting information. You might explore at this level, then dive down at times into the raw feeds for the actual video.
The software that edits out the boring stuff will rely heavily on recognized speech and natural language processing methods like text summarization and named entity recognition.
The meaning of a word is only approximated by a dictionary definition. Meaning is determined by use. Which use? Some uses are more equal than others, but the full story of a word would include every use. Every time someone has uttered or written a word, they contribute some epsilon of meaning to it. Web-based written text allows this kind of analysis now, the new thing is that we can now see the much larger corpus of speech as well.
At a much higher level, there is the News, where reporters write descriptions of events by summarizing the feeds of the participants. Summarization at this level is beyond the software I am imagining.
Beyond News is News Analysis and finally History, which summarizes thousands of news reports.
All levels of summarization will be traced back to the raw data they describe. In the future, there are no more arguments about facts, only about interpretations.
What happens when the technology advances a few more steps? We will have:
- unobtrusive devices
- recordings that are streamed directly to the cloud and stored forever
- batteries that lasts all day
- high resolution video and high quality stereo audio
- a digital rights system that protects everyone's privacy
Privacy
There has to be a way for the subjects of a video to control access to it.
How do we know who is in a video? The video is geo-coded, and let's say the direction it's pointed in is known, and the positions of the people nearby are known (since they are also creating geo-coded videos), then we can be accurate about who is in the video, and therefore who needs to be consulted on allowing access to people who were not there. Face recognition is a double-check.
Some principles:
- I can always access my own feed, no matter who is in it.
- I can access any feed that I am in.
- If I am the only person in my feed, I have complete control over its access by other people.
- If my feed contains other people, I cannot necessarily share it. I need permission from the other people.
- I can release any feed containing me to any given group. But it only becomes available to that group when everyone in the feed allows access by that group.
- Each person can set a default access level. If I'm going out, I may want everything to be by default public. For many people, that will be the default most of the time. But there's an easy way to switch to private whenever I want, and that means any video that includes me will require my explicit permission to share.
The movie of your life
First person videos are always lacking something -- the hero of the story is behind the camera, never seen. As recording becomes omnipresent, there will be images of you in feeds filmed by other people. Software will be able to stitch together the information in multiple videos to create something resembling a movie about your day, automatically selecting scenes with interesting content, showing you as you appear to the people you have encountered.
Hitchcock said "drama is real life with the all the boring parts cut out". The same software that presents the interesting moments of your day can make the movie of your life, at various running times. Everybody may have a novel in them, but someone has to write it. Your life movie is generated automatically. All you have to do is wear the device.
Passive filming produces videos that are mostly dead space. No one watches them from start to finish, much less live. But they capture rare moments of interest that are missed by active recording. The times we pose for a photo and say "cheese" are significant but dull. Will will stop interrupting our lives to take pictures and start picking key frames from video feeds instead.
Our collective perceptual universe
The vast majority of the information we receive is through sound and image. By a remarkable coincidence, these are also the two senses that can be captured and replayed at will. Or maybe it's not a coincidence. We could have developed methods to record and play back sequences of smell, or touch, or tastes, by why? Odorama never caught on. Audio and video is enough.
Metaphysics
Street View Video
Street View Video will become a partial 3d model of the physical world, over time. How do we combine multiple feeds? Street View does its best to morph one street photo to the next as you virtually walk down the street. The corresponding action for video is a lot more complicated. The work on hyperlapse video applies here, now applied to multiple feeds from multiple sources. Each new feed of the same event from a different perspective clarifies some part of the unified model.
Your complete history is a combination of your own feed and the feeds of everyone around you. The 3d model gives out at the edges. You can't go watch that tree falling in a forest unless someone was there to see it.
Information extraction
The software that edits out the boring stuff will rely heavily on recognized speech and natural language processing methods like text summarization and named entity recognition.
Speech
Levels of summarization
Beyond News is News Analysis and finally History, which summarizes thousands of news reports.
All levels of summarization will be traced back to the raw data they describe. In the future, there are no more arguments about facts, only about interpretations.
It is amazing that this is in the realm of possibility. The perceptual world is huge but finite and we may someday have enough storage for the whole thing.
No comments:
Post a Comment