All Your Data Are Belong to Us

Proper geeks, video-game fans and internet historians will immediately comprehend the title. For the rest of you, leading normal balanced lives, it is a reference to the internet meme originating from the Japanese video game Zero Wing. When released in Europe on the Sega Megadrive, the dramatic opening scene included this cutaway :

Pretty clear, you’re in trouble. Time for action. Beat the bad guys.

From that fairly innocuous translation hiccup, an internet meme grew, popularised on the geeky message boards of the day. Eventually it broke free of the virtual world into the real world, appearing on T Shirts, record covers, and memorably, the side of the road through the Nevada desert :

So enough of bases, what about data ? Who does it belong to ? Who owns your data ? If it’s not you, then do you really own it or are you perhaps leasing it to someone, in a trade for some kind of service ?

I was thinking about this in relation to my usual Big Data analytical touchstones, Google, Facebook and Amazon.

Google first, of course. Perhaps the biggest data (base) of them all. They store our email for us. They log our searches, our location, our likes, our friends, our videos, photos and anything else we offer to them through their excellent Gmail, YouTube, Maps, Android, Picassa and G+ services. Do we own that data ? I think we believe we do. But we certainly offer it with the understanding and agreement that Google can process it, analyse it, and profit from it. In return, we get our email service for free, a place to store and share our pictures, and somewhere to connect with friends and like-minded people. For many, it’s a good trade. For Google, it’s an excellent trade. $5 billion a month revenue from selling the results of their data analysis to advertisers, keen to target the exact person who is looking for what they are selling. It’s a great business.

My post this time is not about that business model. I covered it in an earlier post. Facebook is similar, collecting and storing your data, offering you services in return – easy sharing of photos with friends, chat, a gossip stream. The business model is proven, assuming a reasonable store of data, low ( or zero ) acquisition costs, low storage costs, and excellent services on the front-end and analytics on the back end. So far so good. My post this time is about ownership. All your data are belong to … who ?

And as ever, my usual question that provides deeper understanding, why ?

So the “who” first. I’ll use Facebook, because I used Google in the previous post. If you are one of the 1 billion who use Facebook, you send them data to store and distribute. Your “posts”, your “likes”, your biological information ( date of birth, sex, people you add as relatives ), photos are all data. Facebook stores it, and as Google, LinkedIn and the rest, sells the analysis of your likes and loves, connections and location, to advertisers. So they store it, and in return, you get to choose who sees the data. Though not the analytics, an important distinction.

And that’s the “why?” answered. Why would you send fairly personal information – your age, relationship status, location, friends, holiday snaps, to a company, for free ? The answer is, they allow you to control the access to that information. They allow you to access others’ information that they permit you to see.

You can set up your “Friends” list, maybe another list for “Family”, and decide which bits of data you share to whom. Pictures to Family only maybe. Posts and likes to Friends. That’s all. That’s the why. Facebook ( and Google, LinkedIn, etc. ) are giant data stores, which offer the data provider the opportunity to selectively provide access. There is value in that. It’s not so dissimilar to a stock exchange. Data providers ( market participants, such as banks ) provide data, the exchange collates, and provides it selectively to it’s paying customers.

In each case, the data store host ( Facebook, Google, LinkedIn, NYSE … ) acts as benevolent guardian of other peoples data. A kind of maître d’ at an enormous data restaurant. For this to work, the data store host must have data that people or businesses want. People or businesses want market data. People also want to know what their friends are doing and likewise want to share with their friends what they are doing. Facebook ( Google, etc ) doesn’t own the data. If you look hard enough, you can find a button that will let you download your entire facebook history – posts, photos, activity – back to your computer. You definitely still own it. Just Facebook hosts it and provides you with the ability toselectively share.

The interesting ( for me ) thing is, what if we didn’t need a benevolent guardian to store all the data collectively ? It works as a business model, because that data store is analysable for saleable information to advertisers. So economically, that’s the trade. Facebook offer us free data storage and selective sharing, and in return pay for their considerable storage and technology costs by selling advertising. But I’m a technologist. So technically, is it necessary ?

I think perhaps not. Let’s continue using Facebook. Just examine the technical aspect. I send my data ( photos, likes, location ) to Facebook. They store it. They have sharing controls that ensure, I hope, that only my nominated contact groups can access that data. Do I actually need to send them my data though ? Technically, all they need is a mapping of which bits I want to share to which groups and the location of those bits. They don’t actually need to host my data. I could host it myself. Sounds hard right ? But in fact, for things like photos, I already do. On my phone. Or my computer. The data I send to Facebook is often a duplicate of data I have elsewhere. Facebook have the directory, the list of unique users, and that’s highly significant. There needs to be a single directory of authenticated, unique, users against which one can apply permissions. But there does not need to be a single data store. Which flushes out a Big Data Big Point :

Data doesn’t have to be in one place

What if, instead of posting a photo, I merely updated Facebook’s database with a pointer to the location of my photo, and who I permitted to access ( view ) it ?

There would be some advantages for me. I would not have to upload a copy of my photo, saving time and bandwidth. I would retain actual ownership of the bits – if I deleted the bits ( the photo file ) on my PC or phone, it would immediately be inaccessible. A point anyone with teenagers obsessed with sending photos to their friends will understand. Once you distribute data outside your own ownership, you lose some control over it, possibly embarrassingly. So if I just send a pointer, I would not have to be concerned about that data clone I had uploaded to Facebook. I would retain direct control over the accessibility of my data, rather than give Facebook proxy rights.

Which reveals another Big Data Big Point :

To profit from it, you don’t need to own it

There would be some advantages for the data store provider too. It would scale much more efficiently : There are 1 billion Facebook users today. If there are 2 billion in a few years, Facebook, using it’s current model, will have to more than double it’s datacentre and systems capacity if it keeps requiring everyone to upload their data. Using my alternative “keep the data yourself, we’ll act as clearing house/exchange for the permissioning” system, the majority of the data storage requirements ( the actual photo.jpg file for example ) will not grow. Just the directory and permissioning schema. It doesn’t always follow that big data needs to be in one big place. With a little bit of software, you could selectively share your existing photo archive, or journal. It’s the permissioning that has the value to the end-user. It’s the analytics that has the value to the data provider. It’s also the reason why people have reasonable concerns about the privacy of that data, regardless of controls. There is a certain fear about giving too much away.

I like the idea of retaining my own data, and having a service for permissioning selective sharing. It’s technically possible now. Economically ? Well I’m still working on that. The analytics are a lot harder to do if you don’t have direct access to the data. That’s exactly why you have to upload your posts, photos, videos and other information to Facebook, Google et al. Thereby illuminating our final Big Data Big Point :