About 1 day ago Sean Gorman
Open Data: Why the Crowd Can Be Your Best Analytics Tool
Sean Gorman is the president and founder of FortiusOne, which brings data and mapping solutions to the mass market through its location analysis software. With
FortiusOne¶s GeoIQ platform,
data is easily shared, visualized and analyzed for more collaborative and better-informed decisions. The web will continue to generate data at an explosive rate. It will generate even more now that mobile devices have created yet another path to reach that data. For example, mobile traffic alone is predicted to exceed more than two exabytes per month by 2013. There are more than 90 million tweets per day and more than 60 billion images on Facebook. This is just the tip of the iceberg. Out of this bounty of data emerged ³data science´ and a plethora of new tools to deal with the size and speed of
information. Hadoop, Hbase, Cassandra, MongoDB, NodeJS, Hive, R, and Pig are just a few of the tools and techniques that have emerged to wrestle the growing juggernaut of data. The explosion in new tools and the
demand to implement them has far exceeded the number of data scientists available. When we look at the insight and intelligence that companies
like LinkedIn, Facebook and Twitter have been able to mine about the preferences and behaviors of their users, it is no surprise that data scientists are in high demand. It is not just social media data either ² financial, CPG, marketers and even governments are turning to the new skills and techniques to answer new business questions. The rapid rise in demand and the shortage of trained experts has led to the emergence of tools to democratize access to big data. Innovative startups like Datameer and Factual have simple spreadsheet interfaces for doing basic slicing and dicing. Larger players like Google have
launched FusionTables to allow slicing and visualization of medium (100MB) data sets.
The Challenges of Big Data
This sprawling mass of emerging data brings with it a host of challenges. As we slice and dice data, how do we keep track of the many permutations that it creates? What bits are meaningful and validated? How do we move beyond just counting and binning the data and answer more meaningful questions for businesses? As a technology community, we¶ve done a brilliant job of crowdsourcing data, making its creation and curation a social enterprise. We¶ve even made the creation of code social through the open source movement and
tools like Github. Yet for all our innovation, we¶ve done little to harness the collective Internet community to analyze the data we create. While our analyses and visualizations are elegant and often beautiful, they are too often built in isolation. If we were to peer into the not too distant future, how could we use the collective to analyze data and archive its evolution to let others further examine particular pieces of data and run in new directions? Let¶s watch an analysis evolve socially as many hands look for patterns across a large data stream. We¶ll start with a chunk of data comprised of all tweets mentioning ³Walmart´ during Black Friday, November 26, 2010, using hypotheticals. ³John´ examines the data and extracts all the tweets that came from mobile devices and plots them on a map:
He posts the results and data on his blog so others can extend or tweak the analysis. ³Kate,´ one of his readers, checks out the data and thinks it looks cool, but finds it too hard to see a pattern with so many dots on the map. Kate then takes John¶s data and forks it with her own analysis, counting all the tweets about Walmart in each county:
Seeing Kate¶s analysis, another reader, ³Bill,´ wonders what the relationship is between tweets about Walmart and their store location. How often are Walmart stores nearby when someone is tweeting about Walmart? He finds that 67% of the variation of tweets is explained by the number of Walmarts located in each county.
Another potential reader, ³Lauren,´ a Walmart Marketing VP, finds this pattern very intriguing. This analysis shows that when a promotion is sent to people discussing Walmart, there is a high likelihood that a store is nearby to redeem it. Next, her mind runs to other variables she could plug into the equation: population, demographics, competitor mix, weather, traffic, etc. She could fuse and filter the collection of contextual data ² for example, if someone is tweeting from a mobile device a mile from a Walmart, and the location has a density of 30- to 40-year-old single moms, as well as a forecasted heat wave ² in order to target advertisements. Leveraging these dynamic results, Lauren can query into the inventory analytics and immediately push out a promotion for kiddie pools and squirt guns. She can automate this algorithm to generate new promotions based on the streaming data and adjust to inventory levels in real time.
One of the early premises of Web 2.0 was that data would be ³the Intel inside´ and firms like NAVTEQ that provide data would be the big winners. Today we are seeing crowdsourcing increasingly commoditize data, and projects like OpenStreetMap replacing the NAVTEQs of the world. As the market moves up the chain, the future value will be the meaningful questions we can answer with data. This will mean more focus on the ³science´ side of ³data science.´ The more social and collaborative we make the science, the better the answers we¶ll create at a scale that is needed for an explosive market.
An article which explains how the crowd can become the best analytics tool. The explanation is based on "a chunk of data comprised of all tweets mentioning “Walmart” during Black Friday, November 26, 2010".
This article shows that the crowd is indeed the best analytics tool as it becomes more and more social and collaborative. With the data that the crowd shares on the Internet every single minute, it can be used for different purposes, most importantly for improving the marketing aspects of a certain business.