analysisBy Emmanuel Letouzé
What is big data, and could it transform development policy? Emmanuel Letouzé takes a close look at this emerging field.
In just a few years 'big data' have affected industries and activities from marketing and advertising to intelligence gathering and law enforcement, stirring much excitement and scepticism.
With policymaking increasingly looking like big data's next frontier, is this phenomenon - what one expert, Andreas Weigend, is calling the 'new oil' that needs to be refined - poised to be a blessing or a curse for human development and social progress? [1,2]
Optimists are calling it a revolution that will change, mostly for the better, "how we live, think and work" (see below for a video of The Economist's Kenneth Cukier). Some World Bank officials have even expressed the hope that "Africa's statistical tragedy" - that is, the dearth of reliable official statistics in some of the world's poorest places - may be partly fixed by big data.[3,4]
But sceptics and critics have been more circumspect, and some plainly antagonistic - referring to big data as a big ruse, a big hype, a big risk as well as, of course, 'big brother', in the wake of the revelations by former US National Security Agency contractor Edward Snowden.
Big data, especially as applied to development and public policy issues, is in its intellectual and operational infancy. Joe Hellerstein, computer scientist at the University of California, Berkeley, United States, made an early mention of an upcoming "Industrial Revolution of data" in November 2008, while
The Economist talked about a "data deluge" in early 2010.[5,6] 'Big data' itself became a mainstream term only a couple of years ago. Google searches (see Figure1) are one metric that shows this: the number of searches that include the term did not take off until 2011-12.
In those two years, four major reports were published: by the UN Global Pulse, the World Economic Forum, the McKinsey Global Institute and Danah Boyd and Kate Crawford, researchers at Microsoft and academic institutions. [7-10]
Of course, the big data buzz could just be a bubble, or just hype: as some observers point out, automated analysis of large datasets is not new. So what is?
What is big data?
There is no single agreed definition of big data. For one, it is data generated through our increasing use of digital devices and web-supported tools and platforms in our daily lives.
In any given minute, hundreds of millions of individuals across the globe use some of the world's seven to eight billion mobile phones to make a call, send a text message or an email. Or they may wire money, buy a book, search online, pay for a meal with a credit card, update their Facebook status, send tweets, upload videos to YouTube, publish a blog post and so on.
Each of these actions leaves a digital trace. Added up, this digital information makes up the bulk of big data. Each year since 2012, well over 1.2 zettabytes of data has been produced - 1021 bytes, enough to fill 80 billion 16GB iPhones (which would circle the earth more than 100 times). (Table 1) And the volume of these data is growing fast.  So volume, velocity and variety are the three 'Vs' that characterise big data, with the value that could be extracted from them often added as a fourth V.
And much as a population with a sudden outburst of fertility gets both larger and younger, the proportion of digital data produced recently is growing ever faster - up to 90 per cent of the world's data was created over just two years (2010-2012), according to one much cited account. 
Big data come in different types. One kind is small pieces of 'hard' data - numbers or facts, for example - described by Alex 'Sandy' Pentland, a professor at the Massachusetts Institute of Technology, United States, as "digital breadcrumbs". 
They are said to be 'structured' because they make up datasets of variables that can be easily tagged, categorized, and organized (in columns and rows for instance) for systematic analysis.
One example is Call Detail Records (CDRs) collected by mobile phone operators (Table 2). CDRs are metadata (data about data) that capture subscribers' use of their cell-phones - including an identification code and, at a minimum, the location of the phone tower that routed the call for both caller and receiver - and the time and duration of call. Large operators collect over CDRs per day.  (Figure 2).
A second kind of big data are videos, documents, blog posts and other social media content. Most of these data are 'unstructured' - and so harder to analyse.
They differ from 'breadcrumbs' in that they are subject to their authors' editorial choices and, being subjective, may paint a deceiving picture. For example, you might blog that you are boycotting a certain product, but your credit card statement may reveal a different preference based on actual purchases.
A third kind of big data is gathered remotely by digital sensors and reflects human actions. These might be 'smart meters' installed in homes to record electricity consumption, or satellite imagery that can pick up physical information such as vegetation cover as an indicator of deforestation. 
Some consider the universe of big data to be much wider - including administrative records, price or weather data, for instance, or books that have been previously digitized -which, taken collectively, may constitute a fourth kind.
But the bulk of big data is machine-readable, generated about and by people - some combination of the types mentioned above. These data were unavailable 10 years ago, before the age of Facebook or the explosion of mobile phone use - and they stem from powerful technological and societal changes.
Big data's main novelty is that they come from electronic sources and end up in databases whose primary purpose is not statistical inference.
In other words, they were not collected or sampled with the explicit intention of drawing conclusions from them. This also makes putting big data to use challenging.
So the term big data may be a misleading misnomer: size isn't their defining feature. For example, an Excel spreadsheet with CDRs may not be a big file; the entire World Bank Development Indicators database is a big file - but the latter results from fully controlled processes, including surveys and statistical imputations undertaken by official bodies.
The difference is primarily qualitative - it's in the kinds of information contained in the data and the way these are generated.
To add an extra layer of complexity, "Big Data is not about the data", as Harvard University professor Gary King puts it. 
It's about big data analytics, which broadly refers to improvements in computing power and analytical capacities - such as statistical machine-learning and algorithms that are able to look for and unveil patterns and trends in vast amounts of complex data.
This is the second feature of big data: the tools and methods, hardware and software now available to analyse digital data.
A third, less discussed but important property of big data is that it has become is a 'movement'. And that movement is increasingly attracting multidisciplinary teams of social and computer scientists with a "mindset to turn mess into meaning", as data scientist Andreas Weigend puts it - in essence, defining big data as a movement to turn data into decision making. 
Statements such as this have renewed interest in the prospects and promise of 'data-driven' or 'evidence-based' policymaking - although there are technical, technological, commercial and political implications that are far from trivial.
How exactly can big data - new kinds of data, new capacities to analyse it, with new intentions - affect societies? And what explains the buzz it has created?
The promise stems from two aspects: supply of ever-more data, and demand for better, faster and cheaper information - in other words there is both a push for and a pull towards big data.
People are frustrated with the current tools and systems available for decision-making. For instance, a good indicator of a region's poverty or underdevelopment is a lack of poverty or development data. 
Some countries (most of them with a recent history of conflict) haven't had a census in four decades or more. Their population size, structure and distribution is essentially anyone's guess.
Even though official figures exist, they are often based on incomplete data.  Poor data also mean that some countries' official GDP figures get an overnight boost - of 40 per cent for Ghana in 2010 or 60 per cent for Nigeria in 2014 - when changes in the structure of their economies, such as the rise of the technology sector, are finally taken into account. [21-22]
This lack of reliable data has presided over the recent UN call for a 'Data Revolution'. The basic rationale is that, in the age of big data, economies should be steered by policymakers relying on better navigation instruments and indicators that let them design and implement more agile and better targeted policies and programmes [link to Skaliotis and Thompson op].
Big data has even been said to hold the potential for national statistical systems in data-poor areas to 'leapfrog' ahead, much as many poor countries skipped the landline phase to jump straight into the mobile phone era. 
Supplying new knowledge
The appeal of potentially leaping ahead is also shaped by the 'supply side' of big data. There is early practical evidence and a growing body of work on big data's novel potential to understand and affect human populations and processes.
For example, big data has been used to track inflation online, estimate and predict changes in GDP in near real-time, monitor traffic or even a dengue outbreak. [23-26]
Monitoring social media data to analyse people's sentiments is opening new ways to measure welfare, while email and Twitter data could be used to study internal and international migration. [25,27] And an especially rich and growing academic literature is using CDRs to study migration patterns, socioeconomic levels and malaria spread, among others.
Guidance for analysing big data, published by UN Global Pulse, has focused on four fields: disaster response, public health, poverty and socioeconomic levels, and human mobility and transportation (See Box 1). 
Risks and challenges
Of course, big data's promise has been met with warnings about its perils. The risks, challenges and more generally the hard questions were articulated as 'early' as 2011. 
Perhaps the most severe risks - and most urgent avenues for research and debate - are to individual rights, privacy, identity, and security [link to Sanjana's opinion].
In addition to the obvious intrusion of surveillance activities and issues around their legality and legitimacy, there are important questions about 'data anonymization': what it means and its limits.
A study of movie rentals showed that even 'anonymized' data could be 'de-anonymized' - linked to a known individual by correlating rental dates of as few as three movies with the dates of posts on an online movie platform.
 Other research has found that CDRs that record location and time, even when free of any individual identifier could be re-individualized. In that case, four data points were theoretically sufficient to uniquely single out individuals out of the whole dataset with 95 per cent accuracy. 
Critics also point to the risks associated with basing decisions on biased data or dubious analyses (sometimes called threats to both external and internal validity). If policymakers come to believe that 'the data don't lie', such risks could be especially worrisome. Box 2 gives some examples.
Another risk is that analyses based on big data will focus too much on correlation and prediction - at the expense of cause, diagnostics or inference, without which policy is essentially blind.
A good example is 'predictive policing'. Since about 2010, police and law enforcement forces in some US and UK cities have crunched data to assess the likelihood of increased crime in certain areas, predicting rises based on historical patterns.
Forces dispatch their resources accordingly, and this has reduced crime in most cases.  However, unless there is knowledge of why crime is rising it's not possible to put in place preventive policy that tackles the root causes or contributing factors. 
Yet another big risk that has not received the attention it merits is big data's potential to create a 'new digital divide' that may widen rather than close existing gaps in income and power worldwide. 
One of the 'three paradoxes' of big data is that because it requires analytical capacities and access to data that only a fraction of institutions, corporations and individuals have, the data revolution may disempower the very communities and countries it promises to serve.  People with the most data and capacities would be in the best position to exploit big data for economic advantage, even as they claim to use them to benefit others.
Lou del Bello talks to Sandy Pentland, director of the MIT human dynamics laboratory, about the potential benefits of conducting research using the world's digital "breadcrumbs". But Patrick Ball, executive director of the Human Rights Data Analysis Group, is sceptical about the accuracy of analyses using big data and emphasises their potential to misrepresent reality.
A related and basic challenge is that of putting the data to use. All discussions about the 'data revolution' assume that 'data matter'; that poor data are partly to blame for poor policies.
But history has shown that lack of data or information has historically played only a marginal role in the decisions leading to bad policies and poor outcomes. And a blind 'algorithmic' future may undercut the very processes that are meant to ensure that the way data are turned into decisions is subject to democratic oversight.
But since the growth in data production is highly unlikely to abate, the 'big data bubble' is similarly unlikely to burst in the near future.
The world can expect more papers and controversies about big data's potential and perils for development. The future of big data will likely be shaped by three main strands: of academic research, legal and technical frameworks for ethical use of data, and larger societal demands for greater accountability.
Research will continue to examine whether and how methodological and scientific frontiers can be pushed, especially in two areas: drawing stronger inferences, and measuring and correcting sample biases.
Policy debate will develop frameworks and standards - normative, legal and technical - for collecting, storing and sharing big data.
These developments fall under the umbrella term 'ethics of big data'. [37,38] Technical advances will help, for example by injecting 'noise' in datasets to make re-identification of the individuals represented in them more difficult. But a comprehensive approach to the ethics of big data would ideally encompass other humanistic considerations such as privacy and equality, and champion data literacy. 
A third influence on the future of big data will be how it engages and evolves alongside the 'open' data movement and its underlying social drivers - where 'open data' refers to data that is easily accessible, machine-readable, accessible for free or at negligible cost, and with minimal limitations on its use, transformation, and distribution. (See Figure 4] 
For the foreseeable future, the big data and open data movements will be the two main pillars of a larger 'data revolution'. Both rise against a background of increased public demand for more openness, agility, transparency and accountability for public data and actions.
The political overtones - so easily forgotten - are clear. And so a 'true' big data revolution should be one where data can be leveraged to change power structures and decision-making processes, not just create insights. 
Kaz Janowski talks to Philipp Schönrock, director of development think-tank CEPEI based in Colombia, about what's needed to make big data work for development.
The conversation explores the meaning of the 'data revolution' and data disaggregation, the role of traditional statistical offices and the private sector, the need for standards and a legal framework, and how to go about building trust in managing data.
Emmanuel Letouzé is a PhD Candidate at the University of California, Berkeley, a fellow at the Harvard Humanitarian Initiative, a visiting scholar at the MIT Media Lab and a research associate at the Overseas Development Institute. He can be contacted at email@example.com and on Twitter @Data4Dev.
Algorithms (and algorithmic future)in mathematics and computer science, an algorithm is a series of predefined instructions or rules written in a programming language designed to tell a computer how to sequentially solve a recurrent problem through calculations and data processing.
The use of algorithms for decision-making has grown in several sectors and services such as policing and banking. This has led to hopes - and worries - about the advent of an 'algorithmic future' where algorithms may replace human functions, or even become an instrument for repression.
Big dataan umbrella term that, simply put, stands for one or more of three trends: the growing volume of digital data generated daily as a by-product of people's use of digital devices; the new technologies, tools and methods available to analyse large data sets that are not designed for analysis; and the intention to extract policymaking insights from these data and tools.
Call Detail Records (CDRs)the technical name for mobile phone data recorded by all telecom operators. CDRs contain information about the locations of those sending and receiving calls or text messages through operators' networks, as well as data on time and duration.
Data revolutiona common term in development discourse since the High-Level Panel of Eminent Persons on the Post-2015 Development Agenda called for a 'data revolution' to "strengthen data and statistics for accountability and decision-making purposes". It refers to a larger phenomenon than big data or the 'social data revolution' - defined as the shift in human communication patterns towards greater personal information sharing, and the implications of this.
Data scientist or data sciencea professional or a field that focuses on solving real-world problems using large amounts of data by combining skills from often distinct areas of expertise: maths, computer science (for example, hacking and coding), statistics, social science and even storytelling or art.
(New) Digital dividethe differential access and ability to use information and communications technologies between individuals, communities and countries - and the resulting socioeconomic and political inequalities. The skills and tools required to absorb and analyse the growing amounts of data produced by such technologies may lead to a 'new digital divide'.
False positives versus false negatives (or type I versus type II errors)a false positive or type I error refers to a prediction or conclusion that turns out to be false - for example, a fire alarm going off when there is no fire, or an experiment indicating a medical treatment has worked when it had not.
A false negative or type II error refers to cases when a study or a monitoring system fails to identify an event or effect that has occurred. Attempts to predict rare events, such as political revolutions, using increasingly rich data and powerful tools are expected to lead to more false positive than false negative results (also known as over-prediction).
Internal versus external validity internal validity refers to the extent to which a causal relationship can be confidently established between two phenomena - a reduction in speed limit and a fall in road deaths, for example.
This requires all other factors that may affect the outcome and offer alternative explanations to be taken into account; in this case, this would include a change in drinking habits.
External validity refers to the extent to which a study's conclusions can be confidently generalised to other situations and people. In other words, whether they would hold beyond the area and time for which they were established.
Statistical machine learninga subset of data science, falling at the intersection of traditional statistics and machine learning. Machine learning refers to the construction and study of computer algorithms - step-by-step procedures used for calculations and classification - that can 'learn' when exposed to new data.
This enables better predictions and decisions to be made based on what was experienced in the past, as with filtering spam emails, for example. The addition of "statistical" reflects the emphasis on statistical analysis and methodology, which is the main approach to modern machine learning.
This article is part of the Spotlight on Data for development.
 Andreas Weigend
 The new data refineries: transforming big data into decisions. (Technology Services Industry Association blog, covering a talk by Andreas Weigend. 6 January 2014)
 Shanta Devarajan. Africa's statistical tragedy. (World Bank blog, 6 October 2011)
 Marcelo Giugale. Fix Africa's statistics. (The World Post 18 December 2012)
 Joseph Hellerstein. The commoditization of massive data analysis. (Blog on O'Reilly.com 19 November 2008)
 Data data everywhere. Kenneth Cukier interviewed for The Economist (25 February 2010)
 Emmanuel Letouzé. Big data for development: opportunities and challenges. (UN Global Pulse, May 2012)
 Big data, big impact: new possibilities for international development. (World Economic Forum, 2012)
James Manyika and others. Big data: the next frontier for innovation, competition and productivity. (McKinsey Global Institute May 2011)
 Danah Boyd and Kate Crawford. Six provocations for Big Data. (A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, September 2011)
 The physical size of big data. Infographic by Domo. (14 May 2013)
 Christopher Frank. Improving decision making in the world of Big Data. (Forbes, 25 March 2012)
 Reinventing society in the wake of Big Data. A Conversation with Alex (Sandy) Pentland (Edge, 30 August 2012)
 Eric Bouillet, and others. Processing 6 billion CDRs/day: from research to production (experience report) pp. 264-67 in Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems (2012)
 Social impact through satellite remote sensing: visualising acute and chronic crises beyond the visible spectrum. (UN Global Pulse, 28 November 2011)
 Michael Horrigan. Big Data: a perspective from the BLS. Column written for AMSTATNEWS, the magazine of the American Statistical Association. (1 January 2013)
 Gary King. Big Data is not about the data! Presentation (Harvard University USA, 19 November 2013)
 Sanjeev Sardana Big Data: it's not a buzzword, it's a movement (Forbes blog, 20 November 2013)
 Melamed C. Development data: how accurate are the figures? (The Guardian, 31 January 2014)
 2010 World population and housing census programme. United Nations Statistics Division.
 Laura Gray. How to boost GDP stats by 60% (BBC News Magazine, 9 December 2012)
 Nigeria's economy will soon overtake South Africa's (The Economist, 21 January 2014)
 The billion prices project. Massachusetts Institute of Technology
 Measuring economic sentiment (The Economist, 18 July 2012)
 Piet Daas and Mark van der Loo, Big Data (and official statistics) Working paper prepared for the Meeting on the Management of Statistical Information Systems. (23-25 April 2013)
 Rebecca Tave Gluskin and others. Evaluation of Internet-Based Dengue Query Data: Google Dengue Trends. (PLOS Neglected Tropical Diseases, 27 February 2014)
 Emilio Zagheni and others. Inferring international and internal migration patterns from Twitter data. (World Wide Web Conference, April 7-11, 2014, Seoul, Korea)
 New primer on mobile phone network data for development. (UN Global Pulse, 5 November 2013)
 Joshua Blumenstock and others. Motives for mobile phone-based giving: evidence in the aftermath of natural disasters (30 December, 2013)
 Michael Wu. Big Data Reduction 3: from descriptive to prescriptive. (Science of Social blog, Lithium 10 April 2013)
 Arvind Narayanan and Vitaly Shmatikov Robust de-anonymization of large sparse datasets. Pages 111-125 in Proceedings of the 2008 IEEE Symposium on Security and Privacy (IEEE Computer Society Washington, DC, USA 2008)
 Yves-Alexandre de Montjoye and others. Unique in the Crowd: The privacy bounds of human mobility (Nature scientific reports 25 March 2013)
 Erica Goode. Sending the police before there's a crime. (The New York Times, 15 August 2011)
 It is getting easier to foresee wrongdoing and spot likely wrongdoers (The Economist, 18 July 2013)
 Kate Crawford. Think again: Big Data. Why the rise of machines isn't all it's cracked up to be. (Foreign Policy, 9 May 2013)
 Neil M. Richards and Jonathan H. King. Three paradoxes of Big Data. (Stanford Law Review, 3 September 2013)
 Neil M. Richards and Jonathan H. King. Big Data ethics. (Wake Forest Law Review, 23 January 2014)
 Neil M. Richards and Jonathan H. King. Gigabytes gone wild. (Aljazeera America, 2 March 2014)
 Rahul Bhargava. Toward a concept of popular data. (MIT Center for Civic Media, 18 November 2013)
 James Manyika and others. Open data: unlocking innovation and performance with liquid information (McKinsey Global Institute, October 2013)
 Emmanuel Letouzé. The Big Data revolution should be about knowledge security (Post-2015.org, 1 April 2014)