Guest Post: DJ Patil on why Data is still sexy

By now, I’ve published several guest posts (see here , here and here) on the future of data science. Here, DJ Patil describes interesting projects in data science, giving hope to those who’ve just read the guest post on why data science is dead. A data Scientist in Residence at Greylock Partners, DJ Patil has held a variety of roles in Academia, Industry, and Government. 

Three ways data science is changing the world

Data is still “sexy”. And I don’t see any reason why that would change. Don’t believe me? Just take a look at the statistics. A quick search for data jobs on LinkedIn yields over 61,000 results and Google Trends continues to show strong growth for Data Science and Big Data. The Insight Data Science Fellows Program continues to receive over 500 applicants for each class of 30 students, with a 100% placement rate in data careers afterwards.

At the same time, we’re having some very serious debates about the acceptable use of data. For example, when is it OK to collect data or metadata (which traces the patterns of the information gathered) about the citizens of a country? And if a person is identified by mistake, whether for something as trivial as a parking offence or as serious as a no-fly list, how do you challenge an algorithm and what’s the process for fixing the error? We’re also struggling to keep some of our most sensitive data, both at a personal and government level, secure.

Data is coming under a new level of scrutiny. That’s a good thing, and I’m looking forward to seeing how the broader public debates what is acceptable. For me personally, I like to focus on some of the incredible things people are doing with data to make the world a better place.

One of my favourite groups using data for good is Crisis Text Line (CTL), which sends support text messages to teenagers in distress. The organization was started by Nancy Lublin, a Young Global Leader, and inspired by the heart-breaking responses from teenagers to texting campaigns by, America’s largest non-profit for young people. As a separate entity, CTL has a single goal, which is to help teenagers through the technology medium they are most comfortable with: text messaging. Once a teenager texts in, trained counsellors reply to everything from suicide attempts, to self-harm, to bullying.

Combining this new approach with data science, CTL built their entire system from the ground up with data in mind. This includes everything from actionable dashboards to data products that help make counsellors both more efficient and effective. For example, predicting when texting volume will be high, developing queues for which counsellors are most effective, and a unique interface for counsellors to work with multiple teenagers simultaneously. These approaches have enabled them to deliver more than a million text messages in the short time the service has been running. The impact and letters from parents of the teens who have received help from the service is guaranteed to make you cry.

Another example of great use of data is DataKind. Led by Jake Porway and Craig Barowsky, it rallies data scientists from disparate places to help non-profits with some of their most pressing data challenges. They do this through a combination of “data dives” (think of these as data hackathons, open to everyone around the world) and their data corps team – in their words, “an elite group of data scientists dedicated to using data in the service of humanity.” These leading experts spend three to six months working pro bono. Their projects include getting better data about food pricing and consumption to help inform monetary policy and thwart a food crisis in Kenya, and figuring out which trees New York State should prune to stop them causing damage in a storm.

This kind of thing isn’t just restricted to the non-profit space. Companies like Jawbone (founded by another Young Global Leader, Hosain Rahman) are using data in innovative ways to help improve your health. Their data has already yielded interesting trends on sleeping patterns. And this is just the beginning, as they start to apply their insights to help personalize advice to improve your health.

Governments are getting involved, too. Code for America has had a massive impact in bringing modern technological and data science approaches to critical services provided at the city level. Each year, data specialists are paired up with a city in need. Their work really shows the merits of their approaches and the impact a little data science can have. For example, in San Francisco the team has focused on helping those needing food assistance. Their approach doesn’t just focus on bureaucratic data, but on making sure people get the help they need. At the federal level, Todd Park has been leading a similar change through the Presidential Innovation Fellows. Now on their third set of fellows, the results have been fantastic.

These data projects give me confidence that the data revolution has only just started. The people who are driving these programmes do so because they’re passionate about both data and the problems they want to solve. In the few short years these programmes and projects have been active, we’ve seen remarkable results, and I expect the impact will continue to increase. The next couple of years are going to be a great test for how comfortable we want to be with data. It is essential that we define acceptable use of data and find ways to safeguard our personal information. In doing so, we must be careful that we don’t cut off the innovation and the opportunity for data to improve lives for those who need it most.

Full disclosure, I spend as much free time I can helping DataKind, Data Insight Fellows, Code For America,, and CrisisTextLine as I can — and I’m damn proud of it.

Originally Published: March 11th


With permission from DJ Patil, original post can be found here on the World Economic Forum.


Guest Post: Ryan Swanstrom on Stats and Data

I’ve already posted two arguments on Data Science and whether it’s worth going into (see here and here). Here, Ryan Swanstrom adds his two cents: the difference between what a statistician and what a data scientist does.

Data Science is more than just Statistics

I occasionally get comments and emails similar to the following question:

Should I attend a graduate program in data science or statistics?

I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.

Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.


This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.

Data Science

On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.


Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.

Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.


What are your thoughts? Agree/Disagree?


With permission from Ryan Swanstrom, original post can be found on his blog Data Science 101.

Critical Analysis: Maternal Mortality

Maternal mortality down 45% globally, but 33 women an hour are still dying

By Leila Hadou, The Guardian, 7th May 2014

Screencap of Guardian article

Screencap of Guardian article

Summary of the article

A report recently published by the World Health Organisation  show deaths from preventable causes related to pregnancy and childbirth have dropped by 45% since 1980. However, 800 women still die every day from complications, 99% of these occurring in developing countries. The key challenge is suggested to be the lack of accurate data. The number of women dying and the reasons behind it, often remain unrecorded and unreported. Improving data collection worldwide remains a priority, according to health groups.

 Summary of the review

The article was on a relevant topic and well written. There was an abundance of sources and the graphics illustrated the issue rather well. However, there was a lack of contextualising the data and explaining what the numbers actually meant i.e. how much is a hundred thousand or a thousand in comparison to total population numbers.

Total: 9.5/10


1. Is the article factually accurate?

Yes- All the data comes from the World Health Organisation (WHO) report and were used correctly.

Score: 1

2. Is there a conflict of interest?

No-  As far as the Guardian’s trainee journalist Leila Haddou’s past employment shows, she did not work in any organisation that would result in a conflict of interest. Her gender is not an issue when writing objectively on a female topic.  The World Health Organisation who issued the report is the directing and coordinating authority for health within the United Nations system. It is responsible for shaping the health research agenda and monitoring and assessing health trends and therefore there is no conflict of interest.

Score: 1

3. Is it newsworthy?

Yes- While complications during pregnancy and deaths during childbirth are much more prevalent in developing countries, it is an issue that could potentially affect half the world’s population, making it of interest to the  readers.

Score: 1

4. Are the sources authoritative?

The source of the report is the World Health Organisation which an international organisation with a reputation for solid data. It was published in the Lancet which is a well established journal. They’ve also quoted high ranking officials at UNICEF and WHO, increasing the authoritativeness of the information.

 Score: 1

5. Have they used two or more sources?

Yes, they have two independent experts, Dr Geeta Rao Gupta, deputy executive director of UNICEF and Tim Evans, Director of Health, Nutrition and Population for the world bank group.  Dr Marleen Temmerman, Director of reproductive health and research at WHO is also quoted on the report.

Score: 1

6. Is the headline misleading?

No- the headline summarises the subject of the article well.

Score: 1

7. Is the article sensational or scaremongering?

No- they’ve clearly stated a large drop in deaths by the first sentence. Some of the numbers are shocking, such as 33 deaths/hour but the calculations is sound and the writer is not hyping the rest of the statistics.

Score: 1

8. Does it explain concepts properly?

There were not many concepts to be explained in this article. When the writer used the term ‘global Maternal Mortality Rate (MMR)’ it was immediately explained. It did not explain HIV, malaria or AIDS but this could be assumed to be a term understood by the readers.

 Score: 1

9. Is the article well written and engaging?

 The article is clearly written but it is not very engaging. It reads more like a general summary of the report with quotes from experts. While the information is important enough to be read even without being very engaging, it would have helped to contextualise the data a bit more. After a few sentences, the large numbers tend to lose their meaning.

Score: 0.5

10. Do the graphics help illustrate the story?

The graphs and tables were clear and simple, easy to take in and understand. Another possibility would have been a map with shades of red to indicate the severity of the problem on a global context but bar charts also worked well and gave a more accurate image of the numbers.

Score: 1




Guest Post: Miko Matsumura on why Data Science is DEAD

This guest post provides an opposite opinion to John Foreman’s (See previous post). Miko Matsumura is a VP at Hazelcast who does not have a high opinion of data scientists in general, he writes: “You [data scientist] will be replaced by a placid and friendly automaton.” 

So what do you think? Are Data Scientists turning into poor substitutes for future software? Or will they constantly remain ahead of it in their ability to cohesively combine data, science, business and a host of other fields?

Data Science is Dead


Fun fact: nothing on this blackboard makes any sense.


Science creates knowledge via controlled experiments, so a data query isn’t an experiment. An experiment suggests controlled conditions; data scientists stare at data that someone else collected, which includes any and all sample biases.

Now, before you drag out the pitchforks: I’m not a query hater. You won’t see me standing outside the Oracle Open World conference with a sign that says “NO SQL” on it. Queries are fine. Smart people don’t always have the right answer, but they need to ask the right questions. Yes, building a query is like “forming a hypothesis,” but at that point we enter the realm of observational or “soft” science. Yes, by this standard, Astronomy and Social Sciences are also not sciences. I have no idea what Computer Science is, but no, it’s not a science either.

Oh what’s that? Your kind of “Data Science” includes things such as A|B Testing, and your “experiments” actually involve executing designs that affect the world? Allow me to retort: that’s not Data Science, that’s actually doing a job. You might have a job title like Product Management or Marketing. But if your job title is “Data Scientist,” you are effectively removing yourself from the actual creation of data.

I do sympathize. I appreciate that it’s no longer sexy to be a Database Administrator, and I guess the term “Business Analyst” is a bit too 1980’s. Slapping “Data Warehousing” on a resume is probably not going to land you a job, and it’s way down there with “Systems Analyst” on the cool-factor scale. If you’re going to make up a cool-sounding job title for yourself, “Data Scientist” seems to fit the bill. You can go buy a lab coat from a medical-supply surplus store and maybe some thick glasses from a costume shop. And it works! When you put “Data Scientist” on your LinkedIn profile, recruiters perk up, don’t they? Go to the Strata conference and look on the jobs board—every company wants to hire Data Scientists.

OK, so we want to be “Data Scientists” when we grow up, right? Wrong. Not only is Data Science not a science, it’s not even a good job prospect. In the immortal words of Admiral Akbar: “It’s a trap.”

These companies expect data scientists to (from a real job posting): “develop and investigate hypotheses, structure experiments, and build mathematical models to identify… optimization points.” Those scientists will help build “a unique technology platform dedicated to… operation and real-time optimization.”

Well, that sounds like a reasonable—albeit buzzword-filled—job description, no? There is going to be a ton of data in the future, certainly. And interpreting that data will determine the fate of many a business empire. And those empires will need people who can formulate key questions, in order to help surface the insights needed to manage the daily chaos. Unfortunately, the winners who will be doing this kind of work will have job titles like CEO or CMO or Founder, not “Data Scientist.” Mark my words, after the “Big Data” buzz cools a bit it will be clear to everyone that “Data Science” is dead and the job function of “Data Scientist” will have jumped the shark.

Yes, more and more companies are hoarding every single piece of data that flows through their infrastructure. As Google Chairman Eric Schmidt pointed out, we create more data in a single day today than all the data in human history prior to 2013.

Unfortunately, unless this is structured data, you will be subjected to the data equivalent of dumpster diving. But surfacing insight from a rotting pile of enterprise data is a ghastly process—at best. Sure, you might find the data equivalent of a flat-screen television, but you’ll need to clean off the rotting banana peels. If you’re lucky you can take it home, and oh man, it works! Despite that unappetizing prospect, companies continue to burn millions of dollars to collect and gamely pick through the data under respective roofs. What’s the time-to-value of the average “Big Data” project? How about “Never”?

If the data does happen to be structured data, you will probably be given a job title like Database Administrator, or Data Warehouse Analyst.

When it comes to sorting data, true salvation may lie in automation and other next-generation processes, such as machine learning and evolutionary algorithms; converging transactional and analytic systems also looks promising, because those methods deliver real-time analytic insight while it’s still actionable (the longer data sits in your store, the less interesting it becomes). These systems will require a lot of new architecture, but they will eventually produce actionable results—you can’t say the same of “data dumpster diving.” That doesn’t give “Data Scientists” a lot of job security: like many industries, you will be replaced by a placid and friendly automaton.

So go ahead: put “Data Scientist” on your resume. It may get you additional calls from recruiters, and maybe even a spiffy new job, where you’ll be the King or Queen of a rotting whale-carcass of data. And when you talk to Master Data Management and Data Integration vendors about ways to, er, dispose of that corpse, you’ll realize that the “Big Data” vendors have filled your executives’ heads with sky-high expectations (and filled their inboxes with invoices worth significant amounts of money). Don’t be the data scientist tasked with the crime-scene cleanup of most companies’ “Big Data”—be the developer, programmer, or entrepreneur who can think, code, and create the future.


With permission from Miko Matsumura, original post can be accessed here on Dice.

Guest Post: John Foreman giving hope for Data Scientists

John Foreman is a  chief data scientist at MailChimp and has done a lot of analytic work for large companies. He argues that a skilled data scientist’s work will cost more than $30 per hour.


The $30/hr Data Scientist

Yesterday a journalist asked me to comment on Vincent Granville’s post about the $30/hr data scientist for hire on Elance. What started as a quick reply in an email, spiraled a bit, so I figured I’d post the entire reply here to get your thoughts in the comments.When we ask the question, “Can someone do what a data scientist does for $30/hr?” we first need to answer the question, “What does a data scientist do?” And there are a multitude of answers to that question.


If by data scientist, we mean ” a person who can perform a data summary, aggregation or modeling task that has been well-defined for them in advance” then it is by no means a surprise that there are folks who can do this at a $30/hr price point. Indeed, there’ll probably come a day where that task can be completed for free by software without the freelancer. This is similar to the evolution of web development freelancing.The key phrase though is “task that has been well-defined.”

The types of data scientists who command large salaries seem to meet two very different definitions than what a freelancer at $30/hr can meet:

1) There’s the highly-technical engineer. Someone who is knowledgeable and skilled enough to select the correct tools and infrastructure in the polluted big-data landscape to solve a specific, highly-technical data problem. Often these folks are working on problems that haven’t been solved before or if they have there are only a few poorly documented examples. Because these tasks might not even be solvable, they’re certainly not “well-defined.” A business wouldn’t trust important bits of infrastructure to $30/hr.

2) There’s the data scientist as communicator/translator. This person is someone who knows data science techniques intimately but whose strength is actually in the nontechnical — this person thrives on taking an ambiguous business situation and distilling it into a data science solution. Often managers and executives don’t know what’s possible. They know what problems they have, but they don’t know how or even if data science can solve those problems. These folks can’t hire someone halfway across the globe at $30/hr to figure that out for them. No, they need someone who’s deeply technical but also deeply personable in the office to talk things through with them and guide them.

All of the hype around data science is generating a lot of these articles about automating or replacing the role. But

I think it’s important to realize that just like “doctor,” “lawyer,” “consultant,” “developer,” etc. the “data scientist” is more of a spectrum or category than a single role.A data scientist is not someone putting doors on an automobile in a factory. Some of them might be doing just that, i.e. rote modeling tasks. But not all of them. I believe that MOOCs will excel at teaching up an army of these lower-paid data scientists. And that’s great. They’ll fill a need. Kinda like the need in the 90s for people with basic COMPTIA certifications and the most basic of Cisco certs.

However, there will always be a place for those who excel at solving ambiguous technological & business problems. And they’ll cost more than $30/hr.


With permission from John Foreman, original post can be found on here on his blog.


Big Data Will Change Our Lives

Here’s an infographic by Ryan Swanstrom on data in our lives.

Data Science 101

Big Data

View original post

How-to…write an FOI request

Around 120,000 Freedom of Information requests are made each year, and not only by journalists. However, a FOI request is one of the most useful tools for a data journalist. It allows you to ask any public sector organisation for recorded information on any subject. Regardless of your  location, nationality or age you can make the request and they will have to answer within 20 working days. Here, I explain how to do this effectively.

Step No 1: Decide who to request the information from.

Public-funded organisations include government department, local councils, universities, some museums and even the police. See the full list here.

Remember, before you make the request, make sure to search through the FOI answers posted online (here). Government departments FOI answer can be searched here.

Step no. 2: Write the FOI request

You can contact the organisation by email or by letter but requests can also be made through social media or verbally.

Make sure to include your name, contact information and a description of the information you want. Be polite and very detailed, the more you can tell them about what you want, the better they’ll be able to help you.

It can be useful to call the organisation before you send the FOI request and speak to an FOI officer. This way you can ensure they have the information you want and they can help you shape your request in the best way.

Ask them to give you a reference number, it’s easier to communicate with them regarding your request this way.

There is a limit to the number of hours an FOI officer will spend compiling information for your request. This is translated into an expense cap of £400 (£600 for a central government request) and they are not required to answer your request if the expenses go higher than that. They might ask you to make your request more specific.

Step no. 3 The Reply

Remember that there are humans at the other end using their valuable time to compile information together so make sure you know what you’re asking for and why. Don’t waste their time on useless or already-there information.

Some sensitive information is not available to the general public. The organisation will then tell you the information is withheld. The Data Protection Act also stops you from receiving information that can be used to identify a specific person’s private records.

 An Example:

Here’s a basic example of what an FOI request can look like:

Dear Freedom of Information Officer,

I would like to request, under the Freedom of Information Act, records of….

By this, I specify:

I would like to receive all the records for the previous year (2013). If the amount of information exceeds the monetary limit I am allowed, please provide records for the last 6 months ( 01 July 2013 – 31 December 2013) only.

Please provide a full copy of any databases/spreadsheets from which information was extracted in response to this request. In the event that these records contain columns that hold personal information covered by data protection provisions, please simply delete the offending columns and send the rest of the data without them. I would like this data in a delimited text format as is standard for most database software. I would like to receive all correspondence and information electronically.

If this information is held by another public body, then please can you inform me of this and if possible transfer the request to that public body.

If you need any clarification on this request please email me or call me at the earliest opportunity.

Yours faithfully,

Safya Khan-Ruf