What does it mean to have the sexiest job of the 21st century?

Despite what you might think, there are no quick, simple, off-the-shelf data solutions. Data science is a creative process involving a lot of trial and error tasks. Data Analytics goes beyond technical skills, human, business and research aspects must also be taken into account.

Data hype is everywhere and (big) data success stories, seminars and training courses are emerging. Everyone is looking for that one multi-talented data scientist with unique skills who can solve every complex data problem in the blink of an eye. In their quest, many seem to underestimate what is really needed to solve such problems.

Data science is about solving real-world problems

A data scientist is not necessarily an expert in the installation of data tools. Installing and configuring platforms, libraries and tools are merely a prerequisite for data innovation. Data science is about solving real-world problems, not just about applying algorithms. Even the most intelligent algorithm will be useless without some prior understanding of the application domain and the right data to analyse. What you actually need first is business understanding, the process of understanding and extracting project objectives and requirements from a business perspective. You must also be able to convert these objectives into a data science solution.

Gaining insight

Understanding data means exploring data to gain insight into the phenomenon under study. This involves making assumptions about the underlying mechanisms and proceedings to generate hypotheses.

The change that data science can effect is sometimes found in the smallest things. A good example of this is the strategy that parcel delivery company UPS has successfully applied: UPS wanted to optimise its delivery routes via algorithms by having drivers avoid left turns. This resulted in the company saving 38 million litres of fuel per year, which in turn means a 20,000-tonne reduction in CO2 emissions.

Analytical skills through experience

Data science requires analytical skills. Mastering algorithms is not enough. The ability to abstract and conceptualise algorithmic solutions is a condition for being able to generalise them to wider application contexts.

As an example, in February 2013, scientific journal Nature reported that GFT (Google Flu Trends) was predicting more than double the number of doctor's visits for flu-like illnesses than the Centers for Disease Control and Prevention (CDC), the US government organisation responsible for tracking, treating and preventing diseases. Google found the best matches among 50 million search terms, which fit 1,152 data points. There is a high probability of finding search terms which point to flu susceptibility, but which are structurally unrelated and therefore do not predict the future. See also this article.

This illustrates that quantity of data does not mean foundational issues of measurement and valid data construction can be ignored. Or that arrogance and ignorance when unifying different data types can sometimes have problematic consequences.

Data requirements

Data requirements play a key role and often a large amount of historical (time series) data is available, obtained from sensor measurements of various operational parameters, but the essential meta-data is missing, such as, for example, what are the healthy operational periods when no anomalies or faults occur.

On 1 February 2003, Space Shuttle Columbia disintegrated upon re-entering earth's atmosphere. In the wake of the disaster, NASA published a technical report in 2004 on a new health monitoring technique: Inductive System Health Monitoring (ISHM). ISHM was trained using data from previous successful flights, making it possible to characterise typical system behaviour. The purpose of the technique was to test the ability to detect anomalies in data from the final flight.

ISHM can be deployed more broadly and in an application context such as monitoring the operational behaviour of industrial installations. Here, we are thinking in particular of applications which are hard to model (simulate) with a computer or which require computer models that are too complex for real-time monitoring. One example is a wind farm.

Skills acquired through experience

A data scientist must seldom or never despair. There are no off-the-shelf solutions for complex industrial challenges that go beyond simple aggregation and statistics. Consequently, data innovation requires a professional, hand-crafted and iterative approach requiring a substantial amount of black art. Naturally, this skill is hard to find in textbooks and can only be acquired through experience.

In a data innovation project, little time is spent on standard machine learning. Most time is spent on gathering, integrating, cleaning and preprocessing data. There is also a lot of trial and error in selecting the algorithms and repeating 'model > evaluate > refine' until the result is satisfactory.

White raven: a team

Data alone does not provide answers to everything. People will always remain an important factor in the data innovation process. Not everything can be automated and common sense is needed to check and validate results. Data and algorithms are only as usable as the human decisions that accompany them.

Data innovation requires a unique combination of knowledge and skills, such as domain knowledge, creativity, analytical thinking, statistics and self-learning systems, programming, visualisation, etc. As well as a complementary team of people, because all of these skills and knowledge are rarely found in a single person.

At TechBoost, organised by the UGent Alumni Association on 27 April, Elena Tsiporkova, who is leading the EluciDATA Innovation Lab at Sirris, explained what it means to have the sexiest job of the 21st century. Her presentation showed how and why 'big data' and 'data scientist' are so much more than over-hyped concepts.

A photo report (http://www.flickr.com/photos/techboost2017/) and video report (https://www.youtube.com/watch?v=zbEHFO9JaCA) of the full event are available.