Data readiness: ensuring a solid base for predictive operations and condition monitoring

Digital manufacturing is forecasted to bring out massive changes to the industrial world. Today, regardless of the maturity level of digitization projects in each company, the main motivation is staying competitive, boosting revenues, and enhancing profits, while reducing costs and being agile. There are several ways through which an industry can engage in Industry 4.0, but processes and operations areas are relying more and more on sensor data to improve plant performance.

However, one of the main questions that arise when starting a predictive operations or condition monitoring project is whether the data collected from machines will actually show something valuable. While there is a clear understanding about the implementation and operation of infrastructure for data collection, there does not seem to be an equally common understanding that the available data also needs to allow the application of predictive methods to produce the expected results. This state is called data readiness.

Although there is no general answer when your data will be ready for analytics applications, there are a number of necessary criteria which need to be met, with the most prominent ones being information content, structure, correct handling of imperfections and documentation.

This assessment is normally done through a data readiness report, in which a data scientist either as a one-time activity evaluates these criteria or integrates it as a first, automatic step into an analytics pipeline. As many organizations do not have a data science team, it becomes essential to select partners for digitization projects that have the capacity to support domain experts in assessing the conditions of the available data.

More about What Data Science Actually Means To Manufacturing

Documentation

Without documentation data cannot be interpreted. The basic documentation requirements concern origin and meaning of the available data. It needs to be clear what the source is, like a specific sensor on an asset, or an issue tracking system for factory maintenance. It also needs to be clear what the variable (tag or column) relates to (temperature or error code, for example), if it is numerical or categorical data, and what scales or levels are used. In case of numerical data, it is important to clarify which units it is stored in and, in case of measurement data, what the measurement accuracy is.

Handling the imperfections

Imperfections could be missing data, outliers, erroneous sensor readings or user inputs, to name some. The problem with imperfections is that one cannot be sure how valid the output is if data is imperfect or outside the range of values the underlying algorithms can cope with. In a worst-case scenario, an extreme outlier may totally destroy the result of an analysis or, even though the output may still look reasonable, which in turn lets us draw the wrong conclusions.

The handling of missing data is a question of its own: should it be discarded (maybe including other non-missing data in the same table row), should it be interpolated, or should it be filled with a default value? Even more important is knowing how missing data looks like: is it coded as “”, “NA”, or maybe -999? Often there is no straightforward answer to these questions without understanding the underlying algorithms.

More tricky imperfections arise from broken or incorrectly installed sensors: while all standard checks on data may look perfectly fine, all the values stored may be totally off. Here, the help of domain experts is crucial to detect those deviations.

Data quantity and information content

Employing machine learning-based methods obviously requires data and, in general, the more precise predictions or detections should be made, the more data is needed. But before assessing the mere quantity of data (the number of measurements required) it is important to analyze the potential information content. Only relevant data will help achieving good prediction or detection results. Adding more but irrelevant input will most probably deteriorate the analysis’ performance, as it will become noise in relation to the relevant information.

In practice, it means that more is not always better, so it is important to understand which tags, variables, or measurements could carry relevant information for a specific objective. This can be done either by discussing with domain experts which data actually is strongly related to the characteristic that will be predicted or detected, or by using training data with labeled ground truth to see which variable selection gives the best performance. This is required to increase the relevant information content and decrease the noise.

In case the information content in the currently collected data is insufficient, it may be necessary to add new sources, like installing more or better sensors to the asset that will be monitored.

Data structure and merging different sources

Another common task for ensuring data readiness is merging information from different sources, for example, a condition monitoring system and a maintenance logging system. Both activities depend on the envisaged analytics objectives, the available data in the organization, the storage format, and the documentation of the data sources. Other problems could be of organizational kind, as maybe not all parts of the company can be granted access to relevant data. Here the data readiness report may even show the necessity of organizational changes.

Data readiness is a crucial requirement for implementing successful predictive operations’ strategies. While many of us may be more excited by devising or exploring new algorithms, there is no shortcut which lets us progress from available data directly to application of machine learning. Most of the activities related to data readiness are well established, and the methods have been widely used. And apart from making data ready, time spent with getting to know the available data, especially with domain expert support, is very well spent time. It will not only improve the quality of the input but also help you attaching meaning to the data and achieving the expected business results.

This article was written by Fredrik Wartenberg. He is Data Scientist at Viking Analytics, a start-up from Sweden that offers self-service analytics software used by domain-experts to prepare, analyze, and organize large sensor data without advanced data-analytics skills.