Hi all – we hope this article finds you well.
As data scientists we aim for a few things:
- Determine a business pain-point – known or yet unknown extracted from our deep diving into the data abyss;
- Convert the specific business case into a Machine Learning one – supervised, unsupervised, reinforcement learning;
- Meet task’s requirements with data availability;
- Build the strongest model possible within the timeline available while obliging to the accuracy vs. comprehensibility tradeoff of the task at hand;
- Integrate the solution within the current strategy in order to receive highest result (check-out our finance industry article).
In today’s Industry 4.0, Big Data and AI we have a broad spectrum of tools to address the above. Just to name a few:
- All of Tablue, Qlik, PowerBI (clearly) and Shiny (undoubtedly) have great R integrations so that you can utilize advanced analytics directly in your dashboard or take advantage of the visualizations available from R – such as Plotly.
- Execute R code within Python (rpy2 package), Python Code in R (reticulate package), call R scripts from SAS (proc options option=RLANG to verify permissions), SQL in SAS (proc sql – available for a long time now) and R (sqldf library), etc.
To sum up, a lot has been already achieved in the matter of data science – visual intelligence interrelation. Its methods, goals, and applications – evolve with time and technology. More than 25 years ago Data Science referred more to gathering and cleaning data sets then applying statistical methods to that data. Now everything is different. Yet there is more to come!
An example of the bigger picture with how Data Scientists interact with each element can be found in the picture below:
By now data scientists such as ourselves have mostly been involved into the green sections of the diagram above: building advanced analytics solutions presenting them in а way that the business can both obtain an explanation with utmost level of interpretability and appreciate the added-value of our work. The rest have been and in the present is still a little bit out of scope:
- Data ingestion – we take all the data as provided. Usually in this process we require the assistance of data engineers in order to ingest it. Thereafter with the help and availability of a data steward and business analyst we understand and become acquainted with the data itself;
- Model Deployment – dev ops or software engineer experts commonly lend a helping hand here. We all are quite aware of the fact that this is crucial and that very often there are cases where the analytical exercise could fail and actually incur loss if our models are not properly placed within the live/ production environment;
- Roles and Access: Qlik, for example, allows you to set different access roles and dashboard view per users.
In brief, do you know what the major players such as ® Amazon Web Services, ® Microsoft Azure, ® Oracle, etc. are doing? They are aiming to connect all the dots. Have they succeeded in that process? Well, our experience shows that the progress in the field is rather significant:
Let’s review some of the achievements for AWS and it’s SageMaker* – fully managed machine learning service where there are:
- Data ingestion, interoperability, self-service. Direct connection to the stored data – your S3 bucket;
- Tools. Readily available Jupyter notebook on your EC2 instance with build-in Python, Spark, etc.;
- Self-service. Оff-the shelf container with pre-built ML capabilities that are optimized for the AWS environment;
- Costs. Monitor resources utilizing AWS CloudWatch and monitor costs with AWS CloudTrail;
- Version control, interoperability. Connect to version control tools such as GitHub;
- Interoperability, tools. Install R Server and R Shiny Server on your EC2 instance or simply install R notebook within Jupyter (as simple as: “!conda install –yes –name JupyterSystemEnv –channel r r-essentials=1.7”);
- Roles and access level. Set access control via an IAM role;
On the picture below there is shown a functional example of the extended train, host and predict path of your models within SageMaker provided by the SageMaker Python SDK. Just a couple of lines of code cover the full cycle for you:
Similar is the experience with Microsoft Azure. On this link you will find an article about Creating Machine Learning models in Power BI that is worth reading.
Based on our recent research and projects there is a list of some key moments that we want to pay attention to:
- Even though marketed as Plug & Play these service providers do have a learning curve and actually your initial interaction may be a little frustrating – especially in the areas that you are not used to taking part in (one such could be the deployment of your model);
- Self-service is a good thing but cannot substitute experienced professional – it only has limited capabilities/lacks of flexibility and you still have to know what you are doing (e.g. using one regression technique vs. another);
- Pricing: Without the need of setting up and developing your own environment business may reduce costs significantly. Though, it’s better to make precise calculations on the pricing of the service you are going to use because some offers are good but when your data is indeed BIG (e.g. 1000s of sensors providing data each and every second across many manufacturing units) – well, you may be surprised what comes when your bill starts knocking on your accountants door;
All things considered, we believe that we are making great progress in the area of data storage, machine learning, visualizations and interoperability. Some of the solutions presented above will evolve, especially in the aspects of user experience and documentation since as of that point some of them are fresh and newly born.
Our advice? Stay curious and make sure you educate yourself unceasingly. Just like we do!
With this series our aim is to increase data science coverage and to make data-driven decisions an integral part of more companies around the Globe.