Mitä on data science?

 

Data science on ala, jossa tavoitteena on tuottaa datasta ymmärrystä, ennusteita ja päätöksenteon tukea yhdistämällä tilastotiede, koneoppiminen, ohjelmointi ja liiketoimintaymmärrys.

 

Käytännössä data science tarkoittaa esimerkiksi:

                  •               ilmiöiden ja syy–seuraus-suhteiden analysointia

                  •               ennustemallien rakentamista (esim. churn, kysyntä, riskit)

                  •               luokittelua ja segmentointia (esim. asiakasryhmät)

                  •               optimointia (hinnoittelu, resurssit)

                  •               kokeiden suunnittelua ja tulkintaa (A/B-testit)

                  •               mallien viemistä tuotantoon yhteistyössä data engineeringin ja softatiimien kanssa

 

 

Työkalut data scienceen

 

 

1) Ohjelmointikielet

                  •               Python

                  •               R

                  •               SQL

                  •               Julia

                  •               Scala (varsinkin Spark)

                  •               MATLAB (yleinen tietyillä aloilla)

                  •               Java (harvemmin, mutta tuotantomalleissa)

                  •               C++ (erikoistapauksissa)

 

 

2) Data-analyysin peruskirjastot

 

Python

                  •               pandas

                  •               NumPy

                  •               SciPy

                  •               Polars

                  •               PyArrow

                  •               statsmodels

 

R

                  •               tidyverse (dplyr, ggplot2, tidyr)

                  •               data.table

                  •               caret

                  •               forecast

 

 

3) Visualisointi ja exploratiivinen analyysi

                  •               matplotlib

                  •               seaborn

                  •               plotly

                  •               bokeh

                  •               altair

                  •               ggplot2

                  •               Power BI

                  •               Tableau

                  •               Looker

                  •               Superset

                  •               Metabase

 

 

4) Notebook-ympäristöt ja työtilat

                  •               Jupyter Notebook

                  •               JupyterLab

                  •               Google Colab

                  •               Databricks Notebooks

                  •               Kaggle Notebooks

                  •               VS Code (Python + notebooks)

                  •               RStudio

 

 

5) Koneoppiminen (general ML)

                  •               scikit-learn

                  •               XGBoost

                  •               LightGBM

                  •               CatBoost

                  •               H2O.ai

                  •               Spark MLlib

                  •               fastai

 

 

6) Syväoppiminen (Deep Learning)

                  •               PyTorch

                  •               TensorFlow

                  •               Keras

                  •               JAX

                  •               Flax

                  •               Hugging Face Transformers

                  •               ONNX

                  •               OpenVINO (optimointi / inference)

 

 

7) NLP (tekstianalytiikka)

                  •               spaCy

                  •               NLTK

                  •               Gensim

                  •               Hugging Face

                  •               SentenceTransformers

                  •               BERTopic

                  •               LangChain (LLM-pohjaiset sovellukset)

                  •               LlamaIndex

 

 

8) Computer Vision

                  •               OpenCV

                  •               torchvision

                  •               TensorFlow Vision

                  •               MMDetection

                  •               YOLO (Ultralytics)

                  •               Detectron2

                  •               Albumentations

 

 

9) Aikasarja-analyysi ja ennustaminen

                  •               Prophet

                  •               statsmodels

                  •               pmdarima

                  •               sktime

                  •               GluonTS

                  •               Darts

                  •               Nixtla (statsforecast, neuralforecast)

 

 

10) Tilastotiede ja kokeellisuus (A/B-testit)

                  •               statsmodels

                  •               SciPy stats

                  •               PyMC

                  •               Stan

                  •               brms (R)

                  •               CausalImpact

                  •               EconML

                  •               DoWhy

                  •               CausalML

 

 

11) Optimointi ja operations research

                  •               Google OR-Tools

                  •               PuLP

                  •               Pyomo

                  •               CVXPY

                  •               Gurobi

                  •               CPLEX

                  •               MOSEK

 

 

12) AutoML ja no-code ML

                  •               DataRobot

                  •               H2O Driverless AI

                  •               Google Vertex AI AutoML

                  •               Azure AutoML

                  •               SageMaker Autopilot

                  •               Auto-sklearn

                  •               TPOT

                  •               PyCaret

 

 

13) MLOps (mallien hallinta, deployment, seuranta)

 

Experiment tracking ja mallirekisteri

                  •               MLflow

                  •               Weights & Biases

                  •               Neptune.ai

                  •               Comet

                  •               ClearML

 

Deployment ja serving

                  •               FastAPI

                  •               Flask

                  •               BentoML

                  •               TorchServe

                  •               TensorFlow Serving

                  •               KServe

                  •               Seldon

                  •               Ray Serve

 

Pipeline ja orkestrointi

                  •               Kubeflow

                  •               Flyte

                  •               Metaflow

                  •               Airflow (myös ML-putkissa)

                  •               Prefect

                  •               Dagster

 

Monitoring

                  •               Evidently AI

                  •               Arize

                  •               Fiddler

                  •               WhyLabs

                  •               Monte Carlo (myös data observability)

 

 

14) Feature store (mallien syötteiden hallinta)

                  •               Feast

                  •               Tecton

                  •               Hopsworks

                  •               Databricks Feature Store

                  •               Vertex AI Feature Store

                  •               SageMaker Feature Store

 

 

15) Data access ja tietokannat (DS-työssä)

 

SQL / DW

                  •               BigQuery

                  •               Snowflake

                  •               Redshift

                  •               PostgreSQL

                  •               MySQL

                  •               SQL Server

 

NoSQL

                  •               MongoDB

                  •               DynamoDB

                  •               Cassandra

 

Vector DB (LLM / semanttinen haku)

                  •               Pinecone

                  •               Weaviate

                  •               Milvus

                  •               Qdrant

                  •               Chroma

                  •               FAISS

                  •               Elastic (vector search)

 

 

16) Big data -laskenta

                  •               Apache Spark

                  •               Databricks

                  •               Dask

                  •               Ray

                  •               Apache Flink

                  •               Hadoop (legacy)

 

 

17) Data quality (DS-näkökulmasta)

                  •               Great Expectations

                  •               Soda

                  •               Deequ

                  •               Pandera

                  •               TFDV (TensorFlow Data Validation)

 

 

18) Mallien evaluointi ja metriikat

                  •               scikit-learn metrics

                  •               torchmetrics

                  •               TensorBoard

                  •               SHAP

                  •               LIME

                  •               ELI5

                  •               Captum (deep learning selitettävyys)

 

 

19) Versionhallinta ja reproducibility

                  •               Git

                  •               GitHub / GitLab

                  •               DVC (Data Version Control)

                  •               LakeFS

                  •               Weights & Biases Artifacts

                  •               Pachyderm

 

 

20) Dokumentointi ja yhteistyö

                  •               Confluence

                  •               Notion

                  •               Google Docs

                  •               Slack

                  •               Jira

                  •               Miro

 

Yhteenveto

 

Data science = mallintamista ja analytiikkaa, jossa dataa käytetään:

                  •               ymmärtämään mitä tapahtuu ja miksi

                  •               ennustamaan mitä tapahtuu seuraavaksi

                  •               optimoimaan päätöksiä

                  •               rakentamaan dataan perustuvia tuotteita ja automaatiota