Feature Store: The Definitive Guide
특징 저장소: 완벽한 가이드

A feature store is a data platform that supports the development and operation of machine learning systems by managing the storage and efficient querying of feature data. Machine learning systems can be real-time, batch or stream processing systems, and the feature store is a general purpose data platform that supports a multitude of write and read workloads, including batch and streaming writes, to batch and point read queries, and even approximate nearest neighbour search. Feature stores also provide compute support to ML pipelines that create and use features, including ensuring the consistent computation of features in different (offline and online) ML pipelines.
특성 저장소는 특징 데이터의 저장과 효율적인 쿼리를 통해 기계 학습 시스템의 개발과 운영을 지원하는 데이터 플랫폼입니다. 기계 학습 시스템은 실시간, 배치 또는 스트림 처리 시스템일 수 있으며, 특성 저장소는 배치 및 스트리밍 쓰기, 배치 및 포인트 읽기 쿼리, 근접 최근접 이웃 검색 등 다양한 쓰기 및 읽기 작업을 지원하는 범용 데이터 플랫폼입니다. 특성 저장소는 또한 특징을 생성하고 사용하는 ML 파이프라인에 대한 컴퓨팅 지원을 제공하며, 다른 (오프라인 및 온라인) ML 파이프라인에서 특징을 일관되게 계산할 수 있도록 합니다.

What is a feature and why do I need a specialized store for them?
기능이란 무엇이며 이를 위해 전문 매장이 필요한 이유는 무엇입니까?

A feature is a measure property of some entity that has predictive power for a machine learning model. Feature data is used to train ML models, and make predictions in batch ML systems and online ML systems. Features can be computed either when they are needed or in advance and used later for training and inference. Some of the advantages of storing features is that they can be easily discovered and reused in different models, reducing the cost and time required to build new machine learning systems. For real-time ML systems, the feature store provides history and context to (stateless) online models. Online models tend to have no local state, but the feature store can enrich the set of features available to the model by providing, for example, historical feature data about users (retrieved with the user’s ID) as well as contextual data, such as what’s trending. The feature store also reduces the time required to make online predictions, as these features do not need to be computed on-demand, - they are precomputed.
특징은 기계 학습 모델에 대한 예측 능력이 있는 어떤 실체의 측정 속성입니다. 특징 데이터는 ML 모델을 교육하고 일괄 ML 시스템과 온라인 ML 시스템에서 예측을 수행하는 데 사용됩니다. 특징은 필요할 때 계산하거나 사전에 계산하여 훈련과 추론에 나중에 사용할 수 있습니다. 특징을 저장하는 이점 중 하나는 다른 모델에서 쉽게 발견하고 재사용할 수 있어 새로운 기계 학습 시스템을 구축하는 데 드는 비용과 시간을 줄일 수 있다는 것입니다. 실시간 ML 시스템의 경우 특징 저장소는 (상태 없는) 온라인 모델에 기록과 맥락을 제공합니다. 온라인 모델은 로컬 상태가 없는 경향이 있지만, 특징 저장소는 예를 들어 사용자의 ID로 검색한 사용자에 대한 과거 특징 데이터와 같은 특징 집합을 풍부하게 할 수 있습니다. 또한 특징 저장소는 온라인 예측을 수행하는 데 필요한 시간을 줄일 수 있습니다. 이러한 특징은 주문형으로 계산할 필요가 없기 때문입니다.

How does the feature store relate to MLOps and ML systems?
기능 저장소는 MLOps 및 ML 시스템과 어떤 관련이 있습니까?

In a MLOps platform, the feature store is the glue that ties together different ML pipelines to make a complete ML system:
기계 학습 운영(MLOps) 플랫폼에서 특징 저장소는 여러 ML 파이프라인을 연결하여 완전한 ML 시스템을 만드는 접착제 역할을 합니다

feature pipelines compute features and then write those features (and labels/targets) to it;
기능 파이프라인은 기능을 계산하고 이러한 기능(및 레이블/타겟)을 기록합니다;
training pipelines read features (and labels/targets) from it;
교육 파이프라인은 이로부터 특징(및 레이블/대상)을 읽습니다
Inference pipelines can read precomputed features from it.
추론 파이프라인은 그로부터 사전 계산된 특징을 읽을 수 있습니다.

The main goals of MLOps are to decrease model iteration time, improve model performance, ensure governance of ML assets (feature, models), and improve collaboration. By decomposing your ML system into separate feature, training, and inference (FTI) pipelines, your system will be more modular with 3 pipelines that can be independently developed, tested, and operated. This architecture will scale from one developer to teams that take responsibility for the different ML pipelines: data engineers and data scientists typically build and operate feature pipelines; data scientists build and operate training pipelines, while ML engineers build and operate inference pipelines. The feature store enables the FTI pipeline architecture, enabling improved communication within and between data, ML, and operations teams.
기계 학습 운영(MLOps)의 주요 목표는 모델 반복 시간을 줄이고, 모델 성능을 향상시키며, ML 자산(특징, 모델)의 거버넌스를 보장하고, 협업을 개선하는 것입니다. ML 시스템을 별도의 특징, 교육 및 추론(FTI) 파이프라인으로 분해하면 시스템이 더 모듈화되어 3개의 파이프라인을 독립적으로 개발, 테스트 및 운영할 수 있습니다. 이러한 아키텍처를 통해 개발자 한 명부터 다양한 ML 파이프라인에 대한 책임을 지는 팀까지 확장할 수 있습니다: 데이터 엔지니어와 데이터 과학자가 일반적으로 특징 파이프라인을 구축하고 운영하며, 데이터 과학자가 교육 파이프라인을 구축하고 운영하고, ML 엔지니어가 추론 파이프라인을 구축하고 운영합니다. 특징 저장소를 사용하면 FTI 파이프라인 아키텍처를 실현할 수 있어 데이터, ML 및 운영 팀 내부와 팀 간 의사소통이 향상됩니다.

What problems does a feature store solve?
특징 스토어는 어떤 문제를 해결합니까?

The feature store solves many of the challenges that you typically face when you (1) deploy models to production, and (2) scale the number of models you deploy to production, and (3) scale the size of your ML teams, including:
특징 저장소는 다음과 같은 어려움을 해결합니다: (1) 모델을 프로덕션에 배포할 때 (2) 프로덕션에 배포하는 모델의 수를 확장할 때 (3) ML 팀의 규모를 확장할 때

Support for collaborative development of ML systems based on centralized, governed access to feature data, along with a new unified architecture for ML systems as feature, training and inference pipelines;
중앙화된 관리되는 기능 데이터에 대한 액세스를 기반으로 ML 시스템의 협력 개발에 대한 지원과 함께 기능, 교육 및 추론 파이프라인으로 구성된 새로운 통합 아키텍처;
Manage incremental datasets of feature data. You should be able to easily add new, update existing, and delete feature data using DataFrames. Feature data should be transparently and consistently replicated between the offline and online stores;
증분 기능 데이터 세트를 관리하십시오. DataFrames를 사용하여 새로운 기능 데이터를 쉽게 추가하고, 기존 데이터를 업데이트하고, 기능 데이터를 삭제할 수 있어야 합니다. 기능 데이터는 오프라인 및 온라인 저장소 간에 투명하고 일관되게 복제되어야 합니다.
Backfill feature data from data sources using a feature pipeline and backfill training data using a training pipeline;
데이터 소스에서 피처 파이프라인을 사용하여 피처 데이터를 백필하고 학습 파이프라인을 사용하여 학습 데이터를 백필합니다
Provides history and context to stateless interactive (online) ML applications;
무상태 상호작용(온라인) ML 애플리케이션에 대한 역사와 배경을 제공합니다
Feature reuse is made easy by enabling developers to select existing features and reuse them for training and inference in a ML model;
기능 재사용은 ML 모델의 학습 및 추론에 기존 기능을 선택하여 재사용할 수 있게 함으로써 쉽게 실현됩니다
Support for diverse feature computation frameworks - including batch, streaming, and request-time computation. This enables ML systems to be built based on their feature freshness requirements;
일괄, 스트리밍 및 요청 시간 계산을 포함한 다양한 기능 계산 프레임워크에 대한 지원. 이를 통해 ML 시스템을 기능 신선도 요구 사항에 따라 구축할 수 있습니다.
Validate feature data written and monitor new feature data for drift;
새로운 기능 데이터의 드리프트를 모니터링하고 기록된 기능 데이터를 검증하십시오
A taxonomy for data transformations for machine learning based on the type of feature computed (a) reusable features are computed by model-independent transformations, (b) features specific to one model are computed by model-dependent transformations, and (c) features computed with request-time data are on-demand transformations. The feature store provide abstractions to prevent skew between data transformations performed in more than one ML pipeline.
기계 학습을 위한 데이터 변환 분류 (a) 재사용 가능한 특징은 모델 독립적 변환에 의해 계산됩니다. (b) 한 모델에 특정한 특징은 모델 의존 변환에 의해 계산됩니다. (c) 요청 시 데이터로 계산된 특징은 주문형 변환입니다. 특징 저장소는 여러 ML 파이프라인에서 수행된 데이터 변환 간 편향을 방지하기 위한 추상화를 제공합니다.
A point-in-time consistent query engine to create training data from historical time-series feature data, potentially spread over many tables, without future data leakage;
미래 데이터 누출 없이 많은 테이블에 걸쳐 있는 역사적 시계열 특징 데이터에서 교육 데이터를 생성하는 시점 일관된 쿼리 엔진;
A query engine to retrieve and join precomputed features at low latency for online inference using an entity key;
온라인 추론을 위해 엔티티 키를 사용하여 낮은 지연 시간으로 사전 계산된 기능을 검색하고 결합하는 쿼리 엔진
A query engine to find similar feature values using embedding vectors.
임베딩 벡터를 사용하여 유사한 특징 값을 찾는 쿼리 엔진.

The table below shows you how the feature store can help you with common ML deployment scenarios.
아래 표는 특성 저장소가 일반적인 ML 배포 시나리오에서 어떻게 도움을 줄 수 있는지 보여줍니다.

For just putting ML in production, the feature store helps with managing incremental datasets, feature validation and monitoring, where to perform data transformations, and how to create point-in-time consistent training data. Real-Time ML extends the production ML scenario with the need for history and context information for stateless online models, low latency retrieval of precomputed features, online similarity search, and the need for either stream processing or on-demand feature computation. For the ML at large scale, there is also the challenge of enabling collaboration between teams of data engineers, data scientists, and ML engineers, as well as the reuse of features in many models.
기계 학습(ML)을 단순히 프로덕션에 배치하는 것 외에도, 기능 저장소는 증분 데이터세트 관리, 기능 유효성 검사 및 모니터링, 데이터 변환 수행 위치, 일관된 학습 데이터 생성 방법을 지원합니다. 실시간 ML은 상태 없는 온라인 모델의 기록 및 상황 정보 필요성, 사전 계산된 기능의 저지연 검색, 온라인 유사도 검색, 스트림 처리 또는 온디맨드 기능 계산 필요성 등 프로덕션 ML 시나리오를 확장합니다. 대규모 ML의 경우, 데이터 엔지니어, 데이터 과학자, ML 엔지니어 간 협업 활성화와 여러 모델에서의 기능 재사용 과제도 있습니다.

Collaborative Development
협력적 개발

Feature stores are the key data layer in a MLOps platform. The main goals of MLOps are to decrease model iteration time, improve model performance, ensure governance of ML assets (feature, models), and improve collaboration. The feature store enables different teams to take responsibility for the different ML pipelines: data engineers and data scientists typically build and operate feature pipelines; data scientists build and operate training pipelines, while ML engineers build and operate inference pipelines.
특징 저장소는 MLOps 플랫폼의 핵심 데이터 계층입니다. MLOps의 주요 목표는 모델 반복 시간을 단축하고, 모델 성능을 향상시키며, ML 자산(특징, 모델)의 거버넌스를 보장하고, 협업을 개선하는 것입니다. 특징 저장소를 통해 다양한 팀이 서로 다른 ML 파이프라인에 대한 책임을 질 수 있습니다: 데이터 엔지니어와 데이터 과학자가 일반적으로 특징 파이프라인을 구축하고 운영하고, 데이터 과학자가 학습 파이프라인을 구축하고 운영하며, ML 엔지니어가 추론 파이프라인을 구축하고 운영합니다.

They enable the sharing of ML assets and improved communication within and between teams. Whether teams are building batch machine learning systems or real-time machine learning systems, they can use shared language around feature, training, and inference pipelines to describe their responsibilities and interfaces.
ML 자산 공유와 팀 내부 및 팀 간 의사소통 향상을 가능하게 합니다. 팀이 일괄 처리 기계 학습 시스템을 구축하든 실시간 기계 학습 시스템을 구축하든, 기능, 교육, 추론 파이프라인에 대한 공통된 용어를 사용하여 책임과 인터페이스를 설명할 수 있습니다.

A more detailed Feature Store Architecture is shown in the figure below.
더 자세한 피처 스토어 아키텍처가 아래 그림에 나와 있습니다.

Its historical feature data is stored in an offline store (typically a columnar data store), its most recent feature data that is used by online models in an online store (typically a row-oriented database or key-value store), and if indexed embeddings are supported, they are stored in a vector database. Some feature stores provide the storage layer as part of the platform, some have partial or full pluggable storage layers.
역사적 특징 데이터는 오프라인 저장소(일반적으로 컬럼 데이터 저장소)에 저장되고, 온라인 모델에 의해 사용되는 가장 최근의 특징 데이터는 온라인 저장소(일반적으로 행 지향 데이터베이스 또는 키-값 저장소)에 저장됩니다. 인덱싱된 임베딩이 지원되는 경우 벡터 데이터베이스에 저장됩니다. 일부 특징 저장소는 플랫폼의 일부로 저장 계층을 제공하고, 일부는 부분적 또는 완전한 플러그식 저장 계층을 가지고 있습니다.

The machine learning pipelines (feature pipelines, training pipelines, and inference pipelines) read and write features/labels from/to the feature store, and prediction logs are typically also stored there to support feature/model monitoring and debugging. Different data transformations (model-independent, model-dependent, and on-demand) are performed in the different ML pipelines, see the Taxonomy of Data Transformations for more details.
기계 학습 파이프라인(특징 파이프라인, 학습 파이프라인, 추론 파이프라인)은 특징 저장소에서 특징/레이블을 읽고 쓰며, 예측 로그도 일반적으로 해당 저장소에 저장되어 특징/모델 모니터링 및 디버깅을 지원합니다. 다양한 데이터 변환(모델 독립적, 모델 종속적, 주문형)이 각기 다른 ML 파이프라인에서 수행됩니다. 자세한 내용은 데이터 변환 분류법을 참조하세요.

Incremental Datasets 점진적 데이터 세트

Feature pipelines keep producing feature data as long as your ML system is running. Without a feature store, it is non-trivial to manage the mutable datasets updated by feature pipelines - as the datasets are stored in the different offline/online/vector-db stores. Each store has its own drivers, authentication and authorization support, and the synchronization of updates across all stores is challenging.
특징 파이프라인은 ML 시스템이 실행되는 한 계속해서 특징 데이터를 생산합니다. 특징 저장소가 없으면 변경 가능한 데이터 세트를 관리하는 것이 쉽지 않습니다 - 이러한 데이터 세트는 오프라인/온라인/벡터 데이터베이스 저장소에 저장됩니다. 각 저장소에는 고유한 드라이버, 인증 및 권한 부여 지원이 있으며, 모든 저장소에 걸친 업데이트의 동기화는 어려운 과제입니다.

Feature stores make the management of mutable datasets of features, called feature groups, easy by providing CRUD (create/read/update/delete) APIs. The following code snippet shows how to append, update & delete feature data in a feature group using a Pandas DataFrame in Hopsworks. The updates are transparently synchronized across all of the underlying stores - the offline/online/vector-db stores.
기능 저장소는 기능 그룹이라고 하는 변경 가능한 데이터 세트의 관리를 쉽게 해줍니다. CRUD(생성/읽기/업데이트/삭제) API를 제공합니다. 다음 코드 스니펫은 Hopsworks의 Pandas DataFrame을 사용하여 기능 그룹에서 기능 데이터를 추가, 업데이트 및 삭제하는 방법을 보여줍니다. 업데이트는 오프라인/온라인/벡터 DB 저장소에 투명하게 동기화됩니다.

df = # read from data source, then perform feature engineering
fg = fs.get_or_create_feature_group(name="query_terms_yearly",
                              version=1,
                              description="Count of search term by year",
                              primary_key=['year', 'search_term'],
                              partition_key=['year'],
                              online_enabled=True
                              )
fg.insert(df) # insert or update
fg.commit_delete_record(df) # delete

We can also update the same feature group using a stream processing client (streaming feature pipeline). The following code snippet uses PySpark streaming to update a feature group in Hopsworks. It computes the average amount of money spent on a credit card, for all transactions on the credit card, every 10 minutes. It reads its input data as events from a Kafka cluster.
우리는 또한 스트림 처리 클라이언트(스트리밍 기능 파이프라인)를 사용하여 동일한 기능 그룹을 업데이트할 수 있습니다. 다음 코드 스니펫은 PySpark 스트리밍을 사용하여 Hopsworks의 기능 그룹을 업데이트합니다. 이는 10분마다 신용카드의 모든 거래에 대해 신용카드에 사용된 평균 금액을 계산합니다. Kafka 클러스터에서 이벤트로 입력 데이터를 읽습니다.

df_read = spark.readStream.format("kafka")...option("subscribe", 
KAFKA_TOPIC_NAME).load()
 
# Deserialize data from Kafka and create streaming query
df_deser = df_read.selectExpr(....).select(...)
 
# 10 minute window
windowed10mSignalDF = df_deser \
    .selectExpr(...)\
    .withWatermark(...) \
    .groupBy(window("datetime", "10 minutes"), "cc_num").agg(avg("amount")) \
    .select(...)
 
card_transactions_10m_agg =fs.get_feature_group("card_transactions_10m_agg", version=1)
 
query_10m = card_transactions_10m_agg.insert_stream(windowed10mSignalDF)

Some feature stores also support defining columns as embeddings that are indexed for similarity search. The following code snippet writes a DataFrame to a feature group in Hopsworks, and indexes the “embedding_body” column in the vector database. You need to create the vector embedding using a model, add it as a column to the DataFrame, and then write the DataFrame to Hopsworks.
일부 기능 저장소는 유사성 검색을 위해 인덱싱된 임베딩 열을 정의하는 것을 지원합니다. 다음 코드 스니펫은 DataFrame을 Hopsworks의 기능 그룹에 기록하고 벡터 데이터베이스에서 "embedding_body" 열을 인덱싱합니다. 모델을 사용하여 벡터 임베딩을 만들고 이를 DataFrame에 추가한 다음 DataFrame을 Hopsworks에 기록해야 합니다.

from hsfs import embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

df = # read from data source, then perform feature engineering


embeddings_body = model.encode(df["Article"])
df["embedding_body"] = pd.Series(embeddings_body.tolist())

emb = embedding.EmbeddingIndex()
emb.add_embedding("embedding_body", len(df["embedding_body"][0]))

news_fg = fs.get_or_create_feature_group(
    name="news_fg",
    embedding_index=emb,
    primary_key=["id"],
    version=1,
    online_enabled=True
)
news_fg.insert(df)

Backfill feature data and Training Data
백필 기능 데이터와 훈련 데이터

Backfilling is the process of recomputing datasets from raw, historical data. When you backfill feature data, backfilling involves running a feature pipeline with historical data to populate the feature store. This requires users to provide a start_time and an end_time for the range of data that is to be backfilled, and the data source needs to support timestamps, e.g., Type 2 slowly changing dimensions in a data warehouse table.
백필링은 원시, 과거 데이터에서 데이터세트를 재계산하는 프로세스입니다. 피처 데이터를 백필링할 때, 백필링은 피처 스토어를 채우기 위해 과거 데이터로 피처 파이프라인을 실행하는 것을 포함합니다. 이를 위해서는 사용자가 백필링될 데이터 범위의 시작 시간과 종료 시간을 제공해야 하며, 데이터 소스가 타임스탬프를 지원해야 합니다. 예를 들어 데이터 웨어하우스 테이블의 Type 2 천천히 변경되는 차원입니다.

The same feature pipeline used to backfill features should also process “live” data. You just point the feature pipeline at the data source and the range of data to backfill (e.g., backfill the daily partitions with all users for the last 180 days). Both batch and streaming feature pipelines should be able to backfill features. Backfilling features is important because you may have existing historical data that can be leveraged to create training data for a model. If you couldn’t backfill features, you could start logging features in your production system and wait until sufficient data has been collected before you start training your model.
백필 기능을 위해 사용되는 동일한 기능 파이프라인은 "실시간" 데이터도 처리해야 합니다. 기능 파이프라인을 데이터 소스와 백필 범위(예: 마지막 180일 동안의 모든 사용자에 대한 일일 파티션 백필)에 연결하기만 하면 됩니다. 배치 및 스트리밍 기능 파이프라인 모두 기능을 백필할 수 있어야 합니다. 기능을 백필하는 것이 중요한 이유는 모델 학습을 위한 교육 데이터를 생성하는 데 사용할 수 있는 기존 historical 데이터가 있을 수 있기 때문입니다. 기능을 백필할 수 없다면 프로덕션 시스템에서 기능을 계속 기록하고 모델 학습을 시작하기 전에 충분한 데이터가 수집될 때까지 기다려야 합니다.

Point-in-Time Correct Training Data
시점별 정확한 교육 데이터

If you want to create training data from time-series feature data without any future data leakage, you will need to perform a temporal join, sometimes called a point-in-time correct join.
미래 데이터 누출 없이 시계열 특징 데이터에서 학습 데이터를 생성하려면 시간 조인, 때로는 시점 정확 조인이라고 하는 작업을 수행해야 합니다.

For example, in the figure below, we can see that for the (red) label value, the correct feature values for Feature A and Feature B are 4 and 6, respectively. Data leakage would occur if we included feature values that are either the pink (future data leakage) or orange values (stale feature data). If you do not create point-in-time correct training data, your model may perform poorly and it will be very difficult to discover the root cause of the poor performance.
다음 그림에서와 같이 (빨간색) 레이블 값의 경우 특징 A와 B의 정확한 특징값은 각각 4와 6입니다. 만약 핑크 색상(향후 데이터 누출) 또는 주황색 값(오래된 특징 데이터)을 포함하면 데이터 누출이 발생할 수 있습니다. 시점별 정확한 학습 데이터를 구축하지 않으면 모델 성능이 저하될 수 있으며 그 근본 원인을 찾기가 매우 어려워질 것입니다.

If your offline store supports AsOf Joins, feature retrieval involves joining Feature A and Feature B from their respective tables AsOf the timestamp value for each row in the Label table. The SQL query to create training data is an “AS OF LEFT JOIN”, as this query enforces the invariant that for every row in your Label table, there should be a row in your training dataset, and if there are missing feature values for a join, we should include NULL values (we can later impute missing values in model-dependent transformations). If your offline store does not support AsOf Joins, you can write alternative windowing code using state tables.
오프라인 스토어에서 AsOf Joins를 지원하는 경우 피처 검색은 Label 테이블의 각 행에 대한 타임스탬프 값을 사용하여 해당 테이블의 Feature A와 Feature B를 조인하는 것을 포함합니다. 학습 데이터를 생성하는 SQL 쿼리는 "AS OF LEFT JOIN"입니다. 이 쿼리는 Label 테이블의 모든 행에 대해 학습 데이터셋에 행이 있어야 하며, 조인에 누락된 피처 값이 있는 경우 NULL 값을 포함해야 한다는 불변 조건을 적용합니다. 오프라인 스토어에서 AsOf Joins를 지원하지 않는 경우 상태 테이블을 사용하여 대체 윈도잉 코드를 작성할 수 있습니다.

As both AsOf Left joins and window tables result in complex SQL queries, many feature stores provide domain-specific language (DSL) support for executing the temporal query. The following code snippet, in Hopsworks, creates point-in-time-consistent training data by first creating a feature view. The code starts by (1) selecting the columns to use as features and label(s) to use for the model, then (2) creates a feature view with the selected columns, defining the label column(s), and (3) uses the feature view object to create a point-in-time correct snapshot of training data.
AsOf 왼쪽 조인과 윈도우 테이블이 복잡한 SQL 쿼리를 생성하므로, 많은 특징 저장소가 시간별 쿼리를 실행하기 위한 도메인별 언어(DSL) 지원을 제공합니다. Hopsworks의 다음 코드 스니펫은 먼저 기능 보기를 만들어 시점별로 일관된 교육 데이터를 만듭니다. 코드는 (1) 모델의 기능과 레이블로 사용할 열을 선택하고, (2) 선택한 열로 레이블 열을 정의하는 기능 보기를 만든 다음, (3) 기능 보기 개체를 사용하여 교육 데이터의 시점별 정확한 스냅샷을 만듭니다.

fg_loans = fs.get_feature_group(name="loans", version=1)
fg_applicants = fs.get_feature_group(name="applicants", version=1)
select= fg_loans.select_except(["issue_d", "id"]).join(\
            fg_applicants.select_except(["earliest_cr_line", "id"]))
 
fv = fs.create_feature_view(name="loans_approvals", 
            version=1,
            description="Loan applicant data",
            labels=["loan_status"],
            query=select
            )
X_train, X_test, y_train, y_test = fv.train_test_split(test_size=0.2)
#....
model.fit(X_train, y_train)

The following code snippet, in Hopsworks, uses the feature view we just defined to create point-in-time consistent batch inference data. The model makes predictions using the DataFrame df containing the batch inference data.
다음 코드 스니펫은 Hopsworks에서 방금 정의한 피처 뷰를 사용하여 시점 일관성 있는 일괄 추론 데이터를 생성합니다. 모델은 일괄 추론 데이터를 포함하는 DataFrame df를 사용하여 예측을 수행합니다.

fv = fs.get_feature_view(name="loans_approvals", version=fv_version) 
df = fv.get_batch_data(start_time=”2023-12-23 00:00”, end_time=NOW)

predictions_df = model.predict(df)

History and Context for Online Models
온라인 모델을 위한 역사와 맥락

Online models are often hosted in model-serving infrastructure or stateless (AI-enabled) applications. In many user-facing applications, the actions taken by users are “information poor”, but we would still like to use a trained model to make an intelligent decision. For example, in Tiktok, a user click contains a limited amount of information - you could not build the world’s best real-time recommendation system using just a single user click as an input feature.
온라인 모델은 종종 모델 서빙 인프라 또는 상태 없는(AI 지원) 애플리케이션에 호스팅됩니다. 많은 사용자 대면 애플리케이션에서 사용자가 취하는 조치는 "정보가 부족"하지만 우리는 여전히 훈련된 모델을 사용하여 지능형 결정을 내리고 싶습니다. 예를 들어 TikTok에서 사용자 클릭에는 제한된 양의 정보만 포함되어 있으므로 단일 사용자 클릭만으로는 세계 최고의 실시간 추천 시스템을 구축할 수 없습니다.

The solution is to use the user’s ID to retrieve precomputed features from the online store containing the user's personal history as well as context features (such as what videos or searches are trending). The precomputed features returned enrich any features that can be computed from the user input to build a rich feature vector that can be used to train complex ML models. For example, in Tiktok, you can retrieve precomputed features about the 10 most recent videos you looked at - their category, how long you engaged for, what’s trending, what your friends are looking at, and so on.
사용자 ID를 사용하여 사용자의 개인 이력과 콘텍스트 기능(예: 트렌드 동영상 또는 검색)이 포함된 온라인 저장소에서 미리 계산된 기능을 검색하는 것이 해결책입니다. 반환된 미리 계산된 기능은 사용자 입력에서 계산할 수 있는 모든 기능을 풍부하게 만들어 복잡한 ML 모델을 훈련하는 데 사용할 수 있는 풍부한 기능 벡터를 구축합니다. 예를 들어, TikTok에서는 최근 10개의 동영상에 대한 미리 계산된 기능, 즉 카테고리, 시청 시간, 트렌드, 친구가 보는 내용 등을 검색할 수 있습니다.
In many examples of online models, the entity is a simple user or product or booking. However, often you will need more complex data models, and it is beneficial if your online store supports multi-part primary keys (see Uber talk).
온라인 모델의 많은 사례에서 엔티티는 단순한 사용자, 제품 또는 예약입니다. 그러나 종종 더 복잡한 데이터 모델이 필요하며, 온라인 스토어에서 다중 부분 기본 키를 지원하는 것이 유익합니다(Uber 설명 참조).

Feature Reuse 기능 재사용

A common problem faced by organizations when they build their first ML models is that there is a lot of bespoke tooling, extracting data from existing backend systems so that it can be used to train a ML model. Then, when it comes to productionizing the ML model, more data pipelines are needed to continually extract new data and compute features so that the model can make continual predictions on the new feature data.
기관들이 첫 번째 ML 모델을 구축할 때 직면하는 일반적인 문제는 기존 백엔드 시스템에서 데이터를 추출하여 ML 모델 학습에 사용할 수 있도록 하는 데 많은 맞춤형 도구가 필요하다는 것입니다. 그리고 ML 모델을 실제 운영에 적용할 때에는 새로운 데이터를 지속적으로 추출하고 특징을 계산하여 새로운 특징 데이터에 대해 지속적으로 예측을 할 수 있도록 하는 더 많은 데이터 파이프라인이 필요합니다.

However, after the first set of pipelines have been written for the first model, organizations soon notice that one or more features used in an earlier model are needed in a new model. Meta reported that in their feature store “most features are used by many models”, and that the most popular 100 features are reused in over 100 different models. However, for expediency, developers typically rewrite the data pipelines for the new model. Now you have different models re-computing the same feature(s) with different pipelines. This leads to waste, and a less maintainable (non-DRY) code base.
그러나 첫 번째 모델을 위한 첫 번째 일련의 파이프라인을 작성한 후, 조직은 이전 모델에서 사용된 하나 이상의 기능이 새로운 모델에 필요하다는 것을 곧 알게 됩니다. Meta는 자사의 기능 저장소에서 "대부분의 기능이 많은 모델에서 사용"되며, 가장 많이 사용되는 100개의 기능이 100개 이상의 다른 모델에서 재사용된다고 보고했습니다. 그러나 편의상 개발자들은 일반적으로 새로운 모델을 위한 데이터 파이프라인을 다시 작성합니다. 이제 서로 다른 모델이 서로 다른 파이프라인으로 동일한 기능을 재계산하게 됩니다. 이로 인해 낭비가 발생하고 더 유지보수하기 어려운(DRY 원칙을 준수하지 않은) 코드베이스가 만들어집니다.

The benefits of feature reuse with a feature store include higher quality features through increased usage and scrutiny, reduced storage costs - and less feature pipelines. In fact, the feature store decouples the number of models you run in production from the number of feature pipelines you have to maintain. Without a feature store, you typically write at least one feature pipeline per model. With a (large enough) feature store, you may not need to write any feature pipeline for your model if the features you need are already available there.
피처 스토어를 활용한 피처 재사용의 이점은 더 많은 사용과 점검을 통한 높은 품질의 피처, 저감된 저장 비용, 그리고 더 적은 피처 파이프라인입니다. 실제로 피처 스토어는 운영중인 모델 수와 유지해야 하는 피처 파이프라인 수를 분리시킵니다. 피처 스토어가 없다면 일반적으로 모델당 최소 한 개의 피처 파이프라인을 작성해야 합니다. 하지만 충분한 규모의 피처 스토어가 있다면 필요한 피처가 이미 존재할 경우 피처 파이프라인을 작성할 필요가 없을 수 있습니다.

Multiple Feature Computation Models
다중 기능 계산 모델

The feature pipeline typically does not need GPUs, may be a batch program or streaming program, and may process small amounts of data with Pandas or Polars or large amounts of data with a framework such as Spark or DBT/SQL. Streaming feature pipelines can be implemented in Python (Bytewax) or more commonly in distributed frameworks such as PySpark, with its micro-batch computation model, or Flink/Beam with their lower latency per-event computation model.
기능 파이프라인은 일반적으로 GPU가 필요하지 않으며, 배치 프로그램이나 스트리밍 프로그램일 수 있고, Pandas나 Polars로 소량의 데이터를 처리하거나 Spark이나 DBT/SQL과 같은 프레임워크로 대량의 데이터를 처리할 수 있습니다. 스트리밍 기능 파이프라인은 Python(Bytewax)으로 구현하거나 더 일반적으로 마이크로 배치 계산 모델의 PySpark 또는 이벤트 당 낮은 대기 시간 계산 모델의 Flink/Beam과 같은 분산 프레임워크로 구현할 수 있습니다.

The training pipeline is typically a Python program, as most ML frameworks are written in Python. It reads features and labels as input, trains a model and outputs the trained model (typically to a model registry).
훈련 파이프라인은 일반적으로 Python 프로그램으로, 대부분의 ML 프레임워크가 Python으로 작성되어 있기 때문입니다. 입력으로 특성과 레이블을 읽고, 모델을 학습시키며, 학습된 모델을 (일반적으로 모델 레지스트리에) 출력합니다.

An inference pipeline then downloads a trained model and reads features as input (some may be computed from the user’s request, but most will be read as precomputed features from the feature store). Finally, it uses the features as input to the model to make predictions that are either returned to the client who requested them or stored in some data store (often called an inference store) for later retrieval.
추론 파이프라인은 학습된 모델을 다운로드하고 특징을 입력으로 읽습니다(일부는 사용자 요청에서 계산되지만 대부분은 특징 저장소에서 사전 계산된 특징으로 읽힙니다). 마지막으로 특징을 모델의 입력으로 사용하여 예측을 수행하고, 이를 요청한 클라이언트에게 반환하거나 나중에 검색할 수 있도록 추론 저장소에 저장합니다.

Validate Feature Data and Monitor for Drift
기능 데이터 검증 및 드리프트 모니터링

Garbage-in, garbage out is a well known adage in the data world. Feature stores can provide support for validating feature data in feature pipelines. The following code snippet uses the Great Expectations library to define a data validation rule that is applied when feature data is written to a feature group in Hopsworks.
쓰레기가 들어가면 쓰레기가 나온다는 말은 데이터 세상에서 잘 알려진 격언입니다. 특징 저장소는 특징 파이프라인에서 특징 데이터를 검증하는 데 도움을 줄 수 있습니다. 다음 코드 스니펫은 Hopsworks의 특징 그룹에 특징 데이터가 기록될 때 적용되는 데이터 검증 규칙을 정의하는 데 Great Expectations 라이브러리를 사용합니다.

df = # read from data source, then perform feature engineering


# define data validation rules in Great Expectations
ge_suite = ge.core.ExpectationSuite(
    expectation_suite_name="expectation_suite_101"
    )

ge_suite.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column":"'search_term'"}
    )
)

fg = fs.get_or_create_feature_group(name="query_terms_yearly",
                              version=1,
                              description="Count of search term by year",
                              primary_key=['year', 'search_term'],
                              partition_key=['year'],
                              online_enabled=True,
                              expectation_suite=ge_suite
                              )
fg.insert(df) # data validation rules executed in client before insertion

The data validation results can then be viewed in the feature store, as shown below. In Hopsworks, you can trigger alerts if data validation fails, and you can decide whether to allow the insertion or fail the insertion of data, if data validation fails.
데이터 검증 결과는 아래와 같이 기능 저장소에서 볼 수 있습니다. Hopsworks에서는 데이터 검증이 실패할 경우 경고를 트리거할 수 있으며, 데이터 검증 실패 시 데이터 삽입을 허용할지 또는 거부할지를 결정할 수 있습니다.

Feature monitoring is another useful capability provided by many feature stores. Whether you build a batch ML system or an online ML system, you should be able to monitor inference data for the system’s model to see if it is statistically significantly different from the model’s training data (data drift). If it is, you should alert users and ideally kick-off the retraining of the model using more recent training data.
특징 모니터링은 많은 특징 저장소가 제공하는 또 다른 유용한 기능입니다. 배치 ML 시스템이나 온라인 ML 시스템을 구축하든 시스템의 모델에 대한 추론 데이터를 모니터링하여 모델의 학습 데이터와 통계적으로 유의미한 차이가 있는지 확인해야 합니다(데이터 drift). 차이가 있다면 사용자에게 경고하고 가능하면 최근 학습 데이터를 사용하여 모델을 재학습해야 합니다.

Here is an example code snippet from Hopsworks for defining a feature monitoring rule for the feature “amount” in the model’s prediction log (available for both batch and online ML systems). A job is run once per day to compare inference data for the last week for the amount feature, and if its mean value deviates more than 50% from the mean observed in the model’s training data, data drift is flagged and alerts are triggered.
모델의 예측 로그에서 "amount" 특성에 대한 기능 모니터링 규칙을 정의하는 Hopsworks의 예제 코드 조각입니다. 이 작업은 하루에 한 번 실행되어 지난 1주일 동안의 추론 데이터와 모델 학습 데이터의 "amount" 특성 평균값을 비교합니다. 그리고 만약 평균값이 50% 이상 벗어나면 데이터 drift가 감지되고 경고가 트리거됩니다.

# Compute statistics on a prediction log as a detection window
fg_mon = pred_log.create_feature_monitoring("name", 
    feature_name = "amount", job_frequency = "DAILY")
    .with_detection_window(row_percentage=0.8, time_offset ="1w")

# Compare feature statistics with a reference window - e.g., training data
fg_mon.with_reference_training_dataset(version=1).compare_on(
    metric = "mean", threshold=50)

Taxonomy of Data Transformations
데이터 변환의 분류

When data scientists and data engineers talk about data transformations, they are not talking about the same thing. This can cause problems in communication, but also in the bigger problem of feature reuse in feature stores. There are 3 different types of data transformations, and they belong in different ML pipelines.
데이터 과학자와 데이터 엔지니어가 데이터 변환에 대해 이야기할 때 그들은 동일한 것을 이야기하고 있는 것이 아닙니다. 이로 인해 의사소통에 문제가 발생할 수 있으며, 더 큰 문제인 피처 저장소의 피처 재사용에도 문제가 발생할 수 있습니다. 데이터 변환에는 3가지 다른 유형이 있으며 이들은 서로 다른 ML 파이프라인에 속합니다.

Data transformations, as understood by data engineers, is a catch-all term that covers data cleansing, aggregations, and any changes to your data to make it consumable by BI or ML. These data transformations are called model-independent transformations as they produce features that are reusable by many models.
데이터 엔지니어가 이해하는 데이터 변환은 데이터 정화, 집계 및 BI 또는 ML에서 소비할 수 있도록 데이터를 변경하는 것을 포괄하는 총체적인 용어입니다. 이러한 데이터 변환은 많은 모델에서 재사용할 수 있는 특징을 생성하므로 모델 독립적 변환이라고 합니다.

In data science, data transformations are a more specific term that refers to encoding a variable (categorical or numerical) into a numerical format, scaling a numerical variable, or imputing a value for a variable, with the goal of improving the performance of your ML model training. These data transformations are called model-dependent transformations and they are specific to one model.
데이터 과학에서 데이터 변환은 ML 모델 학습 성능을 향상시키기 위해 변수(범주형 또는 수치형)를 수치 형식으로 인코딩하거나, 수치형 변수를 스케일링하거나, 변수에 대한 값을 대체하는 보다 구체적인 용어입니다. 이러한 데이터 변환은 모델 종속 변환이라고 하며 특정 모델에 적합합니다.

Finally, there are data transformations that can only be performed at runtime for online models as they require parameters only available in the prediction request. These data transformations are called on-demand transformations, but they may also be needed in feature pipelines if you want to backfill feature data from historical data.
마지막으로, 예측 요청에서만 사용 가능한 매개변수가 필요한 경우, 실행 시간에만 수행할 수 있는 데이터 변환이 있습니다. 이러한 데이터 변환을 주문형 변환이라고 하지만, 과거 데이터에서 기능 데이터를 백필하려는 경우 기능 파이프라인에서도 필요할 수 있습니다.

The feature store architecture diagram from earlier shows that model-independent transformations are only performed in feature pipelines (whether batch or streaming pipelines). However, model-dependent transformations are performed in both training and inference pipelines, and on-demand transformations can be applied in both feature and online inference pipelines. You need to ensure that equivalent transformations are performed in both pipelines - if there is skew between the transformations, you will have model performance bugs that will be very hard to identify and debug. Feature stores help prevent this problem of online-offline skew. For example, model-dependent transformations can be performed in scikit-learn pipelines or in feature views in Hopsworks, ensuring consistent transformations in both training and inference pipelines. Similarly, on-demand transformations are version-controlled Python or Pandas user-defined functions (UDFs) in Hopsworks that are applied in both feature and online inference pipelines.
이전에 살펴본 특징 저장소 아키텍처 다이어그램에 따르면 모델 독립적인 변환은 특징 파이프라인(배치 또는 스트리밍 파이프라인)에서만 수행됩니다. 그러나 모델 종속적인 변환은 훈련 및 추론 파이프라인 모두에서 수행되며 온-디맨드 변환은 특징 및 온라인 추론 파이프라인 모두에 적용될 수 있습니다. 두 파이프라인에서 동등한 변환이 수행되도록 해야 합니다. 변환에 편차가 있으면 식별하고 디버깅하기 어려운 모델 성능 버그가 발생할 수 있습니다. 특징 저장소는 이러한 온라인-오프라인 편차 문제를 방지합니다. 예를 들어 모델 종속적인 변환은 scikit-learn 파이프라인 또는 Hopsworks의 특징 보기에서 수행되어 훈련 및 추론 파이프라인에서 일관된 변환이 이루어지도록 합니다. 마찬가지로 온-디맨드 변환은 Hopsworks의 버전 관리된 Python 또는 Pandas 사용자 정의 함수(UDF)로, 특징 및 온라인 추론 파이프라인에 적용됩니다.

Query Engine for Point-in-Time Consistent Feature Data for Training
학습을 위한 시점별 일관성 있는 특징 데이터를 위한 쿼리 엔진

Feature stores can use existing columnar data stores and data processing engines, such as Spark, to create point-in-time correct training data. However, as of December 2023, Spark, BigQuery, Snowflake, and Redshift do not support the ASOF LEFT JOIN query that is used to create training data from feature groups. Instead, they have to implement stateful windowed approaches.
특징 저장소는 Spark과 같은 기존 열 데이터 저장소와 데이터 처리 엔진을 사용하여 시점별 정확한 학습 데이터를 생성할 수 있습니다. 그러나 2023년 12월 현재 Spark, BigQuery, Snowflake, Redshift는 특징 그룹에서 학습 데이터를 생성하는 데 사용되는 ASOF LEFT JOIN 쿼리를 지원하지 않습니다. 대신 상태 기반 창 접근법을 구현해야 합니다.

The other main performance bottleneck with many current data warehouses is that they provide query interfaces to Python with either a JDBC or ODBC API. These are row-oriented protocols, and data from the offline store needs to be pivoted from columnar format to row-oriented, and then back to column-oriented in Pandas. Arrow is now the backing data format for Pandas 2.+.
많은 현재 데이터 웨어하우스의 다른 주요 성능 병목 현상은 JDBC 또는 ODBC API를 통해 Python에 쿼리 인터페이스를 제공한다는 것입니다. 이는 행 중심 프로토콜이며, 오프라인 저장소의 데이터는 열 중심에서 행 중심으로, 다시 Pandas의 열 중심으로 피벗되어야 합니다. Arrow는 이제 Pandas 2.+의 백엔드 데이터 형식입니다.

In open-source, reproducible benchmarks by KTH, Karolinska, and Hopsworks, they showed the throughput improvements over a specialist DuckDB/ArrowFlight feature query engine that returns Pandas DataFrames to Python clients in training and batch inference pipelines. We can see from the table below that throughput improvements of 10-45X JDBC/ODBC-based query engines can be achieved.
오픈 소스, 재현 가능한 벤치마크에서 KTH, 카롤린스카 및 Hopsworks는 훈련 및 배치 추론 파이프라인에서 Python 클라이언트에게 Pandas DataFrames를 반환하는 전문 DuckDB/ArrowFlight 기능 쿼리 엔진에 비해 처리량 개선을 보여주었습니다. 아래 표에서 볼 수 있듯이 JDBC/ODBC 기반 쿼리 엔진에서 10-45배의 처리량 개선을 달성할 수 있습니다.

Query Engine for Low Latency Feature Data for Online Inference
온라인 추론을 위한 단일 대기시간 특징 데이터의 쿼리 엔진

The online feature store is typically built on existing low latency row-oriented data stores. These could be key-value stores such as Redis or Dynamo or a key-value store with a SQL API, such as RonDB for Hopsworks.
온라인 특징 저장소는 일반적으로 기존의 저지연 행 지향 데이터 저장소에 구축됩니다. 이는 Redis 또는 Dynamo와 같은 키-값 저장소 또는 RonDB와 같은 SQL API가 있는 키-값 저장소일 수 있습니다.

The process of building the feature vector for an online model also involves more than just retrieving precomputed features from the online feature store using an entity ID. Some features may be passed as request parameters directly and some features may be computed on-demand - using either request parameters or data from some 3rd party API, only available at runtime. These on-demand transformations may even need historical feature values, inference helper columns, to be computed.
온라인 모델에 대한 기능 벡터를 구축하는 과정에는 엔티티 ID를 사용하여 온라인 기능 저장소에서 미리 계산된 기능을 검색하는 것 이상의 작업이 포함됩니다. 일부 기능은 요청 매개변수로 직접 전달될 수 있으며, 일부 기능은 온디맨드로 계산될 수 있습니다 - 요청 매개변수 또는 실행 시간에만 사용할 수 있는 3rd party API 데이터를 사용하여 계산됩니다. 이러한 온디맨드 변환에는 추론 도우미 열을 포함하여 과거 기능 값을 계산해야 할 수도 있습니다.

In the code snippet below, we can see how an online inference pipeline takes request parameters in the predict method, computes an on-demand feature, retrieves precomputed features using the request supplied id, and builds the final feature vector used to make the prediction with the model.
아래의 코드 스니펫에서 온라인 추론 파이프라인이 predict 메서드에서 요청 매개변수를 받고, 온디맨드 기능을 계산하며, 요청에서 제공된 id를 사용하여 사전 계산된 기능을 검색하고, 모델로 예측을 수행하기 위한 최종 특징 벡터를 빌드하는 방식을 확인할 수 있습니다.

def loc_diff(event_ts, cur_loc) :
    return grid_loc(event_ts, cur_loc)

def predict(id, event_ts, cur_loc, amount) :
    f1 = loc_diff(event_ts, cur_loc)
    df = feature_view.get_feature_vector(
        entry = {"id":id}, 
        passed_features ={"f1" : f1, "amount" : amount}
    )
    return model.predict(df)

In the figure below, we can see important system properties for online feature stores. If you are building your online AI application on top of an online feature store, it should have LATS properties (low Latency, high Availability, high Throughput, and scalable Storage), and it should also support fresh features (through streaming feature pipelines).
아래 그림에서 온라인 특성 저장소에 대한 중요한 시스템 속성을 볼 수 있습니다. 온라인 AI 애플리케이션을 온라인 특성 저장소 위에 구축하는 경우, 지연 시간이 낮고, 가용성이 높으며, 처리량이 높고, 확장 가능한 저장 공간을 갖춰야 합니다. 또한 스트리밍 특성 파이프라인을 통해 새로운 특성을 지원해야 합니다.

Some other important technical and performance considerations here for the online store are:
온라인 스토어에 대한 다른 중요한 기술적 및 성과 고려 사항은 다음과 같습니다:

Projection pushdown can massively reduce network traffic and latency. When you have popular features in feature groups with lots of columns, your model may only require a few features. Projection pushdown only returns the features you need. Without projection pushdown (e.g., most key-value stores), the entire row is returned and the filtering is performed in the client. For rows of 10s of KB, this could mean 100s of times more data is transferred than needed, negatively impacting latency and throughput (and potentially also cost).
프로젝션 푸시다운은 네트워크 트래픽과 대기 시간을 크게 줄일 수 있습니다. 기능 그룹에 많은 열이 있는 인기 있는 기능이 있는 경우, 모델에서는 몇 가지 기능만 필요할 수 있습니다. 프로젝션 푸시다운은 필요한 기능만 반환합니다. 프로젝션 푸시다운 없이(예: 대부분의 키-값 저장소의 경우), 전체 행이 반환되고 클라이언트에서 필터링이 수행됩니다. 10KB가 넘는 행의 경우 필요한 것보다 100배 이상의 데이터가 전송될 수 있어 지연 시간과 처리량(및 비용)에 부정적인 영향을 미칠 수 있습니다.
Your feature store should support a normalized data model, not just a star schema. For example, if your user provides a booking reference number that is used as the entity ID, can your online store also return features for the user and products referenced in the booking, or does either the user or application have to provide the user ID and product ID? For high performance, your online store should support pushdown LEFT JOINs to reduce the number of database round trips for building features from multiple feature groups.
귀하의 기능 스토어는 단순한 스타 스키마가 아닌 정규화된 데이터 모델을 지원해야 합니다. 예를 들어 사용자가 엔터티 ID로 사용하는 예약 참조 번호를 제공할 경우, 온라인 스토어에서 예약에 언급된 사용자 및 제품의 기능도 반환할 수 있어야 합니다. 그렇지 않으면 사용자나 애플리케이션이 사용자 ID와 제품 ID를 제공해야 합니다. 높은 성능을 위해 온라인 스토어는 여러 기능 그룹에서 기능을 구축할 때 데이터베이스 라운드 트립 수를 줄이기 위해 pushdown LEFT JOINs를 지원해야 합니다.

Query Engine to find similar Feature Data using Embeddings
임베딩을 사용하여 유사한 기능 데이터를 찾는 쿼리 엔진

Real-time ML systems often use similarity search as a core functionality. For example, personalized recommendation engines typically use similarity search to generate candidates for recommendation, and then use a feature store to retrieve features for the candidates, before a ranking model personalizes the candidates for the user.
실시간 머신 러닝 시스템은 종종 유사성 검색을 핵심 기능으로 사용합니다. 예를 들어 개인화된 추천 엔진은 일반적으로 유사성 검색을 사용하여 추천 후보를 생성하고, 기능 저장소를 사용하여 후보에 대한 기능을 검색한 다음, 순위 모델을 사용하여 사용자에게 맞춤형으로 후보를 개인화합니다.

The example code snippet below is from Hopsworks, and shows how you can search for similar rows in a feature group with the text “Happy news for today” in the embedding_body column.
다음은 Hopsworks의 예제 코드 스니펫이며, 이를 통해 "embedding_body" 열에 "Happy news for today" 텍스트가 포함된 유사한 행을 검색할 수 있습니다.

news_desc = "Happy news for today"
df = news_fg.find_neighbors(model.encode(news_desc), k=3)
# df now contains rows with 'news_desc' values that are most similar to 'news_desc'

Do I need a feature store?
기능 저장소가 필요한가요?

Feature stores have historically been part of big data ML platforms, such as Uber’s Michelangelo, that manage the entire ML workflow, from specifying feature logic, to creating and operating feature pipelines, training pipelines, and inference pipelines.
기능 저장소는 역사적으로 특징 논리를 지정하고, 특징 파이프라인, 학습 파이프라인 및 추론 파이프라인을 만들고 운영하는 등 전체 ML 워크플로를 관리하는 Uber의 Michelangelo와 같은 빅데이터 ML 플랫폼의 일부였습니다.

More recent open-source feature stores provide open APIs enabling easy integration with existing ML pipelines written in Python, Spark, Flink, or SQL. Serverless feature stores further reduce the barriers of adoption for smaller teams. The key features needed by most teams include APIs for consistent reading/writing of point-in-time correct feature data, monitoring of features, feature discovery and reuse, and the versioning and tracking of feature data over time. Basically, feature stores are needed for MLOps and governance. Do you need Github to manage your source code? No, but it helps. Similarly, do you need a feature store to manage your features for ML? No, but it helps.
최근의 오픈소스 특성 스토어는 Python, Spark, Flink 또는 SQL로 작성된 기존 ML 파이프라인과 쉽게 통합할 수 있는 오픈 API를 제공합니다. 서버리스 특성 스토어는 더 작은 팀에게 채택의 장벽을 낮춥니다. 대부분의 팀이 필요로 하는 핵심 기능에는 특성 데이터의 일관된 읽기/쓰기, 특성 모니터링, 특성 검색 및 재사용, 그리고 시간에 따른 특성 데이터의 버전 관리 및 추적이 포함됩니다. 기본적으로 특성 스토어는 MLOps와 거버넌스를 위해 필요합니다. 소스 코드 관리를 위해 Github가 필요하십니까? 아니요, 그렇지만 도움이 됩니다. 마찬가지로 ML을 위한 특성 관리를 위해 특성 스토어가 필요합니까? 아니요, 그렇지만 도움이 됩니다.

What is the difference between a feature store and a vector database?
특징 저장소와 벡터 데이터베이스의 차이점은 무엇입니까?

Both feature stores and vector databases are data platforms used by machine learning systems. The feature store stores feature data and provides query APIs for efficient reading of large volumes feature data (for model training and batch inference) and low latency retrieval of feature vectors (for online inference). In contrast, a vector database provides a query API to find similar vectors using approximate nearest neighbour (ANN) search.
피처 스토어와 벡터 데이터베이스 모두 기계 학습 시스템에서 사용되는 데이터 플랫폼입니다. 피처 스토어는 피처 데이터를 저장하고 대량의 피처 데이터(모델 학습 및 일괄 추론용)를 효율적으로 읽고 피처 벡터를 실시간으로 검색할 수 있는 쿼리 API를 제공합니다. 반면, 벡터 데이터베이스는 근사 최근접 이웃(ANN) 검색을 사용하여 유사한 벡터를 찾을 수 있는 쿼리 API를 제공합니다.

The indexing and data models used by feature stores and vector databases are very different. The feature store has two data stores - an offline store, typically a data warehouse/lakehouse, that is a columnar database with indexes to help improve query performance such as (file) partitioning based on a partition column, skip indexes (skip files when reading data using file statistics), and bloom filters (which files to skip when looking for a row). The online store is row-oriented database with indexes to help improve query performance such as a hash index to lookup a row, a tree index (such as a b-tree) for efficient range queries and row lookups, and a log-structured merge-tree (for improved write performance). In contrast, the vector database stores its data in a vector index that supports ANN search, such as FAISS (Facebook AI Similarity Search) or ScaNN by Google.
특징 저장소와 벡터 데이터베이스가 사용하는 인덱싱과 데이터 모델은 매우 다릅니다. 특징 저장소에는 두 개의 데이터 저장소가 있습니다. 일반적으로 데이터 웨어하우스/레이크하우스인 오프라인 저장소는 콜럼나 데이터베이스로, 쿼리 성능을 향상시키기 위한 인덱스(파티션 컬럼을 기반으로 한 파일 파티셔닝, 파일 통계를 사용한 파일 건너뛰기, Bloom 필터)가 있습니다. 온라인 저장소는 행 지향 데이터베이스로, 행 조회를 위한 해시 인덱스, 효율적인 범위 쿼리와 행 조회를 위한 트리 인덱스(B-트리 등), 향상된 쓰기 성능을 위한 로그 구조화 머지 트리가 있습니다. 반면 벡터 데이터베이스는 FAISS(Facebook AI Similarity Search) 또는 Google의 ScaNN과 같은 ANN 검색을 지원하는 벡터 인덱스에 데이터를 저장합니다.

Is there an integrated feature store and vector database?
통합 기능 저장소와 벡터 데이터베이스가 있습니까?

Hopsworks is a feature store with an integrated vector database. You store tables of feature data in feature groups, and you can index a column that contains embeddings in a built-in vector database. This means you can search for rows of similar features using embeddings and ANN search. Hopsworks also supports filtering, so you can search for similar rows, but provide conditions on what type of data to return (e.g., only users whose age>18).
Hopsworks는 통합 벡터 데이터베이스가 있는 기능 저장소입니다. 기능 그룹에 기능 데이터 테이블을 저장하고, 내장된 벡터 데이터베이스에 임베딩을 포함하는 열을 인덱싱할 수 있습니다. 이는 임베딩과 ANN 검색을 사용하여 유사한 기능의 행을 검색할 수 있음을 의미합니다. Hopsworks는 또한 필터링을 지원하므로 유사한 행을 검색하되 반환되는 데이터 유형에 대한 조건(예: 나이>18인 사용자만)을 제공할 수 있습니다.

Resources on feature stores
특징 저장소에 대한 자료

Our research paper, "The Hopsworks Feature Store for Machine Learning", is the first feature store to appear at the top-tier database or systems conference SIGMOD 2024. This article series is describing in lay terms concepts and results from this study.
우리의 연구 논문 "The Hopsworks Feature Store for Machine Learning"은 최고 수준의 데이터베이스 또는 시스템 컨퍼런스인 SIGMOD 2024에 소개된 최초의 피처 스토어입니다. 이 기사 시리즈는 이 연구의 개념과 결과를 일반인의 용어로 설명하고 있습니다.

Interested for more? 더 관심이 있으신가요?

🤖 Register for free on Hopsworks Serverless
🤖 Hopsworks Serverless에서 무료로 등록하세요
🐍 Learn all about the Python-Centric Feature Store
파이썬 중심 특징 저장소에 대해 모두 알아보기
🛠️ Explore all Hopsworks Integrations
🛠️ Hopsworks 통합 모두 탐색하기
🧩 Get started with codes and examples
🧩 코드와 예제로 시작하기
⚖️ Compare other Feature Stores with Hopsworks
호프스웍스와 다른 기능 저장소 비교⚖️

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.
이 콘텐츠가 낡아 보이나요? 이를 유지하는 데 관심이 있다면 언제든 연락을 주시기 바랍니다.

AI Pipeline

Auto-regressive Models

AutoML

Backfill features

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

DAG Processing Model

Data Compatibility