Immersive Translation - Document Bilingual Translation: Translate PDFs, ePub eBooks, subtitle files, txt files with one click

Research and application of soft measurement of oil well water cut based on big data analysis

Wu Lili

National Engineering Laboratory for Low Permeability Oil and Gas Reservoirs; 2. Oil and Gas Technology Research Institute, PetroChina Changqing Oilfield Branch).

Abstract: As an important indicator to measure the development status of oilfields, the moisture content of crude oil is of great significance to production. In order to solve the problems of high data acquisition cost, poor anti-interference ability and long time consumption of existing moisture content detection methods, a soft moisture content measurement method based on machine learning method was proposed. Combined with the data of oil well daily and monthly reports, correlation analysis and principal component analysis were used to determine the main controlling factors affecting the water cut, and seven main control factors were determined, including pump depth, daily oil production, daily water production, cumulative liquid production, cumulative oil production, cumulative water production and cumulative production time. Then, a variety of machine learning algorithms were used to construct a soft measurement model of moisture content, and the optimal model was optimized. The application example shows that the performance of XGBoost based on ensemble algorithm is better than that of the traditional support vector machine model in various evaluation indicators. The establishment of the model and the screening of the main controlling factors provide a new idea for the moisture content measurement method, which makes the moisture content measurement more accurate and convenient.

Key words: moisture content; Machine learning; correlation analysis; Soft measurement model

（1.national engineering laboratory of tight Oil & Gas Field Exploration and Development, Xi,an 710018, China; 2.Oil & Gas Technology Research Institute Changing Oil Field Company, Xi,an 710018, China）

Abstract: the water cut of crude oil is an important index to measure the development status of oil field. Aiming at the problems of high cost of data acquisition, poor anti-interference ability and time-consuming in the existing moisture content detection methods, a moisture content soft-sensing method based on machine learning method is proposed. Combined with the data of water cut monthly report of oil well, using correlation analysis and principal component analysis method, the main control factor set affecting water cut is determined, seven main control factors, i. e. pump depth, daily oil production, daily water production, cumulative liquid production, cumulative oil production, cumulative water production and cumulative production time, were determined. Then the soft-sensing model of moisture content is constructed by using various machine learning algorithms, and the optimal model is optimized. The application example shows that the performance of XGBoost based on the integrated algorithm is better than that of the traditional Support vector machine model. The establishment of the model and the selection of the main control factors provide a new idea for the water content measurement, which makes the water content measurement more accurate and convenient.

Keywords: water content; machine learning; correlation analysis; soft sensor model

Crude oil is a mixed fluid composed of oil, gas and water, and its water content refers to the percentage of water in the liquid produced by the oil well, reflecting the water production status of the oil well. In the process of oilfield exploitation, storage and transportation, high-precision measurement of crude oil moisture content can optimize production parameters and improve oil recovery, which is a very important indicator of oilfield surface metering ^[1]. At present, a large number of oilfields in China are in the high water cut period, and timely and accurate prediction of water cut has become an urgent need.

At present, there are two main methods for measuring the moisture content of crude oil: offline measurement method and online measurement method ^[2]. The principle of distillation method is to separate oil and water to measure the quality of oil and water, so as to calculate the moisture content, this method has high measurement accuracy, but it requires manual sampling each time, which is random and time-consuming ^[3]. The Karl-Fischer method uses chemical titration to measure the moisture content of crude oil by chemical reaction, but it is only suitable for the measurement of trace moisture content, which will lead to large errors when the oil field is in the high water cut stage ^[4]. The on-line measurement methods mainly include density method, radio frequency method, and capacitance method ^[4-6], among which the density method and radio frequency method are suitable for moisture content measurement in the high water cut stage, but the online measurement method usually needs to be carried out in the laboratory, which is expensive and requires high technical level. In summary, the traditional crude oil moisture content measurement methods generally have problems such as long cycle, high cost of manpower and material resources, difficulty in data collection, and poor real-time performance.

With the development of machine learning and big data technology, it has become a new research direction to combine intelligent algorithms and measurement techniques to construct soft measurement models of moisture content to improve accuracy ^[1,7-8^].。 The soft measurement technique uses the easily available auxiliary variables to predict the hard-to-obtain target variables by establishing a functional relationship model between the auxiliary variable and the target variable. Compared with traditional methods, the soft measurement model has low maintenance cost, convenient deployment, and fast response. This paper proposes a soft measurement model of oil well water cut based on big data method, which sorts and streamlines multi-source data, effectively screens the main controlling factors in multi-source data, and then establishes a soft measurement model of water cut through machine learning algorithm and evaluates and optimizes the model, which provides a new idea for the soft measurement method of oil well water cut.

1 Model Principles and Methods

Different wells and blocks have different static parameters such as injection-production methods, formation conditions, and equipment information, and the quality and characteristics of the collected data are also different, resulting in the traditional method not being universal. Based on this, the soft measurement method of water cut with big data is used to process and process the original data collected in combination with reservoir information, the correlation analysis algorithm is used to delete the features with high correlation to reduce the redundant dimension, and the PCA (Principal Component Analysis) method is used to determine the main control factor set of water cut, which is used as the input parameter of the soft measurement model. The big data algorithm was used to construct a soft measurement model of moisture content, and the model was optimized according to the evaluation index (Fig. 1).

Fig.1 Flow chart of the establishment scheme of the soft measurement model of moisture content

1.1 Principles and methods of main control factor analysis

Correlation coefficients are statistical indicators that reflect the degree of correlation between variables. At present, the two commonly used correlation coefficients are Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. The former mainly evaluates the linear relationship between two continuous variables, and the latter evaluates the monotonic relationship between two continuous variables. In order to accurately evaluate the correlation between variables, this time the two methods were combined to judge the strength of the correlation, and one of the strongly correlated variables was retained to reduce the redundant dimension. On this basis, the PCA method is used to analyze the main control factors, and the dimensionality of the features is reduced while retaining the maximum amount of information, so as to speed up the training of the model and avoid overfitting.

1.1.1 Correlation analysis

st $y$ le="font-family: 宋体; colo $r$ : black; dim_w='1'>Pearson correlation coefficient is a linear correlation coefficient, denoted as $r$ , which is used to reflect two continuous variablesThe correlation between X and Y, the r value is between -1 $~ 1$ , and the greater the absolute value, the stronger the correlation. The Pearson correlation coefficient for two n-dimensional vectors $x$ and $y$ is calculated as follows:

$r_{} = \frac{\sum_{}^{} (x_{} − x)(y_{} − y)}{\sqrt{\sum_{}^{} (x_{} − x)^{}} \sqrt{\sum_{}^{} (y_{} − y)^{}}}$ （1）

le="font-famil $y$ : Songti; color: black; dim_w='1'> where , $y$ - $the$ average value of the elements in x, y;

$r_{}$ —— Pearson correlation coefficient, which is the real number in [-1,1].

le="font-family: 宋体; color: black; dim_w='1'> when $r_{} > 0$ , the two variables are positively correlated; Otherwise, there is a negative correlation. $| r_{} |$ The larger it is, the more correlated $x$ and $y$ are.

style="fo $n$ t-family: 宋体; color: black; dim_w='1' > Spearman correlation coefficient is defined as the Pearson correlation coefficient between hierarchical variables^[9], which is also a coefficient used to reflect the degree of correlation between two variables, X and Y, often in the Greek alphabet $ρ$ It means that its value is between -1~1, and the greater the absolute value, the stronger the correlation. But Spearman's correlation coefficient evaluates a monotonic relationship between two continuous or ordinal variables. For a sample with a sample size of $n$ , $n$ raw data $X_{}, Y_{}$ is converted to grade data $x_{} {, y}_{}$ , the correlation coefficient $ρ$ is:

$\begin{matrix} ρ = \frac{\sum_{}^{} (x_{} − x) (y_{} − y)}{\sqrt{\sum_{}^{} (x_{} − x)^{}} \sqrt{\sum_{}^{} (y_{} − y)^{}}} # （ 2 ） \end{matrix}$

It is generally considered that the correlation coefficient of |>0.8 is extremely strong, so when the correlation between two features is >0.8, one of the features can be removed according to field experience and the difficulty of data collection.

1.1.2 Analysis of main control factors

The PCA method is a statistical analysis method that simplifies the original multiple features into a few comprehensive indicators, and is a dimensionality reduction processing technology. The pumping well production system is a complex system with multiple features, and the original features are replaced by the reduced features, which reduces the data dimension without losing too much data information.

Let the set of influencing factors related to the moisture content of the n-dimensional input be $D= (x^{}, x^{},⋯, x^{}),$ where $x^{}$ denotes each feature. $i= 1,2,⋯n$ The sample set after dimensionality reduction is $D^{}$ and the dimension is $n^{}$ Dimension.

The specific algorithm process is as follows:

(1) Standardization of all collected samples related to moisture content;

(2) calculate the covariance matrix X $X^{}$ of the sample;

(3) The eigenvalue decomposition of the covariance matrix $X X^{}$ is carried out, and the eigenvalue result is $W= (w_{}, w_{},⋯ w_{})$ ；

(4) The weight of each feature is calculated, and the calculation formula is: $ω_{} = \frac{w_{}}{\sum_{}^{} w_{}} ×100%$ , and the weight result is $Ω = (ω_{} {1,ω}_{},⋯ {,ω}_{})$ 。 The weights also become explainable variances, with larger variances indicating more information in that dimension.

(5) Set the threshold of the main control factor. Add the weights of each factor from large to small, and when the sum of the weights is greater than 85%, it is considered that these factors can represent all factors, and the remaining factors can be deleted.

1.2 Principles and Methods of Predictive Models

In recent years, with the rapid development of soft computing technology, machine learning methods have been widely used in the petroleum field. Support Vector Machine (SVM) and XGBoost (eXtreme Gradient Boosting) based on the ensemble method are particularly prominent. This time, the prediction effect of the two methods was compared, and the optimal method was selected.

1.2.1 SVM

As a classical traditional machine learning method, SVM has been widely used to solve classification and regression problems ^[10-13]., its generalization ability is strong, and it has a good ability to solve nonlinear problems. It can be used both as classification and regression, and when used for regression prediction, it is called support vector regression. As a key skill in support vector machines, kernel functions are commonly used in three categories, namely linear kernel functions, polynomial kernel functions, and Gaussian kernel functions. The kernel function is a simple method for calculating the inner product mapped to a high-dimensional space, through which the support vector machine can map the eigenvectors to a higher-dimensional space, so that the originally linear indivisible data becomes linearly separable in the mapped space. Different kernel functions will lead to different SVM effects, in the model construction part, this time we try different kernel functions and compare the effects according to the evaluation index, and select the most suitable kernel function type for this dataset as the optimal structure of the SVM model.

1.2.2 XGBoost

In addition to the common SVM algorithms, the XGBoost method based on ensemble learning was also selected for modeling. The so-called ensemble learning refers to the enhancement of the learning ability of the model by fusing and constructing multiple weak learners, and the traditional ensemble learning methods mainly include GBDT (Gradient Boosting Decision Tree) and random forest. XGBoost is an ensemble algorithm that improves on GBDT, and its speed and efficiency are greatly improved compared to GBDT ^[14]. As the most widely used machine learning method in recent years, XGBoost has been widely recognized by both the competition and the industry for its excellent performance.

XGBoost calculates the value of the predicted moisture content by adding the results of each weak learner, and then uses the next weak classifier to fit the gradient or residual of the predicted moisture content by the error function. For a sample of moisture content to be predicted, XGBoost will build $K$ a tree for training, and the trained $K$ tree will have $K$ a prediction result. The predictions for each tree are added together to get the final prediction for that sample. The closer the predicted moisture content value of the model is to the true moisture content value $y_{}$ , the better the prediction effect of the model is, so the objective function of the model can be expressed as:

$\begin{matrix} L(ϕ)= \sum_{}^{} l(_{}, y_{}) + \sum_{}^{} Ω(f_{}) # (3) \end{matrix}$

where the objective function consists of a loss function and a regularization term.

$l(_{} {i,y}_{})$ denotes the loss function, reflecting the error between the true and predicted moisture content, The loss function motivates the model to fit the training data as much as possible, and the smaller the error, the better the model fits the data.

$Ω (f_{})$ is the regularization term, which reflects the complexity of the tree and penalizes the model with high complexity, because the simpler the model, the smaller the risk of overfitting, and the more stable the trained model is. By minimizing the objective function, the model is continuously trained and learned, and finally the optimal set of parameters is found, and the model is constructed to predict the moisture content using this set of parameters.

1.2.3 Model evaluation index

In this study, three indicators were used to evaluate the performance of each prediction model, which were mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R2). RMSE is very sensitive to large or small errors in the prediction data, which can better reflect the accuracy of the prediction results. $R^{}$ is an important statistic that reflects the goodness of fit of the model, which is the ratio of the regression sum of squares to the total sum of squares (Table 1).

Table 1 Evaluation indexes of the regression model

Evaluation indicators	Calculation formula	Judging criteria
Mean absolute error MAE	$MAE= \frac{1}{n} \sum_{}^{} \| {− y}_{} \|$	$MAD$ The smaller it is, the smaller the prediction error of the model
Root mean square error RMSE	$RMSE= \sqrt{\frac{1}{n} \sum_{}^{} {({− y}_{})}^{}}$	The smaller the RMSE, the smaller the prediction error of the model, and the larger the RMSE, the greater the error
Coefficient of determination R²	$R^{} =1− \frac{\sum_{}^{} {({− y}_{})}^{}}{\sum_{}^{} {({− y}_{})}^{}}$	${R2}^{}$ is between 0 and 1, and the higher the number, the better the model fits

style="font-family: 宋体; color: black; min-height: 9pt; font-size: 9pt; dim_w='1'> Note: —the value of predicted moisture content; $y_{}$ —the value of the true moisture content; — average; $n$ —The number of samples in the dataset.

2. Selection of production characteristics and analysis of main control factors

2.1 Production feature selection

Due to the small fluctuation of moisture content and the poor temporal consistency of the daily moisture content data at the production site compared with the monthly moisture content data, the monthly moisture content and related data were used for data analysis and modeling. The data is collected from about 1200 wells in the Wangnan operation area of Changqing Oilfield, and the monthly water cut data is preliminarily correlated with a single well as the main key, because the collected data comes from different data tables, its structure is not suitable for direct use in the analysis and study of water cut, firstly, the attributes that are not related to the water cut, duplicate and the value is empty are deleted, so as to ensure the integrity and validity of the data in the whole production period. Table 2 shows the statistical information of each production characteristic that has a potential relationship with the moisture content after treatment.

Table 2 Characteristic factors of moisture content production

serial number	Influencing factors	least	utmost	average
F0	Number of production days/d in the current month	3.5	30	28.17
F1	Pump depth/m	500	1863	1151.26
F2	Daily liquid production/t	0.07	12.92	3.17
F3	Daily oil production/t	0	5.36	1.07
F4	Daily water production/m³	0	10.4	2.10
F5	Monthly liquid production/t	2	386	93.24
F6	Monthly oil production/t	0	158	31.59
F7	Monthly water production/m³	0	309	61.64
F8	Cumulative liquid production/^m3	32	91800	14400
F9	Cumulative oil production/t	11	63300	7400
F10	Cumulative water production/10⁴m³	0.0019	4.60	0.58
F11	Cumulative production time/MON	0	274	114.93

2.2 Correlation Analysis

The Pearson and Spearman correlation algorithm is used to analyze the correlation between the 12 features, and the results of the two algorithms are comprehensively considered, and the features with strong correlation with other features are deleted to reduce the redundant information.

The lighter the color indicates the stronger the positive correlation between the variables, and the darker the color indicates the stronger the negative correlation between the variables. The statistical correlation was greater than 0.8, and it was found that the correlation between the four characteristics of daily liquid production, daily water production, monthly liquid production and monthly water production was greater than 0.9, which belonged to the characteristics that were strongly correlated with each other, so only one of these four characteristics was retained, and the numerical results of the correlation of the features involved were displayed for the convenience of quantitative comparison (Table 3). Since the daily water yield accounted for the highest weight compared with the other three features in the subsequent main control factor analysis and model training, the daily water yield feature was retained, and the daily liquid production, monthly liquid production, and monthly water production were deleted, and the statistical information for each data feature was deleted (Table 4).

Table 3 Analysis results of strong correlation characteristics

Options	method	Fluid production per day	Daily water production	Monthly fluid production	Monthly water production
Fluid production per day	Pearson	1	0.9177	0.9901	0.9101
Fluid production per day		Spearman	1	0.9298	0.9924	0.9250
Daily water production	Pearson	0.9177	1	0.9072	0.9916
Daily water production		Spearman	0.9298	1	0.9218	0.9942
Monthly fluid production	Pearson	0.9901	0.9072	1	0.9174
Monthly fluid production		Spearman	0.9924	0.9218	1	0.9299
Monthly water production	Pearson	0.9101	0.9916	0.9174	1
Monthly water production		Spearman	0.9250	0.9942	0.9299	1

Figure 2: Pearson correlation analysis heat map

Fig.3 Spearman correlation analysis heat map

Table 4: Datasets with redundant features removed

serial number	Influencing factors	least	utmost	average
F0	Number of production days/d in the current month	3.5	30	28.17
F1	Pump depth/m	500	1863	1151.26
F2	Daily oil volume/t	0	5.36	1.07
F3	Daily water volume/m³	0	10.4	2.10
F4	Monthly oil volume/t	0	158	31.59
F5	Cumulative liquid production m³	32	91800	14400
F6	Cumulative oil production t	11	63300	7400
F7	Cumulative water yield^{m 3}	19	46000	5800
F8	Cumulative production time/MON	0	274	114.93

2.3 Analysis of controlling factors

In order to further reduce the data dimension and find the main control factor set of the moving liquid level, on the basis of Table 4, the PCA method was used to calculate the interpretable variance of the dimension. The interpretable variance calculation results of each feature in Table 4 are shown in Figure 4, the ordinate axis represents the variance value, the higher the value indicates the greater the amount of information, when the value is 1, it means that the amount of information contained is equal to the original data, and the abscissa axis represents the feature, the larger the interpretable variance corresponding to the feature, the greater the information contained in the feature, and the higher the weight. It can be seen that among the 9 parameters, monthly daily water production, daily oil production and cumulative water production are the three characteristics with the highest weight, while monthly oil production is the lowest weighted feature.

Figure 4: Features explain the variance

Daily oil volume (tons).

In this study, when the cumulative explainable variance value is greater than 0.85, it is considered representative of the original data set. According to the analysis results, the two dimensions of production days and monthly oil production in the current month were deleted, and finally a total of 7 effective characteristic parameters were determined as the input variables of the soft measurement model of moisture content, as shown in Table 5.

Table 5 Main controlling factors of the soft measurement model of moisture content

serial number	Influencing factors	least	utmost	average
F0	Pump depth/m	500	1863	1151.26
F1	Daily oil production/t	0	5.36	1.075
F2	Daily water production/m³	0	10.4	2.10
F3	Cumulative liquid production/^m3	32	91800	14400
F4	Cumulative oil production/t	11	63300	7400
F5	Cumulative water production/^m3	19	46000	5800
F6	Cumulative production time/MON	0	274	114.93

3. Construction and verification of soft measurement model of moisture content

3.1 Model construction and evaluation

According to the determined main controlling factors of water cut as the input dimension of the model, the dataset was collected from about 1300 wells in the Wangnan operation area of Changqing Oilfield, and the collection time was interceptedOn October 31, 2021, each well constituted a dataset with a total of about 1,300 pieces of data, each dataset contained the statistical values of the seven main controlling factors described in Table 5 in the current month and the water cut measured by the well in October, the dataset took the 7 main controlling factors of each well as the input variables of the soft measurement model, and the monthly reported moisture content of each well was used as the label value, and the training set and test set were divided into training sets and test sets according to the ratio of 8:2. Determine the parameters of the optimal model through continuous trial and error combined with grid search. Among them, the SVM uses linear kernel function, polynomial kernel function and Gaussian kernel function for model training, and the results of different kernel functions on the dataset are shown in Table 6.

Table 6 Comparison of evaluation indexes using different kernel functions

Kernel functions	MAE	RMSE	$R^{}$
Linear kernel functions	8.5581	11.9466	0.6624
Polynomial kernel functions	11.8898	16.1351	0.3842
Gaussian kernel function	3.5317	6.1051	0.9118

MAE and RMSE mainly measure the error between the true value and the predicted value, the smaller the value, the better, as can be seen from Table 6, the Gaussian kernel function performs better than the linear kernel function and the polynomial kernel function in these two evaluation criteria. The coefficient of determination $R^{}$ reflects how much of the percentage of the fluctuation of the moisture content can be described by the fluctuation of the controlling factor, the closer $1$ its value is The better, in this evaluation index, the SVM model of the Gaussian kernel function is far better than the other two kernel functions. Therefore, it is not difficult to see that the SVM model using Gaussian kernel function has the best evaluation index and the best effect, and the SVM model based on Gaussian kernel function will be used for soft measurement modeling of moisture content.

For the XGBoost model, the main parameters of the XGBoost optimal model were determined after repeated attempts in grid search, as shown in Table 7, and the loss change during the training process is shown in Figure 5.

Table 7 XGBoost model parameters

Learning rate	The number of iterations	Sample sampling ratio	Maximum depth of the tree
0.25	200	0.8	5

Fig.5 Change of loss during XGBoost training

After determining the optimal structure of the model, the prediction effect of the two soft measurement models was evaluated. Figure 6 vividly shows the degree of fit between the calculation results of the two algorithms and the true moisture content, the abscissa represents the true moisture content, the ordinate represents the predicted moisture content, and the blue and red data points are the calculation results of SVM and XGBoost, respectively, so the closer the prediction point is to the diagonal, the smaller the difference between it and the true value. It can be seen that both models show high accuracy in predicting moisture content, but the XGBoost model has better prediction effect, and the fitting degree is greatly improved compared with SVM at some points with large fluctuations. It is closer to the true value and less discrete from the true moisture content. From a quantitative point of view, Table 8 describes the scores achieved by SVM and XGBoost under the three evaluation indicators. It can be seen that the prediction bias of the XGBoost model for moisture content is smaller than that of SVM, and the accuracy is higher.

In summary, XGBoost is a better choice than the traditional machine learning method SVM in the construction of soft measurement models of moisture content.

Table 8 Comparison of SVM and XGBoost evaluation indexes

model	MAE	RMSE	$R^{}$
SVM	3.7291	6.2142	0.9021
XGBoost	1.9203	2.5445	0.9436

Fig.6 Comparison of SVM and XGBoost training

3.2 On-site effect verification

In order to verify the effect of the proposed soft measurement model of water cut, the model was deployed to the production well in the Wangnan operation area, and the seven main controlling factors were used to predict the water cut (Table 9). Among them, the absolute error = |true value - predicted value |, which reflects the degree of deviation between the true value and the predicted value, and the relative error = (absolute error/true value) ×100%, reflecting the degree of confidence in the measurement.

Table 9 Field application effect of the soft measurement model

Hashtag	Pump depth/m	Daily oil production/t	Daily water production/m³	Cumulative liquid production/^m3	Cumulative oil production/t	Cumulative water production/^m3	Cumulative production time/MON	True value/%.	Predicted value/%.	Absolute error/%	Relative error/%
Well 1	870	1.76	3.84	968	604	968	26	68.57	68.48	0.17	0.24
Well 2	700	1.54	5.79	3038	2220	4178	28	78.99	77.71	1.89	2.39
Well 3	617	1.41	3.05	1984	1848	4450	42	68.39	71.38	2.99	4.37
Well 4	773	0.4	7.26	2956	285	5543	25	94.78	94.26	0.52	0.54
Well 5	731	0.37	3.85	1932	440	5200	37	91.23	92.93	1.63	1.78
Well 6	663	0.23	1.5	0925	389	2309	28	86.63	89.71	3.08	3.55
Well 7	700	2.21	6.53	3607	1593	4488	22	74.71	73.06	1.65	2.20
Well 8	890	2.48	1.54	1804	2079	1343	26	38.31	35.48	2.83	7.38
Well 9	710	1.97	4.65	2751	2249	3704	28	70.24	71.67	1.43	0.20
Well 10	614	0.97	2.48	0987	3180	3147	49	71.88	71.72	0.16	0.22

From the error between the real value and the predicted value, it can be seen that the soft measurement model of moisture content has a small error and high reliability, which meets the requirements of on-site production. It provides a new scheme for the measurement of on-site moisture content.

4 Conclusion

(1) Through the method of main control factor analysis, the main influencing factors of monthly water cut were determined, which were as follows: well pump depth, daily oil production, daily water production, cumulative liquid production, cumulative oil production, cumulative water production, and cumulative production time. On the basis of retaining the original information to the greatest extent, the characteristic dimension can be reduced, and the influence of redundant information and noise on the accuracy of the model can be avoided, which provides an idea for further using big data technology to measure the soft moisture content in the future.

(2) Through verification, the tree model based on the ensemble method XGBoost is more suitable for the establishment of soft measurement model of water content than the traditional support vector machine model, which indicates that XGBoost has better fitting ability for linear and nonlinear datasets with high dimension, high coupling and high noise, and is more suitable for complex industrial production scenarios in oilfield sites, providing experience for modeling using big data methods in other scenarios.

(3) Moisture cut is an important indicator of oilfield development, which has an important impact on guiding production and estimating the life of oil wells, but different geological conditions, formation pressure and exploitation methods will have an impact on the water cut. Therefore, in order to further improve the accuracy of moisture content prediction, the development and geological characteristics should be quantitatively analyzed, and the training and establishment of the model should be added as influencing factors, so that the model has stronger universality, which is also the next step.

References

[1] Liu C, Niu H, Wang J, et al. Research into prediction model of water content in crude oil of wellheat metering based on general regression neural network[C]//2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.IEEE，2009：191-194.

[2] Tang Ying, Cui Lihong. Petroleum Knowledge,2019(3):48-49,51

[3] Chen Deen, Ma Dongchen, Chen Chunming. Journal of China Petroleum and Natural Gas Society,2008(1):132-135,392

[4] Ma Wentao, Guo Wenge, Yong Zhen, et al. Journal of Chongqing University of Science and Technology (Natural Science Edition),2016,18(3):46-49

[5] Huang Zhenghua. On-line measurement of crude oil moisture content by transmission line method[J].Surface Engineering of Oil and Gas Field,1998(2):49-51.)

[6] Wang Zhiqiang. Experimental study on online measurement of turbine oil micro-water content by resonator perturbation method[D].Baoding:North China Electric Power University (Hebei),2007

[7]Makeyev Y V，Lifanov A P, Sovloukov A S. On-line microwave measurement of crude oil water content[C]//2009 19th International Crimean Conference Microwave & Telecommunication Technology.IEEE, 2009: 839-840.

[8] Li Zhiming, Kong Lingfu. Application of soft measurement based on SVM in crude oil moisture content estimation[J].Journal of Yanshan University,2006(4):328-333

[9] Myers J L，Well A D，Lorch Jr R F. Research design and statistical analysis[M]. Routledge，2013.

[10] Ma Jian, Deng Xiaogang, Wang Lei. CIESC Journal,2018,69(3):1121-1128

[11] Guo Qiang. Soft measurement technology based on support vector machine and its application[D].Fushun:Liaoning Shihua University,2014

[12] Xu Lei, Hou Lei, Zhu Zhenyu, et al. Research on prediction of produced water treatment effect in oilfield based on two-layer decomposition algorithm and improved SVM[J].Bulletin of the Petroleum Sciences,2021,6(3):505-515.)

[13] Gumus M，Kiran M S. Crude oil price forecasting using XGBoost[C]//2017 International conference on computer science and engineering (UBMK). IEEE，2017: 1100-1103.

[14] Chen T，Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016: 785-794.

Foundation Item:

About the first author:Wu Lili, master, senior engineer, is mainly engaged in the research and application of digital intelligence technology in oil and gas fields. Address: No. 51, Changqing Oilfield New Technology R&D Center, Mingguang Road, Weiyang District, Xi'an City, Shaanxi Province, postal code 710021.

Received: