Abstract
Machine learning techniques have fundamentally altered how oil and gas industry practitioners design fracture operations. In this paper, we perform data analytics utilizing response surface methodology (RSM), a group of statistical techniques that develop a functional relationship between an output variable of interest and several associated input variables, to optimize the output. We apply RSM to optimize horizontal well production based on initial production (IP) of horizontal oil wells for 180 days (IP180 Oil), as a function of five input variables: reservoir type, fracturing fluid (gal/ft), proppants (lbm/ft), cluster spacing, and stage length (ft). The RSM model correlates the initial production of each well to the input variables via a single equation, thus allowing for exploration of the fitted response surface in order to maximize production. Although the choice of the five inputs is made based primarily after consultation with industry professionals, we validate our selection by also applying an assortment of dataanalyticsbased methods that attempt to rank variable importance and thereby identify completion variables that may be predictive of initial production. The findings rank all five variables above the 50th percentile, thus indicating that the chosen variables have merit. This procedure is applied to a dataset of 201 horizontal wells from the Wolfcamp formations. The model fits reasonably well, with R^{2} = 61%, a very significant Fstatistic p value, and a predicted versus observed values scatterplot indicating a good fit. The RSM analysis suggests that, within the feasible space defined by this dataset, maximum values of IP 180 Oil may be obtained by setting the fracturing fluid in gal/ft at approximately 1972, while simultaneously maximizing the remaining input variables (proppant loading, cluster spacing, stage length). The outcome indicates the possible directions to be taken in seeking a global optimum initial production for the setting of completion variables. Iteration of this scheme may lead to a nearoptimum global solution. The real utility of this work may be indicating the way different studies may be designed to optimize production, each with its own selection of inputs, and ultimately be combined in a metaanalysis.
Introduction
Challenges to oil production in the Wolfcamp range from low productivity and quick decline to high variability in the response of wells due to fracture treatment. Initial production of wells is used as a measure to assess well performance and completion effectiveness, and to compare wells of a given company to those of its competitors, and production performance of one shale play to that of another. However, the challenge remains to determine which of these initial production metrics should be used. The O & G industry recognizes peak rate (the starting point of production decline); initial production (IP30) as a rolling average for 30day production; and IP60, IP90, IP180, and IP365 as a measure of production performance. Some schools of thought use cumulative production in the first months along with EUR to judge performance and profitability of oil wells.
A key question is, what are production metrics?
Production metrics may be represented in initial production (IP) IP30, through IP365 initial production for a rolling average of 30 through 365 days; cumulative production for the first number of days or months of the life of the well, and Estimated Ultimate Recovery (EUR). The production metrics listed in Table 1 show benefits and challenges associated with each.
The IP rate of a well is established based on well type; reservoir considerations; historical and analogous performance; management and commercial strategy; operational considerations; target and economic reserves, and nodal analysis results (Olufemi and Ambastha 2017).
Empirical techniques to forecast oil production include decline curve analysis and capacitance/resistance modeling (CRM). CRM is a datadriven method for characterizing conventional reservoirs and optimizing oil production without complex reservoir simulations (Yousef et al. 2006).
Recent industry efforts established multistage fracture design optimization for unconventional shale oil and gas basins to locate the number of fracture stages that maximize NPV (Rashid et al. 2014). The effect of both spacing and perforation friction on the propagation of concurrent hydraulic fractures from a horizontal well was studied (Izadi et al. 2015). The optimization indicated that while limitedentry perforating could help equalize the lengths of fractures simultaneously growing from multiple clusters, it did not equalize their widths, but equalized placement of proppant across the entire series of perforation clusters.
The databased technique, which relies on empirical correlations, also correlates the well production to reservoir characteristics. Due to the currently used downhole tool restrictions currently in place, multistage fracturing is often conducted consecutively, stage by stage from well toe to heel without real optimization (Xia et al. 2016; Roussel and Sharma 2011). Each stage usually contains several perforation sets, and thus, several fractures propagate simultaneously if each perforation set produces only one fracture (Wu and Olson 2014).
There difference is significant between datadriven reservoir modeling and empirical technologies. First, a model with a good predictivity is necessary; then internal interactions between the variables can be checked to see if the model makes sense. The physical and geological interactions can tell the prediction consistency of the built model (Mohaghegh 2017). A carefully workedout effort to develop the unconventional reservoirs may change the trend of manufacturing approach (uniform spacing of wells). Utilization of stimulation software such as “KinetixIntersect” and comparison of the modeling outcome of fracture simulator for a selected fracture stage and ratetransient analytical results led to variable Enhanced Oil Recovery (EOR) (Rodriguez 2019). Acknowledgment of variable recovery and pressure depletion associated with the producers’ drainage volume should be the basis for the planning of EOR techniques (Rodriguez 2019). Spacing and placement of perforation clusters, stress distribution, and fracture mechanics are the most important input variables of production performance (Cheng 2012). A case study implemented an engineering approach to optimize multiple sets of parallel horizontal wells in the oil segment of the Eagle Ford Shale. The data was used to design stimulationstage intervals and evaluate surface injection responses during fracture treatments (Stegent et al. 2013).
The correlation between the style of completions and the initial production rate is well established (Griffin et al. 2013; Lafollette et al. 2012a, b). Initial production is a crucial metric to measure profitability of wells. Estimated ultimate recovery of wells (EUR) usually increases with the number of stages. As the number of fracture stages increases, the efficiency of incremental stages decreases in the Bakken Shale formation (Ran and Kelkar 2015). Beyond a certain number, therefore, the incremental cost would exceed the incremental benefits. All the efforts led to a good understanding of the effect of individual variables on production without agreement on an optimum methodology. Microseismic monitoring in the unconventional monitoring has proven to be highly effective in many field runs (McClure 2012).
Table 1 lists the benefits and challenges of commonly used production metrics.
Why do O&G operators need production metrics?

To measure performance of new wells versus old wells.

To compare their own wells versus competitors’ wells, shales, plays.

To correlate production metrics with the EUR of wells.
Hydraulic fracture operations lead to an increase in production. This increase may depend on various metrics (shown in Table 2): completed length of the horizontal well; quantity of perforations; number of clusters per stage; cluster spacing in length; size of proppant in lbm; concentration of proppant in fluid; fluid in gal/ft; fluids in bbls; average injection rate; fluid type, and proppant type and size, along with production metrics (e.g., Initial Production, Yield).
The data ranges used to develop the model are summarized in Table 2, which also represents completion variables that might affect initial production performance of horizontal wells.
The paper is organized as follows:

Description of problem

Data analytics methodology to select optimum completion variables

Development of model

Testing and validation of model.
For lack of standard metrics, optimum completion variables are determined by trial and error approach (e.g., amount of fluid and proppant pumped per completed foot, cluster spacings and fracture stage length). One unanswered question is optimum number of completion variables needed per well for highest initial production. Another area of uncertainty is the degree to which each metric affects estimated ultimate recovery (EUR) from an individual well. One school of thought suggests the more, the better. Another suggests that fracture length and dimensionless fracture conductivity are the two primary variables that control the productivity index of a fractured well and thus initial production. For years, industry researchers have applied accepted methods to estimate postfracture production increase. These approaches depend on a variety of fracture lengths and fracture conductivities. The procedures involve a combination of propped fracture lengths and conductivities that optimize postfracture production response.
Graphical solutions rely on the interdependence between formation permeability, halflength of the fracture, and effective conductivity \(\left( {K_{\text{f}} \times w_{\text{fp}} } \right)\). Veatch et al. (2017) showed five commonly used charts for predicting semisteadystate postfracture production fold of increase, i.e., the ratio of production rate postfracture cleanup to prefracture rate. The methods include Prats; McGuire and Sikora; Holditch; Tinsley et al., and Tannich and Nierode graphs. These charts may provide guidelines for determining approximate treatment size (fluid systems and proppants), injection rates and proppant staging schedules (Veatch et al. 2017). Economic analysis based on IP alone can be misleading for gas producers considering lateral placement in the reservoir at the bottom or midway (Taylor et al. 2011). On the other hand, ultimate production (EUR) from Marcellus wells was precisely predicted using initial well production (Male et al. 2016). The wells in the study, which showed a boundarydominated flow, were used to forecast EUR of 5275 wells in Pennsylvania and West Virginia over a period of 25 years.
This paper focuses on use of data from the Wolfcamp play of the Permian Basin to introduce an effective model to predict initial production as a production performance metric from completion variables. The work utilizes data analytics for effective selection of key completion variables that lead to optimum IP. For a list of various analytics tools used in unconventional reservoirs and development, see Table 6 in “Appendix 2”.
As far as we know, no simple model is available to estimate production metrics utilizing key completion variables such as fluid in gal/ft, cluster spacing, proppant in Lbs./ft, reservoir type and stage length of horizontal wells. For this paper, we are optimizing the initial production of oil for 180 days (IP 180 Oil).
Methodology
In this section, we first describe data sources, as well as the selection of input variables for optimizing IP 180 Oil. Summary measures commonly used in the machine learning literature to rank the relative merits of predictor variables are computed. Finally, a quadratic response surface model is fitted.
Data and nomenclature
The dataset consisted of measurements taken from 201 horizontal wells. The 209 originally listed variables of particular interest were analyzed. A list of 32 variables was chosen as given in Table 3. From this list, the following variables were chosen after the analyses described in the next subsection.

1.
Oil: natural logarithmic of IP 180 of Oil (output variable). Range of values: 4.860–8.011.

2.
Reservoir type: reservoir name (acts as blocking factor with 9 levels). Levels: WOLFCAMP, WOLFCAMP A, WOLFCAMP C, WOLFCAMP D, WOLFCAMP D4, WOLFCAMP LA, WOLFCAMP MA, WOLFCAMP UA, WOLFCAMP UAY.

3.
Fluid.gal/ft: amount of fluid pumped per completed foot. Range: 643–2687.

4.
Prop.lbs/ft: amount of proppant pumped per completed foot. Range: 724–2839.

5.
Cluster spacing: distance between the clusters within the stage. Range: 13–132.

6.
Stage length: length of each stage. Range: 156–404.
Selection of input variables
The selection of the five input variables discussed above is based primarily on expert opinion after extensive consultations with industry professionals. As a form of validation, we sought also a variety of purely machinelearningbased methods to rank the relative variable importance in predicting IP 180 Oil. The Random Forest Package in R, which fits regression tree predictive models, provides two such measures (Liaw and Weiner 2018): %IncMSE (Fig. 1) and IncNodePurity (Fig. 2). The third measure is the absolute value of the (Pearson) correlation coefficient of each variable with IP 180 Oil (Fig. 3). Note that, unlike the regression treebased methods, correlations cannot be calculated between a categorical variable (in this case reservoir type) and a quantitative output.
We note that the five selected input variables, although not at the very top of these various rankings, do come out as being important. Namely: Reservoir is ranked 3rd in Fig. 1; Fluid gal/ft is ranked 12th in Fig. 3; Prop.lbs/ft; is ranked 7th in Fig. 3; Cluster spacing is ranked 9th in Fig. 3; Stage length is ranked 19th in Fig. 3.
Fitting a response surface model
Response surface methodology (RSM) is a group of statistical procedures used in the development of a flexible functional relationship between a relevant output variable, y, and a number of related input variables, \(x_{i}\), with a view toward optimization of the output (Khuri and Mukhopadhyay 2010). If we have just one categorical input variable \(x_{0}\) with L levels, and k numerical input variables \(x_{1} , \ldots ,x_{k}\), such a relationship is usually approximated by a lowdegree polynomial model of the form (Eq. 1) listed in “Appendix 4”. Since the model is of the form of a multiple linear regression, it is straightforward to estimate the variables via leastsquares and assess its goodness of fit. Denoting these estimates by \(\widehat{B}_{i }\) and \(\widehat{B}_{i,j }\), the resulting fitted model at a given value of the vector of input variables x is given by Eq. 2 listed in “Appendix 4”, which is the optimal predictor of y at x (in the leastsquares sense).
The main goal of RSM is determination of the optimum modelpredicted output ŷ(x), over all x in the data region D. As a result and for our data, this region would be the 4dimensional rectangle delineated by the intervals (see above summary statistics):
For ease of interpretation, we normalize these variables by subtracting the midpoint of the interval and dividing the result by the interval halfwidth:
whence the (normalized) data region \(D_{\text{norm}}\) becomes the 4dimensional cube delineated by the intervals \( 1 < x_{i} < 1\), for i = 1,…,4. On this normalized scale, the fitted model coefficients are given in Table 5 of “Appendix 1”.
This model fits reasonably well when applied to our dataset with a (typical) quadratic order polynomial, resulting in \(R^{2} = 61\%\) and an Fstatistic p value < 0. The predicted versus observed value plot in Fig. 4 shows relatively good insample predictions.
Results and discussion
The selection of an appropriate pool of input (or predictor) variables is a perennial problem in supervised machine learning applications. One may begin to understand the difficulty of the issue by realizing that the exact functional relationship between a predictor and the outcome of interest is (and forever will be) unknown. Moreover, predictors are typically correlated, thus further confounding the issue of what is predicting what. And how should one define “appropriate”? In classical regression analysis (Neter et al. 2004), one can begin to untangle these interrelated issues by measuring the decrease in prediction error variance (the socalled MSE or “mean squared error”) when each variable by itself is added to the model in turn: important predictors should lead to large decreases in MSE. But this is not the full story, as the results depend on assumed functional relationships (typically linear), and the amount of “dependence” (correlation) among the predictors.
The selection of the five input variables listed in the previous section that were chosen for this study, was based primarily on expert opinion after extensive consultations with industry professionals. As a form of validation, the three metrics displayed in Figs. 1, 2 and 3 provide different instances of purely machinelearningbased rankings of importance. The two treebased measures, %IncMSE (Fig. 1) and IncNodePurity (Fig. 2), consist of dropinMSE type of metrics that use frequency of splits along the variable to determine rank (see “Appendix 3” for a summary). The measure in Fig. 3 is the usual correlation coefficient. A plethora of other measures could be used, e.g., one of the many statistical variable selection informationbased criteria, such as AIC or BIC, in connection with the fitted RSM model. Although the five inputs we chose are not the top 5 by any of these measures, they do rank in the top 50th‰ according to all of them. In any case, the aim of the present study is not to definitively “solve” this difficult ranking problem, but rather to sketch out a procedure for optimizing production that could be implemented following the selection of a few inputs based on pragmatic choices borne out of field experience.
Translation of this optimization goal into the RSM of “Methodology” section, converts into determination of the maximum modelpredicted output ŷ(x), over all x in the data region D. Interestingly, the optimal x is obtained by maximizing all the variables, except fluid in gal/ft as shown in Table 4. Note that different reservoir types simply act as an intercept in the model, so that anything beyond WOLFCAMP increases this value; e.g., the largest increase is 1.0436 for WOLFCAMP UAY (see Table 5 of “Appendix 1”). Thus the location of the optimal point is unaffected by reservoir type, and all discussion and plots pertain to WOLFCAMP.
Figure 5 shows contour plots of ŷ(x) in the vicinity of this optimal point (red dot). The variables not displayed on the axes are fixed at their optimal point values. Overlaid are the data points displayed in blue. The top left panel would seem to indicate that the optimal point is in the vicinity of the data, an error that is dispelled in the bottom left panel where the red dot is actually seen to be quite far from the data cloud.
Finally, Fig. 6 is a zoomedout version of Fig. 5 plotted on the normalized variable scale. The data region \(D_{\text{norm}}\) is displayed as a red square, where the optimal point is still the red dot.
From this we can see that the red dot is in a flat region, thereby suggesting that no dramatic improvements appear to be feasible beyond this point. (The fact that the global optimal solution simultaneously sets all inputs at infinity is not of interest here).
Summary
This paper devises an optimization procedure to maximize the initial production of horizontal wells considering a pool of reservoir variables, summarized as follows:

1.
Choose an initial oil production metric, e.g., IP 180 Oil.

2.
Select a pool of “important” predictors/variables, e.g., number of days in production; horizontal well completion configurations; proppant size; fluid type; stages count, and lateral completed interval, etc. This selection could (perhaps ideally) be made by combining pragmatic field experience with stateofthe art data analytics.

3.
Use RSM to model the functional relationship between the pool of predictors and the metric.

4.
Based on the fitted RSM model, determine the optimum modelpredicted output ŷ(x), over all x in the feasible data region.
Response surface methodology suggests that, within the feasible space defined by this dataset, maximum values of IP 180 Oil may be obtained by setting fluid in gal/ft at approximately 1972, while simultaneously maximizing the remaining inputs (prop in lbs/ft, cluster spacing, stage length).
Note that this statement is not to be taken as absolute truth, but rather as suggestive of possible directions to be taken in seeking a global optimum value for the setting of inputs. More refined estimates may be obtained by collecting new data in the vicinity of this optimum, followed by a new analysis. Iteration of this scheme may lead to a nearoptimum global solution.
The value of the new model proposed in the paper lies in its simplicity and the lower number of input variables needed for a practicing engineer to optimize oil production from shale horizontal wells. More importantly, it is based on completion design variables that may be chosen relatively quickly. It is also possible that the IP 180 Oil calculated from this model can be correlated with EUR.
It is projected that the calculated IP 180 Oil from this model may be used in several applications. First, it may work as a comparative scale to measure well performance of a key play versus that of a competitor. Second, it may be used as a tool to enable quick decision on certain completion variables needed for horizontal wells, which can facilitate further estimation of fluids gal/ft, proppants lbs/feet, cluster spacing and fracture stage length. Third, in case of multiple leases with a variety of IPs, the model may work as a quality check tool to optimize production of a lease. Estimation of IP will be used to differentiate among different reservoirs (Wolfcamp A, B, C, D, MA and LA).
Conclusion and recommendation
The objective of the paper was to propose a methodology for identifying the values of key completion variables that may lead to optimum production of horizontal wells. The study proceeded as follows.

Extensive literature review to identify all production metrics, understand the relationships between them, compare performance, and suggest which may be used.

Description of the problem.

The usage of data analytics methodology to select optimum values of key completion variables, after elimination of noneffective variables such as Avg. Rate (bpm) from the available pool.

Testing and validation of the procedure on publicly available data.
The following conclusions may be drawn.

1.
Optimal production may be obtained by maximizing with respect to some pairs of variables (e.g., cluster spacing and proppants).

2.
Maximum IP over domain is obtained by maximizing all the variables (e.g. proppants lbs/ft of 2839, cluster spacing of 132 ft, and stage length of 404 ft), except fluid in gal/ft.

3.
The proposed model may be used to predict initial production of horizontal wells of the Wolfcamp using few input completion variables with reasonable accuracy.

4.
Additional data might benefit the work for further testing of global areas of shale plays.
Abbreviations
 \(K_{\text{f}}\) :

Fracture permeability
 \(w_{\text{fp}}\) :

Propped fracture closed width
 IP:

The amount of oil produced by a new oil well, measured in B/D (barrels of oil per day) or BOE/D (barrels of oil equivalent per day)
 Proppant Lbs.:

Mass of proppants
 Age:

The difference in number of days between the time the well was completed and 01/01/2001
 Avg. Prop. Concentration:

The lbs of proppant pumped per fluid gallons pumped
 Avg. Rate:

The average rate at which the mixture is pumped downhole to create the fractures
 Cluster spacing:

The distance between the clusters within the stage
 Clusters per stage:

The number of clusters used in each stage
 Clusters:

A set of perforations arranged in a certain pattern to achieve the optimal completion
 Comp Date:

The date the well was completed
 Completed feet:

The calculated distance between the top perf and bottom perf where the fractures occur
 County Variable:

A numeric variable to distinguish between counties
 County:

The county where the well was drilled
 Fluid Bbls:

The amount of fluid pumped downhole to initiate fracture and place proppant
 Fluid Gal/Cluster:

The amount of fluid pumped per cluster
 Fluid Gal/Ft.:

The amount of fluid pumped per completed foot
 Fluid Gal/Perf:

The amount of fluid pumped per perforation
 IP:

Initial production rates
 IP 30, 90, & 180:

Calculations taking a rolling average by number of days described (30, 90, or 180) and then using the maximum value obtained for oil, gas and water
 ISIP/Ft.:

Instantaneous shutin pressure once a stage has been completed and frac pumps are shut down
 Linear Gel:

A fracturing fluid supplemented with different polymers which increase its ability to carry proppant
 Max Prop. Concentration:

The proppant concentration begins with mostly fluid and then is built up to a maximum concentration of pounds of proppant per fluid gallon
 Max. Rate:

The maximum rate achieved pumping mixture into fractures
 Number of stages:

Whenever a plug is set, perforations are created, the reservoir is fractured with fluid, and then another plug is set, it is called a stage
 Perfs/Clusters:

The number of perforations used in each cluster
 Perfs:

The number of holes created from the charges that are inserted downhole in sets of perforations called clusters designed in different patterns to achieve optimal completion
 Prop. Lbs./Cluster:

The amount of proppant pumped per cluster
 Prop. Lbs./Ft.:

The amount of proppant pumped per completed foot
 Prop. Lbs./Perf:

The amount of proppant pumped per perforation
 Prop. Lbs.:

The amount of proppant pumped with the fluid to keep the created fractures open
 Rate/Cluster:

The average rate per cluster
 Rate/Ft.:

The average rate per completed foot
 Rate/Perf:

The average rate per perf
 Fluid Gal/Ft.:

Fluid volume in gallons per foot
 Reservoir:

The formation in which the lateral was drilled (i.e., Wolfcamp A, Wolfcamp LA, Wolfcamp MA)
 Reservoir Variable:

A numeric variable distinguishing between reservoirs Wolfcamp C–D and Wolfcamp A
 Reservoir:

The formation that the lateral was drilled (i.e., Wolfcamp A, Wolfcamp LA, Wolfcamp MA)
 Slickwater:

Water with chemicals added to speed up the rate at which it can be pumped to create more fractures
 Stage length:

The length of each stage, a good indicator for normalizing stages for lateral length
 TVD:

The furthest vertical depth drilled
 UWI:

A unique well identifier for every well; each set of numbers stands for a unique location and well identifier (i.e., county, state, horizontal, pilot)
 GOR Ratios:

A metric used to determine the amount of gas produced per oil produced (SCF/STB)
 Yield:

Condensate yield, MMSCF/STB
 WCUT:

Water cut; a metric used to determine the amount of water produced per oil produced. (Water/(Water + Oil))
 Oil, Gas, MBOE EUR:

The estimated ultimate recovery of oil and gas, MBOE − Oil + Gas/6
 Degradation:

The amount of overlap divided by total rectangular area
References
Bakshi A, Uniacke E, Korjani M (2017) A novel adaptive nonlinear regression method to predict shale oil well performance based on well completions and fracturing data. In: SPE185695MS presented at the SPE western regional meeting. Bakersfield, California, USA, pp 23–27. https://doi.org/10.2118/185695MS
Bolen M, Crkvenjakov V, Converset J (2018) The role of big data in operational excellence and real time fleet performance management—the key to deepwater thriving in a lowcost oil environment. Society of Petroleum Engineers. https://doi.org/10.2118/189603MS
Cheng Y (2012) Impacts of the number of perforation clusters and cluster spacing on production performance of horizontal shale gas wells. SPE Reserv Eval Eng 15(01):31–40. https://doi.org/10.2118/138843PA
Feblowitz J (2013) Analytics in oil and gas: the big deal about big data. Society of Petroleum Engineers. https://doi.org/10.2118/163717MS
Gaurav A (2017) Horizontal shale well EUR determination integrating geology, machine learning, pattern recognition and multivariate statistics focused on the Permian Basin. Society of Petroleum Engineers. https://doi.org/10.2118/187494MS
Grieser WV, Shelley RF, Johnson BJ, Fielder EO, Heinze JR, Werline JR (2006) Data analysis of Barnett shale completions. Society of Petroleum Engineers. https://doi.org/10.2118/100674PA
Griffin LG, Pearson CM, Strickland S, McChesney J, Wright CA, Mayer J,Weijers L (2013) The value proposition for applying advanced completion and stimulation designs to the Bakken Central Basin. Society of Petroleum Engineers. https://doi.org/10.2118/166479MS
Gupta S, Fuehrer F, Jeyachandra BC (2014) Production forecasting in unconventional resources using data mining and time series analysis. In: SPE171588MS presented at the SPE/CSUR unconventional resources conference, 30 September–2 October, Calgary. https://doi.org/10.2118/171588MS
https://www.investopedia.com/terms/e/estimatedultimaterecovery.asp
https://www.verdazo.com/blog/whatproductionperformancemeasureshouldiuse/
Izadi G, Settgast R, Moos D, Baba C, Jo H (2015) Fully 3D hydraulic fracture growth within multistage horizontal wells. International Society for Rock Mechanics and Rock Engineering
Jaripatke OA, Barman I, Ndungu JG, Schein GW, Flumerfelt RW, Burnett N, Bello HD, Barzola GJ (2018) Review of permian completion designs and results. Soc Pet Eng. https://doi.org/10.2118/191560MS
Johnston J, Guichard A (2015) New findings in drilling and wells using big data analytics. In: Offshore technology conference. https://doi.org/10.4043/26021MS
Khuri AI, Mukhopadhyay S (2010) Response surface methodology. Wiley Interdiscip Rev Comput Stat 2(2):128–149
Lafollette R, Holcomb WD, Aragon J (2012a) Impact of completion system, staging, and hydraulic fracturing trends in the Bakken formation of the Eastern Williston Basin. Society of Petroleum Engineers. https://doi.org/10.2118/152530MS
Lafollette R, Holcomb WD, Aragon J (2012b) Practical data mining: analysis of Barnett shale production results with emphasis on well completion and fracture stimulation. Society of Petroleum Engineers. https://doi.org/10.2118/152531MS
Lehman LV, Jackson K, Noblett B (2016) Big data yields completion optimization: using drilling data to optimize completion efficiency in a low permeability formation. Soc Pet Eng. https://doi.org/10.2118/181273MS
Liaw A, Weiner M (2018) “Randomforest”: Breiman and Cutler’s random forests for classification and regression (R package version 4.614). R Foundation for Statistical Computing, Vienna
Male F, Marder MP, Browning J et al (2016) Marcellus wells’ ultimate production accurately predicted from initial production. In: SPE180234MS paper presented at the SPE low perm symposium held in Denver, Colorado, USA
Male F, Aiken C, Duncan IJ (2018) Using data analytics to assess the impact of technology change on production forecasting. Society of Petroleum Engineers. https://doi.org/10.2118/191536MS
McClure MW (2012) Modeling and characterization of hydraulic stimulation and induced seismicity in geothermal and shale gas reservoirs. Ph.D. thesis. Stanford University, Stanford, California
Mohaghegh SD (2017) Datadriven reservoir modeling: topdown modeling (TDM): a paradigm shift in reservoir modeling, the art and science of building reservoir models based on field measurements. Society of Petroleum Engineers, Richardson
Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (2004) Applied linear statistical models, 5th edn. McGrawHill/Irwin, New York
Olaoye O, Zakhour N (2018) Utilizing a data analytics workflow to optimize completion practices while leveraging public data: a Permian Basin case study. Society of Petroleum Engineers. https://doi.org/10.2118/191416MS
Olufemi A, Ambastha A (2017) Initial production rate estimation: impact on recovery and interplay of recovery process. In: SPE189078MS paper presented at the Nigeria annual international conference and exibition held in Lagos, Nigeria
Pankaj P, Geetan S, MacDonald R, Shukla P, Sharma A, Menasria S et al (2018) Need for speed: data analytics coupled to reservoir characterization fast tracks well completion optimization. In: SPE Canada unconventional resources conference. Society of Petroleum Engineers, Alberta, Canada
Passman A (2018) Modeling and data analytics workflow from delineation to development: a case study utilizing limited data in the extensional deep dry Utica shale play of Pennsylvania. Society of Petroleum Engineers. https://doi.org/10.2118/191468MS
Ran B, Kelkar M (2015) Fracture stages optimization in Bakken shale formation. In: Unconventional resources technology conference, no. July: 20–22. https://doi.org/10.15530/URTEC20152154796
Rashid K, Bailey W, Couet B, Hudgens P, Heim R (2014) Sectorwide multistage hydraulic fracture optimization. Society of Exploration Geophysicists
Rodriguez A (2019) Inferences of two dynamic processes on recovery factor and well spacing for a shale oil reservoir. Society of Petroleum Engineers. https://doi.org/10.2118/197089MS
Roussel NP, Sharma MM (2011) Strategies to minimize frac spacing and stimulate natural fractures in horizontal completions. In: SPE annual technical conference and exhibition, 30 October–2 November, Denver, Colorado, USA. https://doi.org/10.2118/146104MS
Schuetter J, Mishra S, Zhong M, LaFollette R (2015) Data analytics for production optimization in unconventional reservoirs. In: Unconventional resources technology conference. https://doi.org/10.15530/URTEC20152167005
Shelley RF, Lehman LV, Shah K (2011) Survey of more than 1,000 fracture stage database with net pressure in the Barnett shale. Part 2. Society of Petroleum Engineers. https://doi.org/10.2118/143330ms
Stegent NA, Wagner A, Stringer C, Tompkins R, Smith N (2013) Engineering approach to optimize development strategy in the oil segment of the Eagle Ford shale: a case study. Society of Petroleum Engineers. https://doi.org/10.2118/158846PA
Taylor RS, Barree R, Aguilera R (2011) Why not to base economic evaluations on initial production alone. In: SPE 148680 paper presented at the Candian unconventional resources conference held in Calgary, Alberta, Canada
Veatch RW, King GE, Holditch SA (2017) Essentials of hydraulic fracturing: vertical and horizontal wellbores. PennWell Corporation, Tulsa
Wu K, Olson JE (2014) Simultaneous multifracture treatments: fully coupled fluid flow and fracture mechanics for horizontal wells. SPE J 20(2):334–346. https://doi.org/10.2118/167626PA
Xia K, Fonseca E, Jones R, Zhan L (2016) A new perspective on multistage stimulation of multiple horizontal wells. In: The 50th US rock mechanics/geomechanics symposium, 26–29 June, Houston, Texas, USA
Yousef AA, Gentil PH, Jensen JL, Lake LW (2006) A capacitance model to infer interwell connectivity from production and injection rate fluctuations. SPE Reserv Eval Eng J 9(5):630–646. https://doi.org/10.2118/95322pa
Acknowledgements
The authors would like to acknowledge the kind assistance through fruitful discussion with Wade Baustian from Camino Natural Resources and Cimarex for providing the data used in this work.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Model coefficient details
See Table 5.
Appendix 2: Table of data analytics use in completion strategies in unconventional resources
See Table 6.
Appendix 3: Regression treebased measures for ranking predictors

1.
%Inc Mean Squared Error (MSE)
It is calculated from the well data for each tree, the prediction error on the test (randomly rearranged values of the variable) is documented as MSE.
$${\text{MSE}} = {\text{mean}}\left( {\left( {y_{\text{measured}}  y_{\text{predicted}} } \right)^{2} } \right)$$ 
2.
Inc Node Purity
It is considered from the test data as well, total reduction in node impurities from splitting on the variable, averaged over all trees. Then impurities are evaluated by residual sum of squares.
Appendix 4: Response surface model
where \(x_{0,\ell }\) is a logic denoting the presence (\(x_{0,\ell } = 1\)) or absence (\(x_{0,\ell } = 0\)) of the factor at level ℓ in the model, \(\beta_{0,0} , \ldots ,\beta_{0,L} ,\;\beta_{1} , \ldots , \beta_{k} ,\; \beta_{1,1} ,\beta_{1,2} , \ldots \beta_{k,k}\) are unknown coefficients to be estimated, and ε is a random error term.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alzahabi, A., Trindade, A.A., Kamel, A. et al. Optimizing initial oil production of horizontal Wolfcamp wells utilizing data analytics. J Petrol Explor Prod Technol 10, 2357–2371 (2020). https://doi.org/10.1007/s13202020009260
Received:
Accepted:
Published:
Issue Date:
Keywords
 Initial oil production
 Wolfcamp
 Horizontal wells
 Completion variables
 Data analytics
 Response surface methodology