July 19, 2024

[ad_1]

Importing Python Libraries

Before we begin our Artificial Neural Network python tutorial, we first need to import the libraries and modules that we are going to require.

  • pandas: used to load data in from a CSV file
  • matplotlib: used to create graphs of the data

Then, from Scikit-Learn, we will be importing the following modules:

  • train_test_split from model_selection: used to split our data into training and validation datasets
  • MLPRegressor from neural_network: this is the Neural Network algorithm we will be using
  • StandardScaler from preprocessing: used to standardise our data so that they are similarly scaled
  • metrics: used to assess our model performance

Data Source

The data used within this tutorial is a subset of the Volve Dataset that was released by Equinor in 2018. Full details of the dataset, including licence can be found at the link below.

The Volve data license is based on CC BY 4.0 license. Full details of the license agreement can be found here:

https://cdn.sanity.io/files/h61q9gi9/global/de6532f6134b9a953f6c41bac47a0c055a3712d3.pdf?equinor-hrs-terms-and-conditions-for-licence-to-data-volve.pdf

Using Pandas to Load the Well Log Data

Once the libraries have been imported we can move onto importing our data. This is done by calling upon pd.read_csv() and passing in the location of our raw data file.

As the CSV file contains numerous columns, we can pass in a list of names to the usecols parameter, as we only want to use a small selection for this tutorial.

The workflow for preprocessing data prior to running it through a machine learning algorithm will vary. For this tutorial we are going to:

  • remove missing values
  • split data into training, validation and testing datasets
  • standardise the range of values for each measurement

Dropping Missing Values

Missing data is one of the most common issues we face when working with real world data. It can be missing for a variety of reasons including:

  • sensor errors
  • human error
  • processing errors

For more information on identifying and dealing with missing data you should have a read of the following article:

For this tutorial, we are going to remove the rows that contain missing values. This is known as listwise deletion and is the quickest way to deal with the missing values. However, doing this reduces the size of the available dataset, and the cause and extent of the missing values should be fully understood before carrying on with a machine learning model.

To drop the missing values, we can use pandas dropna() function and assign that back to the df (dataframe) variable.

Splitting Data into Training, Testing and Validation Datasets

When carrying out machine learning, we often split our data into multiple subsets for training, validation and testing.

One thing to note is terminology for testing and validation datasets can vary between articles, websites and videos. The definitions used here are illustrated and described as follows:

Examples of splitting data into training, validation and testing subsets. Image by author and from McDonald, 2021.
Examples of splitting data into training, validation and testing. Image by author and from McDonald, 2021.

Training Dataset: Data used for training the model

Validation Dataset: Data used for validating the model and tuning the parameters.

Testing Dataset: Data set aside to test the final model on unseen data. This subset allows us to understand how well our model can generalise to new data.

For this tutorial, our dataset contains 3 separate wells. So we will split one off (15/9-F-1 B) as our testing dataset. This is often referred to as a blind test well. the other two wells will be used to train, validate and tune our model.

Once these lists have been created, we can then create two new dataframes for the subsets. This is achieved by checking if the well(s) within the lists are within the main dataframe (df).

Once we have run the above code, we can view the statistics of the subsets using the describe() method.

train_val_df.describe()
Dataframe statistics of the training and validation subset containing two wells worth of data from the Volve field. Image by the author.

We can see that we have 21,6888 rows of data to train and validate our model with.

We can repeat this with the testing dataset:

test_df.describe()
Dataframe statistics of the testing subset containing one wells worth of data from the Volve field. Image by the author.

Creating the Training and Validation Subsets

The next step is to further subdivide our train_val_df into the training and validation subsets.

To do this we first split our data up into features we are going to use for training (X) and our target feature (y). We then call upon the train_test_split() function to split our data.

Within this function we pass in our X and y variables, along with the parameter for indicating how large we want the test data set. This is entered as a decimal value and ranges between 0 and 1.

In this case, we have used 0.2, which means our test dataset will be 20% and our training dataset will be 80% of the original data.

Standardising Values

When working with measurements that have different scales and ranges, it is important to standardise them. This helps to reduce model training times and reduces the impact on models that rely on distance based calculations.

Standardising the data essentially involves calculating the mean of a feature, subtracting it from each data point and then dividing by the feature’s standard deviation.

Within scikit-learn we can use the StandardScaler class to transform our data.

First we use the training data to fit the model and then transform it using the fit_transform function.

When it comes to the validation data, we don’t want to fit the StandardScaler to that data as we have already done it. Instead we just want to apply it. This is done using the transform method.

Training the Model

To begin the Neural Network training process, we first have to create an instance of the MLPRegressor that we imported at the start.

When we call upon the MLPRegressor we can specify numerous parameters. You can find out more about these here. But, for this tutorial we will be using:

  • hidden_layer_sizes: Controls the architecture of the network.
  • activation: Hidden layer activation function.
  • random_state: When an integer is us used, this allows the model to create reproducible results and it controls the random number generation for the weights and biases.
  • max_iter: Controls the maximum number of iterations that the model will go to if convergence is not met beforehand.

After initialising the model, we can train our model with the training data using fit(), and then use the predict method to make prediction

Validating Model Results

Now that our model has been trained, we can begin evaluating the model’s performance on our validation dataset.

It is at this stage we can tweak our model and optimise it.

There are multiple statistical measurements we can use to measure how well our model has performed. For this tutorial we will be using the following three metrics:

Mean Absolute Error (MAE): Provides a measure of the absolute differences between the predicted value and the actual value.

Root Mean Square Error (RMSE): Indicates the magnitude of the prediction error.

To calculate RMSE using scikit-learn we first need to calculate the mean squared error and then take the square root of it, which can be achieved by raising the mse to the power of 0.5.

Coefficient of Correlation (R2): Indicates the strength of the relationship between an independent variable and a dependent variable. The closer the value is to 1, the stronger the relationship.

We can calculate the above metrics as follows:

When we execute the above code we get the following results back. Based on these numbers we can determine if our model is performing well or if it needs tweaking,

Metric values for validation data prediction. Image by the author.

Going Beyond Metrics

Simple metrics like the above are a nice way to see how a model has performed, but you should always check the actual data.

One way to do this is to use a scatter plot with the the validation data on the x-axis, and the predicted data on the y-axis. To help with the visualisation we can add a 1 to 1 relationship line.

The code to do this is as follows.

When we run the code above, we get back the following plot which shows us we have a reasonably good trend between the actual measurement and the predicted result.

Actual acoustic compressional slowness values versus predicted values for the Volve dataset. Image by the author.

Once we have finalised our model, we can finally test it our on the data we set aside for blind testing.

First we will create our features we will use for applying the model. Then we will apply the StandardScaler model we created earlier in order to standardise our values.

And next, we will assign a new column to our dataframe for our predicted data.

Once the prediction has been made, we can view the same scatter plot as above.

Within petrophysics and geoscience we often look at data on log plots, where measurements are plotted against depth. We can create a simple log plot of our predicted result and actual measurement within the test well like so.

This returns the following plot.

We can see our model has performed well on the unseen data, however, there are a few areas where the result is not matching the true measurement. Notably between 3100 and 3250 m.

This tells us that our model may not have enough training data covering these intervals, and as a result we may need to acquire more data if it is available.

Line plot (log plot) of our predicted measurement against our actual measurement. Image by the author.

If you want to see how this model compares to the results of a Random Forest model, check out the article below:

Artificial Neural Networks are a popular machine learning technique. Within this tutorial we have covered a very quick and easy way to implement a model for predicting acoustic compressional slowness that yields reasonable results. We have also seen how to validate and test our model, which is an important part of the process.

There are many other ways to build up a neural network within Python, such as Tensorflow and Keras, however, Scitkit-learn provides a quick and easy to use tool to get started right away.

How to Create a Simple Neural Network Model in Python Republished from Source https://towardsdatascience.com/how-to-create-a-simple-neural-network-model-in-python-70697967738f?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

[ad_2]

Source link