Nikita Beresnev15 min

Training Your Object Recognition Model from Scratch

EngineeringMar 3, 2020



Mar 3, 2020

Nikita BeresneviOS Engineer

Share this article

Machine learning has been around for a while now, but only recently has it begun to accelerate. Why? The reason is pretty simple. People have managed to radically improve the computational power of everyday devices in a relatively short period of time.

The first consumer PC was introduced in 1975. That’s a mere 44 years, and look where we are now. The tech we wear on our wrists or carry around in the pockets of your jeans is much more powerful than the computers available to the public a decade ago.

People are lazy by nature. We tend to invent things we don’t actually need for the sake of saving a few extra minutes or hours of our daily routine. We now have endless power and data, and we continue to automate any task we can. Funnily enough, our laziness yields pretty ingenious results.

For a while, image recognition, text generation or even playing games were believed to be something only humans could do. But we’ve managed to use modern tools to help shake ourselves of full responsibility in these cases, too. Machine learning has already been successful in doing “human” activities — like detecting cancer, preventing car crashes, etc. Machines keep surpassing limitations.

And yet, no matter how powerful they get, all machines understand are 0s and 1s. They will always need a bit of extra help from us. With this article, I want to show you exactly how to provide that help.


In the Apple ecosystem, the seemingly obvious solution is to choose Core ML model. However, there are multiple paths you can take in order to create a model. For now, let's focus on solutions that do not require much machine learning expertise, that simplify development with prepared out-of-the-box solutions with decent performance for typical Machine Learning tasks and that work best with the Apple ecosystem. Good examples are Turi Create and Create ML. However, both come with pros and cons.



  • More flexible (not tied to the UI)
  • Supports more use cases (one-shot object detection, etc.)
  • Not tied only to macOS (also supports Windows and Linux)
  • Supports various annotation formats


  • Cumbersome installation process



  • Has pretty simple UI
  • Comes with the latest XCode


  • Not as flexible as Turi Create (more scenarios are supported by Turi Create, users can see the bounding boxes in the previews, etc.)
  • Currently, not much info is available

In this article, we’re going to concentrate mainly on the Turi Create solution and will briefly touch on Create ML.



Virtual Environment

We will be using virtualenv - a tool that’ll help us create an isolated Python environment, where we’ll be training our model. You can install virtualenv by running the following command in your terminal.

pip install virtualenv

To verify the currently installed version, run:

virtualenv --version
~> 16.5.0

If you’re looking for more information about Virtual Environments, please refer to the following links:

Jupyter Notebook

In order to create a virtual environment, navigate to a location where you would like to store your project using terminal and run:

virtualenv TuriSample

The created virtual environment TuriSample contains three directories, bin, include and lib, which contains your dependencies, reference to Python language used and other files you will need when running the environment.

In order to run the environment, run the following command:

source TuriSample/bin/activate
~> (TuriSample) Nick Beresnev:Untracked strv$

To install Turi Create, run the following command from your environment:

pip install turicreate

When you’ve finished installing Turi Create, run the following command in order to install Jupyter Notebook:

pip install jupyter

After the installation finishes, you can open Jupyter by executing the following command:

jupyter notebook

This should open a window inside your browser. It will look like this:


The 6-step recipe is a recommended set of steps required in order to create your own trained models. It consists of the following steps:

1. Task understanding

2. Data collection

3. Data annotation

4. Model training

5. Model evaluation

6. Model deployment

Task understanding

We have to clearly understand the problem we’re trying to solve, what kind of data set we require in order to proceed further, what our model is going to be responsible for, etc.

In this tutorial, we are going to work on an object detection task. The trained model will be able to simultaneously both classify (establishing the “what”) and localize (establishing the “where”) mango, pineapple, banana and dragonfruit object instances in an image.

Data collection

In order to train our image recognition model, we need to have a representative data set for our problem as soon as possible. Typically, it is good to have many different examples. Variety in your data is the key to good general performance. Use many photos of your object instances in different contexts—from a variety of angles and scales, with various lighting conditions and with a mixture of random objects and backgrounds. It is also essential to have enough representatives for each object instance.


We have prepared some training data for you. You can find it in our repository.

Data annotation

A data set without a respective annotation is absolutely useless, so we have to manually annotate each and every individual picture in our data set. You can use various formats, but, in this tutorial, we will talk about annotating our images in JSON or CSV format.

While annotation can be done manually, I think no one wants to manually write thousands of lines of JSON or CSV. Intead, you can use tools like MakeML on macOS or IBM Cloud Annotation (if you decide to use this, check out this tutorial).


This is how an annotated image looks like:

As you can see, the annotation should include the path to the image, and the annotation itself includes coordinates and a label (a label is like a tagline; because the machine still doesn’t know what a mango is, using the label helps it with categorization).

Coordinates might not be as straightforward as you might think. Height and width are pretty clear and should be taken from the box surrounding the object. The X and Y coordinates are the center of the bounding rectangle, and all coordinates are in pixels.

This is shown in the image below:

You can find annotations of the images from the previous steps in a CSV file we’ve prepared for you.

Model training

Once we have everything ready, we can proceed further with training the model. Get ready for this to take some time. The amount of time depends on how many images you have prepared, how many iterations (the amount of times your machine will repeatedly go through the provided data and improve the model’s accuracy) you have chosen, the machine-learning algorithm you are using and so on. In our case, we will train a Convolutional Neural Network (CNN). This deep learning technique is the most commonly used algorithm for our task. However, there are many challenges with setting up the right parameters for the algorithm. Luckily, the solution we are using provides us with sufficient parameters that should work in our setting.

When it comes to the number of iterations, I would personally suggest 1,200. Choosing the wrong number of iterations may lead to bad performance on unseen data (test set). When experimenting with the provided data and algorithm, it seems that iterations in the hundreds is too few for the model to be anywhere near as precise as we want it to be, and too many iterations allow the model to memorize our training data. Techniques (other than experimentation) to select a good number of iterations exist, but we are not going to dive into them here. Another thing to keep in mind is that our algorithm requires as much input data as possible to perform well.

It is typically advised for most use cases to split the data set into two parts before training a machine learning model. One part of the data is used for training purposes and the second part for the evaluation of the trained model. It is always a good idea to have a test set that can help you estimate the performance of your model on unseen data. You can think of this as studying for an exam using exams from previous years (the training) and then actually taking an exam (the testing) to see how well you learned. When it comes to splitting the data, it’s most common to use 80% of it for training purposes and the rest for testing purposes.

Macs with a dedicated GPU have proven to be the preferred machines in our setting. For example, a MacBook Pro with iGPU (integrated GPU) could take up to almost 14 hours, and an iMac with Radeon RX560 is finished within 30 minutes. Dedicated GPUs are much faster in performing the necessary computations than CPUs when training Convolutional Neural Networks.

To get started, navigate to open the terminal.

  • Go to the virtual environment you have created earlier (example: cd ~/Documents/TuriSample)
  • Open the virtual environment ( source /bin/activate )
  • Open Jupyter Notebook by executing the following command:  jupyter notebook

You will see the following:

  • Click on New -> Choose Python version

This is what you should see:

Now you can continue with the “Working with Data” section.

Working with Data

Start with importing the Turi Create package by running the command import turicreate as tc in the field, followed by shift + enter for execution. All code snippets must be executed one by one.

The TuriSample directory should now contain the following items:

- annotations.csv

- Images

Loading Images

To be able to work with images, you have to load them into an SFRame, which is nothing more than a container to hold your assets.

~> images = tc.image_analysis.load_images("TuriSample/Images")

Then type images and press shift+enter:

~> images

And you should get an output similar to this:

path    image
Images/fruit106.JPG ...    Height: 3024 Width: 4032
Images/fruit111.JPG ...    Height: 3024 Width: 4032
Images/fruit112.JPG ...    Height: 3024 Width: 4032
Images/fruit114.JPG ...    Height: 3024 Width: 4032
Images/fruit118.JPG ...    Height: 3024 Width: 4032
Images/fruit123.JPG ...    Height: 3024 Width: 4032
Images/fruit127.JPG ...    Height: 3024 Width: 4032
Images/fruit128.JPG ...    Height: 3024 Width: 4032
Images/fruit129.JPG ...    Height: 3024 Width: 4032
Images/fruit130.JPG ...    Height: 3024 Width: 4032
[41 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

You can also run images.explore() to get a more interactive overview about your images.

Loading Annotations

To load annotation, run the following command:

~> annotations = tc.SFrame.read_csv("TuriSample/annotations.csv")

You should get an output similar to this:

Finished parsing file /Users/nickberesnev/Documents/TuriSample/annotations.csv
Parsing completed. Parsed 100 lines in 0.043964 secs.
Inferred types from first 100 line(s) of file as 
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument

Finished parsing file /Users/nickberesnev/Documents/TuriSample/annotations.csv
Parsing completed. Parsed 292 lines in 0.017157 secs.

You can list annotations using annotations , which will produce the same output as annotations.head(10) - (note that running annotations will not show the whole list but only the first 10 items, therefore the output will be similar to annotations.head(10)).

You can also access each annotation similarly to how you would with arrays—by using index annotations[index].

Another option is using annotations.explore() to get a more interactive preview.

Joining SFrames

To be able to train your model, you have to join both SFrames (annotations and images).

Run the following commands:

~> joined_sframe = images.join(annotations)
~> joined_sframe

You should see an output similar to this:

path    image    annotation
fruit1.jpg ...    Height: 173 Width: 292    [{'coordinates': {'y':
95, 'x': 80, 'height': ...
fruit2.jpg ...    Height: 300 Width: 400    [{'coordinates': {'y':
150, 'x': 180, 'height': ...
[63 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In case you see 0 rows, you might have some conflict in the path. Try to check if the path matches the location of the annotations and images.


It is good practice to separate the joined_frame into training and testing subsets to be able to test your model once trained. We will divide the data into 80% for training and 20% for testing, but how you choose to do this is completely up to you.

Type the following:

~> training_sframe, testing_sframe = joined_sframe.random_split(0.8)
~> training_sframe

You should see an output similar to this:

path    image    annotation
Train Images/fruit2.jpg ...    Height: 300 Width: 400    [{'coordinates': {'y':
150, 'x': 180, 'height': ...
Train Images/fruit3.jpg ...    Height: 164 Width: 308    [{'coordinates': {'y':
82, 'x': 205, 'height': ...
[53 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Type the following:

~> testing_sframe

You should see an output similar to this:

Train Images/hand155.jpg ...    Height: 3024 Width: 4032    [{'coordinates': {'y':
1597, 'x': 2019, ...
[10 rows x 3 columns]

Training SFrame

To train a sample model, we will use the object_detector toolkit and the previously created training_sframe , with 50 iterations.

This model will not have high accuracy. To get better accuracy, you would need to create a model with at least 300 iterations or more, possibly on more pictures. Run the following command in the field and go grab a snack—’cause this might take some time.

~> model_50 = tc.object_detector.create(training_sframe, max_iterations=50)

After executing the command, you should see an output similar to this:

Using 'image' as feature column
Using 'annotation' as annotations column
Setting 'batch_size' to 32
Using CPU to create model
| Iteration    | Loss         | Elapsed Time |
| 1            | 15.497       | 8.3          |
| 2            | 15.292       | 20.7         |
| 3            | 15.029       | 39.3         |
| 4            | 15.052       | 51.9         |
| 5            | 15.267       | 65.2         |
| 46           | 11.389       | 617.4        |
| 47           | 11.421       | 630.8        |
| 48           | 11.349       | 644.3        |
| 49           | 11.315       | 660.5        |
| 50           | 11.319       | 674.5        |

Comparing models

Type the following command in the field to get information about the currently trained model:

~> model_50

After executing the command, you should see an output similar to this:

Class                                    : ObjectDetector

Model                                    : darknet-yolo
Number of classes                        : 2
Non-maximum suppression threshold        : 0.45
Input image shape                        : (3, 416, 416)

Training summary
Training time                            : 11m 14s
Training epochs                          : 31
Training iterations                      : 50
Number of examples (images)              : 51
Number of bounding boxes (instances)     : 482
Final loss (specific to model)           : 11.319

Alternative route: CreateML

CreateML is pretty straightforward. Make sure that annotations are contained within the image folder.

The format should be the following:

You can find annotations of the images from the previous steps in a JSON file we’ve prepared for you.

  • Open CreateML app, it should be bundled with XCode 11 (it can be opened by searching the app via Spotlight, or it can be found in /Applications/\ ).
  • Create a new project from the offered templates (in our case, it would be object detection).
  • Drag and drop the folder with the images into “Training Data”.
  • Set the number of iterations and click “Train”.
  • Once done, drag out the produced model from the “Output” section and drop it anywhere you want it to be saved..

Model evaluation

Before putting a trained model into production, we want to evaluate its behavior on unseen data. Understanding what to expect from our model in a setting similar to a production environment should be our number one priority before any deployment. One way to learn something about our model is to measure its performance using a metric (or metrics) that is easy to measure and interpret.

For identification tasks, we will usually use accuracy. A score of 80% accuracy means we can expect the model to make the correct prediction in 8 out of 10 images. As you can see, this metric is straightforward in measuring and interpreting (we can even use it to compare models). However, for our task, we will report something less direct, like mean average precision (mAP), which is a value between 0 and 1 (or 0% and 100%), with higher being better. We want to choose this metric instead of accuracy because it can work with a prediction consisting of object type and bounding box, and compare it with actual annotation.

It is good that Turi Create already contains a function that is able to compute mAP, with intersection-over-union thresholds set at 50%. (If you’d like to understand this parameter better, see to evaluate the performance of the model. This function is called evaluate and returns mean_precision for the whole model, as well as average_precision  for your classes of banana, mango, etc.) Let's call the function on testing data so we can get a performance estimate on unseen data during training.

metrics_50 = model_50.evaluate(testing_sframe)

After executing the command, you should see an output similar to this:

Predicting 1/9
Predicting 9/9
{'average_precision_50': {banana: 0.1721996964907307,
  mango: 0.2069230769230769},
 'mean_average_precision_50': 0.1895613867069038}

Evaluating a model with 800 iterations yields the following results:

metrics_800 = model_800.evaluate(testing_sframe)

Predicting 1/9
Predicting 9/9
{'average_precision_50': {'banana': 0.5929196960546802,
  'mango': 0.7964529717866119},
 'mean_average_precision_50': 0.694686333920646}

Average_precision_50 - average precision per class with intersection-over-union thresholds at 50%

Mean_average_precision_50 - mean over all classes for average_precision_50

After training the model with only 50 iterations, we’ve got a model which was able to recognize a banana only 17% of the time, with a threshold of above 50% (meaning that the predicted area matched with at least 50% of the bounding box). Results are similar to those of the mango. As you can see, after training the model with significantly more iterations, our model has become more precise.

Model deployment

This is the final step of our journey. After we have learned how to collect the data, annotate it, train the model and evaluate the model, we can finally export the model that can be used in our app.

In order to export our model, we need to type model.export_coreml(“custom_model”).

The model will be saved one level above the folder in which we were working. The next step is to open the provided Sample Project.


The app we prepared for you is pretty simple: We have the camera stream, where the bounding boxes will be drawn in real time when a phone detects one of the following fruits (apple, banana, dragon fruit, pineapple, mango); bounding boxes will contain a label and confidence level next to it; and the segmented control is also present on the screen, so the user can switch between the two models—the one provided by us and the one that you will train yourself. The model provided by us was trained with 290 images and 2,000 iterations.

This is an example of the running app. The confidence level is being updated constantly, as well as the bounding box (it depends on many factors: lighting condition, the angle, fruit color, etc.).

Our app contains three main parts: TrackItemType.swift, VisionService.swift and MLModelService.swift

MLModelService.swift is responsible for loading and switching between the models.

TrackItemType.swift defines the supported classes/labels by our model.

And VisionService.swift is responsible for observing the changes in the video stream and drawing the rectangles when the tracked object is being detected.

As you can see, the project is already prepared for you. The only thing you have to do is to import your custom model into Model/Trained Models under the name custom_model.mlmodel and build the app. Simple as that.

In order to copy the model to the project:

  • Open the file inspector in XCode
  • Drag & drop custom_model.mlmodel file to the following location: (TuriCreate/Models/Trained Models)
  • Check the option “copy if needed”

After you’re done, don’t forget to change your team in the Signing & Capabilities section of the project. Connect an iPhone and press Play. (Small note: the app will crash if you try to run it using the Simulator because we need camera capabilities for the app to work.)

That’s it! Feel free to train your own model and run the example app. But please note that our model was trained with a limited set of training images, therefore it might not be precise when using other types of fruit or even the same fruit but with different qualities (for example, red vs. green apple).

The app and the dataset can be found in our repository.

Big thank you to Jaime Lopez and our new Head of Machine Learning, Jan Maly, for their input. Having Jan join the STRV team was a long time coming. Businesses from all industries are exploring ways to utilize this new world of advanced data and analytics, which is exciting for both our clients and us — the engineers.

We’re all striving to continually be improving the customer experience, and to see businesses knock it out of the park. Today, that means accumulating boundless data and insight fast, and being smart in how we work with it. To those in the tech industry, it’s obvious that machine learning and AI are the future. And because of the endless potential to improve what we do and how we do it, we’re absolutely on board.


Share this article

Sign up to our newsletter

Monthly updates, real stuff, our views. No BS.