First impression of how to do Computer Vision

3 min readFeb 9, 2025

I started learning computer vision last week. This is a note about my first impression of it.

First of all

This is an article written by a beginner who has just started Computer Vision. Please note that there is a great possibility that mistakes or inaccuracies may be included.

Background

I decided to learn Computer Vision for 2025 at the beginning of this year. Since then, I researched how to understand it and tried a few learning materials like books, online courses, and so on. Then I came to the YouTube video and thought it was good to get the big picture of CV.

So, last week, I started learning CV in earnest. This article is my first impression of it.

What I did

I tried to simplify the pattern to train a model by watching the video. From this video, the training process seems like the below and is repeated multiple times to complete one model.

Data preparation
Define architecture
Define algorithm of loss
Assess model
Data Argumentation

Data preparation

In this step, the main goal would be to understand what you have now and standardize the data for programming. In this video, they did the two things mainly.

Check raw data
Clean up data

It looked very important to check the raw data directly because the visual input seemed to give them insight into what to do with the data. Then based on the insight, they cleaned up the data like removing noise data, rescaling data, resizing images, and so on.

Define architecture

It looked very difficult to follow them. For the first time, I couldn’t get a sense of why they added the layer or chose the layer like the video.

Then I talked with my colleague, who is an ML engineer in our team, and he told me “Most ML engineers don’t create the architecture by themselves. Normally, they decide it by academic papers. Especially for computer vision, the trandy architectures are ResNet50 and ViT.”

After chatting with him, I decided to deprioritize this step for now. I’m more interested in the application of computer vision than digging the architecture itself.

So I’ll focus on building the model with ResNet50 and ViT for a while. There are a lot of pre-trained models on Hugging Face. So my short-term goal would be to be enable to fine-tune with those models.

Models - Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Define algorithm of loss

This step is similar to the previous step. For now, I don’t do anything by myself. I would get the algorithm from the academic papers which I chose for architecture.

Assess model

In this step, the key purpose would be to decide what to adjust next. So, the action seems to visualize the learning curve of train data and validation data and to compare them.

学習曲線から読み取れること - Qiita

学習曲線とは機械学習における学習曲線とは、trainデータ、validationデータそれぞれにおける学習時のパフォーマンスをプロットしたグラフのことである。一般的な学習曲線のグラフは横軸がep...

qiita.com

This Japanese article is very simple and very easy to understand what to do next. I’ll follow what the article is saying for a while.

Data Argumentation

This step is about data argumentation. My ML engineer colleague told me that this step is one of the most important steps for computer vision. I’ll spend more time on data argumentation to understand and decide what to do by myself.

In the video, they did the following argumentation often.

To rotate images
To change the saturation of images
To crop images

I need to research more about what else I can do for data argumentation. However, those three ways seem good starting point for me.

I just started learning computer vision. It seems challenging but I’m very excited to know quite new things. I’m looking forward to what I’ll be at the end of this year.

That’s it!