I started learning computer vision last week. This is a note about my first impression of it.
First of all
This is an article written by a beginner who has just started Computer Vision. Please note that there is a great possibility that mistakes or inaccuracies may be included.
Background
I decided to learn Computer Vision for 2025 at the beginning of this year. Since then, I researched how to understand it and tried a few learning materials like books, online courses, and so on. Then I came to the YouTube video and thought it was good to get the big picture of CV.
So, last week, I started learning CV in earnest. This article is my first impression of it.
What I did
I tried to simplify the pattern to train a model by watching the video. From this video, the training process seems like the below and is repeated multiple times to complete one model.
- Data preparation
- Define architecture
- Define algorithm of loss
- Assess model
- Data Argumentation
Data preparation
In this step, the main goal would be to understand what you have now and standardize the data for programming. In this video, they did the two things mainly.
- Check raw data
- Clean up data
It looked very important to check the raw data directly because the visual input seemed to give them insight into what to do with the data. Then based on the insight, they cleaned up the data like removing noise data, rescaling data, resizing images, and so on.
Define architecture
It looked very difficult to follow them. For the first time, I couldn’t get a sense of why they added the layer or chose the layer like the video.
Then I talked with my colleague, who is an ML engineer in our team, and he told me “Most ML engineers don’t create the architecture by themselves. Normally, they decide it by academic papers. Especially for computer vision, the trandy architectures are ResNet50 and ViT.”
After chatting with him, I decided to deprioritize this step for now. I’m more interested in the application of computer vision than digging the architecture itself.
So I’ll focus on building the model with ResNet50 and ViT for a while. There are a lot of pre-trained models on Hugging Face. So my short-term goal would be to be enable to fine-tune with those models.
Define algorithm of loss
This step is similar to the previous step. For now, I don’t do anything by myself. I would get the algorithm from the academic papers which I chose for architecture.
Assess model
In this step, the key purpose would be to decide what to adjust next. So, the action seems to visualize the learning curve of train data and validation data and to compare them.
This Japanese article is very simple and very easy to understand what to do next. I’ll follow what the article is saying for a while.
Data Argumentation
This step is about data argumentation. My ML engineer colleague told me that this step is one of the most important steps for computer vision. I’ll spend more time on data argumentation to understand and decide what to do by myself.
In the video, they did the following argumentation often.
- To rotate images
- To change the saturation of images
- To crop images
I need to research more about what else I can do for data argumentation. However, those three ways seem good starting point for me.
I just started learning computer vision. It seems challenging but I’m very excited to know quite new things. I’m looking forward to what I’ll be at the end of this year.
That’s it!