Mennatullah Siam

  • |
  • About
  • |
  • |




The Co-Attention Mechanism

January 9, 2021

In this post I will explain co-attention mechanism in both video object segmentation (VOS)[1] and few-shot object segmentation(FSS)[2]. In points I will mainly explain: COSNet model for video object segmentation [1]. [paper, Code] Vanilla Co-attention[1]. Coattention with Visual and Semantic Embeddings for few-shot object segmentation [2]. [ paper, Code to be released April]

Few-Shot Segmentation - Part II

October 27, 2019

I will be continuing in this post about few-shot segmentation, where I will explain the main work we inspired from [4]. I will start to explain the relation between metric learning and softmax classification that was detailed in [4], then further move on explaining the imprinted weights idea.

Few-Shot Segmentation - Part I

October 13, 2019

I will be writing in this post about few-shot segmentation which I just had a recent work in ICCV'19. So what started me to be interested in that topic was two things: (1) The fact that I come from a developing country. (2) And since it could be a mean for robotics to operate in unstructured environments. I originally come from Egypt, so I understand very well that we neither have computational resources nor the resources to collect enormous data as big tech companies do. I recently read this interesting article which I related to so much about algorithmic colonialism, quoting it"data is the new oil". I concluded from the start of my PhD when studying deep learning it's not going to be a fair game if it stays this way, which got me interested in both few-shot learning and domain adaptation. In few-shot learning you can learn from few labelled data, and with unsupervised domain adaptation you can learn from synthetic data and simulation environments. I personally would encourage more research on few-shot learning, self supervised learning and domain adaptation for people working in developing countries. Of course almost all developing countries main problem is a corrupt government and ruler, few-shot learning is not gonna solve that but that's out of my hands currently.

Img2Img Translation and Beyond

December 30, 2018

In robotics especially navigation in unstructured environments and manipulation you are often faced with continuously changing environment for the robot to operate in. Since training deep networks rely heavily on large-scale labelled datasets, its not often very clear how one can utilize deep learning and expect it to scale well within this changing environment. Well, people started to realize that we have lots of training data in synthetic environments and we can easily create more. So why not use them to train deep networks then deploy on real data, or use a mix of relatively smaller labelled real data and abundant synthetic data. The problem with such methods is that there is a gap between synthetic and real domains.

Few-shot Learning using HRI

Few Shot Learning, the ability to learn from few labeled samples, is a vital step in robot manipulation. In order for robots to operate in dynamic and unstructured environments, they need to learn novel objects on the fly from few samples. The current object recognition methods using convolutional networks are based on supervised learning with large-scale datasets such as ImageNet, with hundreds or thousands labeled examples. However, even with large-scale datasets they remain limited in multiple aspects, not all objects in our lives are within the 1000 labels provided in ImageNet. As humans we can hold the object and check it from different viewpoints and try to interact with it to learn more about the object. Thus the robot should be able to teach itself from the few samples for the different object viewpoints. If we are aiming as well at human centered artificial intelligence, a natural step is to teach robots about their environment through human robot interaction. A human teacher can show the object with different poses and verbally instruct the robot on what it is and how it can be used. A further step is combine that with the ability to learn from large-scale web data about that object.