2

I'm curious to know about the capabilities of AI today in 2022. I know that AI has become pretty good at recognizing things like objects in photos. But what about when it comes to elements in HTML? Would it be feasible to use AI to determine things like:

  • Is there a call-to-action? Basically a button or main action that directs the user somewhere. The text in the call to action can obviously contain a variety of different text.
  • Is there a form on the page for the user to fill out?

The last time I tried running a rendered image of a website through image recognition software, such as Google Vision or Amazon's Rekognition, it didn't detect these things, which didn't surprise me. However, maybe there's a better or alternate way, such as using the source code? But the end goal would be to determine if the page is meant to capture leads, and the form elements are some of the criteria we'd be looking for. Maybe this can also be seen as a categorization type of task too.

As I understand, AI is a broad term. So, if this was a feasible project, I'd also be curious to know what branch of AI would be the correct one to explore.

nbro
  • 39,006
  • 12
  • 98
  • 176
kenshin9
  • 121
  • 1
  • I think image recognition with screenshots of a webpage is the wrong approach for this sort of task. I think you can leverage the syntax in the source code of the webpage using a custom NLP model to create a much more powerful classification model. I've read a few papers that do this for fraud detection. But could you clarify what you specifically are looking for? A pretrained model? A network structure? – Lars Mar 11 '22 at 20:03
  • Thanks for the feedback. Honestly, I'm not exactly sure what I'm looking for, which is part of the problem. The thing I'm looking to solve is determining if the intent of a page is to gather leads. I had asked something similar to this elsewhere and NLP was also mentioned there. – kenshin9 Mar 11 '22 at 22:28

1 Answers1

0

The branch of AI that is devoted to image-related processing is Computer Vision (powered by deep learning, indeed). In particular for this project you probably need to train an object recognition model (e.g. Faster R-CNN, YOLO), able to find relevant parts of the websites like buttons and forms. An object detector finds the location (bounding box, i.e. a rect) and class (which kind of object is) of each detection. In alternative, you can look at semantic or even instance segmentation (look here for an explanation) model (e.g. Mask R-CNN) that instead provides segmentation masks which can handle, in principle, arbitrary complex shapes.

  • To train such a model, you have to collect a dataset of (images, labels), where the labels comprise: bounding box or segmentation mask of the page's object of interest, and its class (e.g. button, form, video, header, etc).

I think it would be interesting to train a multi-modal model that at the same time learns from images of webpage, and their source code as well. In this case you need to pair CV with NPL (natural language processing). In order of complexity, code (text in general) can be understood by recurrent neural networks, attention models, and large language models (e.g. BERT, GPT, etc.)

  • In this other solution, you have two neural networs: one processes the image, the other the text/code; you have to combine both at some point, then have the output layer.
Luca Anzalone
  • 2,120
  • 2
  • 13