Optimizing Web Data Extraction Procedure With Natural Language Processing Model

View/Open

ITITIU20252 - PHAN NGOC DONG MINH.pdf (4.402Mb)

Date

2024

Author

Phan, Ngọc Đông Minh

Metadata

Show full item record

Abstract

Extracting valuable data from the vast resources of the web is crucial for business success. Insights derived from analysing consumer behaviour, market trends, and operational performance can inform strategic decision-making. However, traditional web scraping often requires significant programming expertise and adapts poorly to website changes. While nocode tools exist, they can be costly and have limited capabilities. The project addresses these challenges by developing a user-friendly, no-code web data extraction with a focus on generalizing web scraping and data extraction process. The tool aims to extract data from a wide range of websites while ensuring a meaningful, structured, and analysable data output. The project will explore various methods of web scraping and extraction based on the HTML tree parsing techniques. The tool utilizes a variety of scraping frameworks such as Playwright and BeautifulSoup to achieve robust capabilities. Additionally, the analysis module will integrate a Large Language Model (LLM) to further enhance the usability of extracted data. The research contributes to the field by offering a comprehensive, user-friendly tool that supports the complete process of gather, extracting and analysing web data with no coding involved. The tool will enable users across various fields to efficiently derive insights into from web data, facilitating informed decision making and strategic planning.

URI

http://keep.hcmiu.edu.vn:8080/handle/123456789/6776

Collections

Bachelor Thesis - Computer Science and Engineering