Build an integrated entertainment articles website using web crawler
Abstract
The amount of news published and read online has increased dramatically in recent
years, making news data an interesting resource for many research industries, such as social
science and linguistics. Therefore, the amount of information on the internet is getting
richer; as a result, the issue of information aggregation is becoming more and more urgent.
With a large amount of data collected by hand takes a lot of effort, and is not highly
effective; hence, necessary a technology that can synthesize information automatically and
an integrated entertainment articles website was born.
The thesis topic raises the question of understanding the web crawler and will
initially build an integrated entertainment articles website that can synthesize information
automatically from the big online newspaper site such as DanTri, TienPhong, and
VnExpress. The application is written in Node.js programming language interacting with
MySQL database and built on the following criteria: fast collection speed, compact
database, ensuring the integrity of the original material.