如何使用htmlq提取html文件内容

109次阅读

共计 1680 个字符，预计需要花费 5 分钟才能阅读完成。

导读	htmlq 能够对 HTML 数据进行 sed 或 grep 操作。我们可以使用 htmlq 搜索、切片和过滤 HTML 数据。让我们看看如何在 Linux 或 Unix 上安装和使用这个方便的工具并处理 HTML 数据。

什么是 htmlq？

htmlq 类似于 jq，但用于 HTML。使用 CSS 选择器从 HTML 文件中提取部分内容。在 CSS 中，选择器用于定位我们想要设置样式的网页上的 HTML 元素。例如，我们可以使用此工具轻松提取图像或其他 URL。

安装 htmlq

首先需要在系统中安装 cargo 然后使用 cargo 来安装 htmlq：

 [root@localhost ~]# yum -y install cargo
[root@localhost ~]# cargo install htmlq

设置可执行的路径

确保将 $HOME/.cargo/bin 添加到 PATH 变量中，以便能够使用 export 命令运行已安装的二进制文件：

 [root@localhost ~]# echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> ~/.bash_profile 
[root@localhost ~]# . ~/.bash_profile

如何使用 htmlq 从 HTML 文件中提取内容？

下面是使用 curl 和 htmlq 的用法：

 curl -s url | htmlq '#css-selector'
curl -s url2 | htmlq '.css-selector'
curl -s https://www.linuxprobe.com | htmlq --pretty '#content' | more

让我们找到页面中的所有链接。例如：

[root@localhost ~]# curl -s https://www.linuxprobe.com | htmlq --attribute href a

如何使用 htmlq 提取 html 文件内容
人性化显示 HTML:

[root@localhost ~]# curl --silent https://mgdm.net | htmlq --pretty '#posts'

如何使用 htmlq 提取 html 文件内容

帮助手册

使用下面命令查看帮助页面：

 [root@localhost ~]# htmlq --help
htmlq 0.3.0
Michael Maclean <michael@mgdm.net>
Runs CSS selectors on HTML
 
USAGE:
    htmlq [FLAGS] [OPTIONS] [selector]...
 
FLAGS:
    -B, --detect-base          Try to detect the base URL from the <base> tag in the document. If not found, default to
                               the value of --base, if supplied
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information
 
OPTIONS:
    -a, --attribute <attribute>    Only return this attribute (if present) from selected elements
    -b, --base <base>              Use this URL as the base for links
    -f, --filename <FILE>          The input file. Defaults to stdin
    -o, --output <FILE>            The output file. Defaults to stdout
 
ARGS:
    <selector>...    The CSS expression to select [default: html]

如何使用 htmlq 提取 html 文件内容

总结

htmlq 能够对 HTML 数据进行 sed 或 grep 操作。我们可以使用 htmlq 搜索、切片和过滤 HTML 数据。

正文完

星哥玩云-微信公众号

发表至： linux教程

2024-07-25

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

Docker 网络

Python之列表的append()方法最容易踩的坑及解决

剖析Linux的守护神

如何在UBUNTU 16.04上安装桌面模式中的PGADMIN 4

XML – E4X概述

Linux设置交换分区（swap）的方法

3个实例介绍shell脚本中几个特殊参数的用法

三种方法助您缓解SQL注入威胁

ASP.NET Web Pages – 文件简介

如何使用htmlq提取html文件内容

申请腾讯混元的API Key并且使用LobeChat调用混元AI

Docker部署搭建一个开源强大的图书管理系统

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

让每个人都可以轻松使用Git-腾讯自研Git客户端

使用Docker部署开源的WPS-Office

在日常生活、工作中deepseek能帮我们解决哪些问题

Mariadb学习总结（八）：聚合函数及分组查询

细述：Fail2ban 阻止暴力破解案例

DeepSeek+即梦AI：零基础也能轻松制作哪吒动画手办图，超详细教程！

阿里云 RDS 数据库恢复到本地记录

	[root@localhost ~]# yum -y install cargo
	[root@localhost ~]# cargo install htmlq

	[root@localhost ~]# echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> ~/.bash_profile
	[root@localhost ~]# . ~/.bash_profile

	curl -s url \| htmlq '#css-selector'
	curl -s url2 \| htmlq '.css-selector'
	curl -s https://www.linuxprobe.com \| htmlq --pretty '#content' \| more

	[root@localhost ~]# htmlq --help
	htmlq 0.3.0
	Michael Maclean <michael@mgdm.net>
	Runs CSS selectors on HTML

	USAGE:
	htmlq [FLAGS] [OPTIONS] [selector]...

	FLAGS:
	-B, --detect-base Try to detect the base URL from the <base> tag in the document. If not found, default to
	the value of --base, if supplied
	-h, --help Prints help information
	-w, --ignore-whitespace When printing text nodes, ignore those that consist entirely of whitespace
	-p, --pretty Pretty-print the serialised output
	-t, --text Output only the contents of text nodes inside selected elements
	-V, --version Prints version information

	OPTIONS:
	-a, --attribute <attribute> Only return this attribute (if present) from selected elements
	-b, --base <base> Use this URL as the base for links
	-f, --filename <FILE> The input file. Defaults to stdin
	-o, --output <FILE> The output file. Defaults to stdout

	ARGS:
	<selector>... The CSS expression to select [default: html]