屏幕抓取,网页抓取,网站采集,Web数据抽取等使用C#和.NET Framework(Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework)

   IT问题网   2018-07-12 00:00:00

问 题

我工作的microsoft .net应用程序在c#中的web收获,网页抓取,网络数据采集,屏幕抓取等,无论你怎么称呼它。对于解析html,我试图将html敏捷性包,但它不是那么容易,因为我认为这将是。我已经包含了什么我迄今为止的一些规格和图片,并希望得到有关如何我可以继续你的意见。基本上,我想做类似于可视化web开膛手使用的布局的东西,但我不知道他们是如何做到这一点...任何想法?

图片:

http://img69.imageshack.us/img69/8880/webharvester1.png

http://img198.imageshack.us/img198/9563/webharvester2.png

说明:

我的目标是做一个非常人性化的点和点击从网上下载的数据和图像应用。我想加载使用网络浏览器,并输出解析后的数据和图像链接到文本框中的html页面。用户可以指定他们想要的html标签,然后下载数据到网格。最后,将数据导出到任何格式,他们所需要的。

我想使用html敏捷性包加载html网页上,并在文本框中显示。

//加载web浏览器
私人无效form6_load(对象发件人,eventargs的)
{
//浏览到的网页
webbrowser.navigate("http://www.webopedia.com/term/h/html.html");

// url保存到内存中
sitememoryarray [计数] = urltextbox.text;

从web浏览器//加载html
htmlwindow窗口= webbrowser.document.window;
字符串str = window.document.body.outerhtml;

使用htmlagilitypack //提取标记和显示文本
htmlagilitypack.htmldocument htmldoc =新htmlagilitypack.htmldocument();
htmldoc.loadhtml(str);

htmlagilitypack.htmlnodecollection节点= htmldoc.documentnode.selectnodes("// a");

的foreach(在节点htmlagilitypack.htmlnode节点)
{
textbox2.text + = node.outerhtml +"\ r \ n"的;
}

}
 

有关: htmlwindow窗口= webbrowser.document.window;

我得到的错误:对象引用不设置到对象的实例

解决方案

您可能没有,当你引用的浏览器窗口中完成了页面加载。您可以让浏览器控件触发navigationcomplete事件,当它完成。看到这个so的例子回答:<一href="http://stackoverflow.com/questions/583897/c-sharp-how-to-wait-for-a-webpage-to-finish-loading-before-continuing">c#如何等待一个网页继续之前完成加载

标签:屏幕抓取网页网站采集收集数据抽取使用以及



分享:

  • 微信
  • QQ好友
  • QQ空间
  • 新浪微博


热门推荐

依赖于log4net的记录器,并使用温莎城堡检索记录仪由主叫类型(Dependency on Log4Net Logger and Retrieve Logger by Caller Type using Castle Windsor)

problem i have a thin wrapper around log4net and i am try ...

什么是委托呢?(What are delegates for)

problem possible duplicate: what are the advantages of d ...

创建独立版本的Outlook加载项(Creating version independent Outlook add-ins)

problem looking for a library to create outlook add-in's ...

SharpSSH .NET库:无法从.NET连接到Linux(Debian的)(SharpSSH .NET Library: unable to connect to Linux (Debian) from .NET)

problem i'm trying to connect to linux using sharpssh, bu ...

IronPython的脚本调试(IronPython Script debugging)

problem i have a .net application and there is an ironpyt ...

如何使用它引用到其他组件加载的DLL组件“的方法呢?(how to use loaded DLLS assembly&#39; methods which is referenced to another assembly)

problem i have 2 assemblies. i added classlib ...

如何.NET处理范围内的变量(How does .NET handle variables inside scope)

problem i'm just couris about whats happning behind the s ...

RDPSession ConnectToClient终止意外地(RDPSession ConnectToClient Terminating Unexpectedly)

problem i have successfully created a desktop sharing sol ...

任何人都知道一个灵活的Metro UI的WinForms的呢?(Anyone know of a flexible Metro ui for winforms)

problem a lot of them have hard coded square small and 2 ...

强制程序没有管理员权限运行?(Force a program to run without admin privileges)

problem i have a .net program that requests admin privile ...

获取的HTMLTABLE C#的innerHTML(Getting the InnerHtml of an HTMLTable c#)

problem this function is returning an html table: privat ...

字典用锁或Concurency字典?(Dictionary with lock or Concurency Dictionary)

problem which is preferred in a multi-threaded applicatio ...

是可以安全使用的堆栈跟踪类来查找当前方法的调用者(Is it safe to use the StackTrace class to find the caller of the current method)

problem i would like to know if it is safe to use the fol ...

在ADO.NET嵌套事务(Nested Transactions in ADO.NET)

problem first, is it possible to have n transactions leve ...

如何使这个C#类单身,线程安全(how to make this c# class a singleton, thread safe)

problem so here's what i'm trying to do : pu ...

动态调用DLL中的方法(Dynamically invoke a method in DLL)

problem i have a dll containing some methods (show, hide ...

在.NET中多设置文件(Multiple settings files in .NET)

problem i'm currently using my.settings/properties.settin ...

每个对话的例子NHibernate的会议(Nhibernate session per conversation example)

problem hello can some pros with nhibernate g ...

设计良好的接口方法(Designing Fluent interface methods)

problem i am trying to write a dsl i have methods that r ...

.NET RTD /上一个用户的计算机的COM互操作Excel的错误?(.NET RTD/COM Excel Interop errors on one user&#39;s machine)

problem we built a .net com/excel rtd server (.net assemb ...