[Heritrix/crawler-beans.cxml]statisticsTracker
212.Heritrix_설정파일/11. REQUIRED STANDARD BEANS 2016. 8. 5. 21:11StatisticsTracker
• 통계 및 리포팅 기능을 수행한다.
각각의 스냅샷 마다 'progress-statistics.log' file에 해당 정보가 남는다.
기존 설정 값
<bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName">
<!-- <property name="reports">
<list>
<bean id="crawlSummaryReport" class="org.archive.crawler.reporting.CrawlSummaryReport" />
<bean id="seedsReport" class="org.archive.crawler.reporting.SeedsReport" />
<bean id="hostsReport" class="org.archive.crawler.reporting.HostsReport" />
<bean id="sourceTagsReport" class="org.archive.crawler.reporting.SourceTagsReport" />
<bean id="mimetypesReport" class="org.archive.crawler.reporting.MimetypesReport" />
<bean id="responseCodeReport" class="org.archive.crawler.reporting.ResponseCodeReport" />
<bean id="processorsReport" class="org.archive.crawler.reporting.ProcessorsReport" />
<bean id="frontierSummaryReport" class="org.archive.crawler.reporting.FrontierSummaryReport" />
<bean id="frontierNonemptyReport" class="org.archive.crawler.reporting.FrontierNonemptyReport" />
<bean id="toeThreadsReport" class="org.archive.crawler.reporting.ToeThreadsReport" />
</list>
</property> -->
<!-- <property name="reportsDir" value="${launchId}/reports" /> -->
<!-- <property name="liveHostReportSize" value="20" /> -->
<!-- <property name="intervalSeconds" value="20" /> -->
<!-- <property name="keepSnapshotsCount" value="5" /> -->
</bean>
•[property에 대한 설명]1.reports : reporting 할 목록을 적어준다. 각 리포트 마다 관리하는 bean이 다르다. 1.CrawlSummaryReport : 전반적인 크롤 사이즈에 대한 요약
(ex)
crawl name: basic
crawl status: Finished
duration: 15s8ms
seeds crawled: 2
seeds uncrawled: 0
hosts visited: 11
URIs processed: 81
URI successes: 81
URI failures: 0
URI disregards: 0
novel URIs: 81
total crawled bytes: 813450 (794 KiB)
novel crawled bytes: 813450 (794 KiB)
URIs/sec: 5.4
KB/sec: 52
2.SeedsReport : 각 seed들에 대한 결과
[code] [status] [seed] [redirect]
3.HostsReport : 각 host에 의한 기록들
[#urls] [#bytes] [host] [#robots] [#remaining] [#novel-urls] [#novel-bytes] [#dup-by-hash-urls] [#dup-by-hash-bytes] [#not-modified-urls] [#not-modified-bytes]
4.SourceTagsReport : 각 host의 source tag들에 대한 기록을 담는다.(대개의 경우 seed임)
[source] [host] [#urls]
5.MimetypesReport : MIME 타입에 대한 기록들
[#urls] [#bytes] [mime-types]
6.ResponseCodeReport : response/disposition code 당 발견된 url 개수
[#urls] [rescode]
7.ProcessorsReport : CrawlController에 있는 각 processor들이 대표하는 class들과, 자신들의 목적에 부합하는 결과를 남긴다.
(ex)
CandidateChain - Processors report - 201608050647
Number of Processors: 2
Processor: org.archive.crawler.prefetch.CandidateScoper
Processor: org.archive.crawler.prefetch.FrontierPreparer
FetchChain - Processors report - 201608050647
Number of Processors: 9
Processor: org.archive.crawler.prefetch.Preselector
Processor: org.archive.crawler.prefetch.PreconditionEnforcer
Processor: org.archive.modules.fetcher.FetchDNS
Processor: org.archive.modules.fetcher.FetchHTTP
Function: Fetch HTTP URIs
CrawlURIs handled: 71
Recovery retries: 0
Processor: org.archive.modules.extractor.ExtractorHTTP
4 links from 71 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorHTML
1063 links from 26 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorCSS
15 links from 9 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorJS
7 links from 8 CrawlURIs
Processor: org.archive.modules.extractor.ExtractorSWF
0 links from 0 CrawlURIs
DispositionChain - Processors report - 201608050647
Number of Processors: 3
Processor: org.archive.modules.writer.WARCWriterProcessor
Function: Writes WARCs
Total CrawlURIs: 81
Revisit records: 0
Crawled content bytes (including http headers): 813450 (794 KiB)
Total uncompressed bytes (including all warc records): 994111 (971 KiB)
Total size on disk (compressed): 363358 (355 KiB)
Processor: org.archive.crawler.postprocessor.CandidatesProcessor
Processor: org.archive.crawler.postprocessor.DispositionProcessor
8.FrontierSummaryReport : 크롤링 중간에 보통 확인하는 파일로, 각 queue의 종류 별 상황을 요약해서 보여준다.
(ex)
Frontier report - 201608050647
Job being crawled: basic
-----===== STATS =====-----
Discovered: 81
Queued: 0
Finished: 81
Successfully: 81
Failed: 0
Disregarded: 0
-----===== QUEUES =====-----
Already included size: 81
pending: 0
All class queues map size: 11
Active queues: 0
In-process: 0
Ready: 0
Snoozed: 0
Inactive queues: 0 (p3: 0)
Retired queues: 0
Exhausted queues: 11
Last state: EMPTY
-----===== MANAGER THREAD =====-----
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
java.lang.Thread.getStackTrace(Thread.java:1479)
org.archive.crawler.framework.ToeThread.reportThread(ToeThread.java:484)
org.archive.crawler.frontier.WorkQueueFrontier.reportTo(WorkQueueFrontier.java:1304)
org.archive.crawler.reporting.FrontierSummaryReport.write(FrontierSummaryReport.java:39)
org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:910)
org.archive.crawler.reporting.StatisticsTracker.dumpReports(StatisticsTracker.java:938)
org.archive.crawler.reporting.StatisticsTracker.stop(StatisticsTracker.java:342)
org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:236)
org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:213)
org.springframework.context.support.DefaultLifecycleProcessor.access$2(DefaultLifecycleProcessor.java:206)
org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.stop(DefaultLifecycleProcessor.java:352)
.
.
.
-----===== 11 LONGEST QUEUES =====-----
LONGEST#0:
:
:
LONGEST#1:
:
:
-----===== IN-PROCESS QUEUES =====-----
-----===== READY QUEUES =====-----
-----===== SNOOZED QUEUES =====-----
-----===== INACTIVE QUEUES =====-----
-----===== RETIRED QUEUES =====-----
9.FrontierNonemptyReport : 보통 크롤링 마지막에 덤핑되는 파일로 empty 상태가 아닌 frontier queue에 대한 정보를 담는다.
10.ToeThreadsReport : toe thread의 call-stack에 대한 리포트
2.reportsDir : 리포트를 저장할 디렉토리
3.liveHostReportSize : 현재 진행중인 host에 대한 리포트를 남기기 위하여 쓰는 stack 포인터를 위한 변수이다.
4.intervalSeconds : 진행 정보를 남길 시에 interval 값
5.keepSnalshotsCount : 통계를 내기 위한 샘플 스냅샷 개수 설정.
'212.Heritrix_설정파일 > 11. REQUIRED STANDARD BEANS' 카테고리의 다른 글
[Heritrix/crawler-beans.cxml]serverCache (0) | 2016.08.10 |
---|---|
[Heritrix/crawler-beans.cxml]cookieStorage (0) | 2016.08.05 |
[Heritrix/crawler-beans.cxml]bdb (0) | 2016.08.05 |
[Heritrix/crawler-beans.cxml]sheetOverlaysManager (0) | 2016.08.05 |
[Heritrix/crawler-beans.cxml]loggerModule (0) | 2016.08.05 |