§This manual is based on the need to findSina BlogArticles hidden from the public are then transferred in bulk to a WordPress-based blog for publication.
1. Project Objectives
- Full migration: Migrate all Sina Blog articles (including hidden articles) to WordPress.
- Identify the differences: Find posts that are visible only to the blogger (not visible to the public).
- Batch import: Finally, a JSON/HTML file is generated that can be imported into WordPress.
2. Required environment
- Python 3.9+(It is recommended to install the latest version of Python under Windows)
- Install dependent libraries:
pip install requests beautifulsoup4 lxml
3. Scripts used
(1) claw_sina_blog.py
Function: Batch crawl Sina blog articles (public mode & login mode), save HTML and index index.json
.
New features: Support --cookie
/ --cookie-file
, which can crawl articles that are only visible to bloggers.
Common parameters:
--uid
: Sina Blog UID (such as1484861452
).--outdir
: output directory.--start
/--end
: Specify the start and end pages.--cookie-file
: Read cookies from a file (single line of text).
(2) dedup_index.py
Function: Deduplication index.json
, generate a clean version index_clean.json
.
The original crawl may contain duplicates (>3000), and after deduplication, there should be ≈ 387 articles.
(3) sina_hidden_finder.py
Function:
- Compare the indexes of the public and login schemas to find Only bloggers can see the article.
- Generate a manifest (JSON/CSV).
- Batch download hidden article HTML and save to
downloads/
.
Common parameters:
--public-index
: Public modeindex.json
.--owner-index
: Login modeindex_clean.json
.--cookie-file
: Cookie file, crawling while logged in.--outdir
: output directory.
4. Operation steps
Step 1. Public mode crawling
python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_public --start 1 --end 8
generate:
.\sina_public\index.json
Step 2. Login mode crawling
- Copy Cookie → Save As
cookie.txt
(Single line of text). - implement:
python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_owner --start 1 --end 8 --cookie-file .\cookie.txt
generate:
.\sina_owner\index.json
Step 3. Deduplication
python .\dedup_index.py
generate:
.\sina_owner\index_clean.json
Step 4. Difference Extraction & Hidden Article Download
python .\sina_hidden_finder.py ` --public-index .\sina_public\index.json ` --owner-index .\sina_owner\index_clean.json ` --cookie (Get-Content .\cookie.txt -Raw) ` --outdir .\sina_hidden ` --delay 1.0 ` --jitter 0.3
generate:
.\sina_hidden\hidden_only.json
.\sina_hidden\hidden_only.csv
.\sina_hidden\downloads\*.html
.\sina_hidden\hidden_index.json
5. Quick Lookup of Commonly Used Commands
Public Mode
python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_public --start 1 --end 8
Login Mode
python .\claw_sina_blog.py --uid 1484861452 --outdir .\sina_owner --start 1 --end 8 --cookie-file .\cookie.txt
Deduplication
python .\dedup_index.py
Extract hidden articles
python .\sina_hidden_finder.py --public-index .\sina_public\index.json --owner-index .\sina_owner\index_clean.json --cookie (Get-Content .\cookie.txt -Raw) --outdir .\sina_hidden
6. Frequently Asked Questions (FAQ)
- Q: The script keeps turning pages too much?
A: Plus--end 8
Limit the number of pages; or use a modified versioncrawl()
Automatically detect duplicate pages. - Q: Cookie expiration?
A: Copy the complete cookie in the browser again and overwrite it.cookie.txt
Run again. - Q: There are too many index.json files (>3000)?
A: This means there is duplication and you must rundedup_index.py
Remove duplicates. - Q: How do I import into WordPress?
A: You can use the existingwp_batch_import_v4_1_3.py
, incominghidden_index.json
or full amountindex_clean.json
.
[Attachment: List of blog posts blocked by Sina Blog]
The Best Picture Books Published in the Last Two Months (May-July 2008) 2008/7/13 10:14
[Book Excerpt] The World is Made for Good People: Afanti’s Life and Childhood (3) 2008/7/24 17:06
[Book Excerpt] The World is Made for Good People: Afanti’s Life and Childhood (2) 2008/7/24 17:08
A picture book you can read while holding it and reciting it: “Su Wu Shepherding Sheep” 2008/8/30 23:29
[Olympic Side Story] Egg pancakes, pirated discs, and books reappear on Beijing streets… 2008/9/22 20:16
“How to Play with Picture Books” Lecture Transcript (Part 2) 2008/10/17 9:30
“How to Play with Picture Books” Lecture Transcript (Part 2) 2008/10/17 9:35
[Notes] “How can parents help their children fall in love with writing?” 2009/2/2 14:44
A‑Jia, Carrot Detective’s 2009 Summer Book Recommendations (Part 2) 2009/7/16 21:45
Random Notes: From Androgyny to Modern Witches to Lao Tzu’s Feminist Perspectives (2009/9/21 9:37)
Continue to discuss parenting strategies with Yangyang’s dad: reading, asking questions, rewards, etc.… 2009/10/16 14:28
Continuing the Gossip Chapter of the Love Cultivation Chapter — A Story of a Mother Who Abandoned Her Child… 2010/1/24 21:52
German Contemporary Children’s Book Illustration Exhibition: Exhibition Overview, Exhibits, and Illustrators (2010/3/31 11:34)
Reading History Excerpts: A “Superstitious” Account of Natural Disasters in Han Chinese Stories (2010/4/17 23:16)
Masters of the Art of Storytelling for Children (V) 2010/5/27 22:13
Masters of the Art of Storytelling for Children (Part 6) 2010/5/30 21:57
[Repost] Example of a personal application process for “Green Child” (2010/6/17 10:19)
Weibo Chat: Watching and Talking About Movies with Kids 2010/12/20 10:01
Chatting with Children about “Seven-Character Verse: The Long March” 2010/12/30 15:41
[Mandarin and Cantonese] “Five Hundred Words of Reflection on the Journey from Beijing to Fengxian County” Recitation and Pronunciation Demonstration 2011/1/2 23:42
[Reading Notes] Who actually planted the sparse fence in Du Fu’s “Presented to Wu Lang Again”? 2011/6/14 0:12
Noise and commotion? Colorful and harmonious… Reflections on “Sounds in the Park” (2011/6/23 23:24)
[Repost] Ube Live Recording: Nao Matsui’s Favorite Picture Books — Japanese Picture Books 2011/8/28 21:40
A Brief Overview of the Development of Picture Books Worldwide (Based on Currently Published English Picture Books) 2011/9/21 23:01
[Postscript] The fascinating quiet, leisurely, and joyful life 2012/5/29 10:06
[Repost] Pre-registration Notice for the Fifth Red Mud Children’s Classic Book Study Session (2012/8/22 23:29)
[Repost] Leo Lionni and his field mouse Alfred 2012/10/8 21:55
[Repost] Fairy Tale? Not a Fairy Tale? Introduction to the Red Mud Digital Platform Library (2012/10/12 9:25)
[Reprint] Chen Lu: The Social Network Structure of Children’s Reading Promotion in the United States (2012/12/19 9:15)
[Repost] Rain or shine – Red Mud Osen Book Club 2013/7/2 11:16
[Repost] Perfection (Hippo Reviews Lorenz’s Popular Science Classics Series, Volume 5) 2013/12/16 10:45
[Repost] Registration Notice for the 7th Red Mud Children’s Classic Book Study Session (2013/12/28 11:22)
Recommend a small professional library software (free use within 1000 titles) 2014/5/14 9:59
Index of names, titles, and keywords for Dear Genius (sorted by original English title) 2014/6/15 22:30
Why is Dear Genius an important book? 2014/9/25 0:03
A Summer Journey Through American Children’s Literature (October 29, 2014, 6:28 PM)
Repost: Magic Children’s Book Club’s Zhang Hong — Review of the Ten Secrets of Picture Book Playing Lecture in Shanghai 2015/3/26 11:40
Key Points from Lectures 1 and 2 of the “Stories of Geniuses” Series (June 12, 2015, 8:37 AM)
The 8th Seed Storyteller Training Session 2: Inspector Radish — The Secret Path to Oz 2016/4/21 16:42
[Translation Notes] “The Boy and the Cherry Tree,” Dreams and Perseverance… 2016/5/20 10:54
Picture Book Creation Interview: Listen to Teacher Cai Gao Talk About “Meng Jiangnu Weeping at the Great Wall” 2016/5/21 14:51
Audio Column “Ajia Storytelling” Launch Notes 2016/12/15 21:02
How did the Chinese-flavored “Princess’s Kite” come about? 2017/1/2 16:44
[Repost] Barry Moser is a famous artist 2017/3/30 18:41
[Notes] Peter Rabbit’s Character Design and Development (Part 2) 2018/1/9 10:05
[Notes] Peter Rabbit’s Character Design and Development (Part 3) 2018/1/11 10:42
A Brief Overview of the Development of Picture Books Worldwide (Based on Currently Published English Picture Books) 2019/5/22 20:59
Leave a Reply