State any 5 issues of Data Wrangling in python


State any 5 issues of Data Wrangling:-


Data wrangling, also known as data cleaning or data preprocessing, is a crucial step in data analysis. Here are five common issues that can arise during the data wrangling process in Python:


1.Missing Data: Missing data occurs when there are empty or null values in the dataset. Dealing with missing data involves strategies such as imputation (replacing missing values with estimated ones), deletion of rows or columns with missing data, or using advanced techniques like interpolation or machine learning algorithms to fill in the missing values.


2.Data Inconsistencies: Inconsistencies can arise when the same information is represented differently across the dataset. This can include inconsistent formatting, variations in spelling or capitalization, or different encoding schemes. Cleaning inconsistent data requires standardizing formats, applying text cleaning techniques, or using regular expressions to identify and correct inconsistencies.


3.Outliers: Outliers are data points that significantly deviate from the rest of the dataset. These can be caused by measurement errors, data entry mistakes, or genuine anomalies. Dealing with outliers involves identifying them using statistical methods or visualizations and deciding whether to remove them, transform them, or handle them separately during the analysis.


4.Data Duplicates: Duplicate data refers to multiple records or observations that are identical or nearly identical. Duplicates can introduce bias and affect the accuracy of analysis. Handling duplicates involves identifying and removing them based on specific criteria, such as matching values in key fields or considering a combination of attributes.


5.Data Formatting and Type Inconsistencies: Inconsistent data formats and types can hinder data analysis and modeling. This can include inconsistent date formats, numerical values stored as strings, or categorical variables represented differently. To address these issues, it is necessary to convert data to the correct format and ensure consistent data types across the dataset using functions and methods available in Python libraries like pandas.


These are just a few common issues encountered during data wrangling. Effective data wrangling practices involve a combination of Python programming skills, domain knowledge, and understanding of the dataset to handle these issues appropriately and prepare the data for further analysis.


                   OR(THE BELOW ANSWER IS DIRECTLY BASED ON DATA ANALYSIS WITH PANDAS BOOK)



Data wrangling is the process of preparing the data and getting it into a format that

can be used for analysis. The unfortunate reality of data is that it is often dirty,

meaning that it requires cleaning (preparation) before it can be used. The

following are some issues we may encounter with our data:

1.Human errors: Data is recorded (or even collected) incorrectly, such as

putting 100 instead of 1000, or typos. In addition, there may be multiple

versions of the same entry recorded, such as New York City, NYC, and

nyc

2.Computer error: Perhaps we weren't recording entries for a while (missing

data)

3.Unexpected values: Maybe whoever was recording the data decided to use

? for a missing value in a numeric column, so now all the entries in the

column will be treated as text instead of numeric values

4.Incomplete information: Think of a survey with optional questions; not

everyone will answer them, so we have missing data, but not due to

computer or human error

5.Resolution: The data may have been collected per second, while we need

hourly data for our analysis

6.Relevance of the fields: Often, data is collected or generated as a product

of some process rather than explicitly for our analysis. In order to get it to a

usable state, we will have to clean it up

7.Format of the data: The data may be recorded in a format that isn't

conducive to analysis, which will require that we reshape it

8.Misconfigurations in data-recording process: Data coming from sources

such as misconfigured trackers and/or webhooks may be missing fields or

passing them in the wrong order 




 

Comments

Popular posts from this blog

Load a Pandas dataframe with a selected dataset. Identify and count the missing values in a dataframe. Clean the data after removing noise as follows: a. Drop duplicate rows. b. Detect the outliers and remove the rows having outliers c. Identify the most correlated positively correlated attributes and negatively correlated attributes

what is KDD? Explain about data mining as a step in the process of knowledge discovery

The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the median